I’m trying to import the EMNIST Letters dataset into an artificial intelligence program I have created (written in python) and seem to be unable to do it correctly. How should I import it to the following program?
...
# Import Statements
...
emnist = spio.loadmat("EMNIST/emnist-letters.mat")
...
# The problems appear to originate below--I am trying to set these variables to the corresponding parts of the EMNIST dataset and cannot succeed
x_train = emnist["dataset"][0][0][0][0][0][0]
x_train = x_train.astype(np.float32)
y_train = emnist["dataset"][0][0][0][0][0][1]
x_test = emnist["dataset"][0][0][1][0][0][0]
x_test = x_test.astype(np.float32)
y_test = emnist["dataset"][0][0][1][0][0][1]
train_labels = y_train
test_labels = y_test
x_train /= 255
x_test /= 255
x_train = x_train.reshape(x_train.shape[0], 1, 28, 28, order="A")
x_test = x_test.reshape(x_test.shape[0], 1, 28, 28, order="A")
y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)
# Does not work:
plt.imshow(x_train[54000][0], cmap='gray')
plt.show()
# Compilation and Fitting
...
I did not expect an error message at all, but received:
Traceback (most recent call last):
File "OCIR_EMNIST.py", line 61, in <module>
y_train = keras.utils.to_categorical(y_train, 10)
File "/home/user/.local/lib/python3.7/site-packages/keras/utils/np_utils.py", line 34, in to_categorical
categorical[np.arange(n), y] = 1
IndexError: index 23 is out of bounds for axis 1 with size 10
Amendment: The MNIST dataset is not suitable for this project as it does not contain handwritten letters; it only contains handwritten numbers.
3
Answers
I wasn’t familiar with the EMNIST data set, but after some research I found that it directly matches the MNIST data set, found at this link. With it being the same data set, I’d advice just using MNIST, although I don’t know if you need this data set for a particular reason. To use the MNIST data set is simple with keras:
Normalize the data points before sending them through whatever machine learning method you wish to use. Note, y_train and y_test are just the labels.
Hope this helps, you should be getting the same data set in a lot shorter/easier manner.
EDIT: Since you are looking for a letter database to perform on instead of just numbers, I would advise getting the data set from this link. The letter-recognition.data file should be what you can use. It contains the letter, as well as 16 feature vectors describing each letter. You can then load this into a csv file and partition your data for training/validation, then perform some type of ML on it (I have done an ANN with this data set). One note, you may need to change the letters inside the downloaded data file to be numerical values for your ground truths (A=0,B=1,…,Z=25).
MNIST is a classic case to learn Machine Learning and Data Mining. Here is the code I used to load the MNIST when I was comparing the performance of CNN, SVR and decision tree.
Note that, be careful with the indent in the first line which should be four spaces backward. With this dataset reader, you could just use “load_mnist” function to load the dataset and will make your code neat.
Or you can just use the keras dataset to load. The details are available on the Keras Documentation.
I hope this helps.
Maybe you should take a look at: https://github.com/christianversloot/extra_keras_datasets
It’s not a popular library (at the time of writing), and I have not tried it yet, however, it seems easy to use, and well documented.
To load EMNIST dataset with it, you can do it just like you would do with Keras: