skip to Main Content

I’m trying to import the EMNIST Letters dataset into an artificial intelligence program I have created (written in python) and seem to be unable to do it correctly. How should I import it to the following program?

...
# Import Statements
...


emnist = spio.loadmat("EMNIST/emnist-letters.mat")
...

# The problems appear to originate below--I am trying to set these variables to the corresponding parts of the EMNIST dataset and cannot succeed

x_train = emnist["dataset"][0][0][0][0][0][0]
x_train = x_train.astype(np.float32)

y_train = emnist["dataset"][0][0][0][0][0][1]

x_test = emnist["dataset"][0][0][1][0][0][0]
x_test = x_test.astype(np.float32)

y_test = emnist["dataset"][0][0][1][0][0][1]

train_labels = y_train
test_labels = y_test

x_train /= 255
x_test /= 255

x_train = x_train.reshape(x_train.shape[0], 1, 28, 28, order="A")
x_test = x_test.reshape(x_test.shape[0], 1, 28, 28, order="A")

y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)

# Does not work:
plt.imshow(x_train[54000][0], cmap='gray')
plt.show()

# Compilation and Fitting
...

I did not expect an error message at all, but received:

Traceback (most recent call last):
  File "OCIR_EMNIST.py", line 61, in <module>
    y_train = keras.utils.to_categorical(y_train, 10)
  File "/home/user/.local/lib/python3.7/site-packages/keras/utils/np_utils.py", line 34, in to_categorical
    categorical[np.arange(n), y] = 1
IndexError: index 23 is out of bounds for axis 1 with size 10

Amendment: The MNIST dataset is not suitable for this project as it does not contain handwritten letters; it only contains handwritten numbers.

3

Answers


  1. I wasn’t familiar with the EMNIST data set, but after some research I found that it directly matches the MNIST data set, found at this link. With it being the same data set, I’d advice just using MNIST, although I don’t know if you need this data set for a particular reason. To use the MNIST data set is simple with keras:

    mnist = keras.datasets.mnist #loads in the data set
    (x_train, y_train), (x_test, y_test) = mnist.load_data() #separates data for training/validation
    x_train = x_train / 255.0
    x_test = x_test  / 255.0
    

    Normalize the data points before sending them through whatever machine learning method you wish to use. Note, y_train and y_test are just the labels.

    Hope this helps, you should be getting the same data set in a lot shorter/easier manner.

    EDIT: Since you are looking for a letter database to perform on instead of just numbers, I would advise getting the data set from this link. The letter-recognition.data file should be what you can use. It contains the letter, as well as 16 feature vectors describing each letter. You can then load this into a csv file and partition your data for training/validation, then perform some type of ML on it (I have done an ANN with this data set). One note, you may need to change the letters inside the downloaded data file to be numerical values for your ground truths (A=0,B=1,…,Z=25).

    Login or Signup to reply.
  2. MNIST is a classic case to learn Machine Learning and Data Mining. Here is the code I used to load the MNIST when I was comparing the performance of CNN, SVR and decision tree.

    def load_mnist(path, kind='train'):
    import os
    import gzip
    import numpy as np
    
    
    """Load MNIST data from `path`"""
    labels_path = os.path.join(path,
                               '%s-labels-idx1-ubyte.gz'
                               % kind)
    images_path = os.path.join(path,
                               '%s-images-idx3-ubyte.gz'
                               % kind)
    
    with gzip.open(labels_path, 'rb') as lbpath:
        labels = np.frombuffer(lbpath.read(), dtype=np.uint8,
                               offset=8)
    
    with gzip.open(images_path, 'rb') as imgpath:
        images = np.frombuffer(imgpath.read(), dtype=np.uint8,
                               offset=16).reshape(len(labels), 784)
    
    return images, labels
    

    Note that, be careful with the indent in the first line which should be four spaces backward. With this dataset reader, you could just use “load_mnist” function to load the dataset and will make your code neat.

    Or you can just use the keras dataset to load. The details are available on the Keras Documentation.

    from keras.datasets import mnist
    (x_train, y_train), (x_test, y_test) = mnist.load_data()
    

    I hope this helps.

    Login or Signup to reply.
  3. Maybe you should take a look at: https://github.com/christianversloot/extra_keras_datasets

    It’s not a popular library (at the time of writing), and I have not tried it yet, however, it seems easy to use, and well documented.

    To load EMNIST dataset with it, you can do it just like you would do with Keras:

    from extra_keras_datasets import emnist
    (input_train, target_train), (input_test, target_test) = emnist.load_data(type='balanced')
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search