Keras is my favourite deep learning library when I need to get things done fastly. When training a deep learning model, sometimes we face an enormous data set that doesn’t fit into your RAM. So instead of loading all rows into the memory, we can just load what we need by using the fit_generator() function. In this post, I’ll share my experience of using fit_generator(). The sample program reads a CSV file of IRIS data set in a batch, then put those batches into a simple neural network classifier. This post will not focus on which method is the best for classifying those data, as this data set is also too small for deep learning. It only has 100 rows. Thus, this post more focuses on the way to load rows from a CSV file and use fit_generator() to train the model. I used Keras 2.0.6 with Tensorflow-GPU 1.2.1 when writing this post.

Firstly, we have to create a generator function that will read the data from a file and store it in numpy arrays. We can have as many parameters as we’d like to, but the function can’t have a return value. It has to yield the data to be processed by Keras, instead. The code below is the generator function, I give them two parameters, filename and batch_size. The first parameter is obviously to be filled by the CSV filename, while the second one is used to determine how many rows you’d like to put into the generator at a time. The infinite looping exists so the fit_generator() can read the same file again after an epoch finishes.

def read_rows(filename, batch_size):
    while True:
        f_iris = open(filename, "r")
        for i in range(0,100):
            line = f_iris.readline()
            data = line.split(",")
            X = [float(x) for x in data[0:8]]
            X = np.reshape(X, (1, 8))

            if data[8].strip() == "setosa":
                y = 0
            elif data[8].strip() == "versicolor":
                y = 1
            elif data[8].strip() == "virginica":
                y = 2

            Y = np_utils.to_categorical(y, num_classes=3)

            if i == 0 or i % batch_size == 1:
                bufferX = X
                bufferY = Y
            else:
                bufferX = np.r_["0,2", bufferX, X]
                bufferY = np.r_["0,2", bufferY, Y]

            if i % batch_size == 0:
                yield bufferX, bufferY

        yield bufferX, bufferY
        f_iris.close()

In each iteration (reading a line from the file), we first read the features which are the 1st-8th columns and store them in a numpy array (shown in line 5-8 of the code above). Then we need to transform the class name (setosa, versicolor, and virginica) to a one-hot vector. That’s where np_utils.to_categorical() function comes into play. Line 19-24 is basically to put the rows into a buffer until that buffer is full. When it happens, we yield the buffer so it can be used to train the model. numpy.r_[] is used to concatenate the data, the first item is to make sure that the data are concatenated along first axis. The yielding is also conducted when we have read all the rows in the file, to make sure nothing is left behind. One thing to note, when you are really doing this in your research/production environment, don’t forget to pre-process the data before putting them into the buffer. Usually, you need to normalise them in advance.

After defining the generator function, we just need to make the model and train it using fit_generator(). The code directly below shows how to initialise a Sequential model. In this post, I just arbitrarily use two hidden layers with 10 and 5 neurons respectively. If you notice, the last layer in the code (line 7) is the output layer which has three neurons with softmax activation function.

def init_model():
    input_dimension = 8
    model = Sequential()

    model.add(Dense(10, activation="relu", input_shape=(input_dimension,)))
    model.add(Dense(5, activation="relu"))
    model.add(Dense(3, activation="softmax"))

    model.compile(loss="categorical_crossentropy", optimizer="adadelta")

    return model

The last thing to do, putting all of those functions into use by calling the fit_generator(). The code below shows how to train the model using fit_generator(). First, we need to define how many steps_per_epoch. In Keras, an epoch is finished when the system has read all distinct row from the data set. Thus, steps_per_epoch is to be filled by the number of rows/data divided by batch size (the number of data yielded by the generator function). In this example, we have 100 distinct rows, if we set the batch size to 2, then steps_per_epoch will be 50.

filename = argv[1]
batch_size = int(argv[2])

steps_per_epoch = 100 / batch_size

model = init_model()
model.fit_generator(read_rows(filename, batch_size), steps_per_epoch=steps_per_epoch, epochs=10, verbose=1)

That’s all for now. I was a bit confused how to use fit_generator() properly at the beginning. I thought we could only yield a row at a time, but turned out we can concatenate the array by using numpy.r_[]. By making the batch size bigger, we can fasten the training time and compute them efficiently in the GPU. However, this example wouldn’t give you big improvement as it only uses a small data set. If you have any comment regarding the post, please feel free to write it down below. The full working code of this post is put on my GitHub repository.

Leave a comment

Your email address will not be published.

%d bloggers like this: