Dataset Loading

Build Status

This repo is aimed at being a centralized resource for loading in commonly used image datasets like CIFAR, PASCAL VOC, MNIST, ImageNet and others.

Some of these datasets will fit easily on disk (CIFAR and MNIST), but many of the others won’t. This means we have to set up threads to load them as we need them into memory. Tensorflow provides some ability to do this, but after several attempts at using these resources, we found them far too opaque and difficult to use. This package does essentially the same thing as what tensorflow does, but using python’s threading, multiprocessing and queue packages.

Threads vs Processes

Initially this package would only use Python’s threading package to parallelize tasks. It quickly became apparent that this caps the benefits of parallelization, as all of these threads will only take up to 1 processor core. In reality, we want to be able to take up more processors for data loading to reduce bottlenecks. It is still untested, but we are adding in multiprocess support for the heavy lifting tasks (in particular, loading and preprocessing images into The ImageQueue).

Dataset Specific Usage

For instructions on how to call the functions to get images in for common datasets, see their help pages. These functions wrap around the General Usage functions and are provided for convenience. If your application doesn’t quite fit into these functions, or if you have a new dataset, have a look at General Usage, as it was designed to make queueing for any dataset type as easy as possible.

General Usage

For the bigger datasets, we need 2 queues and several threads (perhaps on multiple processors) to load images in parallel.

  1. A File Queue to store a list of file names. Sequencing can be done by shuffling the file names before inserting into the queue.
    • One thread should be enough to manage this queue.
  2. An Image Queue to load images into.
    • Several threads will likely be needed to read file names from the file queue, load from disk, and put into the Image Queue. We may get away with running these all in one Python process, but may need to use more.

The FileQueue

A FileQueue is used to store a list of file names (e.g. jpegs). This is also the location of sequencing (there is an option to shuffle the entries in this queue when adding) and where we set the limits on the number of epochs processed (if we wish to). For example, this would set up a file queue for 50 epochs:

import dataset_loading as dl
IM_DIR = /path/to/images
files = os.listdir(IM_DIR)
files = [f for f in files if os.path.splitext(f)[1] == '.jpeg']
file_queue = dl.FileQueue()
file_queue.load_epochs(files, max_epochs=50)
...
...
file_queue.join()

The load_epochs method will also start a single thread to manage the queue and refill it if it’s getting low (shuffling along as it goes).

If you know what the labels are, you should also feed them to the File Queue alongside the file names in a list of (file, label) tuples. E.g.:

# Assume <labels> is a list of all of the labels and <files> is a
# list of the files.
file_queue = dl.FileQueue()
file_queue.load_epochs(list(zip(files, labels)), max_epochs=float('inf'))

Note that when you are done with the queue, you should call the queue’s join method, which will make sure the queue is empty and the loader thread exits.

The ImageQueue

An ImageQueue to hold a set amount of images (not the entire batch, but enough to keep the main program happily fed). This class has a method we call for starting image reader threads (again, you can choose how many of these you need to meet your main’s demand). Following the above code, you add an image queue like so:

img_queue = dl.ImgQueue(maxsize=1000)
img_queue.start_loaders(file_queue, num_threads=3, img_dir=IM_DIR)
# Wait for the image queue to fill up
sleep(2)
data, labels = img_queue.get_batch(batch_size=100)
...
...
img_queue.join()

The ImgQueue.start_loaders method will start num_threads threads, each of which read from the file_queue, load from disk and feed into the image queue.

If you want the loaders to pre-process images before putting them into the image queue, you can provide a callable to ImgQueue.start_loaders to do this (see its docstring for more info). For example:

img_queue = dl.ImgQueue()
def preprocess(x):
    x = x.astype(np.float32)
    x = x - np.mean(x)
    x = x/max(1, np.std(x))
    return x
img_queue.start_loaders(file_queue, num_threads=3, transform=preprocess)

The ImgQueue.get_batch method has two extra options (block and timeout), instructing it how to handle cases when the image queue doesn’t have a full batch worth of images (should we return with whatever’s there, or wait for the loaders to catch up?). See its docstring for more info.

For synchronization with epochs, the ImageQueue has an attribute last_batch that will be set to true when an epoch’s worth of images have been pulled from the ImageQueue.

data, labels = img_queue.get_batch(batch_size=100)
last_batch = img_queue.last_batch
if last_batch:
    # Print summary info...

You can monitor the queue size and fetch times for the ImgQueue too (to check whether you need to tweak some settings). This works by printing out info to a tensorboard summary file (currently only supported way of doing it). All you need to do is create a tf.summary.FileWriter (you can use the same one the rest of your main program is using), and call the ImgQueue.add_logging method. This will add the data as a to your tensorboard file.

img_queue = dl.ImgQueue()
def preprocess(x):
    x = x.astype(np.float32)
    x = x - np.mean(x)
    x = x/max(1, np.std(x))
    return x
img_queue.start_loaders(file_queue, num_threads=3, transform=preprocess)
file_writer = tf.summary.FileWriter('./log', tf.get_default_graph())
# Write period is the sample period in numbers of batches for dumping data
img_queue.add_logging(file_writer, write_period=10)

Note that when you are done with the queue, you should call the queue’s join method, which will make sure the queue is empty and the loader thread exits.

Small Datasets

If you have a special case where the dataset is small, and so can fit into memory (like CIFAR or MNIST), then you won’t need the same complexity to get batches of data and labels. However, it may still be beneficial to use the ImgQueue class for two reasons:

  • Keeps the same programmatic interface regardless of the dataset
  • May still want to parallelize things if you want to do preprocessing of images before putting them in the queue.

For this, use ImgQueue.take_dataset instead of ImgQueue.start_loaders. This method also has options like whether to shuffle the samples or not (will shuffle by default), and can take a callable function to apply to the images before putting them in the queue. The default number of threads to create is 1, but this can be increased with the num_threads parameter.

Note: to avoid duplicating things in memory, the ImgQueue will not copy the data/labels. This means that once your main program calls the take_dataset method, it shouldn’t modify the arrays.

E.g.

import dataset_loading as dl
import dataset_loading.cifar as dlcifar
train_d, train_l, test_d, test_l, val_d, val_l = \
    dlcifar.load_cifar_data('/path/to/data')
img_queue = dl.ImgQueue()
img_queue.take_dataset(train_d, train_l)
data, labels = img_queue.get_batch(100)
# Or say we want to use more parallel threads and morph the image
def preprocess(x):
    x = x.astype(np.float32)
    x = x - np.mean(x)
    x = x/max(1, np.std(x))
    return x
img_queue = dl.ImgQueue()
img_queue.take_dataset(train_d, train_l, num_threads=3,
                       transform=preprocess)
data, labels = img_queue.get_batch(100)

Installation

Direct install from github (useful if you use pip freeze). To get the master branch, try:

$ pip install -e git+https://github.com/fbcotter/dataset_loading#egg=dataset_loading

or for a specific tag (e.g. 0.0.1), try:

$ pip install -e git+https://github.com/fbcotter/dataset_loading.git@0.0.1#egg=dataset_loading

Download and pip install from Git:

$ git clone https://github.com/fbcotter/dataset_loading
$ cd dataset_loading
$ pip install -r requirements.txt
$ pip install -e .

It is recommended to download and install (with the editable flag), as it is likely you’ll want to tweak things/add functions more quickly than we can handle pull requests.

Further documentation

There is more documentation available online and you can build your own copy via the Sphinx documentation system:

$ python setup.py build_sphinx

Compiled documentation may be found in build/docs/html/ (index.html will be the homepage)