API Guide

Core Functions

class dataset_loading.FileQueue(maxsize=0)[source]

Bases: queue.Queue

A queue to hold filename strings

This queue is used to indicate what order of jpeg files should be read. It may also be a good idea to put the class label alongside the filename as a tuple, so the main program can get access to both of these at the same time.

Create the class, and then call the load_epochs() method to start a thread to manage the queue and refill it as it gets low.

The maxsize is not provided as an option as we want the queue to be able to take entire epochs and not be restricted on the upper limit by a maxsize. The data should be no problem as the queue entries are only integers.

epoch_count

The current epoch count

epoch_size

Gives the size of one epoch of data

filling

Returns true if the file queue is being filled

get(block=True, timeout=None)[source]

Get a single item from the Image Queue

join()[source]

Method to signal any threads that are filling this queue to stop.

Threads will clean themselves up if the epoch limit is reached, but in case you want to kill them manually before that, you can signal them to stop here.

Note: Overloads the queue join method which normally blocks until the queue has been emptied. This will return even if the queue has data in it.

killed

Returns true if the queue has been asked to die

load_epochs(files, shuffle=True, max_epochs=inf)[source]

Starts a thread to load the file names into the file queue.

Parameters:
  • files (list) – Can either be a list of filename strings or a list of tuples of (filenames, labels)
  • shuffle (bool) – Whether to shuffle the list before adding it to the queue.
  • max_epochs (int or infinity) – Maximum number of epochs to allow before queue manager stops refilling the queue.

Notes

Even if shuffle input is set to false, that doesn’t necessarily mean that all images in the image queue will be in the same order across epochs. For example, if thread A pulls the first image from the list and then thread B gets the second 1. Thread A takes slightly longer to read in the image than thread B, so it gets inserted into the Image Queue afterwards. Trying to synchronize across both queues could be done, but it would add unnecessary complications and overhead.

Raises:ValueError - If the files queue was empty
class dataset_loading.ImgQueue(maxsize=1000, name='')[source]

Bases: queue.Queue

A queue to hold images

This queue can hold images which will be loaded from the main program. Multiple file reader threads can fill up this queue as needed to make sure demand is met.

Each entry in the image queue will then be either tuple of (data, label). If the data is loaded using a filename queue and image loader threads and a label is not provided, each queue item will still be a tuple, only the label will be None. If you don’t want to return this label, then you can set the nolabel input to the start_loaders function.

To get a batch of samples from the ImageQueue, see the get_batch() method.

If you are lucky enough to have an entire dataset that fits easily into memory, you won’t need to use multiple threads to start loading data. You may however want to keep the same interface. In this case, you can call the take_dataset function with the dataset and labels, and then call the get_batch() method in the same manner.

Parameters:
  • maxsize (positive int) – Maximum number of images to hold in the queue. Needs to not be 0 or else it will keep filling up until you run out of memory.
  • name (str) – Queue name
Raises:

ValueError if the maxsize parameter is incorrect.

add_logging(writer, write_period=10)[source]

Adds ability to monitor queue sizes and fetch times.

Will try to import tensorflow and throw a warnings.warn if it couldn’t.

Parameters:
  • file_writer (tensorflow FileWriter object) – Uses this object to write out summaries.
  • write_period (int) – After how many calls to get_batch should we write to the logger.
epoch_count

Returns what epoch we are currently at

epoch_size

The epoch size (as interpreted from the File Queue)

filling

Returns true if the file queue is being filled

get(block=True, timeout=None)[source]

Get a single item from the Image Queue

get_batch(batch_size, timeout=3)[source]

Tries to get a batch from the Queue.

If there is less than a batch of images, it will grab them all. If the epoch size was set and the tracking counter sees there are fewer than <batch_size> images until we hit an epoch, then it will cap the amount of images grabbed to reach the epoch.

Parameters:
  • batch_size (int) – How many samples we want to get.
  • timeout (bool) – How long to wait on timeout
Returns:

  • data (list of ndarray) – List of numpy arrays representing the transformed images.
  • labels (list of ndarray or None) – List of labels. Will be None if there were no labels in the FileQueue.

Notes

When we pull the last batch from the image queue, the property last_batch is set to true. This allows the calling function to synchronize tests with the end of an epoch.

Raises:
  • FileQueueNotStarted - when trying to get a batch but the file queue
  • manager hasn’t started.
  • FileQueueDepleted - when we have hit the epoch limit.
  • ImgQueueNotStarted - when trying to get a batch but no image loaders
  • have started.
  • queue.Empty - If timed out on trying to read an image
img_shape

Return what the image size is of the images in the queue

This may be useful to check the output shape after any preprocessing has been done.

Returns:img_size – Returns the shape of the images in the queue or None if it could not determine what they were.
Return type:list of ints or None
join()[source]

Method to signal any threads that are filling this queue to stop.

Threads will clean themselves up if the epoch limit is reached, but in case you want to kill them manually before that, you can signal them to stop here. Note that if these threads are blocked waiting on input, they will still stay alive (and blocked) until whatever is blocking them frees up. This shouldn’t be a problem though, as they will not be taking up any processing power.

If there is a file queue associated with this image queue, those threads will be stopped too.

Note: Overloads the queue join method which normally blocks until the queue has been emptied. This will return even if the queue has data in it.

killed

Returns True if the queue has been asked to die.

label_shape

Return what the label shape is of the labels in the queue

This may be useful to check the output shape after any preprocessing has been done.

Returns:label_shape – Returns the shape of the images in the queue or None if it could not determine what they were.
Return type:list of ints or None
last_batch

Check whether the previously read batch was the last batch in the epoch.

Reading this value will set it to False. This allows you to do something like this:

while True:
    while not train_queue.last_batch:
        data, labels = train_queue.get_batch(batch_size)

    ...
read_count

Returns how many images have been read from this queue.

start_loaders(file_queue, num_threads=3, img_dir=None, img_size=None, transform=None)[source]

Starts the threads to load the images into the ImageQueue

Parameters:
  • file_queue (FileQueue object) – An instance of the file queue
  • num_threads (int) – How many parallel threads to start to load the images
  • img_dir (str) – Offset to add to the strings fetched from the file queue so that a call to load the file in will succeed.
  • img_size (tuple of (height, width) or None) – What size to resize all the images to. If None, no resizing will be done.
  • transform (function handle or None) – Pre-filtering operation to apply to the images before adding to the Image Queue. If None, no operation will be applied. Otherwise, has to be a function handle that takes the numpy array and returns the transformed image as a numpy array.
Raises:

ValueError: if called after take_dataset.

take_dataset(data, labels=None, shuffle=True, num_threads=1, transform=None, max_epochs=inf)[source]

Save the image dataset to the class for feeding back later.

If we don’t need a file queue (we have all the dataset in memory), we can give it to the ImgQueue class with this method. Images will still flow through the queue (so you still need to be careful about how big to set the queue’s maxsize), but now the preprocessed images will be fed into the queue, ready to retrieve quickly by the main program.

Parameters:
  • data (ndarray of floats) – The images. Should be in the form your main program is happy to receive them in, as no reshaping will be done. For example, if the data is of shape [10000, 32, 32, 3], then we randomly sample from the zeroth axis when we call get batch.
  • labels (ndarray numeric or None) – The labels. If not None, the zeroth axis has to match the size of the data array. If None, then no labels will be returned when calling get batch.
  • shuffle (bool) – Normally the ordering will be done in the file queue, as we are skipping this, the ordering has to be done here. Set this to true if you want to receive samples randomly from data.
  • num_threads (int) – How many threads to start to fill up the image queue with the preprocessed data.
  • transform (None or callable) – Transform to apply to images. Should accept a single image (although isn’t fussy about what size/shape it is in), and return a single image. This will be applied to all the images independently before putting them in the Image Queue.

Notes

Even if shuffle input is set to false, that doesn’t necessarily mean that all images in the image queue will be in the same order across epochs. For example, if thread A pulls the first 100 images from the list and then thread B gets the second 100. Thread A takes slightly longer to process the images than thread B, so these get inserted into the Image Queue afterwards. Trying to synchronize across both queues could be done, but it would add unnecessary complications and overhead.

Raises:AssertionError if data and labels don’t match up in size.

Exceptions

exception dataset_loading.ImgQueueNotStarted(value)[source]

Exception Raised when trying to pull from an Image queue that hasn’t had its feeders started.

exception dataset_loading.FileQueueNotStarted(value)[source]

Exception Raised when trying to pull from a File queue that hasn’t had its manager started.

exception dataset_loading.FileQueueDepleted(value)[source]

Exception Raised when the file queue has been depleted. Will be raised when the epoch limit is reached.

Dataset Specific

MNIST

dataset_loading.mnist.extract_images(f)[source]

Extract the images into a 4D uint8 numpy array [index, y, x, depth].

Parameters:f (file object) – file that can be passed into a gzip reader.
Returns:data
Return type:A 4D uint8 numpy array [index, y, x, depth]
Raises:ValueError: If the bytestream does not start with 2051.
dataset_loading.mnist.extract_labels(f, one_hot=False, num_classes=10)[source]

Extract the labels into a 1D uint8 numpy array [index].

Parameters:
  • f (file object) – A file object that can be passed into a gzip reader.
  • one_hot (bool) – Does one hot encoding for the result.
  • num_classes (int) – Number of classes for the one hot encoding.
Returns:

labels

Return type:

a 1D uint8 numpy array.

Raises:

ValueError: If the bystream doesn’t start with 2049.

dataset_loading.mnist.get_mnist_queues(data_dir, val_size=2000, transform=None, maxsize=10000, num_threads=(2, 2, 2), max_epochs=inf, get_queues=(True, True, True), one_hot=True, download=False, _rand_data=False)[source]

Get Image queues for MNIST

MNIST is a small dataset. This function loads it into memory and creates several ImgQueue to feed the training, testing and validation data through to the main function. Preprocessing can be done by providing a callable to the transform parameter. Note that by default, the black and white MNIST images will be returned as a [28, 28, 1] shape numpy array. You can of course modify this with the transform function.

Parameters:
  • data_dir (str) – Path to the folder containing the cifar data. For cifar10, this should be the path to the folder called ‘cifar-10-batches-py’. For cifar100, this should be the path to the folder ‘cifar-100-python’.
  • val_size (int) – How big you want the validation set to be. Will be taken from the end of the train data.
  • transform (None or callable or tuple of callables) – Callable function that accepts a numpy array representing one image, and transforms it/preprocesses it. E.g. you may want to remove the mean and divide by standard deviation before putting into the queue. If tuple of callables, needs to be of length 3 and should be in the order (train_transform, test_transform, val_transform). Setting it to None means no processing will be done before putting into the image queue.
  • maxsize (int or tuple of 3 ints) – How big the image queues will be. Increase this if your main program is chewing through the data quickly, but increasing it will also mean more memory is taken up. If tuple of ints, needs to be length 3 and of the form (train_qsize, test_qsize, val_qsize).
  • num_threads (int or tuple of 3 ints) – How many threads to use for the train, test and validation threads (if tuple, needs to be of length 3 and in that order).
  • max_epochs (int) – How many epochs to run before returning FileQueueDepleted exceptions
  • get_queues (tuple of 3 bools) – In case you only want to have training data, or training and validation, or any subset of the three queues, you can mask the individual queues by putting a False in its position in this tuple of 3 bools.
  • one_hot (bool) – True if you want the labels pushed into the queue to be a one-hot vector. If false, will push in a one-of-k representation.
  • download (bool) – True if you want the dataset to be downloaded for you. It will be downloaded into the data_dir provided in this case.
Returns:

  • train_queue (ImgQueue instance or None) – Queue with the training data in it. None if get_queues[0] == False
  • test_queue (ImgQueue instance or None) – Queue with the test data in it. None if get_queues[1] == False
  • val_queue (ImgQueue instance or None) – Queue with the validation data in it. Will be None if the val_size parameter was 0 or get_queues[2] == False

Notes

If the max_epochs paramter is set to a finite amount, then when the queues run out of data, they will raise a dataset_loading.FileQueueDepleted exception.

dataset_loading.mnist.load_mnist_data(data_dir, val_size=2000, one_hot=True, download=False)[source]

Load mnist data

Parameters:
  • data_dir (str) –

    Path to the folder with the mnist files in them. These should be the gzip files downloaded from yann.lecun.com

  • val_size (int) – Size of the validation set.
  • one_hot (bool) – True to return one hot labels
  • download (bool) – True if you don’t have the data and want it to be downloaded for you.
Returns:

  • trainx (ndarray) – Array containing training images. There will be 60000 - val_size images in this.
  • trainy (ndarray) – Array containing training labels. These will be one hot if the one_hot parameter was true, otherwise the standard one of k.
  • testx (ndarray) – Array containing test images. There will be 10000 test images in this.
  • testy (ndarray) – Test labels
  • valx (ndarray) – Array containing validation images. Will be None if val_size was 0.
  • valy (ndarray) – Array containing validation labels. Will be None if val_size was 0.

CIFAR

dataset_loading.cifar.load_cifar_data(data_dir, cifar10=True, val_size=2000, one_hot=True, download=False)[source]

Load cifar10 or cifar100 data

Parameters:
  • data_dir (str) –

    Path to the folder with the cifar files in them. These should be the python files as downloaded from cs.toronto

  • cifar10 (bool) – True if cifar10, false if cifar100
  • val_size (int) – Size of the validation set.
  • one_hot (bool) – True to return one hot labels
  • download (bool) – True if you don’t have the data and want it to be downloaded for you.
Returns:

  • trainx (ndarray) – Array containing training images. There will be 50000 - val_size images in this.
  • trainy (ndarray) – Array containing training labels. These will be one hot if the one_hot parameter was true, otherwise the standard one of k.
  • testx (ndarray) – Array containing test images. There will be 10000 test images in this.
  • testy (ndarray) – Test labels
  • valx (ndarray) – Array containing validation images. Will be None if val_size was 0.
  • valy (ndarray) – Array containing validation labels. Will be None if val_size was 0.

dataset_loading.cifar.get_cifar_queues(data_dir, cifar10=True, val_size=2000, transform=None, maxsize=10000, num_threads=(2, 2, 2), max_epochs=inf, get_queues=(True, True, True), one_hot=True, download=False, _rand_data=False)[source]

Get Image queues for CIFAR

CIFAR10/100 are both small datasets. This function loads them both into memory and creates several ImgQueue instances to feed the training, testing and validation data through to the main function. Preprocessing can be done by providing a callable to the transform parameter. Note that by default, the CIFAR images returned will be of shape [32, 32, 3] but this of course can be changed by the transform function.

Parameters:
  • data_dir (str) – Path to the folder containing the cifar data. For cifar10, this should be the path to the folder called ‘cifar-10-batches-py’. For cifar100, this should be the path to the folder ‘cifar-100-python’.
  • cifar10 (bool) – True if we are using cifar10.
  • val_size (int) – How big you want the validation set to be. Will be taken from the end of the train data.
  • transform (None or callable or tuple of callables) – Callable function that accepts a numpy array representing one image, and transforms it/preprocesses it. E.g. you may want to remove the mean and divide by standard deviation before putting into the queue. If tuple of callables, needs to be of length 3 and should be in the order (train_transform, test_transform, val_transform). Setting it to None means no processing will be done before putting into the image queue.
  • maxsize (int or tuple of 3 ints) – How big the image queues will be. Increase this if your main program is chewing through the data quickly, but increasing it will also mean more memory is taken up. If tuple of ints, needs to be length 3 and of the form (train_qsize, test_qsize, val_qsize).
  • num_threads (int or tuple of 3 ints) – How many threads to use for the train, test and validation threads (if tuple, needs to be of length 3 and in that order).
  • max_epochs (int) – How many epochs to run before returning FileQueueDepleted exceptions
  • get_queues (tuple of 3 bools) – In case you only want to have training data, or training and validation, or any subset of the three queues, you can mask the individual queues by putting a False in its position in this tuple of 3 bools.
  • one_hot (bool) – True if you want the labels pushed into the queue to be a one-hot vector. If false, will push in a one-of-k representation.
  • download (bool) – True if you want the dataset to be downloaded for you. It will be downloaded into the data_dir provided in this case.
Returns:

  • train_queue (ImgQueue instance or None) – Queue with the training data in it. None if get_queues[0] == False
  • test_queue (ImgQueue instance or None) – Queue with the test data in it. None if get_queues[1] == False
  • val_queue (ImgQueue instance or None) – Queue with the validation data in it. Will be None if the val_size parameter was 0 or get_queues[2] == False

Notes

If the max_epochs paramter is set to a finite amount, then when the queues run out of data, they will raise a dataset_loading.FileQueueDepleted exception.

PASCAL

dataset_loading.pascal.img_sets()[source]

List all the image sets from Pascal VOC. Don’t bother computing this on the fly, just remember it. It’s faster.

dataset_loading.pascal.load_pascal_data(data_dir, max_epochs=None, thread_count=3, imsize=(128, 128))[source]

Will use a filename queue and img_queue and load the data