The devil in the details : feeding images to Theano

Another lesson from the Kaggle competition : pushing your code to ultimate speed is still labour intensive.

High-level libraries like Numpy and Theano really help, but when you are training complex deep-learning models, rushing through the epochs takes more than typing a couple of imports, enabling the GPU, and hoping for the best.

Seemingly trivial tasks — input, output, copying buffers, rearranging arrays — may easily become the critical bottleneck. Those treacherous operations always raise flags for practioners of high-performance computing. The rest of us, as we fixate on numerical computations, tend to overlook that moving data around easily dominates the costs.

Sometimes the bottleneck is a bolt out of the blue, a straightforward task that ends up being painfully slow. Case in point, the pipeline of importing an image from PIL to Theano through a Numpy array.

Opening an image using PIL is certainly very simple :

>>> from PIL import Image
>>> image = Image.open('color.png', 'r')
>>> image
<PIL.PngImagePlugin.PngImageFile image mode=RGB size=12x8 at 0x1D3E518>

PIL will be so charming as even to deduce by itself the necessary decoder from the file extension or headers.

From the returned object, you can fetch metadata :

>>> image.format
'PNG'
>>> image.getbbox()
(0, 0, 12, 8)
>>> image.getbands()
('R', 'G', 'B')

Or you can get an object that gives access to the actual pixels with :

>>> image.getdata()
<ImagingCore object at 0x7f21e5bb2350>

If you want to use the image as input for a model in Theano, you’ll need first to convert it to a Numpy array. This can very easily be done with :

>>> import numpy
>>> array1 = numpy.asarray(image)
>>> array1 
array([[[255,   0,   0],
        [255,   0,   0],
        [128,   0,   0],
        [128,   0,   0],
        [  0,   0,   0],
        [ 64,  64,  64],
        [128, 128, 128],
        [255, 255, 255],
        [  0,   0, 255],
        [  0,   0, 255],
        [  0,   0, 128],
        [  0,   0, 128]],

       [[255,   0,   0],
        [255,   0,   0],
        [128,   0,   0],
        [128,   0,   0],
        [  0,   0,   0],
        [ 64,  64,  64],
        [128, 128, 128],
        [255, 255, 255],
        [  0,   0, 255],
        [  0,   0, 255],
        [  0,   0, 128],
        [  0,   0, 128]],

       [[255,   0,   0],
        [255,   0,   0],
        [128,   0,   0],
        [128,   0,   0],
        [  0,   0,   0],
        [ 64,  64,  64],
        [128, 128, 128],
        [255, 255, 255],
        [  0,   0, 255],
        [  0,   0, 255],
        [  0,   0, 128],
        [  0,   0, 128]],

       [[255,   0,   0],
        [255,   0,   0],
        [128,   0,   0],
        [128,   0,   0],
        [  0,   0,   0],
        [ 64,  64,  64],
        [128, 128, 128],
        [255, 255, 255],
        [  0,   0, 255],
        [  0,   0, 255],
        [  0,   0, 128],
        [  0,   0, 128]],

       [[255, 255,   0],
        [255, 255,   0],
        [  0, 255, 255],
        [  0, 255, 255],
        [  0, 255,   0],
        [  0, 255,   0],
        [  0, 128,   0],
        [  0, 128,   0],
        [128, 128,   0],
        [128, 128,   0],
        [  0, 128, 128],
        [  0, 128, 128]],

       [[255, 255,   0],
        [255, 255,   0],
        [  0, 255, 255],
        [  0, 255, 255],
        [  0, 255,   0],
        [  0, 255,   0],
        [  0, 128,   0],
        [  0, 128,   0],
        [128, 128,   0],
        [128, 128,   0],
        [  0, 128, 128],
        [  0, 128, 128]],

       [[255,   0, 255],
        [255,   0, 255],
        [  0,   0,   0],
        [  0,   0,   0],
        [  0, 255,   0],
        [  0, 255,   0],
        [  0, 128,   0],
        [  0, 128,   0],
        [128,   0, 128],
        [128,   0, 128],
        [255, 255, 255],
        [255, 255, 255]],

       [[255,   0, 255],
        [255,   0, 255],
        [  0,   0,   0],
        [  0,   0,   0],
        [  0, 255,   0],
        [  0, 255,   0],
        [  0, 128,   0],
        [  0, 128,   0],
        [128,   0, 128],
        [128,   0, 128],
        [255, 255, 255],
        [255, 255, 255]]], dtype=uint8)

You may, at this point, specify a data type. For injecting images into Theano, we’ll often want to convert them to float32 to allow GPU processing :

array1 = numpy.asarray(image, dtype='float32')

Or for enhanced portability, you can use Theano’s default float datatype (which will be float32 if the code is intended to run in the current generation of GPUs) :

import theano
array1 = numpy.asarray(image, dtype=theano.config.floatX)

Alternatively, the array can be converted from the ImagingCore object :

>>> array2 = numpy.asarray(image.getdata())
>>> array2
array([[255,   0,   0],
       [255,   0,   0],
       [128,   0,   0],
       [128,   0,   0],
       [  0,   0,   0],
       [ 64,  64,  64],
       [128, 128, 128],
       [255, 255, 255],
       [  0,   0, 255],
       [  0,   0, 255],
       [  0,   0, 128],
       [  0,   0, 128],
       [255,   0,   0],
       [255,   0,   0],
       [128,   0,   0],
       [128,   0,   0],
       [  0,   0,   0],
       [ 64,  64,  64],
       [128, 128, 128],
       [255, 255, 255],
       [  0,   0, 255],
       [  0,   0, 255],
       [  0,   0, 128],
       [  0,   0, 128],
       [255,   0,   0],
       [255,   0,   0],
       [128,   0,   0],
       [128,   0,   0],
       [  0,   0,   0],
       [ 64,  64,  64],
       [128, 128, 128],
       [255, 255, 255],
       [  0,   0, 255],
       [  0,   0, 255],
       [  0,   0, 128],
       [  0,   0, 128],
       [255,   0,   0],
       [255,   0,   0],
       [128,   0,   0],
       [128,   0,   0],
       [  0,   0,   0],
       [ 64,  64,  64],
       [128, 128, 128],
       [255, 255, 255],
       [  0,   0, 255],
       [  0,   0, 255],
       [  0,   0, 128],
       [  0,   0, 128],
       [255, 255,   0],
       [255, 255,   0],
       [  0, 255, 255],
       [  0, 255, 255],
       [  0, 255,   0],
       [  0, 255,   0],
       [  0, 128,   0],
       [  0, 128,   0],
       [128, 128,   0],
       [128, 128,   0],
       [  0, 128, 128],
       [  0, 128, 128],
       [255, 255,   0],
       [255, 255,   0],
       [  0, 255, 255],
       [  0, 255, 255],
       [  0, 255,   0],
       [  0, 255,   0],
       [  0, 128,   0],
       [  0, 128,   0],
       [128, 128,   0],
       [128, 128,   0],
       [  0, 128, 128],
       [  0, 128, 128],
       [255,   0, 255],
       [255,   0, 255],
       [  0,   0,   0],
       [  0,   0,   0],
       [  0, 255,   0],
       [  0, 255,   0],
       [  0, 128,   0],
       [  0, 128,   0],
       [128,   0, 128],
       [128,   0, 128],
       [255, 255, 255],
       [255, 255, 255],
       [255,   0, 255],
       [255,   0, 255],
       [  0,   0,   0],
       [  0,   0,   0],
       [  0, 255,   0],
       [  0, 255,   0],
       [  0, 128,   0],
       [  0, 128,   0],
       [128,   0, 128],
       [128,   0, 128],
       [255, 255, 255],
       [255, 255, 255]])

Note that the techniques are not exactly interchangeable. Converting directly from the Image object preserves the shape, while converting from .getdata() creates a flatter array :

>>> array1.shape
(8, 12, 3)
>>> array2.shape
(96, 3)

In both cases the image channels are “interleaved”, i.e., for each row and column of the image, the values of the channels (in this example: red, green and blue) appear in sequence. This organization is helpful if you are doing local transformations in the pixels (say, changing colorspaces) as it keeps referential locality. For deep learning, however, that arrangement is a nuisance.

Even with 2D input images, we will typically want “3D” convolutional filters in order to reach through — and mix and match — the entire stack of input channels. We need the input channels separated by planes, i.e., all red data, then all green data, etc. That is a breeze with rollaxis and reshape :

>>> array1 = numpy.rollaxis(array1, 2, 0)
>>> array2 = array2.T.reshape(3,8,12)

Some neural nets go one step further and stack together all the images from a given batch into a single tensor. The LeNet/MNIST sample from deeplearning.net does exactly that. That strategy can increase GPU utilization by giving it bigger chunks to chew at once. If you decide to adopt it, it’s not difficult to assemble the tensor :

ImageSize = (512, 512)
NChannelsPerImage = 3
imagesData = [ Image.open(f, 'r').getdata() for f in batch ]
for i in imagesData :
    assert i.size == ImageSize
    assert i.bands == NChannelsPerImage

allImages = numpy.asarray(imagesData)
nImages = len(batch)
allImages = numpy.rollaxis(allImages, 2, 1).reshape(nImages, NChannelsPerImage, ImageSize[0], ImageSize[1])
print allImages.shape

The code above checks that all images conform to a given shape (essential to convolutional networks, which are very rigid about input sizes). And it works… but it will rather walk than run.

As you try to discover what is holding back the speed, it is easy to suspect the array reshaping operations, but the real culprit is the innocent-loking image conversion to array. For some reason importing images to Numpy either from Image objects or from ImagingCore objects — as we have been trying so far — takes an absurd amount of time.

The solution is not exactly elegant but it makes the conversion so much faster, you might want to consider it. You have to bridge the conversion with a pair of .tostring / .fromstring operations :

ImageSize = (512, 512)
NChannelsPerImage = 3
images = [ Image.open(f, 'r') for f in batch ]
for i in images :
    assert i.size == ImageSize
    assert len(i.getbands()) == NChannelsPerImage

ImageShape =  (1,) + ImageSize + (NChannelsPerImage,)
allImages = [ numpy.fromstring(i.tostring(), dtype='uint8', count=-1, sep='') for i in images ]
allImages = [ numpy.rollaxis(a.reshape(ImageShape), 3, 1) for a in allImages ]
allImages = numpy.concatenate(allImages)

The snippet above has exactly the same effect than the previous one, but it will run up to 20 times faster. In both cases, the array will be ready to be fed to the network.

* * *

TL;DR ?

If speed is important to you, do not convert an image from PIL to Numpy like this…

from PIL import Image
import numpy
image = Image.open('color.png', 'r')
array = numpy.asarray(image)

…nor like this…

from PIL import Image
import numpy
imageData = Image.open('color.png', 'r').getdata()
imageArray = numpy.asarray(imageData).reshape(imageData.shape + (imageData.bands,))

…because although both methods work, they will be very very slow. Do it like this :

from PIL import Image
import numpy
image = Image.open('color.png', 'r').getdata()
imageArray = numpy.fromstring(image.tostring(), dtype='uint8', count=-1, sep='').reshape(image.shape + (len(image.getbands()),))

It will be up to 20⨉ faster.

(Also, you’ll probably have to work on the shape of the array before you feed it to a convolutional network, but for that, I’m afraid, you’ll have to read the piece from the top.)

Deep Neural Network

From instance launch to model accuracy: an AWS/Theano walkthrough

My team has recently participated at Kaggle’s Diabetic Retinopathy challenge, and we won… experience. It was our first Kaggle challenge and we found ourselves unprepared for the workload.

But it was fun — and it was the opportunity to learn new skills, and to sharpen old ones. As the deadline approached, I used Amazon Web Services a lot, and got more familiar with it. Although we have our GPU infrastructure at RECOD, the extra boost provided by AWS allowed exploring extra possibilities.

But it was in the weekend just before the challenge deadline that AWS proved invaluable. Our in-house cluster went AWOL. What are the chances of having a power outage bringing down your servers and a pest control blocking your physical access to them in the weekend before a major deadline ? Murphy knows. Well, AWS allowed us to go down fighting, instead of throwing in the towel.

In this post, I’m compiling Markus Beissinger’s how-to and deeplearning.net tutorials into a single hyper-condensed walkthrough to get you as fast as possible from launching an AWS instance until running a simple convolutional deep neural net. If you are anything like me, I know that you are aching to see some code running — but after you scratch that itch, I strongly suggest you to go back to those sources and study them at leisure.

I’ll assume that you already know :

  1. How to create an AWS account ;
  2. How to manage AWS users and permissions ;
  3. How to launch an AWS instance.

Those preparations out of way, let’s get started ?

Step 1: Launch an instance at AWS, picking :

  • AMI (Amazon Machine Image) : Ubuntu Server 14.04 LTS (HVM), SSD Volume Type – 64-bit
  • Instance type : GPU instances / g2.2xlarge

For the other settings, you can use the defaults, but be careful with the security group and access key to not lock yourself out of the instance.

Step 2 : Open a terminal window, and log into your instance. In my Mac I type :

ssh -i private.pem ubuntu@xxxxxxxx.amazonaws.com

Where private.pem is the private key file of the key pair used when creating the instance, and xxxxxxxx.amazonaws.com is the public DNS of the instance. You might get an angry message from SSH, complaining that your .pem file is too open. If that happens, change its permissions with :

chmod go-rxw private.pem

Step 3 : Install Theano.

Once you’re inside the machine, this is not complicated. Start by making the machine up-to date :

sudo apt-get update
sudo apt-get -y dist-upgrade

Install Theano’s dependencies :

sudo apt-get install -y gcc g++ gfortran build-essential git wget linux-image-generic libopenblas-dev python-dev python-pip python-nose python-numpy python-scipy

Get the package for CUDA and install it :

wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/cuda-repo-ubuntu1404_7.0-28_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1404_7.0-28_amd64.deb
sudo apt-get update
sudo apt-get install -y cuda

This last command is the only one that takes some time — you might want to go brew a cuppa while you wait. Once it is over, put CUDA on the path and reboot the machine :

echo 'export PATH=/usr/local/cuda/bin:$PATH' >> .bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> .bashrc
sudo reboot

Log into the instance again and query for the GPU :

nvidia-smi -q

This should spit a lengthy list of details about the installed GPU.

Now you just have to install Theano. The one-liner below installs the latest version, and after the wait for the CUDA driver, runs anticlimactically fast :

sudo pip install --upgrade --no-deps git+git://github.com/Theano/Theano.git

And that’s it ! You have Theano on your system.

Step 4 : Run an example.
Let’s take Theano for a run. The simplest sample from deeplearning.net that’s already interesting is the convolutional/MNIST digits tutorial. The sample depends on code written in the previous tutorials, MLP and Logistic Regression, so you have to download those too. You also have to download the data. The commands below do all that:

mkdir theano
mkdir theano/data
mkdir theano/lenet
cd theano/data
wget http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz
cd ../lenet
wget http://deeplearning.net/tutorial/code/logistic_sgd.py
wget http://deeplearning.net/tutorial/code/mlp.py
wget http://deeplearning.net/tutorial/code/convolutional_mlp.py

Finally, hit the magical command :

python convolutional_mlp.py

What ? All this work and the epochs will go by as fast as molasses on a winter’s day. What gives ?

You have to tell Theano to run on the GPU, otherwise it will crawl on the CPU. You can paste the lines below into your ~/.theanorc file :

[global]
floatX=float32
device=gpu

[lib]
cnmem=0.9

[nvcc]
fastmath=True

…or you can use the one-liner below to create it:

echo -e '[global]\nfloatX=float32\ndevice=gpu\n\n[lib]\ncnmem=0.9\n\n[nvcc]\nfastmath=True' > ~/.theanorc

Try running the example again.

python convolutional_mlp.py

With some luck, you’ll note two differences: first, Theano will announce the use of the GPU…

Using gpu device 0: GRID K520 (CNMeM is enabled)

…and second, the epochs will run much, much faster !

(Image credit : networkworld.com).