$\newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\yv}{\mathbf{y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}$

3D synthetic image generation using Torch¶

(Updated version of CS480 final project)¶

Steve Kommrusch

Introduction¶

There are certain problem categories where having synthetic examples of input can improve the ability of an algorithm to optimally process the data space. One recent example in the literature is the CLGEN program [1] which generates synthetic GPU kernels using an LSTM network trained on code from github. The generated programs are then used to improve compilation techniques because a broader sample of possible programs can be analyzed by the compile step. Synthetic data can also be useful when directly using the data in question may have privacy implications (such as health data associated with specific patients).

For this project, I am interested in training a neural network to generate new examples of 3D shapes that fit within a broad learned category. I will implement the neural network using the Torch framework [2] as it is a common platform used broadly in the community. I have already ported HW3 from CS480 into Torch and many concepts and functions are easily remapped. For example, Torch has many of the same array access functions that numpy has [3]. This project ipynb file was edited using 'itorch notebook' instead of 'jupyter notebook' used in CS480.

While the model I hope to produce should be generally applicable to learning and creating 3D shapes, for this project I plan to use only this specific training shape set:

All shapes will be 48x48x48 voxels, grayscale
Single spheres of various sizes
Single cubes of various sizes and 45 degree rotations
Rods (cylinders) of various sizes and 90 degree rotations
Cones of various sizes and 90 degree rotations
A large sphere connected to a small sphere by a rod and 90 degree rotations
A large cube connected to a small cube by a rod and 90 degree rotations
A large cone connected to a small cone by a rod and 90 degree rotations
A large sphere connected to a small cube by a rod and 90 degree rotations
A large cube connected to a small cone by a rod and 90 degree rotations
A large cone connected to a small sphere by a rod and 90 degree rotations

I am interested in exploring the best algorithms to use for this generation. I have not found LSTMs to be easy to use in Torch (examples and tutorials were cumbersome to use and often did not work correctly), and so will not be testing image generation with LSTMs (see appendix B for more details). But preliminary testing on 2D shape generation with a standard 3-layer fully connected network showed promise. The 'Methods' section will discuss algorithm ideas a bit more, but I am interested in exploring autoencoding through a bottleneck layer and various Torch training approaches for learned 3D image generation. I will try to find the algorithm and network parameters that result in the best images generated in the shortest time. To that end, I will use the CUDA options within Torch (which are easy to use). (Appendix A covers Torch and CUDA experiments I have done. The 'Results' section includes some discussion of performance stability I had with CUDA).

Some key algorithm questions are:

Using only fully connected network layers, what are the best parameters for 3D generation?
How difficult is it in Torch to interact with the internal bottleneck layer of an autoencoder?
Can Torch with CUDA use scaled conjugate gradient?

Some key data questions are:

When using an autoencoder, how are the shapes described above mapped into the bottleneck layer? See the results section for plots of shapes relative to 4-neuron bottleneck layer.
Note that there are missing permutations in the training data set I propose (i.e., a large cube and a small sphere). What will the novel images the system generates tend to look like?
Will the model overtrain, or will it be able to generate interesting new shapes that were not explicitly in the training data set?

This data and algorithms are interesting for several reasons. As noted, the general problem of synthetic data generation is useful in certain medical applications, which relates to research I am doing outside of CS480. Also, the algorithms best used for 3D shape learning are interesting even more broadly as large 3D data sets arise in many contexts and learning how to efficiently interact with this data using neural networks will be broadly valuable.

Methodology¶

I broke the problem of generating 3D images within a trained range into these steps:

Generate 3D training images
Explore optimal training setups (CG, SGD, various Torch packages)
Explore CUDA performance improvements
Explore optimal parameters for training the bottleneck network
Explore how to observe and generate image information at the bottleneck layer
Process generation network output for final 3D generated image results

Regarding steps 4,5, and 6, below is a figure showing the final network that I found worked best for this problem. The autoencoder network has 7 layers of neurons with 2048,1024,64,4,64,1024, and 2048 neurons in each layer. This network is trained to generate the same output as input. The bottleneck layer can be observed to see how well spaced the source images are in the 4 dimensional feature space. Then the same weights learned for the autoencoder can be copied to a new network for image generation using 4 inputs. Since the network uses tanh for nonlinearity, the range of value on the 4 inputs is -1 to 1. 4 random numbers from -1 to 1 can be fed in to the generation network to create a novel image in a style similar to those that the network trained on. Images similar to a particular input image can be generated by varying the 4 feature values slightly from the input's mapping.

itorch.image("Autoencode.png")

3D source image generation¶

The next cells create functions for generating the sphere, cube, rod, and cone image groups discussed in the introduction. I generate the 100 48x48x48 images. The first images are spheres and the last are large cones connected to a sphere with a rod. Sizes and orientation are randomized. During training, the order that images are presented to the network is randomized.

-- Here are the torch packages required for executing the
-- cells in this notebook
require 'torch'
require 'nn'       -- neural networks
require 'optim'    -- Improved optimizers including Conjugate Gradient
require 'cutorch'  -- Improves performance by using GPU compute
require 'cunn'

-- Set up functions that allow shapes of various sizes, offsets, and rotations to be built
sphere = function(r,xoff,yoff,zoff,i) 
    local rsq = r*r
    for xi = 1-xoff,size-xoff do
        x = xi-(size/2)-0.5
        for yi = 1-yoff,size-yoff do
            y = yi-(size/2)-0.5
            for zi = 1-zoff,size-zoff do
                z = zi - (size/2)-0.5
                if (x*x + y*y + z*z <rsq) then
                    X[i][xi+xoff][yi+yoff][zi+zoff] = 1.0
                end
            end
        end
    end
end
cube = function(r,xoff,yoff,zoff,rotation,i)
    for xi = 1-xoff,size-xoff do          -- size of world to position in
        for yi = 1-yoff,size-yoff do
            for zi = 1-zoff,size-zoff do
                if (rotation < 0.25) then
                    x = xi - (size/2) - 0.5
                    y = yi - (size/2) - 0.5
                    z = zi - (size/2) - 0.5
                elseif (rotation < 0.5) then
                    x = 1.414 * (xi - (size/2) - 0.5)
                    y = zi + yi - size - 1
                    z = zi - yi 
                elseif (rotation < 0.75) then
                    x = xi + zi - size - 1
                    y = 1.414 * (yi - (size/2) - 0.5)
                    z = zi - xi
                else
                    x = xi - yi
                    y = xi + yi - size - 1
                    z = 1.414 * (zi - (size/2) - 0.5)
                end
                if (x > -r and x < r and y > -r and y < r and z > -r and z < r) then
                    X[i][xi+xoff][yi+yoff][zi+zoff] = 1.0
                end
            end
        end
    end
end
rod = function(r,xoff,yoff,zoff,rotation,i)
    local rsq = r*r/4 -- This results in a thin cylinder
    for xi = 1-xoff,size-xoff do
        for yi = 1-yoff,size-yoff do
            for zi = 1-zoff,size-zoff do
                if (rotation < 0.33) then
                    x = xi - (size/2) - 0.5
                    y = yi - (size/2) - 0.5
                    z = zi - (size/2) - 0.5
                elseif (rotation < 0.67) then
                    x = zi - (size/2) - 0.5
                    y = xi - (size/2) - 0.5
                    z = yi - (size/2) - 0.5
                else
                    x = yi - (size/2) - 0.5
                    y = zi - (size/2) - 0.5
                    z = xi - (size/2) - 0.5
                end
                if (x > -r and x < r and y*y + z*z < rsq) then
                    X[i][xi+xoff][yi+yoff][zi+zoff] = 1.0
                end
            end
        end
    end
end
cone = function(r,xoff,yoff,zoff,rotation,i)
    local rsq
    for xi = 1-xoff,size-xoff do
        for yi = 1-yoff,size-yoff do
            for zi = 1-zoff,size-zoff do
                if (rotation < 0.167) then
                    x = xi - (size/2) - 0.5
                    y = yi - (size/2) - 0.5
                    z = zi - (size/2) - 0.5
                elseif (rotation < 0.333) then
                    x = -(xi - (size/2) - 0.5)
                    y = yi - (size/2) - 0.5
                    z = zi - (size/2) - 0.5
                elseif (rotation < 0.5) then
                    x = zi - (size/2) - 0.5
                    y = xi - (size/2) - 0.5
                    z = yi - (size/2) - 0.5    
                elseif (rotation < 0.667) then
                    x = -(zi - (size/2) - 0.5)
                    y = xi - (size/2) - 0.5
                    z = yi - (size/2) - 0.5
                elseif (rotation < 0.833) then
                    x = yi - (size/2) - 0.5
                    y = zi - (size/2) - 0.5
                    z = xi - (size/2) - 0.5
                else
                    x = -(yi - (size/2) - 0.5)
                    y = zi - (size/2) - 0.5
                    z = xi - (size/2) - 0.5
                end
                rsq = (r-x)*(r-x)/4 -- radius of yz circle grows as x changes
                if (x > -r and x < r and y*y + z*z < rsq) then
                    X[i][xi+xoff][yi+yoff][zi+zoff] = 1.0
                end
            end
        end
    end
end

-- Generate input training images
images = 100*6  -- Must be multiple of 6 to allow XYZ order permutations
size = 48
X = torch.Tensor(images,size,size,size):fill(0)
for i = 1,images/6 do
    shape = 6.0*i/images  -- 7% sphere, 9% each 3 simple shapes, 11% for 6 shape pairs
    rotation = torch.uniform()  -- random orientation of single shape or position of pair
    r = (torch.uniform()+1)*(size/4)  -- single shape 50-100% of available radius
    -- Double shapes have large and small radii, and rod rotation
    rlarge = torch.uniform()*size*0.075+size*0.225   -- 60% to 45% of available radius
    rsmall = torch.uniform()*size*0.05+size*0.175   -- 45% to 35% of available radius
    if (rotation < 0.17) then xdir=1 elseif (rotation < 0.33) then xdir=-1 else xdir=0 end
    if (rotation < 0.33) or (rotation >= 0.67) then zdir=0 
      elseif (rotation < 0.5) then zdir=-1 else zdir=1 end
    if (rotation < 0.67) then ydir=0 elseif (rotation < 0.83) then ydir=-1 else ydir=1 end
    if (shape < 0.07) then -- Sphere
        sphere(r,0,0,0,i)
    elseif (shape < 0.16) then -- cube with random 45 degree rotation
        cube(r,0,0,0,rotation,i)
    elseif (shape < 0.25) then -- rod with 90 degree rotation
        rod(r,0,0,0,rotation,i)
    elseif (shape < 0.34) then -- cone with 90 degree rotation  
        cone(r,0,0,0,rotation,i)    
    elseif (shape < 0.45) then -- large sphere connected to small sphere by rod
        rod(size/4,0,0,0,rotation,i)
        sphere(rlarge,xdir*size*0.2,ydir*size*0.2,zdir*size*0.2,i)
        sphere(rsmall,-xdir*size*0.25,-ydir*size*0.25,-zdir*size*0.25,i)
    elseif (shape < 0.56) then -- large cube connected to small cube by rod
        rod(size/4,0,0,0,rotation,i)
        cube(rlarge,xdir*size*0.2,ydir*size*0.2,zdir*size*0.2,torch.uniform(),i)
        cube(rsmall,-xdir*size*0.25,-ydir*size*0.25,-zdir*size*0.25,torch.uniform(),i)
    elseif (shape < 0.67) then -- large cone connected to small cone by rod
        rod(size/4,0,0,0,rotation,i)
        cone(rlarge,xdir*size*0.2,ydir*size*0.2,zdir*size*0.2,torch.uniform(),i)
        cone(rsmall,-xdir*size*0.25,-ydir*size*0.25,-zdir*size*0.25,torch.uniform(),i)
    elseif (shape < 0.78) then -- large sphere connected to small cube by rod
        rod(size/4,0,0,0,rotation,i)
        sphere(rlarge,xdir*size*0.2,ydir*size*0.2,zdir*size*0.2,i)
        cube(rsmall,-xdir*size*0.25,-ydir*size*0.25,-zdir*size*0.25,torch.uniform(),i)
    elseif (shape < 0.89) then -- large cube connected to small cone by rod
        rod(size/4,0,0,0,rotation,i)
        cube(rlarge,xdir*size*0.2,ydir*size*0.2,zdir*size*0.2,torch.uniform(),i)
        cone(rsmall,-xdir*size*0.25,-ydir*size*0.25,-zdir*size*0.25,torch.uniform(),i)
    else                      -- large cone connected to small sphere by rod
        rod(size/4,0,0,0,rotation,i)
        cone(rlarge,xdir*size*0.2,ydir*size*0.2,zdir*size*0.2,torch.uniform(),i)
        sphere(rsmall,-xdir*size*0.25,-ydir*size*0.25,-zdir*size*0.25,i)
    end
end

Here are a few images showing 3 of the 48x48x48 images. Each of the 3 images has 48 2D images, one 2D image for each slice of the 3D image. The first image shows a sphere, the last 2 show 2 different examples of a large cube connected to a small cone.

itorch.image(X[1])
itorch.image(X[81])
itorch.image(X[82])

The full 48x48x48 source images are unnecessarily large for training, so the step below averages 8 voxels together to create 24x24x24 images for training. Also, 5 permutations (reflection and rotations) of the initial randomly oriented image are produced. This mimics how a limited input image set can be used to improve network training.

-- TIME NOTE: This cell takes about 2 minutes to run on my laptop 
-- Generate 24x24x24 images by down-sampling 48x48x48
X24 = torch.Tensor(images,size/2,size/2,size/2):fill(0)
for i = 1,images/6 do for x = 1,size/2 do for y = 1,size/2 do for z = 1,size/2 do
    X24[i*6][x][y][z] = (X[i][x*2][y*2][z*2] + X[i][x*2-1][y*2][z*2]
                        +X[i][x*2][y*2-1][z*2] + X[i][x*2-1][y*2-1][z*2]
                        +X[i][x*2][y*2][z*2-1] + X[i][x*2-1][y*2][z*2-1]
                        +X[i][x*2][y*2-1][z*2-1] + X[i][x*2-1][y*2][z*2-1])/8
    X24[i*6-5][1+size/2-x][z][y] = X24[i*6][x][y][z]  -- Permutation speeds up this generation
    X24[i*6-4][y][x][z] = X24[i*6][x][y][z]           --  step and emulates methods for creating
    X24[i*6-3][1+size/2-y][z][x] = X24[i*6][x][y][z]  --  more training samples from a limited
    X24[i*6-2][z][x][y] = X24[i*6][x][y][z]           --  source data set.
    X24[i*6-1][1+size/2-z][y][x] = X24[i*6][x][y][z]
end end end end

In addition to being easier to train on, the 24x24x24 images are easier to display. Below is one image of each shape group and one of its 5 permutations. Each row has 24 2D images side-by-side to allow all 24 2D slices of one image to be shown in a row.

for i = 1,91,10 do -- Print example of all 10 shapes
    itorch.image(X24[i*6]:reshape(24,24*24))   -- One of the 100 random sized/rotated originals
    itorch.image(X24[i*6-3]:reshape(24,24*24)) -- One of 5 rotate/reflects of the original
end

Neural network for autoencoding¶

We have images, now below I define a neural network for training. In the Results section I'll summarize experiments with parameters, but this neural network was, ultimately, the best configuration I found.

net = nn.Sequential()              -- 'net' is the name for the autoencoder network
net:add(nn.Reshape(24*24*24))
net:add(nn.Linear(24*24*24, 2048))
net:add(nn.Tanh())                 -- Hyperbolic tangent for non-linearity
net:add(nn.Linear(2048,1024))
net:add(nn.Tanh())
net:add(nn.Linear(1024,64))
net:add(nn.Tanh())
net:add(nn.Linear(64,4))           -- Bottleneck layer has 4 neurons
net:add(nn.Tanh())
net:add(nn.Linear(4,64))
net:add(nn.Tanh())
net:add(nn.Linear(64,1024))
net:add(nn.Tanh())
net:add(nn.Linear(1024,2048))
net:add(nn.Tanh())
net:add(nn.Linear(2048, 24*24*24))
net:add(nn.Reshape(24,24,24))
criterion = nn.MSECriterion()   -- Use mean square error function to train on

-- Transfer everything to CUDA memory organization to aid performance
net = net:cuda()
criterion = criterion:cuda()
X24=X24:cuda()
dataset={}  -- The 'trainer' class wants input/target organized in this format
function dataset:size() return images end
for i = 1,images do
    dataset[i] = {X24[i],X24[i]}
end

-- Use criterion and net in quick test to catch simple typos
predict=net:forward(X24)
print(criterion:forward(predict, X24))      -- MSE of untrained network
itorch.image(predict[1]:reshape(24,24*24))  -- show example output of untrained network

0.11937122792006

The first network I tested is a simple training loop as documented in the Torch repository [10].

-- TIME NOTE: 100 iterations takes about 104 seconds on my laptop
timer = torch.Timer();
for i = 0, 100 do   
  -- feed the neural network and the criterion
  mse = criterion:forward(net:forward(X24), X24)
  -- train over this example in 3 steps
  -- (1) zero the accumulation of the gradients
  net:zeroGradParameters()
  -- (2) accumulate gradients
  net:backward(X24, criterion:backward(net.output, X24))
  -- (3) update parameters with learning rate
  net:updateParameters(0.01)
  if (i%50 == 0) then
    print("Iteration "..i..": mse="..mse)
  end
end
print('Time elapsed: ' .. timer:time().real .. ' seconds')

Iteration 0: mse=0.014246131293476

Iteration 50: mse=0.014246057718992

Iteration 100: mse=0.014246016740799	
Time elapsed: 107.11159110069 seconds

The nn.StochasticGradient trainer is also explained in the Torch repository [10] and runs slower per iteration but converges better than the loop built in steps. But the nn package does not include a conjugate gradient option. (It would be interesting to know more details of this package, but I found that the optim package worked better and so didn't explore the issue.)

-- TIME NOTE: 2 iterations takes about 138 seconds on my laptop
timer = torch.Timer()
trainer = nn.StochasticGradient(net, criterion)
trainer.learningRate = 0.01
trainer.maxIteration = 2    -- Set to 2 for an example, see results section for experiments
trainer:train(dataset)
print('Time elapsed: ' .. timer:time().real .. ' seconds')

# StochasticGradient: training

# current error = 0.12775345425432

# current error = 0.12118925068217

# current error = 0.11564730634602

# current error = 0.11073783615294

# current error = 0.10628282715877	
# StochasticGradient: you have reached the maximum number of iterations	
# training error = 0.10628282715877	
Time elapsed: 345.95418787003 seconds

Having seen the improvement the conjugate gradient method can have over stochastic gradient decent (see Appendix A), I learned from references [11] and [12] that the 'optim' package provides a cg function that can be used for training. First, I tested with optim's stochastic gradient decent as shown in the following cells.

x, dl_dx = net:getParameters()
sgd_params = {
   learningRate = 0.01
}
stepsgd = function(batch_size)
    local current_loss = 0
    local shuffle = torch.randperm(dataset:size())
    batch_size = batch_size or 1000
    
    for t = 1,dataset:size(),batch_size do
        -- setup inputs for this mini-batch
        -- no need to setup targets, since they are the same
        local size = math.min(t + batch_size, 1+ dataset:size()) - t
        local inputs = torch.Tensor(size, 24, 24, 24):cuda()
        for i = 1,size do
            inputs[i] = dataset[shuffle[i+t-1]][1]
        end
        
        local feval = function(x_new)
            -- reset data
            if x ~= x_new then x:copy(x_new) end
            dl_dx:zero()

            -- perform mini-batch gradient descent
            local loss = criterion:forward(net:forward(inputs), inputs)
            net:backward(inputs, criterion:backward(net.output, inputs))

            return loss, dl_dx
        end
        
        _, fs = optim.sgd(feval, x, sgd_params)
        -- fs is a table containing value of the loss function
        -- (just 1 value for the SGD optimization)
        current_loss = current_loss + fs[1]
    end

    return current_loss
end

--TIME NOTE: 100 iterations takes about 110 seconds on my laptop
timer = torch.Timer()
max_iters=100
do
    for i = 1,max_iters do
        local loss = stepsgd()
        if (i%50 == 0) then
            print(string.format('Epoch: %d Current loss: %4f', i, loss))
        end
    end
end
print('Time elapsed: ' .. timer:time().real .. ' seconds')

Epoch: 100 Current loss: 0.103493

Epoch: 200 Current loss: 0.102821	
Time elapsed: 221.05263090134 seconds

Finally, the very best training I got was using the optim.cg package which is used by the cells below. As discussed in the results section, these 2 cells describe the method I used for my best autoencoder training results. Running this notebook sequentially does a little training with 4 different methods, but just using the following 2 cells after redefining the neural network 'net' results in the lowest error.

x, dl_dx = net:getParameters()
cg_params = {
    maxIter = 20
}
stepcg = function(batch_size)
    local current_loss = 0
    local shuffle = torch.randperm(dataset:size())
    batch_size = batch_size or 1000
    
    for t = 1,dataset:size(),batch_size do
        -- setup inputs for this mini-batch
        -- no need to setup targets, since they are the same
        local size = math.min(t + batch_size, 1+ dataset:size()) - t
        local inputs = torch.Tensor(size, 24, 24, 24):cuda()
        for i = 1,size do
            inputs[i] = dataset[shuffle[i+t-1]][1]
        end
        
        local feval = function(x_new)
            -- reset data
            if x ~= x_new then x:copy(x_new) end
            dl_dx:zero()

            -- perform mini-batch gradient descent
            local loss = criterion:forward(net:forward(inputs), inputs)
            net:backward(inputs, criterion:backward(net.output, inputs))

            return loss, dl_dx
        end
        
        _, fs = optim.cg(feval, x, cg_params)
        -- fs is a table containing value of the loss function
        current_loss = current_loss + fs[1]
    end

    return current_loss
end

-- TIME NOTE: Running 100 iterations takes about 500 seconds on my laptop with 'fast' CUDA
timer = torch.Timer()
max_iters=1000
for i = 1,max_iters do
    local loss = stepcg()
    if (i%1000 == 0) then
        print(string.format('Epoch: %d Current loss: %4f', i, loss))
    end
end
print('Time elapsed: ' .. timer:time().real .. ' seconds')

Epoch: 1000 Current loss: 0.002434	
Time elapsed: 4493.0059530735 seconds

I have now shown the data setup and autoencoder neural network methodology I developed. In the results section, I'll analyze the training and experimentation in more detail. Also, I leave the description of the 'gen' network used to generate images and the post-process step to create final 48x48x48 generated outputs until the later half of the 'Results' section as they will be more understandable after seeing the results of the autoencoder training.

Results¶

During the setup and configuration of this project, I explored various approaches to learn which path would yield the best training with the least amount of time.

1) My first attempts at training were with the loop that explicitly does net:backward and net:updateParameters steps. I found that even over 100,000 iterations on autoencoder models with initial layers 1,000 and 4,000 neurons did not converge well. An example detail from this initial testing was that 500 training iterations with a network consisting of layers with 1024,512,64,8,3,8,64,512,1024 neurons took 160 seconds and yielded 0.1508 from the MSECriterion (lower is better, image results were very poor). With CUDA, the same 500 iterations took 11.6 sec and yielded a similar 0.1502 MSE.

2) Next, based on internet research, I trained with nn.StochasticGradient(net,criterion):train feature. 100 iterations with a 1024,512,64,8,3,8,64,512,1024 network on 100 images took 730 sec and yielded 0.1137 from MSECriterion. With CUDA, the same 100 iterations took 76 sec and yielded 0.1195 MSE. Below are more experiment results with CUDA on the nn.StochasticGradient approach. An MSE of 0.0452 has about the right size for the objects, but the shape is still a formless blob.

Network layer sizes	100 iter time (s)	100 iter MSE	1900 more time (s)	1900 more MSE
1024,512,64,8,3,8,64,512,1024	76	0.1195	1422	0.0452
4096,512,64,8,3,8,64,512,4096	285	0.0770	5740	0.0447
512,64,8,3,8,64,512	38	0.1336	763	0.0527

3) In preparation for this project I had tested 2D image autoencoding with the conjugate gradient code provided in CS480 by professor Chuck Anderson. Given that long training times with SGD were not yielding good results, I researched conjugate gradient in Torch and found that the 'optim' package supports SGD and CG. With CUDA, 20,000 optim.sgd iterations ran in 355 sec and got 0.1021 with 512,64,8,3,8,64,512. 2,000 SGD iterations took 34 seconds to get 0.1489. 200 CG iterations took 127 seconds and got 0.0295 MSE - the best result yet and in a reasonable time!

4) And now for something completely different. During optim.cg testing my CUDA performance fell off for some reason. I've seen this before and it may be related to getting the GPU clock stuck in a low-power and low-frequency mode. I thought the latest approved Ubuntu updates might help, and agreed to the download, but then my system would no longer start up the windowing system, and I lost wireless and Ethernet even when in console mode. Fortunately, I had a backup of this project file that was only 1 hour of work old. After surfing for answers and trying various patches and edits, I ended up re-installing Ubuntu 16.04.2 and reloading CUDA and TORCH (and Anaconda and Java). It only cost me about 8 wall clock hours and maybe 4 hours of effort. Lesson learned - backing up your work is good! Sadly, after the reinstall, CUDA was still slower than the earlier numbers for a couple days, then it got better for a day, then it got worse again. After the slowdown, 500 iterations with the explicit net:backward calls for 1024,512,64,8,3,8,64,512,1024 took 180 sec without CUDA and 90 seconds with CUDA (compared to 160 and 11.6 from note 1 above). The general recommendation on the internet for such slowdown is to update drivers and/or try to reinstall Torch, but I don't want to risk breaking my system again.

5) Given the great results with optim.cg, I tested the circle, square, rod, cone 100 image training with various parameters:

Network layer sizes	CG iters	Train iters	time (s)	MSE	Comment
512,64,8,3,8,64,512	20	200	127(old)	0.0295	Pre-slowdown CUDA
512,64,8,3,8,64,512	20	200	1069(new)	0.0205	Newer CUDA behavior
512,64,8,3,8,64,512	20	5000	24882	0.0079	Matched 100 simple shapes OK
1024,512,64,8,3,8,64,512,1024	20	200	2103	0.0272	Worse than smaller
512,64,8,2,8,64,512	20	200	1093	0.0229	2 feature neurons
512,32,2,32,512	20	200	1037	0.0301	Worse than larger
512,128,16,2,16,128,512	20	200	1051	0.0218	Bit better than 512,64,8
512,128,32,8,2,8,32,128,512	20	200	1034	0.0381	Worse than smaller
256,64,8,2,8,64,256	20	200	627	0.0358	Worse than 512
256,64,8,2,8,64,256	10	200	322	0.0479	Worse than 20 CG
256,64,8,2,8,64,256	40	100	630	0.0388	Bit worse than 20/200

7) After the above experiments with the 4 simple shapes, I added the 6 shape pairs as planned, and created the 600 image training set described in the methodology section.

Network layer sizes	CG iters	Train iters	time (s)	MSE	Comment
256,64,8,2,8,64,256	20	200	1998	0.0354	Harder to converge complex shapes
512,128,16,3,16,128,512	20	2000	~30000	~0.0130	Not quite done overnight
512,128,16,2,16,128,512	20	100	1820	0.0396	Worse than 200 w/ 256 in layer
256,128,16,2,16,128,256	20	100	1050	0.0390	Faster than 100 w/512
256,128,16,2,16,128,256	20	10	57	0.0496	Initial set of 10
256,128,16,2,16,128,256	20	500	2926	0.0246	Correct object size and count, but not shape
256,128,16,2,16,128,256	20	50	294	0.0500	Randomness made worse than 10 iterations
256,128,16,2,16,128,256	20	50	36	0.0481	CUDA's back! View random loss range
256,128,16,2,16,128,256	20	50	37	0.0467	Repeat, compare to 0.0481
256,128,16,2,16,128,256	20	100	75	0.0390	Baseline for param explore
128,128,16,2,16,128,128	20	100	18	0.0394	Similar to 256
128,16,2,16,128	20	100	46	0.0391	Similar to more layers
512,16,2,16,512	20	100	137	0.0305	Slower but better
512,32,2,32,512	20	100	136	0.0315	Maybe same
512,2,512	20	100	134	0.0380	Worse, too few layers
512,32,2,32,512	20	100	136	0.0321	Repeat, compare to 0.0315
512,16,2,16,512	20	100	135	0.0386	Repeat, compare to 0.0305
2048,16,2,16,2048	20	100	494	0.0297	More nodes seems better
2048,32,2,32,2048	20	100	500	0.0275	32 layer seems better
2048,64,2,64,2048	20	100	496	0.0288	32 layer seems better
1024,32,2,32,1024	20	100	241	0.0303	Faster, a bit worse than 2048
1024,32,3,32,1024	20	100	245	0.0285	3 feature have good variance in plots
1024,32,4,32,1024	20	100	245	0.0258	4 feature have good variance in plots
2048,32,4,32,2048	20	100	499	0.0196
2048,32,4,32,2048	20	2000	9986	0.0061	Shapes recognizable
2048,32,4,32,2048	20	7000	33499	0.0035	Shape pairs look good
4096,64,4,64,4096	20	10	n/a	n/a	Out of memory error
2048,32,4,32,2048	20	200	7322	0.0156	Ack! CUDA's slow again!
2048,32,4,32,2048	20	900	33814	0.0145	Final overnight training with slow CUDA

I ran a few more experiments after the CS480 course ended. I went ahead and updated Ubuntu with the latest changes, insured CUDA was up to date and libraries were correct in LD_LIBRARY_PATH, reran 'luarocks install' on all torch required components; CUDA performance seems fast and stable for now. Note that runs above and below often have different random 600 images, as well as random network starting weights, so the same network can train to different MSE's, making comparisons difficult. In the end I prefer the 2048,1024,64,4,64,1024,2048 code - it converges well and has a good bottleneck layer 4-neuron distribution. I suspect the larger network may handle more naturally occurring shapes better, and 64 neurons perhaps allows for the 4-neuron distribution to be processed better.

Network layer sizes	CG iters	Train iters	time (s)	MSE	Comment
4096,64,4,64,4096	20	50	n/a	n/a	Out of memory error
3100,64,4,64,3100	20	50	n/a	n/a	Out of memory error
2600,64,4,64,2600	20	80	n/a	n/a	Out of memory error
2048,64,4,64,2048	20	100	505	0.0208	Fast CUDA! 64 seems worse than 32
2304,64,4,64,2304	20	90	n/a	n/a	Out of memory error
2176,64,4,64,2176	20	95	537	0.0261	Worse than 2048
2048,32,4,32,2048	20	100	500	0.0239	Variability rather large
2176,64,4,64,2176	20	100	562	0.0304	Initially worse than 2048
2176,64,4,64,2176	20	100	?	0.0316	Data down to the 2000 iter row from same run
2176,64,4,64,2176	20	200	?	0.0272
2176,64,4,64,2176	20	360	?	0.0195	Compare to 100 iter, 0.0196 for 2048,32 above
2176,64,4,64,2176	20	590	?	0.0155	Compare to 200 iter, 0.0156 for 2048,32 above
2176,64,4,64,2176	20	700	?	0.0145	Compare to 900 iter, 0.0145 for 2048,32 above
2176,64,4,64,2176	20	900	?	0.0133
2176,64,4,64,2176	20	1000	?	0.0128
2176,64,4,64,2176	20	1500	?	0.0100
2176,64,4,64,2176	20	2000	11239	0.0090	Not as good as 2048,32 after 2000 iter
2176,1024,64,4,64,1024,2176	20	100	n/a	n/a	Out of memory error
2048,1024,64,4,64,1024,2048	20	100	541	0.0174	Different 600 images from 2176,64 run above
2048,1024,64,4,64,1024,2048	20	100	?	0.0288	Data down to 2000 iter row from same run
2048,1024,64,4,64,1024,2048	20	200	?	0.0224
2048,1024,64,4,64,1024,2048	20	700	?	0.0103
2048,1024,64,4,64,1024,2048	20	900	?	0.0091
2048,1024,64,4,64,1024,2048	20	1000	?	0.0086
2048,1024,64,4,64,1024,2048	20	1500	?	0.0068
2048,1024,64,4,64,1024,2048	20	2000	10400	0.0061
2048,1024,64,4,64,1024,2048	20	3000	?	0.0052	Continuing 8000 more after 2000 start
2048,1024,64,4,64,1024,2048	20	7000	?	0.0039	Worse than 2048,32 result
2048,1024,64,4,64,1024,2048	20	10000	49977	0.0034	Total time for 10,000 iters
2048,32,4,32,2048	20	100	503	0.0251	Same 600 images from 2048,1024,64 run above
2048,32,4,32,2048	20	200	?	0.0209	Data from 100 iter to 10000 is same run
2048,32,4,32,2048	20	700	?	0.0137
2048,32,4,32,2048	20	900	?	0.0128
2048,32,4,32,2048	20	1000	5034	0.0121
2048,32,4,32,2048	20	2000	?	0.0094
2048,32,4,32,2048	20	3000	?	0.0084
2048,32,4,32,2048	20	7000	?	0.0051	Worse than previous 2048,32 and 2048,1024
2048,1024,64,4,64,1024,2048	20	16000	71876	0.0024	New 600 images

Below is a plot of some of the above data to aid visualization. I compare 3 of the networks that I gathered a fair amount of trials for.

Plot = require("itorch.Plot")
local plot = Plot()  -- Plot selected data from experiments above
plot:circle({0,100   ,100   ,200   ,900   ,2000  ,7000  ,100   ,200   ,700   ,900   ,1000  ,2000  ,3000  ,7000},
            {0,0.0196,0.0239,0.0156,0.0145,0.0061,0.0035,0.0251,0.0209,0.0137,0.0128,0.0121,0.0094,0.0084,0.0051},
            'red','2048,32')
plot:circle({0,100   ,100   ,200   ,360   ,590   ,700   ,900   ,1000  ,1500  ,2000},
            {0,0.0304,0.0316,0.0272,0.0195,0.0155,0.0145,0.0133,0.0128,0.0100,0.0090},
            'green','2176,64')
plot:circle({0,100   ,100   ,200   ,700   ,900   ,1000  ,1500,  2000,  3000,  7000, 10000},
            {0,0.0174,0.0288,0.0224,0.0103,0.0091,0.0086,0.0068,0.0061,0.0052,0.0039,0.0034},
            'blue','2048,1024,64')
plot:title('2048,32 vs 2176,64 vs 2048,1024,64')
plot:xaxis('Training iterations'):yaxis('MSE')
plot:legend(true)
plot:draw()
itorch.image("MSE.png") -- note that 'plot' is not saved in notebook, so load in a saved copy

In addition to the testing above, I also tried mixing SGD and CG a bit - to see if some SGD iteration work might help the CG move away from a local minimum, but I didn't find that too helpful. I ran some more training during the day, hence the final MSE is better than the last entry in the chart above. Note that when CUDA runs fast, overnight training got to trained images with good quality and an MSE error of 0.0035.

predict = net:forward(X24)  -- Use the network on the training data
print("MSE for trained network used in remaining notebook cells: "..criterion:forward(predict, X24))
for i = 1,91,10 do  -- Print 1 of each of 10 shapes and predicted shape
    itorch.image(X24[i*6]:reshape(24,24*24))
    itorch.image(predict[i*6]:reshape(24,24*24))
end
-- The shapes shown are: sphere, cube, rod, cone, 2 spheres, 2 cubes, 2 cones, 
--                       large sphere and small cube, large cube and small cone, large cone and small sphere

MSE for trained network used in remaining notebook cells: 0.0024338632356375

Below are diagrams showing the position of the 600 images in X24 in the 4-dimensional feature space. I plot neuron 1 vs neuron 2 in one plot and neuron 3 vs neuron 4 in the other. The pair of 2D plots is a simplification of the full 4 dimensional space, but it gives a sense of the distribution of the images. As the network trained the 4 dimensional space seemed to get used better and become less sparse - an indication that the network was learning more useful ways to condense the inputs down to 4 dimensions. Note that a goal of this project relates to having an expressive but dense set of feature neurons. I want to have a space where any random location will map to a 'reasonable' image given the training inputs. 4 neurons seems to be a good number for this problem - the space still has some sparsity to it, indicating there is room for more learning and differentiation, but 4 values result in good training results and are not too sparse, as will be explored next.

predict = net:forward(X24)
features = net:get(9).output
spheres=features[{{1,36},{}}]
cubes=features[{{37,90},{}}]
rods=features[{{91,144},{}}]
cones=features[{{145,198},{}}]
spheresphere=features[{{199,264},{}}]
cubecube=features[{{265,330},{}}]
conecone=features[{{331,396},{}}]
spherecube=features[{{397,462},{}}]
cubecone=features[{{463,528},{}}]
conesphere=features[{{529,600},{}}]
Plot = require("itorch.Plot")
local plot = Plot()
plot:circle(spheres[{{},1}],spheres[{{},2}],'red','Spheres')
plot:circle(cubes[{{},1}],cubes[{{},2}],'green','Cubes')
plot:circle(rods[{{},1}],rods[{{},2}],'blue','Rods')
plot:circle(cones[{{},1}],cones[{{},2}],'black','Cones')
plot:circle(spheresphere[{{},1}],spheresphere[{{},2}],'cyan','SphereSphere')
plot:circle(cubecube[{{},1}],cubecube[{{},2}],'magenta','CubeCube')
plot:circle(conecone[{{},1}],conecone[{{},2}],'yellow','ConeCone')
plot:circle(spherecube[{{},1}],spherecube[{{},2}],'orange','SphereCube')
plot:circle(cubecone[{{},1}],cubecone[{{},2}],'purple','CubeCone')
plot:circle(conesphere[{{},1}],conesphere[{{},2}],'grey','ConeSphere')
plot:title('Feature neuron values for training images')
plot:xaxis('Neuron 1'):yaxis('Neuron 2')
plot:legend(true)
plot:draw()
plot = Plot()
plot:circle(spheres[{{},3}],spheres[{{},4}],'red','Spheres')
plot:circle(cubes[{{},3}],cubes[{{},4}],'green','Cubes')
plot:circle(rods[{{},3}],rods[{{},4}],'blue','Rods')
plot:circle(cones[{{},3}],cones[{{},4}],'black','Cones')
plot:circle(spheresphere[{{},3}],spheresphere[{{},4}],'cyan','SphereSphere')
plot:circle(cubecube[{{},3}],cubecube[{{},4}],'magenta','CubeCube')
plot:circle(conecone[{{},3}],conecone[{{},4}],'yellow','ConeCone')
plot:circle(spherecube[{{},3}],spherecube[{{},4}],'orange','SphereCube')
plot:circle(cubecone[{{},3}],cubecone[{{},4}],'purple','CubeCone')
plot:circle(conesphere[{{},3}],conesphere[{{},4}],'grey','ConeSphere')
plot:title('Feature neuron values for training images')
plot:xaxis('Neuron 3'):yaxis('Neuron 4')
plot:legend(true)
plot:draw()
itorch.image("Neurons12.png") -- 'plot' is not saved in notebook, so load in a saved copy
itorch.image("Neurons34.png") -- 'plot' is not saved in notebook, so load in a saved copy

Now we have observed the values of the 4 feature neurons from the trained autoencoder. Torch provides a way to set the weights and bias values in one network based on the training in another one. First, we will create a 4 input network with the layers to the output matching the autoencoder (see the network diagram in the introduction for reference).

gen = nn.Sequential()
gen:add(nn.Linear(4,64))
gen:add(nn.Tanh())
gen:add(nn.Linear(64,1024))
gen:add(nn.Tanh())
gen:add(nn.Linear(1024,2048))
gen:add(nn.Tanh())
gen:add(nn.Linear(2048, 24*24*24))
gen:add(nn.Reshape(24,24,24))

Next we simply copy the parameters from the autoencoder output layers to the new network that can be used to generate 3D images.

print(net:parameters())  -- Print weight and bias tensors for autoencoder network
print(gen:parameters())  -- Print weight and bias tensors for generation network 
for i = 1,8 do  -- Assign weights and biases in generation network from trained autoencoder
    gen:parameters()[i]:copy(net:parameters()[i+8])
end

{
  1 : CudaTensor - size: 2048x13824
  2 : CudaTensor - size: 2048
  3 : CudaTensor - size: 1024x2048
  4 : CudaTensor - size: 1024
  5 : CudaTensor - size: 64x1024
  6 : CudaTensor - size: 64
  7 : CudaTensor - size: 4x64
  8 : CudaTensor - size: 4
  9 : CudaTensor - size: 64x4
  10 : CudaTensor - size: 64
  11 : CudaTensor - size: 1024x64
  12 : CudaTensor - size: 1024
  13 : CudaTensor - size: 2048x1024
  14 : CudaTensor - size: 2048
  15 : CudaTensor - size: 13824x2048
  16 : CudaTensor - size: 13824
}
{
  1 : CudaTensor - size: 2048x13824
  2 : CudaTensor - size: 2048
  3 : CudaTensor - size: 1024x2048
  4 : CudaTensor - size: 1024
  5 : CudaTensor - size: 64x1024
  6 : CudaTensor - size: 64
  7 : CudaTensor - size: 4x64
  8 : CudaTensor - size: 4
  9 : CudaTensor - size: 64x4
  10 : CudaTensor - size: 64
  11 : CudaTensor - size: 1024x64
  12 : CudaTensor - size: 1024
  13 : CudaTensor - size: 2048x1024
  14 : CudaTensor - size: 2048
  15 : CudaTensor - size: 13824x2048
  16 : CudaTensor - size: 13824
}
{
  1 : DoubleTensor - size: 64x4
  2 : DoubleTensor - size: 64
  3 : DoubleTensor - size: 1024x64
  4 : DoubleTensor - size: 1024
  5 : DoubleTensor - size: 2048x1024
  6 : DoubleTensor - size: 2048

  7 : DoubleTensor - size: 13824x2048
  8 : DoubleTensor - size: 13824
}
{
  1 : DoubleTensor - size: 64x4
  2 : DoubleTensor - size: 64
  3 : DoubleTensor - size: 1024x64
  4 : DoubleTensor - size: 1024
  5 : DoubleTensor - size: 2048x1024
  6 : DoubleTensor - size: 2048
  7 : DoubleTensor - size: 13824x2048
  8 : DoubleTensor - size: 13824
}

To test this approach, below are the 4 feature neuron values for one of the more complex images (large cube and small cone). The cell after that generates 8 images: the first is the original 24x24x24 training input to the autoencoder; the second is the output of the autoencoder for that image; the 3rd image shows that the generator network produces exactly the same output as the autoencoder when fed with the same 4 feature values; the 4th through 8th images show how the generated image is affected as one of the feature inputs is varied. As training progresses, the 4-coordinate feature value for the image moves around a bit, as improved feature mappings are found.

print(net:get(9).output[486])

-0.1864
-0.8316
 0.6090
-0.0173
[torch.CudaTensor of size 4]

itorch.image(X24[486]:reshape(24,24*24))  -- original image
itorch.image(net.output[486]:reshape(24,24*24))
for i=0,5 do -- Check that gen outputs same image with same inputs, then vary 1 input a bit
    itorch.image(gen:forward(torch.Tensor({-0.1864,-0.8316,0.6090,-0.0173+0.2*i})):reshape(24,24*24))
end

To explore the image generation more, below are 10 images generated from the 'gen' network. The first is the image from location 0,0,0,0 and the other 9 are images generated from random inputs to 'gen'.

gen_10 = torch.Tensor(15,size/2,size/2,size/2)
gen_10[1] = gen:forward(torch.Tensor({0,0,0,0})) -- Center of generaton space
for i = 2,10 do -- 9 'random' generated images
    gen_10[i] = gen:forward(torch.Tensor({torch.uniform()*2-1,torch.uniform()*2-1,
                         torch.uniform()*2-1,torch.uniform()*2-1}))
end
for i = 1,10 do
    itorch.image(gen_10[i]:reshape(24,24*24))
end

The final step for the 48x48x48 image generation is to decompress the 24x24x24 outputs from 'gen' into the final desired format. The step below is an algorithm that I played around with a bit (using various weightings for the original voxel value and neighbor voxel values). In the end, the algorithm expands the 'gen' output voxels into 8 voxels and cleans up some of the grey backgrounds from the network output.

-- Regenerate 48x48x48 images from 24x24x24 images - do for
-- original X data, autoencoder output, and generative output.
-- To save time, just do 10 examples of each type for viewing.
X24_10 = torch.Tensor(10,size/2,size/2,size/2)
X_10 = torch.Tensor(10,size,size,size)
X48 = torch.Tensor(10,size,size,size)
predict_10 = torch.Tensor(10,size/2,size/2,size/2)
predict48 = torch.Tensor(10,size,size,size)
gen48 = torch.Tensor(10,size,size,size)
neighbors = torch.Tensor(3,3,3)
voxelavg = function(orig,corner,edge1,edge2,edge3,face1,face2,face3)
    retval = orig - 0.4 + (corner*2+edge1+edge2+edge3+face1+face2+face3)/9
    if (retval > 1.0) then     -- strongly on voxel
        return 1.0
    elseif (retval > 0.5) then -- partly on voxel
        return retval
    else
        return 0               -- off voxel
    end
end
decompress = function(in24,out48,i)
    for x = 1,size/2 do for y = 1,size/2 do for z = 1,size/2 do
        -- Process X24 into X48 to evaluate this decompression step
        neighbors:fill(0)   -- Start all dark, applies to 'out-of-bounds' values
        for dx = -1,1 do for dy = -1,1 do for dz = -1,1 do
            if (x+dx > 0 and x+dx <= size/2 and y+dy >0 and y+dy <= size/2 
                                    and z+dz > 0 and z+dz <= size/2) then
                neighbors[dx+2][dy+2][dz+2] = in24[i][x+dx][y+dy][z+dz]                            
            end
        end end end
        for dx = -1,0 do for dy = -1,0 do for dz = -1,0 do -- decompress to 8 voxels 
            out48[i][x*2+dx][y*2+dy][z*2+dz] = voxelavg(in24[i][x][y][z],
                    neighbors[3+2*dx][3+2*dy][3+2*dz],
                    neighbors[3+2*dx][3+2*dy][2],neighbors[3+2*dx][2][3+2*dz],neighbors[2][3+2*dy][3+2*dz],
                    neighbors[3+2*dx][2][2],neighbors[2][3+2*dy][2],neighbors[2][2][3+2*dz])
        end end end
    end end end
end

-- TIME NOTE: This cell takes about 2 minutes on my laptop
for i = 1,10 do 
    X24_10[i]:copy(X24[i*60-54])  -- Sample each shape type 
    X_10[i]:copy(X[i*10-9])       -- One of the 100 originals
    predict_10[i]:copy(predict[i*60-54])
    decompress (X24_10,X48,i)
    decompress (predict_10,predict48,i)
    decompress (gen_10,gen48,i)
end

X_10=X_10:cuda()
X48=X48:cuda()
predict48=predict48:cuda()
print("Test criterion of same matrix: "..criterion:forward(X_10,X_10))
print("Test compress/decompress result: "..criterion:forward(X_10,X48))
print("Test decompressed autoencoder results: "..criterion:forward(X_10,predict48))

Test criterion of same matrix: 0	
Test compress/decompress result: 0.0031157785560936	
Test decompressed autoencoder results: 0.0059013832360506

-- Show example 48x48x48 images
itorch.image(X_10[9])  -- Original cube and cone is interesting (spheres are 'easy' to learn)
itorch.image(X24_10[9]:reshape(24,24*24))  -- 24x24x24 image from 48x48x48 original
itorch.image(X48[9])                       -- Reconstructed original (no neural net, just decompression)
itorch.image(predict_10[9]:reshape(24,24*24)) -- 24x24x24 autoencoder output same cube and cone
itorch.image(predict48[9])                 -- Reconstructed 48x48x48 from autoencoder output
itorch.image(gen_10[2]:reshape(24,24*24))  -- Randomly generated image from 'gen' network
itorch.image(gen48[2])                     -- 48x48x48 version of random 'gen' output
itorch.image(gen_10[3]:reshape(24,24*24))  -- 2nd randomly generated image from 'gen' network
itorch.image(gen48[3])                     -- 48x48x48 version of 2nd random 'gen' output

Conclusion¶

This approach presents a valid way to generate 3D shapes that are within a broadly defined group of learned objects, as well as a way to generate similar objects relative to an existing sample object. The input and use mechanisms were created so that a new set of 48x48x48 input shapes could be provided for training this system. There are various stages of Torch code that are used by this algorithm which could be optimized, but the main need for optimization is the training loop using conjugate gradients. Even when CUDA is working well, I expect there remain opportunities for improved performance given good compiler or library coding techniques. This code can be used to augment sparse image repositories in various useful ways.

References¶

[1] CLGEN site with overview diagrams, slide deck, and code: https://github.com/ChrisCummins/clgen

[2] Main page for Torch: http://torch.ch/

[3] Tables with numpy to Torch conversions: https://github.com/torch/torch7/wiki/Torch-for-Numpy-users

[4] Description of the Torch nn class: https://github.com/torch/nn/blob/master/doc/overview.md#nn.overview.dok

[5] Torch tutorial on nn and cuda: http://ronan.collobert.com/torch/

[6] Overview of Torch cudnn modules (like LSTM): https://github.com/soumith/cudnn.torch

[7] Documentation on Torch nn modules https://github.com/torch/nn/blob/master/doc/convolution.md#nn.TemporalConvolution

[8] Example usage for Torch cudnn.LSTM and other modules: https://github.com/soumith/cudnn.torch/blob/master/test/test_rnn.lua

[9] A Torch autoencoder: http://rnduja.github.io/2015/11/06/torch-autoencoder/

[10] Example of training a network in torch: https://github.com/torch/nn/blob/master/doc/training.md

[11] Example of using optim package for training a network: https://github.com/vic-w/torch-practice/blob/master/mnist/trainMnist.lua

[12] Description of optimization algorithms for optim in Torch: https://github.com/torch/optim/blob/master/doc/algos.md

Appendix A: Torch performance for CPU, GPU, SGD, and CG¶

My experiments for this project were informed by some testing I did with CS480 HW3 using Torch with CPU and GPU modes.

My laptop is the system these experiments were run on, it has this HW:

- Intel Core i7-6700HQ @2.6GHz (4 cores, 8 threads, 3.5GHz max)
- NVIDIA GeForce GTX 965M (1,024 CUDA cores, 1.15GHz max)

The neural network model from HW3 had 18 inputs, 1 layer of 15 neurons, 1 layer with 10 neurons, and 1 output. It was trained for 1000 iterations. (The HW3 code uses SCG instead of SGD for the backpropagation, which likely improves the rate of convergence, but shouldn't affect the performance discussed in this email).

- HW3 code using SCG from NeuralNetworks.py: 4.02s
- HW3 code using SGD from NeuralNetworks.py: 2.03s
- Torch CPU using SGD: 2.05s
- Torch GPU using SGD: 0.42s
- HW3 code for SCG trained to same RMSE as SGD: 0.19s

As another note, I ran CPU matrix multiply in Torch compared with CUDA 1,000x200,000, 200,000x1,000 to produce 1,000x1,000. The GPU was over 10X faster:

- CPU: 2.79s
- GPU: 0.21s

This performance was visible with some of the results in the main report above; although CUDA sometimes seems to get stuck in a low-performance mode on my laptop. Here are some samples of experiments repeated here for easy reference.

- Train a network with 13,824 inputs through 9 layers using updateParameters loop:
  - CPU time: 160 seconds
  - GPU time: 11.6 seconds
- Train nn.StochasticGradient:
  - CPU time: 730 seconds
  - GPU time: 76 seconds

-- Example code showing how to set up matrices for CUDA use in Torch
a=torch.rand(1000,200000)
b=torch.rand(200000,1000)
c=torch.Tensor(1000,1000)
timer = torch.Timer() -- the Timer starts to count now
c:mm(a,b)
print(c[5][5])
print('Time elapsed using CPU: ' .. timer:time().real .. ' seconds')
a=torch.rand(1000,200000):cuda()
b=torch.rand(200000,1000):cuda()
c=torch.Tensor(1000,1000):cuda()
timer = torch.Timer() -- the Timer starts to count now
c:mm(a,b)
print(c[4][4])
print('Time elapsed using CUDA: ' .. timer:time().real .. ' seconds')
a=torch.rand(10,20):cuda() -- shrink arrays so large arrays are garbage collected
b=torch.rand(20,10):cuda()
c=torch.Tensor(10,10):cuda()

50136.746942773	
Time elapsed using CPU: 4.71653008461 seconds

49769.2421875	
Time elapsed using CUDA: 0.21396207809448 seconds

Appendix B: LSTM experiments in Torch¶

For CS480 HW3, I analyzed stock price correlations and I thought that would be a good data set to test LSTM usage in Torch. For part of this project I had wanted to test LSTM use to generate 3D images since CLGEN [1] had success with that approach. In the end, I could not get meaningful learning to occur using either the cudnn:LSTM nor rnn:LSTM package. I suspect something about the data organization or LSTM parameter set was not correct in my code. In any case, for the case of 3D images, I expect that LSTMs would not have been ideal anyway, so I shifted my focus to the networks discussed in the main body of this project report. Below is some of the code I was attempting to use for LSTM training, based on references [6],[7], and [8].

-- This cell is example code and may not run well if attempted.
require 'cudnn'
lstm = cudnn.LSTM(18, 9, 3)
lstm = lstm:cuda()
print(lstm)
rnn = cudnn.RNN(18,9,3)
rnn = rnn:cuda()
print(rnn)
module = nn.Module(18,9,3)
module = module:cuda()
print(module)
LSTMtrain=torch.Tensor(90,16,18):cuda()
LSTMT=torch.Tensor(90,16,1):cuda()
for i = 1,90 do
    for j = 1,16 do
        LSTMtrain[{i,j,{}}] = Xtrain[{i*16+j-16,{}}]
        LSTMT[{i,j,1}] = T[i*16+j-16]
    end
end
w,dw = lstm:getParameters()
criterion = nn.MSECriterion():cuda()
timer = torch.Timer() -- the Timer starts to count now
for i = 0, 1000 do     -- Train for 1000 epochs
  -- reset LSTM state
  lstm:reset()
  -- feed it to the neural network and the criterion
  criterion:forward(lstm:forward(LSTMtrain), LSTMT)

  -- train over this example in 3 steps
  -- (1) zero the accumulation of the gradients
  dw:zero()
  -- (2) accumulate gradients
  lstm:backward(LSTMtrain, criterion:backward(lstm.output, LSTMT))
  -- (3) update parameters with a 0.02 learning rate
  lstm:updateParameters(0.02)
end
print('Time elapsed: ' .. timer:time().real .. ' seconds')