Convolutional Autoencoder

Sticking with the MNIST dataset, let's improve our autoencoder's performance using convolutional layers. Again, loading modules and the data.

In [ ]:
%matplotlib inline

import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
In [ ]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', validation_size=0)
In [3]:
img = mnist.train.images[2]
plt.imshow(img.reshape((28, 28)), cmap='Greys_r')
Out[3]:
<matplotlib.image.AxesImage at 0x1280806d8>

Network Architecture

The encoder part of the network will be a typical convolutional pyramid. Each convolutional layer will be followed by a max-pooling layer to reduce the dimensions of the layers. The decoder though might be something new to you. The decoder needs to convert from a narrow representation to a wide reconstructed image. For example, the representation could be a 4x4x8 max-pool layer. This is the output of the encoder, but also the input to the decoder. We want to get a 28x28x1 image out from the decoder so we need to work our way back up from the narrow decoder input layer. A schematic of the network is shown below.

Here our final encoder layer has size 4x4x8 = 128. The original images have size 28x28 = 784, so the encoded vector is roughly 16% the size of the original image. These are just suggested sizes for each of the layers. Feel free to change the depths and sizes, but remember our goal here is to find a small representation of the input data.

What's going on with the decoder

Okay, so the decoder has these "Upsample" layers that you might not have seen before. First off, I'll discuss a bit what these layers aren't. Usually, you'll see transposed convolution layers used to increase the width and height of the layers. They work almost exactly the same as convolutional layers, but in reverse. A stride in the input layer results in a larger stride in the transposed convolution layer. For example, if you have a 3x3 kernel, a 3x3 patch in the input layer will be reduced to one unit in a convolutional layer. Comparatively, one unit in the input layer will be expanded to a 3x3 path in a transposed convolution layer. The TensorFlow API provides us with an easy way to create the layers, tf.nn.conv2d_transpose.

However, transposed convolution layers can lead to artifacts in the final images, such as checkerboard patterns. This is due to overlap in the kernels which can be avoided by setting the stride and kernel size equal. In this Distill article from Augustus Odena, et al, the authors show that these checkerboard artifacts can be avoided by resizing the layers using nearest neighbor or bilinear interpolation (upsampling) followed by a convolutional layer. In TensorFlow, this is easily done with tf.image.resize_images, followed by a convolution. Be sure to read the Distill article to get a better understanding of deconvolutional layers and why we're using upsampling.

Exercise: Build the network shown above. Remember that a convolutional layer with strides of 1 and 'same' padding won't reduce the height and width. That is, if the input is 28x28 and the convolution layer has stride = 1 and 'same' padding, the convolutional layer will also be 28x28. The max-pool layers are used the reduce the width and height. A stride of 2 will reduce the size by a factor of 2. Odena et al claim that nearest neighbor interpolation works best for the upsampling, so make sure to include that as a parameter in tf.image.resize_images or use tf.image.resize_nearest_neighbor. For convolutional layers, use tf.layers.conv2d. For example, you would write conv1 = tf.layers.conv2d(inputs, 32, (5,5), padding='same', activation=tf.nn.relu) for a layer with a depth of 32, a 5x5 kernel, stride of (1,1), padding is 'same', and a ReLU activation. Similarly, for the max-pool layers, use tf.layers.max_pooling2d.

In [4]:
mnist.train.images.shape[1]
Out[4]:
784
In [5]:
image_size = mnist.train.images.reshape((mnist.train.images.shape[0], 28, 28)).shape[1:]
print('image size: ', image_size)
image size:  (28, 28)
In [6]:
learning_rate = 0.001

image_size = mnist.train.images.shape[1]
print('image size: ', image_size)

# Input and target placeholders
inputs_ = tf.placeholder(tf.float32, (None, 28, 28, 1), name='inputs')
print("This is our placeholder for the inputs: ", inputs_)

targets_ = tf.placeholder(tf.float32, (None, 28, 28, 1), name='targets')
print("This is our placeholder for the targets: ", targets_)
image size:  784
This is our placeholder for the inputs:  Tensor("inputs:0", shape=(?, 28, 28, 1), dtype=float32)
This is our placeholder for the targets:  Tensor("targets:0", shape=(?, 28, 28, 1), dtype=float32)
In [7]:
### Encoder

# Convolutional Layer in Tensorflow example:
# conv1 = tf.layers.conv2d(
#         inputs=input_layer,
#         filters=32,
#         kernel_size=[5, 5],
#         padding="same",
#         activation=tf.nn.relu)
conv1 = tf.layers.conv2d(inputs=inputs_, filters=16, kernel_size=(3, 3), padding='same', activation=tf.nn.relu)
print("Convolution Layer conv1", conv1)
print('Now 28x28x16')

# Max Pooling layer
# max_pooling2d(
#    inputs,
#    pool_size,
#    strides,
#    padding='valid',
#    data_format='channels_last',
#    name=None)
maxpool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=2, strides=2, padding='same')
print("Maxpooling Layer maxpoo11", maxpool1)
print('Now 14x14x16')

conv2 = tf.layers.conv2d(inputs=maxpool1, filters=8, kernel_size=3, padding='same', activation=tf.nn.relu)
print("Convolutinal Layer conv2", conv2)
print('Now 14x14x8')

maxpool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=2, strides=2, padding='same')
print("Maxpooling Layer maxpoo12", maxpool2)
print('Now 7x7x8')

conv3 = tf.layers.conv2d(inputs=maxpool2, filters=8, kernel_size=3, padding='same', activation=tf.nn.relu)
print("Convolutinal Layer conv3", conv3)
print('Now 7x7x8')

encoded = tf.layers.max_pooling2d(inputs=conv3, pool_size=2, strides=2, padding='same')
print("Encoded", encoded)
print('Now 4x4x8')
Convolution Layer conv1 Tensor("conv2d/Relu:0", shape=(?, 28, 28, 16), dtype=float32)
Now 28x28x16
Maxpooling Layer maxpoo11 Tensor("max_pooling2d/MaxPool:0", shape=(?, 14, 14, 16), dtype=float32)
Now 14x14x16
Convolutinal Layer conv2 Tensor("conv2d_2/Relu:0", shape=(?, 14, 14, 8), dtype=float32)
Now 14x14x8
Maxpooling Layer maxpoo12 Tensor("max_pooling2d_2/MaxPool:0", shape=(?, 7, 7, 8), dtype=float32)
Now 7x7x8
Convolutinal Layer conv3 Tensor("conv2d_3/Relu:0", shape=(?, 7, 7, 8), dtype=float32)
Now 7x7x8
Encoded Tensor("max_pooling2d_3/MaxPool:0", shape=(?, 4, 4, 8), dtype=float32)
Now 4x4x8
In [8]:
### Decoder

#resize_images(
#    images,
#    size,
#    method=ResizeMethod.BILINEAR,
#    align_corners=False
#)
upsample1 = tf.image.resize_images(images=encoded, size=(7,7))
print("Resize Layer  upsample1", upsample1)
print('Now 7x7x8')

conv4 = tf.layers.conv2d(inputs=upsample1, filters=8, kernel_size=3, padding='same', activation=tf.nn.relu)
print("Convolutinal Layer conv4", conv4)
print('Now 7x7x8')

upsample2 = tf.image.resize_images(images=conv4, size=(14, 14))
print("Resize Layer upsample2", upsample2)
print('Now 14x14x8')

conv5 = tf.layers.conv2d(inputs=upsample2, filters=8, kernel_size=3, padding='same', activation=tf.nn.relu)
print("Convolutinal Layer conv5", conv5)
print('Now 14x14x8')

upsample3 = tf.image.resize_images(images=conv5, size=(28, 28))
print("Resize Layer upsample3", upsample3)
print('Now 28x28x8')

conv6 = tf.layers.conv2d(inputs=upsample3, filters=16, kernel_size=3, padding='same', activation=tf.nn.relu)
print("Convolutinal Layer conv6", conv6)
print('Now 28x28x16')

logits = tf.layers.conv2d(inputs=conv6, filters=1, kernel_size=3, padding='same')
print("Logits", logits)
print('Now 28x28x1')

# Pass logits through sigmoid to get reconstructed image
decoded = tf.nn.sigmoid(logits)
print("Decoded", decoded)

# Pass logits through sigmoid and calculate the cross-entropy loss
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=targets_)
print("Loss", loss)

# Get cost and define the optimizer
cost = tf.reduce_mean(loss)
print("Cost", cost)
opt = tf.train.AdamOptimizer(learning_rate).minimize(cost)
print("Opt", opt)
Resize Layer  upsample1 Tensor("ResizeBilinear:0", shape=(?, 7, 7, 8), dtype=float32)
Now 7x7x8
Convolutinal Layer conv4 Tensor("conv2d_4/Relu:0", shape=(?, 7, 7, 8), dtype=float32)
Now 7x7x8
Resize Layer upsample2 Tensor("ResizeBilinear_1:0", shape=(?, 14, 14, 8), dtype=float32)
Now 14x14x8
Convolutinal Layer conv5 Tensor("conv2d_5/Relu:0", shape=(?, 14, 14, 8), dtype=float32)
Now 14x14x8
Resize Layer upsample3 Tensor("ResizeBilinear_2:0", shape=(?, 28, 28, 8), dtype=float32)
Now 28x28x8
Convolutinal Layer conv6 Tensor("conv2d_6/Relu:0", shape=(?, 28, 28, 16), dtype=float32)
Now 28x28x16
Logits Tensor("conv2d_7/BiasAdd:0", shape=(?, 28, 28, 1), dtype=float32)
Now 28x28x1
Decoded Tensor("Sigmoid:0", shape=(?, 28, 28, 1), dtype=float32)
Loss Tensor("logistic_loss:0", shape=(?, 28, 28, 1), dtype=float32)
Cost Tensor("Mean:0", shape=(), dtype=float32)
Opt name: "Adam"
op: "NoOp"
input: "^Adam/update_conv2d/kernel/ApplyAdam"
input: "^Adam/update_conv2d/bias/ApplyAdam"
input: "^Adam/update_conv2d_1/kernel/ApplyAdam"
input: "^Adam/update_conv2d_1/bias/ApplyAdam"
input: "^Adam/update_conv2d_2/kernel/ApplyAdam"
input: "^Adam/update_conv2d_2/bias/ApplyAdam"
input: "^Adam/update_conv2d_3/kernel/ApplyAdam"
input: "^Adam/update_conv2d_3/bias/ApplyAdam"
input: "^Adam/update_conv2d_4/kernel/ApplyAdam"
input: "^Adam/update_conv2d_4/bias/ApplyAdam"
input: "^Adam/update_conv2d_5/kernel/ApplyAdam"
input: "^Adam/update_conv2d_5/bias/ApplyAdam"
input: "^Adam/update_conv2d_6/kernel/ApplyAdam"
input: "^Adam/update_conv2d_6/bias/ApplyAdam"
input: "^Adam/Assign"
input: "^Adam/Assign_1"

Training

As before, here we'll train the network. Instead of flattening the images though, we can pass them in as 28x28x1 arrays.

In [9]:
sess = tf.Session()
In [10]:
epochs = 20
batch_size = 200
sess.run(tf.global_variables_initializer())
for e in range(epochs):
    for ii in range(mnist.train.num_examples//batch_size):
        batch = mnist.train.next_batch(batch_size)
        imgs = batch[0].reshape((-1, 28, 28, 1))
        batch_cost, _ = sess.run([cost, opt], feed_dict={inputs_: imgs,
                                                         targets_: imgs})

    print("Epoch: {}/{}...".format(e+1, epochs),
          "Training loss: {:.4f}".format(batch_cost))
Epoch: 1/20... Training loss: 0.1692
Epoch: 2/20... Training loss: 0.1439
Epoch: 3/20... Training loss: 0.1297
Epoch: 4/20... Training loss: 0.1258
Epoch: 5/20... Training loss: 0.1244
Epoch: 6/20... Training loss: 0.1163
Epoch: 7/20... Training loss: 0.1139
Epoch: 8/20... Training loss: 0.1124
Epoch: 9/20... Training loss: 0.1123
Epoch: 10/20... Training loss: 0.1098
Epoch: 11/20... Training loss: 0.1124
Epoch: 12/20... Training loss: 0.1124
Epoch: 13/20... Training loss: 0.1083
Epoch: 14/20... Training loss: 0.1064
Epoch: 15/20... Training loss: 0.1051
Epoch: 16/20... Training loss: 0.1059
Epoch: 17/20... Training loss: 0.1058
Epoch: 18/20... Training loss: 0.1003
Epoch: 19/20... Training loss: 0.1018
Epoch: 20/20... Training loss: 0.0996
In [11]:
fig, axes = plt.subplots(nrows=2, ncols=10, sharex=True, sharey=True, figsize=(20,4))
in_imgs = mnist.test.images[:10]
reconstructed = sess.run(decoded, feed_dict={inputs_: in_imgs.reshape((10, 28, 28, 1))})

for images, row in zip([in_imgs, reconstructed], axes):
    for img, ax in zip(images, row):
        ax.imshow(img.reshape((28, 28)), cmap='Greys_r')
        ax.get_xaxis().set_visible(False)
        ax.get_yaxis().set_visible(False)


fig.tight_layout(pad=0.1)
In [12]:
sess.close()

Denoising

As I've mentioned before, autoencoders like the ones you've built so far aren't too useful in practive. However, they can be used to denoise images quite successfully just by training the network on noisy images. We can create the noisy images ourselves by adding Gaussian noise to the training images, then clipping the values to be between 0 and 1. We'll use noisy images as input and the original, clean images as targets. Here's an example of the noisy images I generated and the denoised images.

Denoising autoencoder

Since this is a harder problem for the network, we'll want to use deeper convolutional layers here, more feature maps. I suggest something like 32-32-16 for the depths of the convolutional layers in the encoder, and the same depths going backward through the decoder. Otherwise the architecture is the same as before.

Exercise: Build the network for the denoising autoencoder. It's the same as before, but with deeper layers. I suggest 32-32-16 for the depths, but you can play with these numbers, or add more layers.

In [13]:
learning_rate = 0.001
inputs_ = tf.placeholder(tf.float32, (None, 28, 28, 1), name='inputs')
targets_ = tf.placeholder(tf.float32, (None, 28, 28, 1), name='targets')

### Encoder
conv1 = tf.layers.conv2d(inputs=inputs_, filters=32, kernel_size=(3, 3), padding='same', activation=tf.nn.relu)
# Now 28x28x32
maxpool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=2, strides=2, padding='same')
# Now 14x14x32
conv2 = tf.layers.conv2d(inputs=maxpool1, filters=32, kernel_size=(3, 3), padding='same', activation=tf.nn.relu)
# Now 14x14x32
maxpool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=2, strides=2, padding='same')
# Now 7x7x32
conv3 = tf.layers.conv2d(inputs=maxpool2, filters=16, kernel_size=(3, 3), padding='same', activation=tf.nn.relu)
# Now 7x7x16
encoded = tf.layers.max_pooling2d(inputs=conv3, pool_size=2, strides=2, padding='same')
# Now 4x4x16

### Decoder
upsample1 = tf.image.resize_images(images=encoded, size=(7,7))
# Now 7x7x16
conv4 = tf.layers.conv2d(inputs=upsample1, filters=16, kernel_size=(3, 3), padding='same', activation=tf.nn.relu)
# Now 7x7x16
upsample2 = tf.image.resize_images(images=conv4, size=(14,14))
# Now 14x14x16
conv5 = tf.layers.conv2d(inputs=upsample2, filters=32, kernel_size=(3, 3), padding='same', activation=tf.nn.relu)
# Now 14x14x32
upsample3 = tf.image.resize_images(images=conv5, size=(28,28))
# Now 28x28x32
conv6 = tf.layers.conv2d(inputs=upsample3, filters=32, kernel_size=(3, 3), padding='same', activation=tf.nn.relu)
# Now 28x28x32

logits = tf.layers.conv2d(inputs=conv6, filters=1, kernel_size=3, padding='same')
#Now 28x28x1

# Pass logits through sigmoid to get reconstructed image
decoded = tf.nn.sigmoid(logits)

# Pass logits through sigmoid and calculate the cross-entropy loss
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=logits, labels=targets_)

# Get cost and define the optimizer
cost = tf.reduce_mean(loss)
opt = tf.train.AdamOptimizer(learning_rate).minimize(cost)
In [14]:
sess = tf.Session()
In [ ]:
epochs = 100
batch_size = 200
# Set's how much noise we're adding to the MNIST images
noise_factor = 0.5
sess.run(tf.global_variables_initializer())
for e in range(epochs):
    for ii in range(mnist.train.num_examples//batch_size):
        batch = mnist.train.next_batch(batch_size)
        # Get images from the batch
        imgs = batch[0].reshape((-1, 28, 28, 1))
        
        # Add random noise to the input images
        noisy_imgs = imgs + noise_factor * np.random.randn(*imgs.shape)
        # Clip the images to be between 0 and 1
        noisy_imgs = np.clip(noisy_imgs, 0., 1.)
        
        # Noisy images as inputs, original images as targets
        batch_cost, _ = sess.run([cost, opt], feed_dict={inputs_: noisy_imgs,
                                                         targets_: imgs})

    print("Epoch: {}/{}...".format(e+1, epochs),
          "Training loss: {:.4f}".format(batch_cost))
Epoch: 1/100... Training loss: 0.1824
Epoch: 2/100... Training loss: 0.1574
Epoch: 3/100... Training loss: 0.1495
Epoch: 4/100... Training loss: 0.1391
Epoch: 5/100... Training loss: 0.1323
Epoch: 6/100... Training loss: 0.1278
Epoch: 7/100... Training loss: 0.1270
Epoch: 8/100... Training loss: 0.1213
Epoch: 9/100... Training loss: 0.1207
Epoch: 10/100... Training loss: 0.1201
Epoch: 11/100... Training loss: 0.1213
Epoch: 12/100... Training loss: 0.1246
Epoch: 13/100... Training loss: 0.1152
Epoch: 14/100... Training loss: 0.1152
Epoch: 15/100... Training loss: 0.1148
Epoch: 16/100... Training loss: 0.1153
Epoch: 17/100... Training loss: 0.1116
Epoch: 18/100... Training loss: 0.1127
Epoch: 19/100... Training loss: 0.1128
Epoch: 20/100... Training loss: 0.1067
Epoch: 21/100... Training loss: 0.1056
Epoch: 22/100... Training loss: 0.1094
Epoch: 23/100... Training loss: 0.1090
Epoch: 24/100... Training loss: 0.1052
Epoch: 25/100... Training loss: 0.1104
Epoch: 26/100... Training loss: 0.1077

Checking out the performance

Here I'm adding noise to the test images and passing them through the autoencoder. It does a suprisingly great job of removing the noise, even though it's sometimes difficult to tell what the original number is.

In [29]:
fig, axes = plt.subplots(nrows=2, ncols=10, sharex=True, sharey=True, figsize=(20,4))
in_imgs = mnist.test.images[:10]
noisy_imgs = in_imgs + noise_factor * np.random.randn(*in_imgs.shape)
noisy_imgs = np.clip(noisy_imgs, 0., 1.)

reconstructed = sess.run(decoded, feed_dict={inputs_: noisy_imgs.reshape((10, 28, 28, 1))})

for images, row in zip([noisy_imgs, reconstructed], axes):
    for img, ax in zip(images, row):
        ax.imshow(img.reshape((28, 28)), cmap='Greys_r')
        ax.get_xaxis().set_visible(False)
        ax.get_yaxis().set_visible(False)

fig.tight_layout(pad=0.1)