Git Repository

MNIST

The MNIST dataset, or Modified National Institute of Standards and Technology dataset, is a labeled preprocessed dataset of handwritten digits taken from American Census Bureau employees. The database contains 7000070000 28×2828\times 28 black and white images, split into 6000060000 training and 1000010000 testing images. Training a model to recognize these digits is something of a “Hello world” for machine learning.

Fashion MNIST is a dataset of the same size, with black and white images of the same size that represent one of 10 outfits. An artificial neural network that can be trained on the MNIST dataset to recognize digits should also be able to be trained on the Fashion MNIST dataset to recognize articles of clothing. Although getting a good score might be harder.

Artificial neural networks

We only consider feedforward neural networks for simplicity. An artificial neural network N\mathcal{N} is a directed weighted graph composed of layers, including an input layer, hidden layers, and an output layer. Each layer is made up of nodes, termed “neurons”, that hold values that are a function of values in previous layers. Neurons in a given layer ll are connected to neurons in the next layer l+1l+1.

3 fully connected layers. By Glosser.ca / CC BY-SA 3.0

3 fully connected layers. By Glosser.ca / CC BY-SA 3.0

If there are MM neurons in layer l1l-1, and NN in layer ll, then the weights between the layers WlRM×NW_l\in\mathbb{R}^{M\times N} are a matrix, and the biases bRNb\in\mathbb{R}^{N} are a vector. We call the neurons in layer ll XlX_l.

Xl=σ(Wlxl1+bl)X_l=\sigma(W_l\cdot x_{l-1}+b_l)

Where σ\sigma is a nonlinear activation function, the operation between WlW_l and xl1x_{l-1} is matrix multiplication, and between blb_l is matrix addition.

Consider a network with LL layers. Let Θ={W1,,WL,b1,,bL}\Theta=\{W_1,\dots,W_L,b_1,\dots,b_L\}. The entire network f:RI,ΘROf:\mathbb{R}^I,\Theta\rightarrow\mathbb{R}^O is a composition of layers:

f(x,Θ)=σL(WLσL1(WL1σ1(W1x+b1)+bL1)+bL)f(x,\Theta)=\sigma_L(W_L\sigma_{L-1}(W_{L-1}\dots\sigma_1(W_1x+b_1)\dots+b_{L-1})+b_L)

Activation function

The purpose of the activation function is to introduce non-linearity into the network. Without an activation function, all layers can collapse into a single linear layer:

W2(W1x+b1)+b2=(W2W1)x+(W2b1+b2)=Mx+bW_2(W_1x+b_1)+b_2=(W_2W_1)x+(W_2b_1+b_2)=Mx+b

for some MM and bb. By induction, this can be repeated for LL layers.

A linear layer cannot solve nor approximate non-linear functions. In contrast, the Universal Approximation Theorem states that

multilayer feedforward networks with as few as one hidden layer using arbitrary squashing functions are capable of approximating any Borel measurable function from one finite dimensional space to another to any desired degree of accuracy, provided sufficiently many hidden units are available. In this sense, multilayer feedforward networks are a class of universal approximators.

Examples of activation functions

Table from Wikipedia

Name Function Derivative Range Order of continuity
Binary step {0if x<01if x0\begin{cases}0&\textrm{if }x<0\\1&\textrm{if }x\geq 0\end{cases} 00 {0,1}\{0,1\} C1\mathcal{C}^{-1}
Sigmoid σ(x)=11+ex\sigma(x)=\frac{1}{1+\mathrm{e}^{-x}} σ(x)(1σ(x))\sigma(x)(1-\sigma(x)) (0,1)(0,1) C\mathcal{C}^\infty
ReLU max(0,x)\mathrm{max}(0,x) {0if x<01if x>0undefinedat x=0\begin{cases}0&\textrm{if }x<0\\1&\textrm{if }x>0\\\textrm{undefined}&\textrm{at }x=0\end{cases} [0,)[0,\infty) C0\mathcal{C}^0
Leaky ReLU {0.01xif x0xif x>0\begin{cases}0.01x&\textrm{if }x\leq 0\\x&\textrm{if }x>0\end{cases} {0.01if x<01if x>0\begin{cases}0.01&\textrm{if }x<0\\1&\textrm{if }x>0\end{cases} (,)(-\infty,\infty) C0\mathcal{C}^0

Sigmoid is apt for binary classification problems, whereas softmax is a generalization of the sigmoid function for multiclass classification problems. Given a layer x=[x1,,xn]x=[x_1,\dots,x_n],

softmax(x)i=exij=1nexj\mathrm{softmax}(x)_i=\frac{\mathrm{e}^{x_i}}{\sum_{j=1}^n\mathrm{e}^{x_j}}

ReLU and variants are commonly used in hidden layers, and sigmoid and softmax are commonly used in output layers. ReLU and variants are preferred for their computational efficiency. Variants of ReLU primarily serve to preserve the effects of neurons with negative outputs, whereas ReLU has sparse activation and fewer vanishing gradient problems. Sigmoid and softmax give values between 00 and 11, which are easier to process as outputs.

Training

Artificial neural networks, like all machine learning models, are statistical models. They are trained to minimize a loss function L\mathcal{L}. This quantifies the discrepancy between model outputs, y^\hat{y}, and true labels, yy. Differentiating the loss guides gradient-based optimizations. For supervised learning models, true labels are provided. In unsupervised learning models, they are derived. MNIST datasets are labelled, so I will only discuss loss functions used with supervised learning models.

  1. Forward propagation: A model produces an output y^\hat{y} based on given inputs xx.
  2. Loss function: The loss L\mathcal{L} is computed.
  3. Backpropagation: Gradients of the loss δ\delta are computed for each weight and bias.
  4. Optimization algorithm: Weights and biases are updated.

Loss functions

Loss function Formula
Mean squared error L=1Ni=1N(yiy^i)2\mathcal{L}=\frac{1}{N}\sum_{i=1}^N(y_i-\hat{y}_i)^2
Binary cross entropy L=1Ni=1N[yilog(y^i)+(1yi)log(1y^i)]\mathcal{L}=-\frac{1}{N}\sum_{i=1}^N[y_i\mathrm{log}(\hat{y}_i)+(1-y_i)\mathrm{log}(1-\hat{y}_i)]
Cross entropy L=1Ni=1Nc=1Cyi,clog(y^i,c)\mathcal{L}=-\frac{1}{N}\sum_{i=1}^N\sum_{c=1}^Cy_{i,c}\mathrm{log}(\hat{y}_{i,c})

Mean squared error is used in regression tasks, or tasks with continuous output. Cross entropy is used for classification tasks. Other loss functions are commonly used depending on what task is being performed.

Backpropagation

The derivative of a composite function can be computed using the chain rule. Given a function L(f(g(h(x))))\mathcal{L}(f(g(h(x)))),

dLdx=dLdfdfdgdgdhdhdx\frac{\mathrm{d}\mathcal{L}}{\mathrm{d}x}=\frac{\mathrm{d}\mathcal{L}}{\mathrm{d}f}\cdot\frac{\mathrm{d}f}{\mathrm{d}g}\cdot\frac{\mathrm{d}g}{\mathrm{d}h}\cdot\frac{\mathrm{d}h}{\mathrm{d}x}

Thus, δl\delta_l is computed for each layer iteratively in reverse order. Let ala_l be the output of a layer, and zlz_l be the output of the layer before the activation function is applied. aL=y^a_L=\hat{y}

δL=zLL=Ly^σ(zL)\delta_L=\nabla_{z_L}\mathcal{L}=\frac{\partial\mathcal{L}}{\partial \hat{y}}\cdot\sigma'(z_L)δl=zlL=σl(zl)(Wl+1Tδl+1),1l<L\delta_l=\nabla_{z_l}\mathcal{L}=\sigma_l'(z_l)\odot(W_{l+1}^T\delta_{l+1}), 1\leq l<L

where \odot represents the Hadamard product, and \nabla is the differential operator.

Optimizers

Optimizers update model parameters. They typically take hyperparameters, or user-defined parameters, as input.

Optimizer Function
SGD Wl+1=WlηWW_{l+1}=W_l-\eta\nabla_W
SGD w/ momentum Vl=γVl1+ηWlL(Wl)Wl+1=WlVlV_l=\gamma V_{l-1}+\eta\nabla_{W_l}\mathcal{L}(W_l)\\W_{l+1}=W_l-V_l
RMSprop El=βEl1+(1β)WlL(Wl)WlL(Wl)Wl+1=WlηEt+ϵWlL(Wl)E_l=\beta E_{l-1}+(1-\beta)\nabla_{W_l}\mathcal{L}(W_l)\odot\nabla_{W_l}\mathcal{L}(W_l)\\W_{l+1}=W_l-\frac{\eta}{\sqrt{E_t+\epsilon}}\odot\nabla_{W_l}\mathcal{L}(W_l)
Adam Ml=γMl1+(1γ)WlL(Wl)Vl=βVl1+(1β)WlL(Wl)WlL(Wl)Wl+1=WlηVl+ϵMlM_l = \gamma M_{l-1}+(1-\gamma)\nabla_{W_l}\mathcal{L}(W_l)\\V_l=\beta V_{l-1}+(1-\beta)\nabla_{W_l}\mathcal{L}(W_l)\odot\nabla_{W_l}\mathcal{L}(W_l)\\W_{l+1}=W_l-\frac{\eta}{\sqrt{V_l}+\epsilon}\odot M_l

with hyperparameters including the learning rate η\eta (0.0010.001), momentum coefficient γ\gamma (0.90.9), stabilizer ϵ\epsilon (~10810^{-8}), and decay rate β\beta ((0.9,0.999)\in (0.9,0.999)).

A better optimizer will increase the rate of convergence, stability, and generalization. Thus, it is okay if the function is slightly slower to compute. Adam is a very popular optimizer in deep learning tasks -although it may overfit data.

Overfitting

Overfitting refers to a model learning the data it is trained on too well. That is, it may learn noise and outliers of training data, limiting its applicability to unseen general problems. Overfitting occurs when a model is too complex for the data it is approximating, when it is trained on limited data, and when the model isn’t regularized. It is the reason a distinction is made between training data and testing data. A good way to detect overfitting is to identify high performance on training data (using the loss function), and low performance on test data.

Regularization

Methods to reduce variance and prevent overfitting. There are penalized regression techniques, such as L1L_1 and L2L_2 regularization, that add parameters to the loss function to increase sparsity (decrease the number of weights influencing a decision) or uniformly shrink the effect of individual parameters. Dropout is a practice in which neuron activations are randomly set to zero during training -forcing the model to learn higher-dimension representations of patterns so they are redundant across the network.

Network architecture

Dense/Fully connected layers have connections between every neuron of one layer and every neuron of the next. Outputs are computed using matrix multiplication and addition.

Convolution layers pass a filter KK of weights of a given size over 2-dimensional input data. Filter size and stride/step size are hyperparameters. Optionally, padding is also applied to preserve dimensionality. The filter produces a single number as output after each step. CNNs are good at learning patterns in images. Multiple filters can be used in a single layer to learn more patterns.

LeNet-5

LeNet are a series of convolutional neural network architectures published by AT&T’s Bell Labs between 1988 and 1998 to classify handwritten digits in the MNIST dataset. The “Le” in LeNet refers to Yann LeCun, who led the research group that worked on these models. LeCun is a big name in the field. 5 “LeNet” architectures were published.

LeNet-5 architecture. By dvgodoy / CC BY 4.0

LeNet-5 architecture. By dvgodoy / CC BY 4.0

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
import sys

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, stride=1, padding=2)
        self.conv2 = nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5, stride=1, padding=0)
        self.fc1 = nn.Linear(in_features=400, out_features=120)
        self.fc2 = nn.Linear(in_features=120, out_features=84)
        self.fc3 = nn.Linear(in_features=84, out_features=10)

    def forward(self, x):
        x = F.sigmoid(self.conv1(x))
        x = F.avg_pool2d(x, kernel_size=2, stride=2)
        x = F.sigmoid(self.conv2(x))
        x = F.avg_pool2d(x, kernel_size=2, stride=2)
        x = torch.flatten(x, 1)
        x = F.sigmoid(self.fc1(x))
        x = F.sigmoid(self.fc2(x))
        x = self.fc3(x)
        return x

Written using PyTorch. This replaces the original Gaussian activation layer with a softmax layer.

This model achieves 97.90% accuracy on MNIST, and 86.54% accuracy on FashionMNIST after 10 epochs, using the Adam optimizer. The performance of the model can be greatly improved by replacing the average pooling layers with max pooling layers, and the softmax layers with ReLU.

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, stride=1, padding=2)
        self.conv2 = nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5, stride=1, padding=0)
        self.fc1 = nn.Linear(in_features=400, out_features=120)
        self.fc2 = nn.Linear(in_features=120, out_features=84)
        self.fc3 = nn.Linear(in_features=84, out_features=10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, kernel_size=2, stride=2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, kernel_size=2, stride=2)
        x = torch.flatten(x, 1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

The result, with changes made, is a model that achieves 99.07% accuracy on MNIST and 90.45% accuracy on FashionMNIST after 10 epochs.

Future work

There is a lot that could expand upon this. I could compare model performance when different choices are made. Compared parameters here.

I could explore other datasets, such as CIFAR. I could explore other types of models. I could also train models for tasks other than classifying images. I could explore lower-level implementations of a model. Etc.