MNIST and LeNet | Jeff Heiges

MNIST

The MNIST dataset, or Modified National Institute of Standards and Technology dataset, is a labeled preprocessed dataset of handwritten digits taken from American Census Bureau employees. The database contains $70000$ $28\times 28$ black and white images, split into $60000$ training and $10000$ testing images. Training a model to recognize these digits is something of a “Hello world” for machine learning.

Fashion MNIST is a dataset of the same size, with black and white images of the same size that represent one of 10 outfits. An artificial neural network that can be trained on the MNIST dataset to recognize digits should also be able to be trained on the Fashion MNIST dataset to recognize articles of clothing. Although getting a good score might be harder.

Artificial neural networks

We only consider feedforward neural networks for simplicity. An artificial neural network $\mathcal{N}$ is a directed weighted graph composed of layers, including an input layer, hidden layers, and an output layer. Each layer is made up of nodes, termed “neurons”, that hold values that are a function of values in previous layers. Neurons in a given layer $l$ are connected to neurons in the next layer $l+1$ .

3 fully connected layers. By Glosser.ca / CC BY-SA 3.0

If there are $M$ neurons in layer $l-1$ , and $N$ in layer $l$ , then the weights between the layers $W_l\in\mathbb{R}^{M\times N}$ are a matrix, and the biases $b\in\mathbb{R}^{N}$ are a vector. We call the neurons in layer $l$ $X_l$ .

X_l=\sigma(W_l\cdot x_{l-1}+b_l)

Where $\sigma$ is a nonlinear activation function, the operation between $W_l$ and $x_{l-1}$ is matrix multiplication, and between $b_l$ is matrix addition.

Consider a network with $L$ layers. Let $\Theta=\{W_1,\dots,W_L,b_1,\dots,b_L\}$ . The entire network $f:\mathbb{R}^I,\Theta\rightarrow\mathbb{R}^O$ is a composition of layers:

f(x,\Theta)=\sigma_L(W_L\sigma_{L-1}(W_{L-1}\dots\sigma_1(W_1x+b_1)\dots+b_{L-1})+b_L)

Activation function

The purpose of the activation function is to introduce non-linearity into the network. Without an activation function, all layers can collapse into a single linear layer:

W_2(W_1x+b_1)+b_2=(W_2W_1)x+(W_2b_1+b_2)=Mx+b

for some $M$ and $b$ . By induction, this can be repeated for $L$ layers.

A linear layer cannot solve nor approximate non-linear functions. In contrast, the Universal Approximation Theorem states that

multilayer feedforward networks with as few as one hidden layer using arbitrary squashing functions are capable of approximating any Borel measurable function from one finite dimensional space to another to any desired degree of accuracy, provided sufficiently many hidden units are available. In this sense, multilayer feedforward networks are a class of universal approximators.

Examples of activation functions

Table from Wikipedia

Name	Function	Derivative	Range	Order of continuity
Binary step	$\begin{cases}0&\textrm{if }x<0\\1&\textrm{if }x\geq 0\end{cases}$	$0$	$\{0,1\}$	$\mathcal{C}^{-1}$
Sigmoid	$\sigma(x)=\frac{1}{1+\mathrm{e}^{-x}}$	$\sigma(x)(1-\sigma(x))$	$(0,1)$	$\mathcal{C}^\infty$
ReLU	$\mathrm{max}(0,x)$	$\begin{cases}0&\textrm{if }x<0\\1&\textrm{if }x>0\\\textrm{undefined}&\textrm{at }x=0\end{cases}$	$[0,\infty)$	$\mathcal{C}^0$
Leaky ReLU	$\begin{cases}0.01x&\textrm{if }x\leq 0\\x&\textrm{if }x>0\end{cases}$	$\begin{cases}0.01&\textrm{if }x<0\\1&\textrm{if }x>0\end{cases}$	$(-\infty,\infty)$	$\mathcal{C}^0$

Sigmoid is apt for binary classification problems, whereas softmax is a generalization of the sigmoid function for multiclass classification problems. Given a layer $x=[x_1,\dots,x_n]$ ,

\mathrm{softmax}(x)_i=\frac{\mathrm{e}^{x_i}}{\sum_{j=1}^n\mathrm{e}^{x_j}}

ReLU and variants are commonly used in hidden layers, and sigmoid and softmax are commonly used in output layers. ReLU and variants are preferred for their computational efficiency. Variants of ReLU primarily serve to preserve the effects of neurons with negative outputs, whereas ReLU has sparse activation and fewer vanishing gradient problems. Sigmoid and softmax give values between $0$ and $1$ , which are easier to process as outputs.

Training

Artificial neural networks, like all machine learning models, are statistical models. They are trained to minimize a loss function $\mathcal{L}$ . This quantifies the discrepancy between model outputs, $\hat{y}$ , and true labels, $y$ . Differentiating the loss guides gradient-based optimizations. For supervised learning models, true labels are provided. In unsupervised learning models, they are derived. MNIST datasets are labelled, so I will only discuss loss functions used with supervised learning models.

Forward propagation: A model produces an output $\hat{y}$ based on given inputs $x$ .
Loss function: The loss $\mathcal{L}$ is computed.
Backpropagation: Gradients of the loss $\delta$ are computed for each weight and bias.
Optimization algorithm: Weights and biases are updated.

Loss functions

Loss function	Formula
Mean squared error	$\mathcal{L}=\frac{1}{N}\sum_{i=1}^N(y_i-\hat{y}_i)^2$
Binary cross entropy	$\mathcal{L}=-\frac{1}{N}\sum_{i=1}^N[y_i\mathrm{log}(\hat{y}_i)+(1-y_i)\mathrm{log}(1-\hat{y}_i)]$
Cross entropy	$\mathcal{L}=-\frac{1}{N}\sum_{i=1}^N\sum_{c=1}^Cy_{i,c}\mathrm{log}(\hat{y}_{i,c})$

Mean squared error is used in regression tasks, or tasks with continuous output. Cross entropy is used for classification tasks. Other loss functions are commonly used depending on what task is being performed.

Backpropagation

The derivative of a composite function can be computed using the chain rule. Given a function $\mathcal{L}(f(g(h(x))))$ ,

\frac{\mathrm{d}\mathcal{L}}{\mathrm{d}x}=\frac{\mathrm{d}\mathcal{L}}{\mathrm{d}f}\cdot\frac{\mathrm{d}f}{\mathrm{d}g}\cdot\frac{\mathrm{d}g}{\mathrm{d}h}\cdot\frac{\mathrm{d}h}{\mathrm{d}x}

Thus, $\delta_l$ is computed for each layer iteratively in reverse order. Let $a_l$ be the output of a layer, and $z_l$ be the output of the layer before the activation function is applied. $a_L=\hat{y}$

\delta_L=\nabla_{z_L}\mathcal{L}=\frac{\partial\mathcal{L}}{\partial \hat{y}}\cdot\sigma'(z_L)

\delta_l=\nabla_{z_l}\mathcal{L}=\sigma_l'(z_l)\odot(W_{l+1}^T\delta_{l+1}), 1\leq l<L

where $\odot$ represents the Hadamard product, and $\nabla$ is the differential operator.

Optimizers

Optimizers update model parameters. They typically take hyperparameters, or user-defined parameters, as input.

Optimizer	Function
SGD	$W_{l+1}=W_l-\eta\nabla_W$
SGD w/ momentum	$V_l=\gamma V_{l-1}+\eta\nabla_{W_l}\mathcal{L}(W_l)\\W_{l+1}=W_l-V_l$
RMSprop	$E_l=\beta E_{l-1}+(1-\beta)\nabla_{W_l}\mathcal{L}(W_l)\odot\nabla_{W_l}\mathcal{L}(W_l)\\W_{l+1}=W_l-\frac{\eta}{\sqrt{E_t+\epsilon}}\odot\nabla_{W_l}\mathcal{L}(W_l)$
Adam	$M_l = \gamma M_{l-1}+(1-\gamma)\nabla_{W_l}\mathcal{L}(W_l)\\V_l=\beta V_{l-1}+(1-\beta)\nabla_{W_l}\mathcal{L}(W_l)\odot\nabla_{W_l}\mathcal{L}(W_l)\\W_{l+1}=W_l-\frac{\eta}{\sqrt{V_l}+\epsilon}\odot M_l$

with hyperparameters including the learning rate $\eta$ ( ~~$0.001$ ), momentum coefficient $\gamma$ (~~ $0.9$ ), stabilizer $\epsilon$ (~ $10^{-8}$ ), and decay rate $\beta$ ( $\in (0.9,0.999)$ ).

A better optimizer will increase the rate of convergence, stability, and generalization. Thus, it is okay if the function is slightly slower to compute. Adam is a very popular optimizer in deep learning tasks -although it may overfit data.

Overfitting

Overfitting refers to a model learning the data it is trained on too well. That is, it may learn noise and outliers of training data, limiting its applicability to unseen general problems. Overfitting occurs when a model is too complex for the data it is approximating, when it is trained on limited data, and when the model isn’t regularized. It is the reason a distinction is made between training data and testing data. A good way to detect overfitting is to identify high performance on training data (using the loss function), and low performance on test data.

Regularization

Methods to reduce variance and prevent overfitting. There are penalized regression techniques, such as $L_1$ and $L_2$ regularization, that add parameters to the loss function to increase sparsity (decrease the number of weights influencing a decision) or uniformly shrink the effect of individual parameters. Dropout is a practice in which neuron activations are randomly set to zero during training -forcing the model to learn higher-dimension representations of patterns so they are redundant across the network.

Network architecture

Dense/Fully connected layers have connections between every neuron of one layer and every neuron of the next. Outputs are computed using matrix multiplication and addition.

Convolution layers pass a filter $K$ of weights of a given size over 2-dimensional input data. Filter size and stride/step size are hyperparameters. Optionally, padding is also applied to preserve dimensionality. The filter produces a single number as output after each step. CNNs are good at learning patterns in images. Multiple filters can be used in a single layer to learn more patterns.

LeNet-5

LeNet are a series of convolutional neural network architectures published by AT&T’s Bell Labs between 1988 and 1998 to classify handwritten digits in the MNIST dataset. The “Le” in LeNet refers to Yann LeCun, who led the research group that worked on these models. LeCun is a big name in the field. 5 “LeNet” architectures were published.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
import sys

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, stride=1, padding=2)
        self.conv2 = nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5, stride=1, padding=0)
        self.fc1 = nn.Linear(in_features=400, out_features=120)
        self.fc2 = nn.Linear(in_features=120, out_features=84)
        self.fc3 = nn.Linear(in_features=84, out_features=10)

    def forward(self, x):
        x = F.sigmoid(self.conv1(x))
        x = F.avg_pool2d(x, kernel_size=2, stride=2)
        x = F.sigmoid(self.conv2(x))
        x = F.avg_pool2d(x, kernel_size=2, stride=2)
        x = torch.flatten(x, 1)
        x = F.sigmoid(self.fc1(x))
        x = F.sigmoid(self.fc2(x))
        x = self.fc3(x)
        return x

Written using PyTorch. This replaces the original Gaussian activation layer with a softmax layer.

This model achieves 97.90% accuracy on MNIST, and 86.54% accuracy on FashionMNIST after 10 epochs, using the Adam optimizer. The performance of the model can be greatly improved by replacing the average pooling layers with max pooling layers, and the softmax layers with ReLU.

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, stride=1, padding=2)
        self.conv2 = nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5, stride=1, padding=0)
        self.fc1 = nn.Linear(in_features=400, out_features=120)
        self.fc2 = nn.Linear(in_features=120, out_features=84)
        self.fc3 = nn.Linear(in_features=84, out_features=10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, kernel_size=2, stride=2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, kernel_size=2, stride=2)
        x = torch.flatten(x, 1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

The result, with changes made, is a model that achieves 99.07% accuracy on MNIST and 90.45% accuracy on FashionMNIST after 10 epochs.

Future work

There is a lot that could expand upon this. I could compare model performance when different choices are made. Compared parameters here.

I could explore other datasets, such as CIFAR. I could explore other types of models. I could also train models for tasks other than classifying images. I could explore lower-level implementations of a model. Etc.

MNIST#

Artificial neural networks#

Activation function#

Examples of activation functions#

Training#

Loss functions#

Backpropagation#

Optimizers#

Overfitting#

Regularization#

Network architecture#

LeNet-5#

Future work#