MNIST
The MNIST dataset, or Modified National Institute of Standards and Technology dataset, is a labeled preprocessed dataset of handwritten digits taken from American Census Bureau employees. The database contains black and white images, split into training and testing images. Training a model to recognize these digits is something of a “Hello world” for machine learning.
Fashion MNIST is a dataset of the same size, with black and white images of the same size that represent one of 10 outfits. An artificial neural network that can be trained on the MNIST dataset to recognize digits should also be able to be trained on the Fashion MNIST dataset to recognize articles of clothing. Although getting a good score might be harder.
Artificial neural networks
We only consider feedforward neural networks for simplicity. An artificial neural network is a directed weighted graph composed of layers, including an input layer, hidden layers, and an output layer. Each layer is made up of nodes, termed “neurons”, that hold values that are a function of values in previous layers. Neurons in a given layer are connected to neurons in the next layer .
3 fully connected layers. By Glosser.ca / CC BY-SA 3.0
If there are neurons in layer , and in layer , then the weights between the layers are a matrix, and the biases are a vector. We call the neurons in layer .
Where is a nonlinear activation function, the operation between and is matrix multiplication, and between is matrix addition.
Consider a network with layers. Let . The entire network is a composition of layers:
Activation function
The purpose of the activation function is to introduce non-linearity into the network. Without an activation function, all layers can collapse into a single linear layer:
for some and . By induction, this can be repeated for layers.
A linear layer cannot solve nor approximate non-linear functions. In contrast, the Universal Approximation Theorem states that
multilayer feedforward networks with as few as one hidden layer using arbitrary squashing functions are capable of approximating any Borel measurable function from one finite dimensional space to another to any desired degree of accuracy, provided sufficiently many hidden units are available. In this sense, multilayer feedforward networks are a class of universal approximators.
Examples of activation functions
Table from Wikipedia
Name | Function | Derivative | Range | Order of continuity |
---|---|---|---|---|
Binary step | ||||
Sigmoid | ||||
ReLU | ||||
Leaky ReLU |
Sigmoid is apt for binary classification problems, whereas softmax is a generalization of the sigmoid function for multiclass classification problems. Given a layer ,
ReLU and variants are commonly used in hidden layers, and sigmoid and softmax are commonly used in output layers. ReLU and variants are preferred for their computational efficiency. Variants of ReLU primarily serve to preserve the effects of neurons with negative outputs, whereas ReLU has sparse activation and fewer vanishing gradient problems. Sigmoid and softmax give values between and , which are easier to process as outputs.
Training
Artificial neural networks, like all machine learning models, are statistical models. They are trained to minimize a loss function . This quantifies the discrepancy between model outputs, , and true labels, . Differentiating the loss guides gradient-based optimizations. For supervised learning models, true labels are provided. In unsupervised learning models, they are derived. MNIST datasets are labelled, so I will only discuss loss functions used with supervised learning models.
- Forward propagation: A model produces an output based on given inputs .
- Loss function: The loss is computed.
- Backpropagation: Gradients of the loss are computed for each weight and bias.
- Optimization algorithm: Weights and biases are updated.
Loss functions
Loss function | Formula |
---|---|
Mean squared error | |
Binary cross entropy | |
Cross entropy |
Mean squared error is used in regression tasks, or tasks with continuous output. Cross entropy is used for classification tasks. Other loss functions are commonly used depending on what task is being performed.
Backpropagation
The derivative of a composite function can be computed using the chain rule. Given a function ,
Thus, is computed for each layer iteratively in reverse order. Let be the output of a layer, and be the output of the layer before the activation function is applied.
where represents the Hadamard product, and is the differential operator.
Optimizers
Optimizers update model parameters. They typically take hyperparameters, or user-defined parameters, as input.
Optimizer | Function |
---|---|
SGD | |
SGD w/ momentum | |
RMSprop | |
Adam |
with hyperparameters including the learning rate (), momentum coefficient (), stabilizer (~), and decay rate ().
A better optimizer will increase the rate of convergence, stability, and generalization. Thus, it is okay if the function is slightly slower to compute. Adam is a very popular optimizer in deep learning tasks -although it may overfit data.
Overfitting
Overfitting refers to a model learning the data it is trained on too well. That is, it may learn noise and outliers of training data, limiting its applicability to unseen general problems. Overfitting occurs when a model is too complex for the data it is approximating, when it is trained on limited data, and when the model isn’t regularized. It is the reason a distinction is made between training data and testing data. A good way to detect overfitting is to identify high performance on training data (using the loss function), and low performance on test data.
Regularization
Methods to reduce variance and prevent overfitting. There are penalized regression techniques, such as and regularization, that add parameters to the loss function to increase sparsity (decrease the number of weights influencing a decision) or uniformly shrink the effect of individual parameters. Dropout is a practice in which neuron activations are randomly set to zero during training -forcing the model to learn higher-dimension representations of patterns so they are redundant across the network.
Network architecture
Dense/Fully connected layers have connections between every neuron of one layer and every neuron of the next. Outputs are computed using matrix multiplication and addition.
Convolution layers pass a filter of weights of a given size over 2-dimensional input data. Filter size and stride/step size are hyperparameters. Optionally, padding is also applied to preserve dimensionality. The filter produces a single number as output after each step. CNNs are good at learning patterns in images. Multiple filters can be used in a single layer to learn more patterns.
LeNet-5
LeNet are a series of convolutional neural network architectures published by AT&T’s Bell Labs between 1988 and 1998 to classify handwritten digits in the MNIST dataset. The “Le” in LeNet refers to Yann LeCun, who led the research group that worked on these models. LeCun is a big name in the field. 5 “LeNet” architectures were published.
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
import sys
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, stride=1, padding=2)
self.conv2 = nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5, stride=1, padding=0)
self.fc1 = nn.Linear(in_features=400, out_features=120)
self.fc2 = nn.Linear(in_features=120, out_features=84)
self.fc3 = nn.Linear(in_features=84, out_features=10)
def forward(self, x):
x = F.sigmoid(self.conv1(x))
x = F.avg_pool2d(x, kernel_size=2, stride=2)
x = F.sigmoid(self.conv2(x))
x = F.avg_pool2d(x, kernel_size=2, stride=2)
x = torch.flatten(x, 1)
x = F.sigmoid(self.fc1(x))
x = F.sigmoid(self.fc2(x))
x = self.fc3(x)
return x
Written using PyTorch. This replaces the original Gaussian activation layer with a softmax layer.
This model achieves 97.90% accuracy on MNIST, and 86.54% accuracy on FashionMNIST after 10 epochs, using the Adam optimizer. The performance of the model can be greatly improved by replacing the average pooling layers with max pooling layers, and the softmax layers with ReLU.
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, stride=1, padding=2)
self.conv2 = nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5, stride=1, padding=0)
self.fc1 = nn.Linear(in_features=400, out_features=120)
self.fc2 = nn.Linear(in_features=120, out_features=84)
self.fc3 = nn.Linear(in_features=84, out_features=10)
def forward(self, x):
x = F.relu(self.conv1(x))
x = F.max_pool2d(x, kernel_size=2, stride=2)
x = F.relu(self.conv2(x))
x = F.max_pool2d(x, kernel_size=2, stride=2)
x = torch.flatten(x, 1)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
The result, with changes made, is a model that achieves 99.07% accuracy on MNIST and 90.45% accuracy on FashionMNIST after 10 epochs.
Future work
There is a lot that could expand upon this. I could compare model performance when different choices are made. Compared parameters here.
I could explore other datasets, such as CIFAR. I could explore other types of models. I could also train models for tasks other than classifying images. I could explore lower-level implementations of a model. Etc.