Effect of Optimizer Choice on MNIST Score

Problem Statement

To determine how the choice of optimizer effects LeNet-5’s performance on the MNIST and Fashion MNIST benchmarks. LeNet-5 with SGD with a learning rate of 0.001 and momentum of 0.9 will be our baseline. Without momentum, SGD doesn’t really learn.

Results

The table shows the accuracy on the test set from either the 9th or 10th epoch, whichever is higher.

Choices	MNIST Performance	Fashion MNIST Performance	Notes
LeNet-5 tr=0.001	11%	11%	Didn’t learn.
LeNet-5 tr=0.01	91.91%	77.20%	Learning rate 0.01. Slow start, more epochs needed.
ReLU tr=0.001	87.28%	72.69%	_
ReLU tr=0.01	98.94%	90.22%	Slow start
MaxPool tr=0.01	94.57%	79.37%	Slow start
ReLU and MaxPool lr=0.001	98.11%	87.21%	Trained steadily. Would benefit from more epochs.
ReLU and MaxPool lr=0.01	99.07%	90.45%	same as Adam
^ and ASGD lr=0.01	98.52%	_	_
^ and Rprop lr=0.01	91.95%	_	_
^ and RMSprop lr=0.001	98.95%	90.64%	BEST FASHION MNIST SCORE
^ and Adadelta lr=0.001	80.37%	65.66%	_
^ and Adafactor lr=0.01	99.15%	89.87%	BEST MNIST SCORE
^ and Adagrad lr=0.01	98.93%	_	_
^ and Adagrad lr=0.001	96.07%	_	would likely be equal to tr=0.01 with more epochs.
^ and Adam lr=0.01	98.31%	_	jumped around a lot. tr obviously too high.
^ and Adam lr=0.001	99.07%	90.35%	97.34% and 84.45% accuracy after first epoch
^ and AdamW lr=0.001	98.94%	89.48%	_
^ and Adamax lr=0.001	98.95%	89.07%	_
^ and NAdam lr=0.001	99.08%	_	_
^ and NAdam lr=0.002	99.01%	_	_
^ and RAdam lr=0.001	98.97%	_	jumps around a decent amount
^ and RAdam lr=0.0001	98.23%	_	needs more epochs

In conclusion, using ReLU over Sigmoid and MaxPool over AvgPool has significant most benefit. Most optimizers perform about the same on this small dataset, with Adafactor performing the best on MNIST, and RMSprop performing the best on Fashion MNIST. However, SGD is extremely sensitive to training rate, so it would likely perform significantly worse on larger models. RMSprop, Adam, Nadam, seem promising. The default parameters provided by PyTorch seem to be very sane, and often perform about the best.

In future tests, I will be dropping Rprop, Adadelta, and Adagrad. Additionally, I will choose 1 Adam-based optimizer by their performance training a larger model.

LBFGS

LBFGS has problems. First off, it is exceptionally slow. PyTorch’s documentation also mentions it being very memory heavy. The loss quickly went to NaN. I attempted to resolve this by setting the learning rate to 0.1, and changing the batch size to 512. Which made the loss equal NaN at the third epoch instead of the first. LBFGS also requires the program to be modified a bit. The step function must be passed a closure function that recomputes the model’s output and loss. Changing the training function to this will satisfy its requirements:

def train(model, device, train_loader, optimizer, epoch):
    model.train()
    losses = []
    correct = 0
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)

        def closure():
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            return loss

        # Perform the optimization step
        loss = optimizer.step(closure)
        losses.append(loss.item())

        # Calculate accuracy with updated weights
        with torch.no_grad():
            output = model(data)
            pred = output.argmax(dim=1, keepdim=True)
            correct += pred.eq(target.view_as(pred)).sum().item()

        if batch_idx % log_interval == 0 and batch_idx != 0:
            print(f'{dataset}: Epoch {epoch} [{batch_idx * len(data)}/{len(train_loader.dataset)}] Loss: {loss.item():.6f}')

    avg_loss = sum(losses) / len(losses)
    accuracy = 100. * correct / len(train_loader.dataset)
    print(f'{dataset}: Epoch {epoch} - Avg Loss: {avg_loss:.6f}, Accuracy: {accuracy:.2f}%')

However, even with this the model has an accuracy of 9.8% and a loss of NaN. There is more I can try, but this would change the model to something I consider different. and thus the comparison wouldn’t be fair.

Adding batch normalization, weight initialization, reducing the learning rate to 0.01 (default for this function is 1), applying gradient clipping, and changing the batch size again allows it to achieve 98.77% accuracy on MNIST, worse than Adam does without any of those changes made. To reiterate, it is also significantly slower than all other algorithms tried. The wikipedia article states

L-BFGS has been called “the algorithm of choice” for fitting log-linear (MaxEnt) models and conditional random fields with $\ell_2$ -regularization[2] [3].

So maybe it’s just decently incompatible with this problem.

Under the same conditions, with default arguments, Adam achieves 99.11% accuracy. I won’t test anything else, as I have changed way too much at once for that to be meaningful.

Future Work

Sparse Adam and regularization in general, more epochs, change other parameters, further explore math and reasoning behind optimizers.

Problem Statement#

Results#

LBFGS#

Future Work#

Problem Statement

Results

LBFGS

Future Work