Results in folder training_results.
Problem Statement
To determine how the choice of optimizer effects LeNet-5’s performance on the MNIST and Fashion MNIST benchmarks. LeNet-5 with SGD with a learning rate of 0.001 and momentum of 0.9 will be our baseline. Without momentum, SGD doesn’t really learn.
Results
The table shows the accuracy on the test set from either the 9th or 10th epoch, whichever is higher.
Choices | MNIST Performance | Fashion MNIST Performance | Notes |
---|---|---|---|
LeNet-5 tr=0.001 | 11% | 11% | Didn’t learn. |
LeNet-5 tr=0.01 | 91.91% | 77.20% | Learning rate 0.01. Slow start, more epochs needed. |
ReLU tr=0.001 | 87.28% | 72.69% | _ |
ReLU tr=0.01 | 98.94% | 90.22% | Slow start |
MaxPool tr=0.01 | 94.57% | 79.37% | Slow start |
ReLU and MaxPool lr=0.001 | 98.11% | 87.21% | Trained steadily. Would benefit from more epochs. |
ReLU and MaxPool lr=0.01 | 99.07% | 90.45% | same as Adam |
^ and ASGD lr=0.01 | 98.52% | _ | _ |
^ and Rprop lr=0.01 | 91.95% | _ | _ |
^ and RMSprop lr=0.001 | 98.95% | 90.64% | BEST FASHION MNIST SCORE |
^ and Adadelta lr=0.001 | 80.37% | 65.66% | _ |
^ and Adafactor lr=0.01 | 99.15% | 89.87% | BEST MNIST SCORE |
^ and Adagrad lr=0.01 | 98.93% | _ | _ |
^ and Adagrad lr=0.001 | 96.07% | _ | would likely be equal to tr=0.01 with more epochs. |
^ and Adam lr=0.01 | 98.31% | _ | jumped around a lot. tr obviously too high. |
^ and Adam lr=0.001 | 99.07% | 90.35% | 97.34% and 84.45% accuracy after first epoch |
^ and AdamW lr=0.001 | 98.94% | 89.48% | _ |
^ and Adamax lr=0.001 | 98.95% | 89.07% | _ |
^ and NAdam lr=0.001 | 99.08% | _ | _ |
^ and NAdam lr=0.002 | 99.01% | _ | _ |
^ and RAdam lr=0.001 | 98.97% | _ | jumps around a decent amount |
^ and RAdam lr=0.0001 | 98.23% | _ | needs more epochs |
In conclusion, using ReLU over Sigmoid and MaxPool over AvgPool has significant most benefit. Most optimizers perform about the same on this small dataset, with Adafactor performing the best on MNIST, and RMSprop performing the best on Fashion MNIST. However, SGD is extremely sensitive to training rate, so it would likely perform significantly worse on larger models. RMSprop, Adam, Nadam, seem promising. The default parameters provided by PyTorch seem to be very sane, and often perform about the best.
In future tests, I will be dropping Rprop, Adadelta, and Adagrad. Additionally, I will choose 1 Adam-based optimizer by their performance training a larger model.
LBFGS
LBFGS has problems. First off, it is exceptionally slow. PyTorch’s documentation also mentions it being very memory heavy. The loss quickly went to NaN. I attempted to resolve this by setting the learning rate to 0.1, and changing the batch size to 512. Which made the loss equal NaN at the third epoch instead of the first. LBFGS also requires the program to be modified a bit. The step function must be passed a closure function that recomputes the model’s output and loss. Changing the training function to this will satisfy its requirements:
def train(model, device, train_loader, optimizer, epoch):
model.train()
losses = []
correct = 0
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
def closure():
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
return loss
# Perform the optimization step
loss = optimizer.step(closure)
losses.append(loss.item())
# Calculate accuracy with updated weights
with torch.no_grad():
output = model(data)
pred = output.argmax(dim=1, keepdim=True)
correct += pred.eq(target.view_as(pred)).sum().item()
if batch_idx % log_interval == 0 and batch_idx != 0:
print(f'{dataset}: Epoch {epoch} [{batch_idx * len(data)}/{len(train_loader.dataset)}] Loss: {loss.item():.6f}')
avg_loss = sum(losses) / len(losses)
accuracy = 100. * correct / len(train_loader.dataset)
print(f'{dataset}: Epoch {epoch} - Avg Loss: {avg_loss:.6f}, Accuracy: {accuracy:.2f}%')
However, even with this the model has an accuracy of 9.8% and a loss of NaN. There is more I can try, but this would change the model to something I consider different. and thus the comparison wouldn’t be fair.
Adding batch normalization, weight initialization, reducing the learning rate to 0.01 (default for this function is 1), applying gradient clipping, and changing the batch size again allows it to achieve 98.77% accuracy on MNIST, worse than Adam does without any of those changes made. To reiterate, it is also significantly slower than all other algorithms tried. The wikipedia article states
L-BFGS has been called “the algorithm of choice” for fitting log-linear (MaxEnt) models and conditional random fields with -regularization[2] [3].
So maybe it’s just decently incompatible with this problem.
Under the same conditions, with default arguments, Adam achieves 99.11% accuracy. I won’t test anything else, as I have changed way too much at once for that to be meaningful.
Future Work
Sparse Adam and regularization in general, more epochs, change other parameters, further explore math and reasoning behind optimizers.