A Hands-on Approach for Implementing Stochastic Optimization Algorithms from Scratch
Education Short Course
Important
This documentation is related to SC-1 Short Courses @ ICASSP’23.
Last update: June 9th, 2023.
Lecture slides are intended to be observed in presentation mode (full screen).
Partially supported by the Army Research Office (ARO) under Grant # W911NF-22-1-0296.
Course description
Summary
Gradient descent (GD) is a well-known first order optimization method, which uses the gradient of the loss function, along with a step-size (or learning rate), to iteratively update the solution. When the loss (cost) function is dependent on datasets with large cardinality, such in cases typically associated with deep learning (DL), GD becomes impractical.
In this scenario, stochastic GD (SGD), which uses a noisy gradient approximation (computed over a random fraction of the dataset), has become crucial. There exits several variants/improvements over the “vanilla” SGD, such RMSprop, Adagrad, Adadelta, Adam, Nadam, etc., which are usually given as black-boxes by most of DL’s libraries (TensorFlow, PyTorch, MXNet, etc.).
The primary objective of this course is to combined the essential theoretical aspects related to SGD and variants, along with hands on experience to program in Python, from scratch (i.e. not based on DL’s libraries such as TensorFlow, PyTorch, MXNet) the SGD along with the RMSprop, Adagrad, Adadelta, Adam and Nadam algorithms and to test their performance using the MNIST and CIFAR-10 datasets for shallow networks (consisting of up to two ReLU layers and a Softmax as the last layer).
Syllabus
Introduction
Basic concepts
Bayes’ theorem.
MAP (maximum a posteriori).
Linear regression.
Logistic and softmax regression.
Gradient descent (GD) and stochastic GD.
Hands-on 1
The MNIST dataset.
Data preparation.
GD implementation; simple (quadratic) test.
SGD implementation; multiclass regression for the MNIST dataset.
Part A: Accelerated GD (AGD)
Adaptive step-sizes
Momentum.
Nesterov acceleration.
Anderson acceleration.
Hands-on 2.A
Accelerated GD implementation.
Comparisons w.r.t. GD.
Part B: SGD variants
Momentum (SGD-MTM)
Nesterov (SGD-NTRV)
SG Clipping (SGC)
Adagrad
Adadelta
RMSprop
Adam
AdaMax
Nadam
AdaBelief
SGD variants’ taxonomy
Hands-on 2.B
SGD variants implementation.
Multiclass regression for the MNIST dataset.
Part A: Hidden layers
Introduction.
Linear vs. non-linear.
Activation functions.
Hands-on 3.A
Impact of adding one, random value, ReLU hidden layer.
Classification of the MNIST dataset.
Classification of the CIFAR dataset.
Part B: Computing gradients
Introduction.
The backpropagation (BP) algorithm.
SGD and BP working together.
Hands-on 3.B
BP and SGD along with one ReLU hidden layer.
Classification of the CIFAR dataset.
Part A: Hands-on 4.A
BP and SGD along with two hidden layers / SGD variants.
Classification of the CIFAR dataset.
Part B: DL (deep learning) overview
Introduction.
Convolutional layer
Other layers: maxpool, dropout, dense, etc.
DL libraries: TensorFlow, PyTorch, MXNet.
Hands-on 4.B
Using TensorFlow (TF).
Performance comparison (w.r.t. Previously developed code).
Implementing your own solver in TF.
Using simple DL networks.