It can be difficult to scale machine learning projects when a data scientist is needed to perform manual decisions in the pipeline. An example in deep learning is finding the optimal learning rate that will minimize the loss function at each step. In research scientist Leslie Smith’s papers, Cyclical Learning Rates for Training Neural Networks and A Disciplined Approach to Neural Network Hyper-Parameters, he introduces a method to systematically determine an optimal learning rate.
With this method, data scientists can train their models faster and potentially achieve higher metrics than they could manually selecting an arbitrary value. In this process, the model is trained using a few batches of data. At the end of each batch, the learning rate dynamically increases until the loss begins to increase at an exponential rate (as seen in the graph below). After plotting the loss and learning rate, the optimal learning rate can be found in the value that is approximately a power of ten less than the inflection point, or in the value of the steepest slope. While this method is useful in finding a learning rate, in many cases, there are still situations where the determined learning rate is sub-optimal.
After Smith’s work was published, Novetta and ESRI, both members of the fastai community, each developed an open source algorithm that automatically finds a learning rate. Novetta’s method was developed with natural language processing tasks in mind to better automate the deployment of ML models in the Novetta Mission Analytics platform. ESRI’s algorithm focuses on computer vision tasks.
Since Novetta implements automated learning rates in production, we wanted to compare the three methods (Smith, Novetta, ESRI) across a range of tasks.
Datasets, Training Paradigm, and Metrics
In order to capture a wide range of common deep learning domains, we evaluated these methods on a total of six datasets across four domains, as seen in the table below:
|Domain and Task||Datasets|
|Computer Vision Classification||Oxford Pets, fastai’s ImageWoof|
|Tabular Classification||Adult Sample|
|Natural Language Processing||IMDB, one internal Novetta dataset|
For the Adult Sample dataset, we used Smith’s cyclical training regimen in a single step. The rest of the datasets were trained with fastai’s fine-tuning regimen, as it is geared towards transfer learning. This involves a two-step training process with Smith’s regimen. For each of the datasets, we reported the mean accuracy of a trained network.
This table shows absolute performance improvement on the six datasets vs Smith’s algorithm:
|Algorithm||Oxford Pets||ImageWoof||CAMVID||Adult Sample||IMDB||Internal Novetta text dataset|
Over these six datasets, we saw that Novetta’s algorithm and ESRI’s algorithm both perform better than the baseline approach. Compared to each other, Novetta’s and ESRI’s methods achieve nearly the same performance, with one gaining a few percentage points over the other in the specific domains for which they were intended to be utilized. Only three results showed a statistically significant difference between the two algorithms: Adult Sample, CAMVID, and IMDB. With CAMVID and IMDB, Novetta’s algorithm chose an appropriate learning rate, but the rate ESRI’s algorithm chose was too high, leading to exploding gradients and a loss of accuracy. We also saw a small drop in performance with Novetta’s algorithm on the Adult Sample dataset. Based on our experiments, both of the approaches can serve as improvements on traditional methods for automatically choosing a learning rate.
As part of Novetta’s continued involvement in the open source community, we provided the source code for our learning rate finder to the fastai community. Novetta is working with fastai developers to integrate the finder into their package so that it is more accessible to users. With these developments and our comparison research, we look forward to the evolution of increasingly accurate learning rate finders.