This post is part 1 of a series. Part 2 is an opinionated introduction to AutoML and neural architecture search, and Part 3 looks at Google’s AutoML in particular.
There are frequent media headlines about both the scarcity of machine learning talent (see here, here, and here) and about the promises of companies claiming their products automate machine learning and eliminate the need for ML expertise altogether (see here, here, and here). In his keynote at the TensorFlow DevSummit, Google’s head of AI Jeff Dean estimated that there are tens of millions of organizations that have electronic data that could be used for machine learning but lack the necessary expertise and skills. I follow these issues closely since my workat fast.ai focuses on enabling more people to use machine learning and on making it easier to use.
In thinking about how we can automate some of the work of machine learning, as well as how to make it more accessible to people with a wider variety of backgrounds, it’s first necessary to ask, what is it that machine learning practitioners do? Any solution to the shortage of machine learning expertise requires answering this question: whether it’s so we know what skills to teach, what tools to build, or what processes to automate.
This post is the first in a 3-part series. It will address what it is that machine learning practitioners do, with Part 2 explaining AutoML and neural architecture search (which several high profile figures have suggested will be key to decreasing the need for data scientists) and Part 3 will cover Google’s heavily hyped AutoML product in particular.
Building Data Products is Complex Work
While many academic machine learning sources focus almost exclusively on predictive modeling, that is just one piece of what machine learning practitioners do in the wild. The processes of appropriately framing a business problem, collecting and cleaning the data, building the model, implementing the result, and then monitoring for changes are interconnected in many ways that often make it hard to silo off just a single piece (without at least being aware of what the other pieces entail). As Jeremy Howard et al. wrote in Designing great data products, Great predictive modeling is an important part of the solution, but it no longer stands on its own; as products become more sophisticated, it disappears into the plumbing.
glue code: massive amount of supporting code written to get data into and out of general-purpose packages
pipeline jungles: the system for preparing data in an ML-friendly format may become a jungle of scrapes, joins, and sampling steps, often with intermediate files output
re-use input signals in ways that create unintended tight coupling of otherwise disjoint systems
risk that changes in the external world may make models or input signals change behavior in unintended ways, and these can be difficult to monitor
The authors write, A remarkable portion of real-world “machine learning” work is devoted to tackling issues of this form… It’s worth noting that glue code and pipeline jungles are symptomatic of integration issues that may have a root cause in overly separated “research” and “engineering” roles… It may be surprising to the academic community to know that only a tiny fraction of the code in many machine learning systems is actually doing “machine learning”. (emphasis mine)
When machine learning projects fail
In a previous post, I identified some failure modes in which machine learning projects are not effective in the workplace:
The data science team builds really cool stuff that never gets used. There’s no buy-in from the rest of the organization for what they’re working on, and some of the data scientists don’t have a good sense of what can realistically be put into production.
There is a backlog with data scientists producing models much faster than there is engineering support to put them in production.
The data infrastructure engineers are separate from the data scientists. The pipelines don’t have the data the data scientists are asking for now, and the data scientists are under-utilizing the data sources the infrastructure engineers have collected.
The company has definitely decided on feature/product X. They need a data scientist to gather some data that supports this decision. The data scientist feels like the PM is ignoring data that contradicts the decision; the PM feels that the data scientist is ignoring other business logic.
The data science team interviews a candidate with impressive math modeling and engineering skills. Once hired, the candidate is embedded in a vertical product team that needs simple business analytics. The data scientist is bored and not utilizing their skills.
I framed these as organizational failures in my original post, but they can also be described as various participants being overly focused on just one slice of the complex system that makes up a full data product. These are failures of communication and goal alignment between different parts of the data product pipeline.
So, what do machine learning practitioners do?
As suggested above, building a machine learning product is a multi-faceted and complex task. Here are some of the things that machine learning practitioners may need to do during the process:
Understanding the context:
identify areas of the business that could benefit from machine learning
communicate with other stakeholders about what machine learning is and is not capable of (there are often many misconceptions)
develop understanding of business strategy, risks, and goals to make sure everyone is on the same page
identify what kind of data the organization has
appropriately frame and scope the task
understand operational constraints (e.g. what data is actually available at inference time)
proactively identify ethical risks, including how your work could be mis-used by harassers, trolls, authoritarian governments, or for propaganda/disinformation campaigns (and plan how to reduce these risks)
fit model resource needs into constraints (e.g. will the completed model need to run on an edge device, in a low memory or high latency environment, etc)
choose hyperparameters (e.g. in the case of deep learning, this includes choosing an architecture, loss function, and optimizer)
train the model (and debug why it’s not training). This can involve:
adjusting hyperparmeters (e.g. such as the learning rate)
outputing intermediate results to see how the loss, training error, and validation error are changing with time
inspecting the data the model is wrong on to look for patterns
identifying underlying errors or issues with the data
realizing you need to change how you clean and pre-process the data
realizing you need more or different data augmentation
realizing you need more or different data
trying out different models
identifying if you are under- or over-fitting
creating an API or web app with your model as an endpoint in order to productionize
exporting your model into the needed format
plan for how often your model will need to be retrained with updated data (e.g. perhaps you will retrain nightly or weekly)
track model performance over time
monitor the input data, to identify if it changes with time in a way that would invalidate your model
communicate your results to the rest of the organization
have a plan in place for how you will monitor and respond to mistakes or unexpected consequences
Certainly, not every machine learning practitioner needs to do all of the above steps, but components of this process will be a part of many machine learning applications. Even if you are working on just a subset of these steps, a familiarity with the rest of the process will help ensure that you are not overlooking considerations that would keep your project from being successful!
Two of the hardest parts of Machine Learning
For myself and many others I know, I would highlight two of the most time-consuming and frustrating aspects of machine learning (in particular, deep learning) as:
Dealing with data formatting, inconsistencies, and errors is often a messy and tedious process.
Training deep learning models is a notoriously brittle process right now.
Is cleaning data really part of ML? Yes.
Dealing with data formatting, inconsistencies, and errors is often a messy and tedious process. People will sometimes describe machine learning as separate from data science, as though for machine learning, you can just begin with your nicely cleaned, formatted data set. However, in my experience, the process of cleaning a data set and training a model are usually interwoven: I frequently find issues in the model training that cause me to go back and change the pre-processing for the input data.
Training Deep Learning Models is Brittle and Finicky (for now)
The difficulty of getting models to train deters many beginners, who often wind up feeling discouraged. Even experts frequently complain of how frustrating and fickle the training process can be. One AI researcher at Stanford told me, I taught a course on deep learning and had all the students do their own projects. It was so hard. The students couldn’t get their models to train, and we were like “well, that’s deep learning”. Ali Rahimi, an AI researcher with over a decade of experience and winner of the NIPS 2017 Test of Time Award, complained about the brittleness of training in his NIPS award speech. How many of you have designed a deep net from scratch, built it from the ground up, architecture and all, and when it didn’t work, you felt bad about yourself? Rahimi asked the audience of AI researchers, and many raised their hands. Rahimi continued, This happens to me about every 3 months.
The fact that even AI experts sometimes have trouble training new models implies that the process has yet to be automated in a way where it could be incorporated into a general-purpose product. Some of the biggest advances in deep learning will come through discovering more robust training methods. We have already seen this some with advances like dropout, super convergence, and transfer learning, all of which make training easier. Through the power of transfer learning (to be discussed in Part 3) training can be a robust process when defined for a narrow enough problem domain; however, we still have a ways to go in making training more robust in general.
For Academic Researchers
Even if you are working on theoretical machine learning research, it is useful to understand the process that machine learning practitioners working on practical problems go through, as that might provide insights on what the most relevant or high-impact areas of research are.
As Googler engineers D. Sculley et al. wrote, Technical debt is an issue that both engineers and researchers need to be aware of. Research solutions that provide a tiny accuracy benefit at the cost of massive increases in system complexity are rarely wise practice… Paying down technical debt is not always as exciting as proving a new theorem, but it is a critical part of consistently strong innovation. And developing holistic, elegant solutions for complex machine learning systems is deeply rewarding work. (emphasis mine)
Now that we have an overview of some of the tasks that machine learning practitioners do as part of their work, we are ready to evaluate attempts to automate this work. As it’s name suggests, AutoML is one field in particular that has focused on automating machine learning, and a subfield of AutoML called neural architecture search is currently receiving a ton of attention. In part 2, I will explain what AutoML and neural architecture search are, and in part 3, look at Google’s AutoML in particular.
Be sure to check out Part 2 here, and stay tuned for Part 3!
Note from Jeremy: Welcome to fast.ai’s first scholar-in-residence, Sylvain Gugger. What better way to introduce him than to publish the results of his first research project at fast.ai. We’ll be using the results of this research to change how we train models in the next version of our course and in our fastai library, as a result of which students and practitioners will be able to reliably train their models far faster than previous approaches.
The Adam roller-coaster
The journey of the Adam optimizer has been quite a roller coaster. First introduced in 2014, it is, at its heart, a simple and intuitive idea: why use the same learning rate for every parameter, when we know that some surely need to be moved further and faster than others? Since the square of recent gradients tells us how much signal we’re getting for each weight, we can just divide by that to ensure even the most sluggish weights get their chance to shine. Adam takes that idea, adds on the standard approach to momentum, and (with a little tweak to keep early batches from being biased) that’s it!
When first released, the deep learning community was full of excitement after seeing charts like this one from the original paper:
200% speed up in training! “Overall, we found Adam to be robust and well-suited to a wide range of non-convex optimization problems in the field machine learning” concluded the paper. Ah yes, those were the days, over three years ago now, a life-time in deep-learning-years. But it started to become clear that all was not as we hoped. Few research articles used it to train their models, new studies began to clearly discourage to apply it and showed on several experiments that plain ole SGD with momentum was performing better. By the time the 2018 fast.ai course had come around, the decision was made to cut poor Adam from the early lessons.
But at the end of 2017, Adam seemed to get a new lease of life. Ilya Loshchilov and Frank Hutter pointed out in their paper that the way weight decay is implemented in Adam in every library seems to be wrong, and proposed a simple way (which they call AdamW) to fix it. Although their results were slightly mixed, they did show some encouraging charts, such as this one:
We expected to see the Adam enthusiasm return, since it seemed those first results could perhaps be found again. But that’s not what happened. Indeed, the only deep learning framework that implemented the fix was fastai, using code written by Sylvain. Without broad framework availability, day-to-day practitioners were stuck with the old, “broken” Adam.
But that’s not the only problem. More obstacles lay ahead. Two separate papers pointed out apparent problems with the convergence proof of poor Adam, although one of them claimed a fix (and won a “best paper” award at the prestigious ICLR conference), which they called amsgrad. But if we’ve learned anything from this potted history of this most dramatic life (at least, dramatic by optimizer standards), it’s that nothing is as it seems. And indeed, PhD student Jeremy Bernstein has pointed out that the claimed convergence problems are actually just signs of poorly chosen hyper-parameters, and that perhaps amsgrad won’t fix things anyway. Another PhD student, Filip Korzeniowski, showed some early results that seemed to support this discouraging view of amsgrad.
Getting off the roller-coaster
So for those of us that just want to train accurate models fast, what do we do? Let’s solve this debate the same way scientific debates have been solved for hundreds of years: with experiments! We’ll tell you all the details in just a moment, but first, here’s a summary of the results:
Properly tuned, Adam really works! We got new state of the art results (in terms of training time) on various tasks like
training CIFAR10 to >94% accuracy in as few as 18 epochs with Test Time Augmentation or with 30 epochs without, as in the DAWNBench competition;
fine-tuning Resnet50 to 90% accuracy on the Cars Stanford Dataset in just 60 epochs (previous reports to the same accuracy used 600);
training from scratch an AWD LSTM or QRNN in 90 epochs (or 1 hour and a half on a single GPU) to state-of-the-art perplexity on Wikitext-2 (previous reports used 750 for LSTMs, 500 for QRNNs).
That means that we’ve seen (for the first time we’re aware of) super convergence using Adam! Super convergence is a phenomenon that occurs when training a neural net with high learning rates, growing for half the training. Before it was understood, training CIFAR10 to 94% accuracy took about 100 epochs.
In contrast to previous work, we see Adam getting about as good accuracy as SGD+Momentum on every CNN image problem we’ve tried it on, as long as it’s properly tuned, and it’s nearly always a bit faster too.
The suggestions that amsgrad are a poor “fix” are correct. We consistently found that amsgrad didn’t achieve any gain in accuracy (or other relevant metric) than plain Adam/AdamW.
When you hear people saying that Adam doesn’t generalize as well as SGD+Momentum, you’ll nearly always find that they’re choosing poor hyper-parameters for their model. Adam generally requires more regularization than SGD, so be sure to adjust your regularization hyper-parameters when switching from SGD to Adam.
Understanding AdamW: Weight decay or L2 regularization?
L2 regularization is a classic method to reduce over-fitting, and consists in adding to the loss function the sum of the squares of all the weights of the model, multiplied by a given hyper-parameter (all equations in this article use python, numpy, and pytorch notation):
…where wd is the hyper-parameter to set. This is also called weight decay, because when applying vanilla SGD it’s equivalent to updating the weight like this:
(Note that the derivative of w2 with respect to w is 2w.) In this equation we see how we subtract a little portion of the weight at each step, hence the name decay.
All libraries we have looked at use the first of these forms. (In practice, it is nearly always implemented by adding wd*w to the gradients, rather than actually changing the loss function: we don’t want to add more computations by modifying the loss when there is an easier way.)
So why make a distinction between those two concepts if they are the same thing? The answer is that they are only the same thing for vanilla SGD, but as soon as we add momentum, or use a more sophisticated optimizer like Adam, L2 regularization (first equation) and weight decay (second equation) become different. In the rest of this article, when we talk about weight decay, we will always refer to this second formula (decay the weight by a little bit) and talk about L2 regularization if we want to mention the classic way.
Let’s look at SGD with momentum for instance. Using L2 regularization consists in adding wd*w to the gradients (as we saw earlier) but the gradients aren’t subtracted from the weights directly. First we compute a moving average:
…and it’s this moving average that will be multiplied by the learning rate and subtracted from w. So the part linked to the regularization that will be taken from w is lr* (1-alpha)*wd * w plus a combination of the previous weights that were already in moving_avg.
On the other hand, weight decay’s update will look like
We can see that the part subtracted from w linked to regularization isn’t the same in the two methods. When using the Adam optimizer, it gets even more different: in the case of L2 regularization we add this wd*w to the gradients then compute a moving average of the gradients and their squares before using both of them for the update. Whereas the weight decay method simply consists in doing the update, then subtract to each weight.
Clearly those are two different approaches. And after experimenting with this, Ilya Loshchilov and Frank Hutter suggest in their article we should use weight decay with Adam, and not the L2 regularization that classic deep learning libraries implement.
How can we do this? Very easily if you’re using the fastai library since its implemented inside. Specifically if you use the fit function, just add the argument use_wd_sched=True:
If you prefer the new training API, you can use the argument wd_loss=False (for weight decay not computed in the loss) in each of your training phases:
Here’s a quick summary of how we implemented this in fastai. Inside the step function of the optimizer, only the gradients are used to modify the parameters, the value of the parameters themselves isn’t used at all (except for the weight decay, but we will be dealing with that outside). We can then implement weight decay by simply doing it before the step of the optimizer. It still has to be done after the gradients are computed (otherwise it would impact the gradients values) so inside your training loop, you have to look for this spot.
loss.backward()#Do the weight decay here!optimizer.step()
Of course, the optimizer should have been set with wd=0 otherwise it will do some L2 regularization, which is exactly what we don’t want right now. Now in that spot, we have to loop over all the parameters and do our little weight decay update. Your parameters should all be inside the dictionary param_groups of your optimizer, so the loop looks something like this:
Our first tests on computer vision problems were very encouraging. Specifically, the accuracy we managed to get in 30 epochs (which is the necessary time for SGD to get to 94% accuracy with a 1cycle policy) with Adam and L2 regularization was at 93.96% on average, going over 94% one time out of two. We consistently reached values between 94% and 94.25% with Adam and weight decay. To do this, we found the optimal value for beta2 when using a 1cycle policy was 0.99. We treated the beta1 parameter as the momentum in SGD (meaning it goes from 0.95 to 0.85 as the learning rates grow, then goes back to 0.95 when the learning rates get lower).
Even more impressive, using Test Time Augmentation (i.e. taking the average of the predictions on one image of the test set and four data-augmented versions of it), we can get to those 94% accuracy in just 18 epochs (93.98% on average)! With plain Adam and L2 regularization, going over the 94% happened once every twenty tries.
One thing to take into account in those comparisons is that changing the way we regularize changes the best values of weight decay or learning rate. In the tests we ran, the best learning rate with L2 regularization was 1e-6 (with a maximum learning rate of 1e-3) while 0.3 was the best value for weight decay (with a learning rate of 3e-3). The difference of orders of magnitude has been very consistent in all our tests, and comes primarily from the fact that L2 regularization gets effectively divided by the average norm of the gradients (which are pretty low) and that learning rates with Adam are pretty small (so the update of weight decay needs a stronger coefficient).
So, weight decay is always better than L2 regularization with Adam then? We haven’t found a situation where it’s significantly worse, but for either a transfer-learning problem (e.g. fine-tuning Resnet50 on Stanford cars) or RNNs, it didn’t give better results.
Amsgrad was introduced in a recent article by Sashank J. Reddi, Satyen Kale and Sanjiv Kumar. By analyzing the proof of convergence for the Adam optimizer, they spotted a mistake in the update rule that could cause the algorithm to converge to a sub-optimal point. They designed theoretical experiments that showed situations where Adam would fail and proposed a simple fix.
To understand the error and the fix, let’s have a look at the update rule of Adam (if you need a refresher, Sebastian got you covered):
We’ve just skipped the bias correction (useful for the beginning of training) to focus on the important point. The error in the proof of Adam the authors spotted is that it requires the quantity
lr / sqrt(avg_squared)
…which is the step we take in the direction of our average gradients, to be decreasing over training. Since the learning rate is often taken constant or decreasing (except for crazy people like us trying to obtain super-convergence), the fix the authors proposed was to force the avg_squared quantity to be increasing by adding another variable to keep track of their maximums.
The associated article won an award at ICLR 2018 and gained such popularity that it’s already implemented in two of the main deep learning libraries, pytorch and Keras. There is little to do except turn the option on with amsgrad=True.
This causes the weight update code from the previous section to be changed to something like this:
Results of amsgrad experiments: a lot of noise for nothing
Amsgrad turns out to be very disappointing. In none of our experiments did we find that it helped the slightest bit, and even if it’s true that the minimum found by amsgrad is sometimes slightly lower (in terms of loss) than the one reached by Adam, the metrics (accuracy, f1 score…) always end up worse (see the tables in our introduction, or more examples here)
The proof of convergence for the Adam optimizer in deep learning (since it’s for convex problems) and the mistake they found in it mattered for synthetic experiments that have nothing to do with real-life problems. Actual tests show that when those avg_squared gradients want to decrease, it’s best for the final result to do so.
This shows that even if the focus on theory can be useful to gain some new ideas, nothing replaces experiments (and lots of them!) to make sure these ideas actually help practitioners train better models.
Appendix: Full results
Training of CIFAR10 from scratch (model is a wide resnet 22, average of the error on the test set with five models shown):
Fine-tuning Resnet50 on the Stanford Cars dataset using the standard head introduced by the fastai library (training the head for 20 epochs before unfreezing and training with differential learning rates for 40 epochs).:
Training an AWD LSTM with the hyper-parameters from the github repo (results show the perplexity on the validation/test set, with or without cache pointer):
With cache pointer
Adam + amsgrad
AdamW + amsgrad
And the same with QRNNs instead of LSTMs:
With cache pointer
Adam + amsgrad
AdamW + amsgrad
For this specific task, we used a modified version of the 1cycle policy, growing the learning rate faster, then having a long period of high constant learning rates before going down again.
The values of all relevant hyper-parameters as well as the code used to produce these results are available here.
Today we are launching the 2018 edition of Cutting Edge Deep Learning for Coders, part 2 of fast.ai’s free deep learning course. Just as with our part 1 Practical Deep Learning for Coders, there are no pre-requisites beyond high school math and 1 year of coding experience—we teach you everything else you need along the way. This course contains all new material, including new state of the art results in NLP classification (up to 20% better than previously known approaches), and shows how to replicate recent record-breaking performance results on Imagenet and CIFAR10. The main libraries used are PyTorch and fastai (we explain why we use PyTorch and why we created the fastai library in this article).
Each of the seven lessons includes a video that’s around two hours long, an interactive Jupyter notebook, and a dedicated discussion thread on the fast.ai forums. The lessons cover many topics, including: multi-object detection with SSD and YOLOv3; how to read academic papers; customizing a pre-trained model with a custom head; more complex data augmentation (for coordinate variables, per-pixel classification, etc); NLP transfer learning; handling very large (billion+ token) text corpuses with the new fastai.text library; running and intepreting ablation studies; state of the art NLP classification; multi-modal learning; multi-task learning; bidirectional LSTM with attention for seq2seq; neural translation; customizing resnet architectures; GANs, WGAN, and CycleGAN; data ethics; super resolution; image segmentation with u-net.
Lesson 8 starts with a quick recap of what we learned in part 1, and introduces the new focus of this part of the course: cutting edge research. We talk about how to read papers, and what you’ll need to build your own deep learning box to run your experiments. Even if you’ve never read an academic paper before, we’ll show you how to do so in a way that you don’t get overwhelmed by the notation and writing style. Another difference in this part is that we’ll be digging deeply into the source code of the fastai and Pytorch libraries: in this lesson we’ll show you how to quickly navigate and build an understanding of the code. And we’ll see how to use python’s debugger to deepen your understand of what’s going on, as well as to fix bugs.
The main topic of this lesson is object detection, which means getting a model to draw a box around every key object in an image, and label each one correctly. You may be surprised to discover that we can use transfer learning from an Imagenet classifier that was never even trained to do detection! There are two main tasks: find and localize the objects, and classify them; we’ll use a single model to do both these at the same time. Such multi-task learning generally works better than creating different models for each task—which many people find rather counter-intuitive. To create this custom network whilst leveraging a pre-trained model, we’ll use fastai’s flexible custom head architecture.
In this lesson we’ll move from single object to multi-object detection. It turns out that this slight difference makes things much more challenging. In fact, most students found this the most challenging lesson in the whole course. Not because any one piece is highly complex, but because there’s a lot of pieces, so it really tests your understanding of the foundations we’ve learnt so far. So don’t worry if a lot of details are unclear on first viewing – come back to this lesson from time to time as you complete the rest of the course, and you should find more and more of it making sense!
Our focus is on the single shot multibox detector (SSD), and the related YOLOv3 detector. These are ways to handle multi-object detection by using a loss function that can combine losses from multiple objects, across both localization and classification. They also use a custom architecture that takes advantage of the difference receptive fields of different layers of a CNN. And we’ll see how to handle data augmentation in situations like this one where the dependent variable requires augmentation too. Finally, we discuss a simple but powerful trick called focal loss which is used to get state of the art results in this field.
After reviewing what we’ve learned about object detection, in lesson 10 we jump into NLP, starting with an introduction to the new fastai.text library. This is a replacement for torchtext which is faster and more flexible in many situations. A lot of this class will be very familiar—we’re covering a lot of the same ground as lesson 4. But this lesson will show you how to get much more accurate results, by using transfer learning for NLP.
Transfer learning has revolutionized computer vision, but until now it largely has failed to make much of an impact in NLP (and to some extent has been simply ignored). In this class we’ll show how pre-training a full language model can greatly surpass previous approaches based on simple word vectors. We’ll use this language model to show a new state of the art result in text classification.
In lesson 11 we’re going to learn to translate French into English! To do so, we’ll learn how to add attention to an LSTM in order to build a sequence to sequence (seq2seq) model. But before we do, we’ll do a review of some key RNN foundations, since a solid understanding of those will be critical to understanding the rest of this lesson.
A seq2seq model is one where both the input and the output are sequences, and can be of difference lengths. Translation is a good example of a seq2seq task. Because each translated word can correspond to one or more words that could be anywhere in the source sentence, we learn an attention mechanism to figure out which words to focus on at each time step. We’ll also learn about some other tricks to improve seq2seq results, including teacher forcing and bidirectional models.
We finish the lesson by discussing the amazing DeVISE paper, which shows how we can bridge the divide between text and images, using them both in the same model!
We start this lesson with a deep dive into the DarkNet architecture used in YOLOv3, and use it to better understand all the details and choices that you can make when implementing a resnet-ish architecture. The basic approach discussed here is what we used to win the DAWNBench competition!
Then we’ll learn about Generative Adversarial Networks (GANs). This is, at its heart, a different kind of loss function. GANs have a generator and a discriminator that battle it out, and in the process combine to create a generative model that can create highly realistic outputs. We’ll be looking at the Wasserstein GAN variant, since it’s easier to train and more resilient to a range of hyperparameters.
For the start of lesson 13 we’ll cover the CycleGAN, which is a breakthrough idea in GANs that allows us to generate images even where we don’t have direct (paired) training data. We’ll use it to turn horses into zebras, and visa versa; this may not be an application you need right now… but the basic idea is likely to be transferable to a wide range of very valuable applications. One of our students is already using it to create a new form of visual art.
But generative models (and many other techniques we’ve discussed) can cause harm just as easily as they can benefit society. So we spend some time talking about data ethics. It’s a topic that really deserves its own whole course; whilst we can’t go into the detail we’d like in the time available, hopefully you’ll get a taste of some of the key issues, and ideas for where to learn more.
We finish the lesson by looking at style transfer, an interesting approach that allows us to change the style of images in whatever way we like. The approach requires us to optimize pixels, instead of weights, which is an interesting different way of looking at optimization.
In this final lesson, we do a deep dive into super resolution, an amazing technique that allows us to restore high resolution detail in our images, based on a convolutional neural network. In the process, we’ll look at a few modern techniques for faster and more reliable training of generative convnets.
We close with a look at image segmentation, in particular using the Unet architecture, a state of the art technique that has won many Kaggle competitions and is widely used in industry. Image segmentation models allow us to precisely classify every part of an image, right down to pixel level.
DAWNBench is a Stanford University project designed to allow different deep learning methods to be compared by running a number of competitions. There were two parts of the Dawnbench competition that attracted our attention, the CIFAR 10 and Imagenet competitions. Their goal was simply to deliver the fastest image classifier as well as the cheapest one to achieve a certain accuracy (93% for Imagenet, 94% for CIFAR 10).
In the CIFAR 10 competition our entries won both training sections: fastest, and cheapest. Another fast.ai student working independently, Ben Johnson, who works on the DARPA D3M program, came a close second in both sections.
In the Imagenet competition, our results were:
Fastest on publicly available infrastructure, fastest on GPUs, and fastest on a single machine (and faster than Intel’s entry that used a cluster of 128 machines!)
Lowest actual cost (although DAWNBench’s official results didn’t use our actual cost, as discussed below).
Overall, our findings were:
Algorithmic creativity is more important than bare-metal performance
Pytorch, developed by Facebook AI Research and a team of collaborators, allows for rapid iteration and debugging to support this kind of creativity
AWS spot instances are an excellent platform for rapidly and inexpensively running many experiments.
In this post we’ll discuss our approach to each competition. All of the methods discussed here are either already incorporated into the fastai library, or are in the process of being merged into the library.
fast.ai is a research lab dedicated to making deep learning more accessible, both through education, and developing software that simplifies access to current best practices. We do not believe that having the newest computer or the largest cluster is the key to success, but rather utilizing modern techniques and the latest research with a clear understanding of the problem we are trying to solve. As part of this research we recently developed a new library for training deep learning models based on Pytorch, called fastai.
Over time we’ve been incorporating into fastai algorithms from a number of research papers which we believe have been largely overlooked by the deep learning community. In particular, we’ve noticed a tendency of the community to over-emphasize results from high-profile organizations like Stanford, DeepMind, and OpenAI, whilst ignoring results from less high-status places. One particular example is Leslie Smith from the Naval Research Laboratory, and his recent discovery of an extraordinary phenomenon he calls super convergence. He showed that it is possible to train deep neural networks 5-10x faster than previously known methods, which has the potential to revolutionize the field. However, his paper was not accepted to an academic publishing venue, nor was it implemented in any major software.
Within 24 hours of discussing this paper in class, a fast.ai student named Sylvain Gugger had completed an implementation of the method, which was incorporated into fastai and he also developed an interactive notebook showing how to experiment with other related methods too. In essence, Smith showed that if we very slowly increase the learning rate during training, whilst at the same time decreasing momentum, we can train at extremely high learning rates, thus avoiding over-fitting, and training in far fewer epochs.
Such rapid turnaround of new algorithmic ideas is exactly where Pytorch and fastai shine. Pytorch allows for interactive debugging, and the use of standard Python coding methods, whilst fastai provides many building blocks and hooks (such as, in this case, callbacks to allow customization of training, and fastai.sgdr for building new learning rate annealing methods). Pytorch’s tensor library and CUDA allow for fast implementation of new algorithms for exploration.
We have an informal deep learning study group (free for anyone to join) that meets each day to work on projects together during the course, and we thought it would be interesting to see whether this newly contributed code would work as well as Smith claimed. We had heard that Stanford University was running a competition called DAWNBench, which we thought would be an interesting opportunity to test it out. The competition finished just 10 days from when we decided to enter, so timing was tight!
Both CIFAR 10 and Imagenet are image recognition tasks. For instance, imagine that we have a set of pictures of cats and dogs, and we want to build a tool to separate them automatically. We build a model and then train it on many pictures so that afterwards we can classify dog and cat pictures we haven’t seen before. Next, we can take our model and apply it to larger data sets like CIFAR, a collection of pictures of ten various objects like cats and dogs again as well as other animals/vehicles, for example frogs and airplanes. The images are small (32 pixels by 32 pixels) and so this dataset is small (160MB) and easy to work with. It is, nowadays, a rather under-appreciated dataset, simply because it’s older and smaller than the datasets that are fashionable today. However, it is very representative of the amount of data most organizations have in the real world, and the small image size makes it both challenging but also accessible.
When we decided to enter the competition, the current leader had achieved a result of 94% accuracy in a little over an hour. We quickly discovered that we were able to train a Resnet 50 model with super-convergence in around 15 minutes, which was an exciting moment! Then we tried some different architectures, and found that Resnet 18 (in its preactivation variant) achieved the same result in 10 minutes. We discussed this in class, and Ben Johnson independently further developed this by adding a method fast.ai developed called “concat pooling” (which concatenates max pooling and average pooling in the penultimate layer of the network) and got down to an extraordinary 6 minutes on a single NVIDIA GPU.
In the study group we decided to focus on multi-GPU training, in order to get the fastest result we could on a single machine. In general, our view is that training models on multiple machines adds engineering and sysadmin complexity that should be avoided where possible, so we focus on methods that work well on a single machine. We used a library from NVIDIA called NCCL that works well with Pytorch to take advantage of multiple GPUs with minimal overhead.
Most papers and discussions of multi-GPU training focus on the number of operations completed per second, rather than actually reporting how long it takes to train a network. However, we found that when training on multiple GPUs, our architectures showed very different results. There is clearly still much work to be done by the research community to really understand how to leverage multiple GPUs to get better end-to-end training results in practice. For instance, we found that training settings that worked well on single GPUs tended to lead to gradients blowing up on multiple GPUs. We incorporated all the recommendations from previous academic papers (which we’ll discuss in a future paper) and got some reasonable results, but we still weren’t really leveraging the full power of the machine.
In the end, we found that to really leverage the 8 GPUs we had in the machine, we actually needed to give it more work to do in each batch—that is, we increased the number of activations in each layer. We leveraged another of those under-appreciated papers from less well-known institutions: Wide Residual Networks, from Université Paris-Est, École des Ponts. This paper does an extensive analysis of many different approaches to building residual networks, and provides a rich understanding of the necessary building blocks of these architectures.
Another of our study group members, Brett Koonce, started running experiments with lots of different parameter settings to try to find something that really worked well. We ended up creating a “wide-ish” version of the resnet-34 architecture which, using Brett’s carefully selected hyper-parameters, was able to reach the 94% accuracy with multi-GPU training in under 3 minutes!
AWS and spot instances
We were lucky enough to have some AWS credits to use for this project (thanks Amazon!) We wanted to be able to run many experiments in parallel, without spending more credits than we had to, so study group member Andrew Shaw built out a python library which would allow us to automatically spin up a spot instance, set it up, train a model, save the results, and shut the instance down again, all automatically. Andrew even set things up so that all training occurred automatically in a tmux session so that we could log in to any instance and view training progress at any time.
Based on our experience with this competition, our recommendation is that for most data scientists, AWS spot instances are the best approach for training a large number of models, or for training very large models. They are generally about a third of the cost of on-demand instances. Unfortunately, the official DAWNBench results do not report the actual cost of training, but instead report the cost based on an assumption of on-demand pricing. We do not agree that this is the most useful approach, since in practice spot instance pricing is quite stable, and is the recommended approach for training models of this type.
Google’s TPU instances (now in beta) may also a good approach, as the results of this competition show, but be aware that the only way to use TPUs is if you accept lock-in to all of:
Google’s hardware (TPU)
Google’s software (Tensorflow)
Google’s cloud platform (GCP).
More problematically, there is no ability to code directly for the TPU, which severely limits algorithmic creativity (which as we have seen, is the most important part of performance). Given the limited neural network and algorithm support on TPU (e.g. no support for recurrent neural nets, which are vital for many applications, including Google’s own language translation systems), this limits both what problems you can solve, and how you can solve them.
AWS, on the other hand, allows you to run any software, architecture, and algorithm, and you can then take the results of that code and run them on your own computers, or use a different cloud platform. The ability to use spot instances also means you we were able to save quite a bit of money compared to Google’s platform (Google has something similar in beta called “preemptible instances”, but they don’t seem to support TPUs, and automatically kill your job after 24 hours).
For single GPU training, another great option is Paperspace, which is the platform we use for our new courses. They are significantly less complex to set up than AWS instances, and have the whole fastai infrastructure pre-installed. On the other hand, they don’t have the features and flexibility of AWS. They are more expensive than AWS spot instances, but cheaper that AWS on-demand instances. We used a Paperspace instance to win the cost category of this competition, with a cost of just $0.26.
Half precision arithmetic
Another key to fast training was the use of half precision floating point. NVIDIA’s most recent Volta architecture contains tensor cores that only work with half-precision floating point data. However, successfully training with this kind of data has always been complex, and very few people have shown successful implementations of models trained with this data.
NVIDIA was kind enough to provide an open-source demonstration of training Imagenet using half-precision floating point, and Andrew Shaw worked to incorporate these ideas directly into fastai. We’ve now gotten it to a point where you simply write learn.half() in your code, and from there on all the necessary steps to train quickly and correctly with half-precision floating point are automatically done for you.
Imagenet is a different version of the same problem as CIFAR 10, but with larger images (224 pixels, 160GB) and more categories (1000). Smith showed super convergence on Imagenet in his paper, but he didn’t reach the same level of accuracy as other researchers had on this dataset. We had the same problem, and found that when training with really high learning rates that we couldn’t achieve the required 93% accuracy.
Instead, we turned to a method we’d developed at fast.ai, and teach in lessons 1 & 2 of our deep learning course: progressive resizing. Variations of this technique have shown up in the academic literature before (Progressive Growing of GANs and Enhanced Deep Residual Networks) but have never to our knowledge been applied to image classification. The technique is very simple: train on smaller images at the start of training, and gradually increase image size as you train further. It makes intuitive sense that you don’t need large images to learn the general sense of what cats and dogs look like (for instance), but later on when you’re trying to learn the difference between every breed of dog, you’ll often need larger images.
Many people incorrectly believe that networks trained on one size of images can’t be used for other sizes. That was true back in 2013 when the VGG architecture was tied to one specific size of image, but hasn’t been true since then, on the whole. One problem is that many implementations incorrectly used a fixed-size pooling layer at the end of the network instead of a global/adaptive pooling layer. For instance none of the official pytorch torchvision models use the correct adaptive pooling layer. This kind of issue is exactly why libraries like fastai and keras are important—libraries built by people who are committed to ensuring that everything works out-of-the-box and incorporates all relevant best practices. The engineers building libraries like pytorch and tensorflow are (quite rightly) focused on the underlying foundations, not on the end-user experience.
By using progressive resizing we were both able to make the initial epochs much faster than usual (using 128x128 images instead of the usual 224x224), but also make the final epochs more accurate (using 288x288 images for even higher accuracy). But performance was only half of the reason for this success; the other impact is better generalization performance. By showing the network a wider variety of image sizes, it helps it to avoid over-fitting.
A word on innovation and creativity
I’ve been working with machine learning for 25 years now, and throughout that time I’ve noticed that engineers are drawn to using the biggest datasets they can get, on the biggest machines they can access, like moths flitting around a bright light. And indeed, the media loves covering stories about anything that’s “biggest”. The truth though is that throughout this time the genuine advances consistently come from doings things differently, not doing things bigger. For instance, dropout allows us to train on smaller datasets without over-fitting, batch normalization lets us train faster, and rectified linear units avoid gradient explosions during training; these are all examples of thoughtful researchers thinking about doing things differently, and allowing the rest of us to train better networks, faster.
I worry when I talk to my friends at Google, OpenAI, and other well-funded institutions that their easy access to massive resources is stifling their creativity. Why do things smart when you can just throw more resources at them? But the world is a resource-constrained place, and ignoring that fact means that you will fail to build things that really help society more widely. It is hardly a new observation to point out that throughout history, constraints have been drivers of innovation and creativity. But it’s a lesson that few researchers today seem to appreciate.
Worse still are the people I speak to that don’t have access to such immense resources, and tell me they haven’t bothered trying to do cutting edge research because they assume that without a room full of GPUs, they’ll never be able to do anything of value. To me, they are thinking about the problem all wrong: a good experimenter with a slow computer should always be able to overtake a poor experimenter with a fast one.
We’re lucky that there folks like the Pytorch team that are building the tools that creative practitioners need to rapidly iterate and experiment. I hope that seeing that a small non-profit self-funded research lab and some part-time students can achieve these kinds of top-level results can help bring this harmful myth to an end.
Despite what you may have heard, you can use deep learning for the type of data you might keep in a SQL database, a Pandas DataFrame, or an Excel spreadsheet (including time-series data). I will refer to this as tabular data, although it can also be known as relational data, structured data, or other terms (see my twitter poll and comments for more discussion).
Tabular data is the most commonly used type of data in industry, but deep learning on tabular data receives far less attention than deep learning for computer vision and natural language processing. This post covers some key concepts from applying neural networks to tabular data, in particular the idea of creating embeddings for categorical variables, and highlights 2 relevant modules of the fastai library:
fastai.structured: this module works with Pandas DataFrames, is not dependent on PyTorch, and can be used separately from the rest of the fastai library to process and work with tabular data.
fastai.column_data: this module also works with Pandas DataFrames, and provides methods to convert DataFrames (with both continuous and categorical variables) into ModelData objects that can easily be used when training neural networks. It also includes an implementation for creating embeddings of categorical variables, a powerful technique I will explain below.
A key technique to making the most of deep learning for tabular data is to use embeddings for your categorical variables. This approach allows for relationships between categories to be captured. Perhaps Saturday and Sunday have similar behavior, and maybe Friday behaves like an average of a weekend and a weekday. Similarly, for zip codes, there may be patterns for zip codes that are geographically near each other, and for zip codes that are of similar socio-economic status.
Taking Inspiration from Word Embeddings
A way to capture these multi-dimensional relationships between categories is to use embeddings. This is the same idea as is used with word embeddings, such as Word2Vec. For instance, a 3-dimensional version of a word embedding might look like:
[0.9, 1.0, 0.0]
[1.0, 0.2, 0.0]
[0.0, 1.0, 0.9]
[0.0, 0.2, 1.0]
Notice that the first dimension is capturing something related to being a dog, and the second dimension captures youthfulness. This example was made up by hand, but in practice you would use machine learning to find the best representations (while semantic values such as dogginess and youth would be captured, they might not line up with a single dimension so cleanly). You can check out my workshop on word embeddings for more details about how word embeddings work.
Applying Embeddings for Categorical Variables
Similarly, when working with categorical variables, we will represent each category by a vector of floating point numbers (the values of this representation are learned as the network is trained).
For instance, a 4-dimensional version of an embedding for day of week could look like:
[.8, .2, .1, .1]
[.1, .2, .9, .9]
[.2, .1, .9, .8]
Here, Monday and Tuesday are fairly similar, yet they are both quite different from Sunday. Again, this is a toy example. In practice, our neural network would learn the best representations for each category while it is training, and each dimension (or direction, which doesn’t necessarily line up with ordinal dimensions) could have multiple meanings. Rich relationships can be captured in these distributed representations.
Reusing Pretrained Categorical Embeddings
Embeddings capture richer relationships and complexities than the raw categories. Once you have learned embeddings for a category which you commonly use in your business (e.g. product, store id, or zip code), you can use these pre-trained embeddings for other models. For instance, Pinterest has created 128-dimensional embeddings for its pins in a library called Pin2Vec, and Instacart has embeddings for its grocery items, stores, and customers.
The fastai library contains an implementation for categorical variables, which work with Pytorch’s nn.Embedding module, so this is not something you need to code from hand each time you want to use it.
Treating some Continuous Variables as Categorical
We generally recommend treating month, year, day of week, and some other variables as categorical, even though they could be treated as continuous. Often for variables with a relatively small number of categories, this results in better performance. This is a modeling decision that the data scientist makes. Generally, we want to keep continuous variables represented by floating point numbers as continuous.
Although we can choose to treat continuous variables as categorical, the reverse is not true: any variables that are categorical must be treated as categorical.
Time Series Data
The approach of using neural networks together with categorical embeddings can be applied to time series data as well. In fact, this was the model used by students of Yoshua Bengio to win 1st place in the Kaggle Taxi competition(paper here), using a trajectory of GPS points and timestamps to predict the length of a taxi ride. It was also used by the 3rd place winners in the Kaggle Rossmann Competition, which involved using time series data from a chain of stores to predict future sales. The 1st and 2nd place winners of this competition used complicated ensembles that relied on specialist knowledge, while the 3rd place entry was a single model with no domain-specific feature engineering.
In our Lesson 3 jupyter notebook we walk through a solution for the Kaggle Rossmann Competition. This data set (like many data sets) includes both categorical data (such as the state the store is located in, or being one of 3 different store types) and continuous data (such as the distance to the nearest competitor or the temperature of the local weather). The fastai library lets you enter both categorical and continuous variables as input to a neural network.
When applying machine learning to time-series data, you nearly always want to choose a validation set that is a continuous selection with the latest available dates that you have data for. As I wrote in a previous post, “If your data is a time series, choosing a random subset of the data will be both too easy (you can look at the data both before and after the dates your are trying to predict) and not representative of most business use cases (where you are using historical data to build a model for use in the future).”
One key to successfully using deep learning with time series data is to split the date into multiple categorical variables (year, month, week, day of week, day of month, and Booleans for whether it’s the start/end of a month/quarter/year). The fastai library has implemented a method to handle this for you, as described below.
Modules to Know in the Fastai Library
We will be releasing more documentation for the fastai library in coming months, but it is already available on pip and on github, and it is used in the Practical Deep Learning for Coders course. The fastai library is built on top of Pytorch and encodes best practices and helpful high-level abstractions for using neural networks. The fastai library achieves state-of-the-art results and was recently used to win the Stanford DAWNBench competition (fastest CIFAR10 training).
fastai.column_data.ColumnarModelData takes a Pandas DataFrame as input and creates a type of ModelData object (an object which contains data loaders for the training, validation, and test sets, and which is the fundamental way of keeping track of your data while training models).
The fastai.structured module of the fastai library is built on top of Pandas, and includes methods to transform DataFrames in a number of ways, improving the performance of machine learning models by pre-processing the data appropriately and creating the right types of variables.
For instance, fastai.structured.add_datepart converts dates (e.g. 2000-03-11) into a number of variables (year, month, week, day of week, day of month, and booleans for whether it’s the start/end of a month/quarter/year.)
Other useful methods in the module allow you to:
Fill in missing values with the median whilst adding a boolean indicator variable (fix_missing)
Change any columns of strings in a Pandas DataFrame to a column of categorical values (train_cats)