Our online courses (all are free and have no ads):

Our software: fastai v1 for PyTorch

Take our course in person, March-April 2019 in SF: Register here

fast.ai in the news:


What HBR Gets Wrong About Algorithms and Bias

The Harvard Business Review recently published an article, Want Less-Biased Decisions? Use Algorithms. by Alex P. Miller. The article focuses on the fact that humans make very biased decisions (which is true), yet ignores many important related issues, including:

  • algorithms are often implemented without any appeals method in place (due to the misconception that algorithms are objective, accurate, and won’t make mistakes)
  • algorithms are often used at a much larger scale than human decision makers, in many cases, replicating an identical bias at scale (part of the appeal of algorithms is how cheap they are to use)
  • users of algorithms may not understand probabilities or confidence intervals (even if these are provided), and may not feel comfortable overriding the algorithm in practice (even if this is technically an option)
  • instead of just focusing on the least-terrible existing option, it is more valuable to ask how we can create better, less biased decision-making tools by leveraging the strengths of humans and machines working together
Credit: Jonathan Lidbeck https://www.flickr.com/photos/jondissed/2278335691
Credit: Jonathan Lidbeck https://www.flickr.com/photos/jondissed/2278335691

Miller acknowledges that critics of the “algorithmic revolution” are “concerned that algorithms are often opaque, biased, and unaccountable tools being wielded in the interests of institutional power”, although he then focuses exclusively on the biased part for the remainder of the article, without addressing the opaque or unaccountable charges (as well as how these interact with bias).

Humans vs. machines is not a helpful framing

The media often frames advances in AI through a lens of humans vs. machines: who is the champion at X task. This framework is both inaccurate as to how most algorithms are used, as well as a very limited way to think about AI. In all cases, algorithms have a human component, in terms of who gathers the data (and what biases they have), which design decisions are made, how they are implemented, how results are used to make decisions, the understanding various stakeholders have of correct uses and limitations of the algorithm, and so on.

Most people working on medical applications of AI are not trying to replace doctors; they are trying to create tools that will allow doctors to be more accurate and more efficient, improving quality of care. The best chess “players” are neither humans nor computers, but rather, teams of humans and computers working together.

Miller’s HBR article points out (correctly) that humans are very biased, and then compares our current not-so-great approaches to see which is less terrible. The article does not ask the question, how can we develop less biased ways to make decisions (perhaps using some combination of humans and algorithms)? which is a far more interesting and important question.

Algorithms are often used differently than human decision makers

Algorithms are often used at a larger scale, mass-producing identical biases, and assumed to be error-proof or objective. The studies that Miller shares compares them in an apples-to-apples way, which doesn’t acknowledge how differently they are often used in practice.

Cathy O’Neil writes in Weapons of Math Destruction that the algorithms she is critiquing tend to punish the poor. They specialize in bulk, and they’re cheap. That’s part of their appeal. The wealthy, by contrast, often benefit from personal input. A white-shoe law firm or an exclusive prep school will lean far more on recommendations and face-to-face interviews than will a fast-food chain or a cash-strapped urban school district. The privileged, we’ll see time and again, are processed more by people, the masses by machines. (emphasis mine)

One example from O’Neil’s book is that of a college student with bipolar disorder who wanted to get a summer job bagging groceries. Every store he applied to was using the same pyschometric evaluation software to screen candidates, and he was rejected from every store. This captures another danger of algorithms: even though humans often have similar biases, not all humans will make the exact same decisions (e.g. perhaps that college student would have been able to find one place to hire him, even if some of the people making decisions had biases about mental health).

Many people will put more trust in algorithmic decisions than they might in human decisions. While the researchers designing the algorithms may have a good grasp on probability and confidence intervals, often the general public using them will not. Even if people are given the power to override algorithmic decisions, it is crucial to understand if they will feel comfortable doing so in practice.

The need for meaningful appeals and explanations

Many of the most chilling stories of algorithmic bias don’t involve meaningful explanations or a meaningful appeals process. This seems to be a particular trend amongst algorithmic decision making systems, perhaps since people mistakenly assume algorithms are objective, they believe there is no need for appeals. Also, as explained above, algorithmic decision making systems are often used as a cost-cutting device, and allowing appeals would be more expensive.

Cathy O’Neil writes the account of a teacher who is beloved by her students, their parents, and the principal, yet is inexplicably fired by an algorithm. She is never able to get an answer as to why she was fired. Stories like this would be somewhat less disturbing if there had been a relatively quick and simple way for her to appeal the decision, or even to know for sure what factors it was related to.

The Verge investigated software used in over half of U.S. states to determine how much healthcare people receive. After its implemention in Arkansas, people (many with severe disabilities) drastically had their healthcare cut. For instance, Tammy Dobbs, a woman with cerebral palsy who needs an aid to help her to get out of bed, to go to the bathroom, to get food, and more, had her hours of help suddenly reduced by 20 hours a week. She couldn’t get any explanation for why her healthcare was cut. Eventually, a court case revealed that there were mistakes in the software implementation of the algorithm, negatively impacting people with diabetes or cerebral palsy. However, Dobbs and many other people reliant on these health care benefits live in fear that their benefits could again be cut suddenly and inexplicably.

source: wikimedia commons
source: wikimedia commons

The creator of the algorithm, who is a professor and earning royalties off of this software, was asked whether there should be a way to communicate decisions, “It’s probably something we should do. I should also probably dust under my bed.” He later clarified that he thought it was someone else’s responsibility. We can not keep claiming the problems caused by our technology are someone else’s responsibility.

For a separate computer system used in Colorado to determine public benefits in the mid-2000s, it was discovered that more than 900 incorrect rules had been coded into the system, resulting in problems like pregnant women being denied Medicaid. It is often hard for lawyers to even discover these flaws, since the inner-workings of the algorithms are typically protected as trade secrets. Systems used to make decisions related to healthcare, hiring/firing, criminal justice, and other life-altering areas should include some sort of human appeals process, that is relatively fast and easy to navigate. Many of the most chilling stories of algorithmic decision making would not be nearly as concerning if there had been an easy way to appeal and correct faulty decisions. Mistakes are possible in anything we do, so it’s important to have a tight loop in which we make it easy to discover and correct mistakes.

Complicated, real-world systems

When we think about AI, we need to think about complicated, real-world systems. The studies in the HBR article treat decision making as an isolated action, without taking into account that this decision-making happens within complicated real-world systems. A decision about whether someone is likely to commit another crime is not an isolated decision: it lives within the complicated system of our criminal justice system. We have a responsibility to understand the real-world systems with which our work will interact, and to not lose sight of the actual people who will be impacted.

The COMPAS recidivism algorithm is used in some US courtrooms for decisions related to pre-trial bail, sentencing, and parole. It was the subject of a ProPublica investigation finding that the false positive rate (people that were labeled “high risk” but were not re-arrested) for white defendants was 24% and for Black defendants was 45%. Later research found that COMPAS (which uses 137 inputs in a black-box algorithm) was no more accurate than a simple linear equation on two variables. COMPAS was also not more accurate than untrained Mechanical Turk workers. (You can find out more about various definitions of fairness in Princeton CS Professor Arvind Narayanan’s excellent 21 Definitions of Fairness talk).

Kristian Lum, statistics PhD and lead data scientist at the Human Rights Digital Analysis Group, organized a workshop together with Elizabeth Bender, a staff attorney for the NY Legal Aid Society and former public defender, and Terrence Wilkerson, an innocent man who had been arrested and could not afford bail. Together, they shared first hand experience about the obstacles and inefficiencies that occur in the legal system, providing valuable context to the debate around COMPAS. Bender shared that for public defenders meet with defendants at Rikers Island, where many pre-trial detainees in NYC who can’t afford bail are held, involves a bus ride that is two hours each way and they then only get 30 minutes to see the defendant, assuming the guards are on time (which is not always the case). Wilkerson explained how frequently innocent defendents who can’t afford bail accept guilty plea bargains just so they can get out of jail faster. Again, all this is for people that have not even faced a trial yet! This panel was an excellent way to illuminate the real-world systems and educate about the first-hand impact. I hope more statisticians and computer scientists will follow this example.

Dr. Kristian Lum's Panel at the Fairness, Accountability, and Transparency in ML Conference
Dr. Kristian Lum's Panel at the Fairness, Accountability, and Transparency in ML Conference

As this example shows, algorithms can often exacerbate underlying societal problems. There are deep, structural problems with the US courts and prison systems, including racial bias, the use of cash bail (nearly half a million people in the USA are languishing in jail before even facing a trial, because they are too poor to afford bail), predatory for-profit prisons, and extreme over-use of prisons (the US is home to 4% of the world’s population and 22% of the world’s prisoners). We have a responsibility to understand the systems and underlying problems our algorithms may interact with.

Most critics of unjust bias aren't anti-algorithm

Most critics of biased algorithms are opposed to unjust bias; they are not people who hate algorithms. Miller says that critics of biased algorithms “rarely ask how well the systems they analyze would operate without algorithms,” suggesting that those speaking out against biased algorithms are perhaps unaware of how biased humans are or perhaps just don’t like algorithms. I spent a great deal of time researching and writing about studies of human bias (particularly as to how they pertain to the tech industry), long before I began writing about bias in machine learning.

When I tweet or share about biased or unethical algorithms, I frequently encounter push-back that I must be anti-algorithms or opposed to tech. This couldn’t be further from the truth: I have a PhD in math; I’ve worked as a quant, data scientist, and software engineer; I created a free, online Computational Linear Algebra course and co-founded fast.ai, which runs Practical Deep Learning for Coders and won Stanford’s Computer Vision Speed Test through clever use of algorithms.

I’m in no way unique in this: most of the outspoken critics of biased algorithms that come to mind have PhDs in computer science, math, or statistics, and continue to be active in their fields. Just check out some of the speakers from the Fairness Accountability and Transparency Conference (and watch their talks!). One such example is Arvind Narayanan, a computer science professor at Princeton, winner of the Kaggle Social Network Challenge, teacher of a popular cryptocurrency course, and also speaks out against algorithmic bias.

I hope that the popular discussion of biased algorithms can move beyond unnuanced rebuttals and more deeply engage with the issues involved.

Google's AutoML: Cutting Through the Hype

This is part 3 in a series. Part 1 is here and Part 2 is here.

To announce Google’s AutoML, Google CEO Sundar Pichai wrote, “Today, designing neural nets is extremely time intensive, and requires an expertise that limits its use to a smaller community of scientists and engineers. That’s why we’ve created an approach called AutoML, showing that it’s possible for neural nets to design neural nets. We hope AutoML will take an ability that a few PhDs have today and will make it possible in three to five years for hundreds of thousands of developers to design new neural nets for their particular needs.” (emphasis mine)

Google CEO Sundar Pichai says that we all need to design our own neural nets
Google CEO Sundar Pichai says that we all need to design our own neural nets

When Google’s Head of AI, Jeff Dean, suggested that 100x computational power could replace the need for machine learning expertise, computationally expensive neural architecture search was the only example he gave to illustrate this point. (around 23:50 in his TensorFlow DevSummit keynote)

This raises a number of questions: do hundreds of thousands of developers need to “design new neural nets for their particular needs” (to quote Pichai’s vision), or is there an effective way for neural nets to generalize to similar problems? Can large amounts of computational power really replace machine learning expertise?

In evaluating Google’s claims, it’s valuable to keep in mind Google has a vested financial interest in convincing us that the key to effective use of deep learning is more computational power, because this is an area where they clearly beat the rest of us. If true, we may all need to purchase Google products. On its own, this doesn’t mean that Google’s claims are false, but it’s good be aware of what financial motivations could underlie their statements.

In my previous posts, I shared an introduction to the history of AutoML, defined what neural architecture search is, and pointed out that for many machine learning projects, designing/choosing an architecture is nowhere near the hardest, most time-consuming, or most painful part of the problem. In today’s post, I want to look specifically at Google’s AutoML, a product which has received a lot of media attention, and address the following:

What is Google's AutoML?

Although the field of AutoML has been around for years (including open-source AutoML libraries, workshops, research, and competitions), in May 2017 Google co-opted the term AutoML for its neural architecture search. In blog posts accompanying announcements made at the conference Google I/O, Google CEO Sundar Pichai wrote, “That’s why we’ve created an approach called AutoML, showing that it’s possible for neural nets to design neural nets” and Google AI researchers Barret Zoph and Quoc Le wrote “In our approach (which we call “AutoML”), a controller neural net can propose a “child” model architecture…”

Google’s Cloud AutoML was announced in January 2018 as a suite of machine learning products. So far it consists of one publicly available product, AutoML Vision, an API that identifies or classifies objects in pictures. According to the product page, Cloud AutoML Vision relies on two core techniques: transfer learning and neural architecture search. Since we’ve already explained neural architecture search, let’s now take a look at transfer learning, and see how it relates to neural architecture search.

Headlines from just a few of the many, many articles written about Google's AutoML and Neural Architecture Search
Headlines from just a few of the many, many articles written about Google's AutoML and Neural Architecture Search

Note: Google Cloud AutoML also has a drag-and-drop ML product that is still in alpha. I applied for access to it over 2 months ago, but I have not heard back from Google yet. I plan to write a post once it’s released.

What is transfer learning?

Transfer learning is a powerful technique that lets people with smaller datasets or less computational power achieve state-of-the-art results, by taking advantage of pre-trained models that have been trained on similar, larger data sets. Because the model learned via transfer learning doesn’t have to learn from scratch, it can generally reach higher accuracy with much less data and computation time than models that don’t use transfer learning.

Transfer learning is a core technique that we use throughout our free Practical Deep Learning for Coders course– and that our students have been applying in production in everything from their own startups to Fortune 500 companies. Although transfer learning seems to be considered “less sexy” than neural architecture search, it is being used to achieve ground-breaking academic results, such as in Jeremy Howard and Sebastian Ruder’s application of transfer learning to NLP, which achieved state-of-the-art classification on 6 datasets and is serving as a basis for further research in this area at OpenAI.

Neural architecture search vs. transfer learning: two opposing approaches

The underlying idea of transfer learning is that neural net architectures will generalize for similar types of problems: for example, that many images have underlying features (such as corners, circles, dog faces, or wheels) that show up in a variety of different types of images. In contrast, the underlying idea of promoting neural architecture search for every problem is the opposite: that each dataset has a unique, highly specialized architecture it will perform best with.

Examples from Matthew Zeiler and Rob Fergus of 4 features learned by image classifiers: corners, circles, dog faces, and wheels
Examples from Matthew Zeiler and Rob Fergus of 4 features learned by image classifiers: corners, circles, dog faces, and wheels

When neural architecture search discovers a new architecture, you must learn weights for that architecture from scratch, while with transfer learning, you begin with existing weights from a pre-trained model. In this sense, you you can’t use neural architecture search and transfer learning on the same problem: if you’re learning a new architecture, you would need to train new weights for it; whereas if you are using transfer learning on a pretrained model, you can’t make substantial changes to the architecture.

Of course, you can apply transfer learning to an architecture learned through neural architecture search (which I think is a good idea!). This requires only that a few researchers use neural architecture search and open-source the models that they find. It is not necessary for all machine learning practitioners to be using neural architecture search themselves on all problems when they can instead use transfer learning. However, Jeff Dean’s keynote, Sundar Pichai’s blog post, Google Cloud’s promotional materials, and the media coverage all suggest the opposite: that everybody needs to be able to use neural architecture search directly.

What Neural Architecture Search is good for

Neural architecture search is good for finding new architectures! Google’s AmoebaNet was learned via neural architecture search, and (with the inclusion of fast.ai advances such as an aggressive learning schedule and changing the image size as training progresses) is now the cheapest way to train ImageNet on a single machine!

AmoebaNet was not designed with a reward function that involved the ability to scale, and so it didn’t scale as well as ResNet to multiple machines, but a neural net that scales well could potentially be learned in the future, optimized for different qualities.

In need of more evidence

We haven’t seen evidence that every dataset would be best modeled with its own custom model, as opposed to instead fine-tuning an existing model. Since neural architecture search requires a larger training set, this would particularly be an issue for smaller data sets. Even some of Google’s own research uses transferable techniques instead of finding a new architecture for each data set, such as NASNet (blog post here), which learned an architectural building block on Cifar10 and then used that building block to create an architecture for ImageNet. I don’t know of any widely-entered machine learning competitions that have been won using neural architectures search yet.

Furthermore, we don’t know that the mega-computationally expensive approach to neural architecture search that Google touts is the superior approach. For instance, more recent papers such as Efficient Neural Architecture Search (ENAS) and
Differentiable architecture search (DARTS) propose significantly more efficient algorithms. DARTS takes just 4 GPU days, compared to 1800 GPU days for NASNet and 3150 GPU days for AmoebaNet (all learned to the same accuracy on Cifar-10). Jeff Dean is an author on the ENAS paper, which proposed a technique that is 1000x less computationally expensive, which seems inconsistent with his emphasis at the TF DevSummit one month later on using approaches that are 100x more computationally expensive.

Then why all the hype about Google's AutoML?

Given the above limitations, why has Google AutoML’s hype been so disproportionate to its proven usefulness (at least so far)? I think there are a few explanations:

  1. Google’s AutoML highlights some of the dangers of having an academic research lab embedded in a for-profit corporation. There is a temptation to try to build products around interesting academic research, without assessing if they fulfill an actual need. This is also the story of many AI start-ups, such as MetaMind or Geometric Intelligence, that end up as acquihires without ever having produced a product. My advice for startup founders is to avoid productionizing your PhD thesis and to avoid hiring only academic researchers.

  2. Google excels at marketing. Artificial intelligence is seen as an inaccessible and intimidating field by many outsiders, who don’t feel that they have a way to evaluate claims, particularly from lionized companies like Google. Many journalists fall prey to this as well, and uncritically channel Google’s hype into glowing articles. I periodically talk to people that do not work in machine learning, yet are excited about various Google ML products that they’ve never used and can’t explain anything about.

    One example of Google’s misleading coverage of its own achievements occurred when Google AI researchers released “a deep learning technology to reconstruct the true human genome”, compared their own work to Nobel prize-winning discoveries (the hubris!), and the story was picked up by Wired. However, Steven Salzberg, a distinguished professor of Biomedical Engineering, Computer Science, and Biostatistics at Johns Hopkins University debunked Google’s post. Salzberg pointed out that the research didn’t actually reconstruct the human genome and was “little more than an incremental improvement over existing software, and it might be even less than that.” A number of other genomics researchers chimed in to agree with Salzberg.

    There is some great work happening at Google, but it would be easier to appreciate if we didn’t have to sift through so much misleading hype to figure out what is legitimate.

  3. Google has a vested interest in convincing us that the key to effective use of deep learning is more computational power, because this is an area where they clearly beat the rest of us. AutoML is often very computationally expensive, such as in the examples of Google using 450 K40 GPUs for 7 days (the equivalent of 3150 GPU days) to learn AmoebaNet.

    While engineers and the media often drool over bare-metal power and anything bigger, history has shown that innovation is often birthed instead by constraint and creativity. Google works on the biggest data possible using the most expensive computers possible; how well can this really generalize to the problems that the rest of us face living in a constrained world of limited resources?

    Innovation comes from doing things differently, not from doing things bigger. The recent success of fast.ai in the Stanford DAWNBench competition is one example of this.

How can we address the shortage of machine learning expertise?

To return to the issue that Jeff Dean raised in his TensorFlow DevSummit keynote about the global shortage of machine learning practitioners, a different approach is possible. We can remove the biggest obstacles to using deep learning in several ways by:

  1. making deep learning easier to use
  2. debunking myths about what it takes to do deep learning
  3. increasing access for people that lack the money or credit cards needed to use a cloud GPU

Making Deep Learning Easier to Use

Research to make deep learning easier to use has a huge impact, making it faster and simpler to train better networks. Examples of exciting discoveries that have now become standard practice are:

  • Dropout allows training on smaller datasets without over-fitting.
  • Batch normalization allows for faster training.
  • Rectified linear units help avoid gradient explosions.

Newer research to improve ease of use includes:

  • The learning rate finder makes the training process more robust.
  • Super convergence speeds up training, requiring fewer computational resources.
  • “Custom heads” for existing architectures (e.g. modifying ResNet, which was initially designed for classification, so that it can be used to find bounding boxes or perform style transfer) allow for easier architecture reuse across a range of problems.

None of the above discoveries involve bare-metal power; instead, all of them were creative ideas of ways to do things differently.

Address Myths About What it Takes to Do Deep Learning

Another obstacle is the many myths that cause people to believe that deep learning isn’t for them: falsely believing that their data is too small, that they don’t have the right education or background, or that their computers aren’t big enough. One such myth says that only machine learning PhDs are capable of using deep learning, and many companies that can’t afford to hire expensive experts don’t even bother trying. However, it’s not only possible for companies to train the employees they already have to become machine learning experts, it’s even preferable, because your current employees already have domain expertise for the area you work in!


In my talk at the MIT Technology Review Conference, I addressed 6 Myths that lead people to incorrectly believe that using deep learning is harder than it is.


For the vast majority of people I talk with, the barriers to entry for deep learning are far lower than they expected: one year of coding experience and access to a GPU.

Increasing Access: Google Colab Notebooks

Although the cost of cloud GPUs (around 50 cents per hour) are within the budgets of many of us, I’m periodically contacted by students from around the world that can’t afford any GPU use at all. In some countries, rules about banking and credit cards can make it difficult for students to use services like AWS, even when they have the money. Google Colab notebooks are a solution! Colab notebooks provide a Jupyter notebook environment that requires no setup to use, runs entirely in the cloud, and gives users access to a free GPU (although long-running GPU use is not allowed). They can also be used to create documentation that contains working code samples running in an interactive environment. Google colab notebooks will do much more to democratize deep learning than Google’s AutoML will; perhaps this would be a better target for Google’s marketing machine in the future.

If you haven’t already, check out Part 1: What is it that machine learning practitioners do? and Part 2: An Opinionated Introduction to AutoML Neural Architecture Search of this series.

An Opinionated Introduction to AutoML and Neural Architecture Search

This is part 2 in a series. Check out part 1 here and part 3 here.

Researchers from CMU and DeepMind recently released an interesting new paper, called Differentiable Architecture Search (DARTS), offering an alternative approach to neural architecture search, a very hot area of machine learning right now. Neural architecture search has been heavily hyped in the last year, with Google’s CEO Sundar Pichai and Google’s Head of AI Jeff Dean promoting the idea that neural architecture search and the large amounts of computational power it requires are essential to making machine learning available to the masses. Google’s work on neural architecture search has been widely and adoringly covered by the tech media (see here, here, here, and here for examples).

Headlines from just a few of the many, many articles written about Google's AutoML and Neural Architecture Search
Headlines from just a few of the many, many articles written about Google's AutoML and Neural Architecture Search

During his keynote (starts around 22:20) at the TensorFlow DevSummit in March 2018, Jeff Dean posited that perhaps in the future Google could replace machine learning expertise with 100x computational power. He gave computationally expensive neural architecture search as a primary example (the only example he gave) of why we need 100x computational power in order to make ML accessible to more people.

Slide from Jeff Dean's Keynote at the TensorFlow Dev Summit
Slide from Jeff Dean's Keynote at the TensorFlow Dev Summit

What is neural architecture search? Is it the key to making machine learning available to non-machine learning experts? I will dig into these questions in this post, and in my next post, I will look specifically at Google’s AutoML. Neural architecture search is a part of a broader field called AutoML, which has also been receiving a lot of hype and which we will consider first.

Part 2 table of contents:

What is AutoML?

The term AutoML has traditionally been used to describe automated methods for model selection and/or hyperparameter optimization. These methods exist for many types of algorithms, such as random forests, gradient boosting machines, neural networks, and more. The field of AutoML includes open-source AutoML libraries, workshops, research, and competitions. Beginners often feel like they are just guessing as they test out different hyperparameters for a model, and automating the process could make this piece of the machine learning pipeline easier, as well as speeding things up even for experienced machine learning practitioners.

There are a number of AutoML libraries, the oldest of which is AutoWEKA, which was first released in 2013 and automatically chooses a model and selects hyperparameters. Other notable AutoML libraries include auto-sklearn (which extends AutoWEKA to python), H2O AutoML, and TPOT. AutoML.org (formerly known as ML4AAD, Machine Learning for Automated Algorithm Design) has been organzing AutoML workshops at the academic machine learning conference ICML yearly since 2014.

How useful is AutoML?

AutoML provides a way to select models and optimize hyper-parameters. It can also be useful in getting a baseline to know what level of performance is possible for a problem. So does this mean that data scientists can be replaced? Not yet, as we need to keep the context of what else it is that machine learning practitioners do.

For many machine learning projects, choosing a model is just one piece of the complex process of building machine learning products. As I covered in my previous post, projects can fail if participants don’t see how interconnected the various parts of the pipeline are. I thought of over 30 different steps that can be involved in the process. I highlighted two of the most time-consuming aspects of machine learning (in particular, deep learning) as cleaning data (and yes, this is an inseparable part of machine learning) and training models. While AutoML can help with selecting a model and choosing hyperparameters, it is important to keep perspective on what other data expertise is still needed and on the difficult problems remain.

I will suggest some alternate approaches to AutoML for making machine learning practitioners more effective in the final section.

What is neural architecture search?

Now that we’ve covered some of what AutoML is, let’s look at a particularly active subset of the field: neural architecture search. Google CEO Sundar Pichai wrote that, “designing neural nets is extremely time intensive, and requires an expertise that limits its use to a smaller community of scientists and engineers. That’s why we’ve created an approach called AutoML, showing that it’s possible for neural nets to design neural nets.”

What Pichai refers to as using “neural nets to design neural nets” is known as neural architecture search; typically reinforcement learning or evolutionary algorithms are used to design the new neural net architectures. This is useful because it allows us to discover architectures far more complicated than what humans may think to try, and these architectures can be optimized for particular goals. Neural architecture search is often very computationally expensive.

To be precise, neural architecture search usually involves learning something like a layer (often called a “cell”) that can be assembled as a stack of repeated cells to create a neural network:

Diagram from Zoph et. al. 2017. On the left is the full neural network of stacked cells, and on the right is the inside structure of a cell
Diagram from Zoph et. al. 2017. On the left is the full neural network of stacked cells, and on the right is the inside structure of a cell

The literature of academic papers on neural architecture search is extensive, so I will highlight just a few recent papers here:

  • The term AutoML jumped to “mainstream” prominence with work by Google AI researchers (paper here) Quoc Le and Barret Zoph, which was featured at Google I/O in May 2017. This work used reinforcement learning to find new architectures for the computer vision problem Cifar10 and the NLP problem Penn Tree Bank, and achieved similar results to existing architectures.
Diagram from Le and Zoph's blog post: the simpler architecture on the left was designed by a human and the more complicated architecture on the right was designed by a neural net.
Diagram from Le and Zoph's blog post: the simpler architecture on the left was designed by a human and the more complicated architecture on the right was designed by a neural net.
  • NASNet from Learning Transferable Architectures for Scalable Image Recognition (blog post here). This work searches for an architectural building block on a small data set (Cifar10) and then builds an architecture for a large data set (ImageNet). This research was very computationally intensive with it taking 1800 GPU days (the equivalent of almost 5 years for 1 GPU) to learn the architecture (the team at Google used 500 GPUs for 4 days!).

  • AmoebaNet from Regularized Evolution for Image Classifier Architecture Search This research was even more computationally intensive than NASNet, with it taking the equivalent of 3150 GPU days (the equivalent of almost 9 years for 1 GPU) to learn the architecture (the team at Google used 450 K40 GPUs for 7 days!). AmoebaNet consists of “cells” learned via an evolutionary algorithm, showing that artificially-evolved architectures can match or surpass human-crafted and reinforcement learning-designed image classifiers. After incorporating advances from fast.ai such as an aggressive learning schedule and changing the image size as training progresses, AmoebaNet is now the cheapest way to train ImageNet on a single machine.

  • Efficient Neural Architecture Search (ENAS): used much fewer GPU-hours than previously existing automatic model design approaches, and notably, was 1000x less expensive than standard Neural Architecture Search. This research was done using a single GPU for just 16 hours.

What about DARTS?

Differentiable architecture search (DARTS). This research was recently released from a team at Carnegie Mellon University and DeepMind, and I’m excited about the idea. DARTS assumes the space of candidate architectures is continuous, not discrete, and this allows it to use gradient-based aproaches, which are vastly more efficient than the inefficient black-box search used by most neural architecture search algorithms.

Diagram from DARTS, which treats the space of all possible architectures as continuous, not discrete
Diagram from DARTS, which treats the space of all possible architectures as continuous, not discrete

To learn a network for Cifar-10, DARTS takes just 4 GPU days, compared to 1800 GPU days for NASNet and 3150 GPU days for AmoebaNet (all learned to the same accuracy). This is a huge gain in efficiency! Although more exploration is needed, this is a promising research direction. Given how Google frequently equates neural architecture search with huge computational expense, efficient ways to do architecture search have most likely been under-explored.

How useful is Neural Architecture Search?

In his TensorFlow DevSummit keynote (starts around 22:20), Jeff Dean suggested that a significant part of deep learning work is trying out different architectures. This was the only step of machine learning that Dean highlighted in his short talk, and I was surprised by his emphasis. Sundar Pichai’s blog post contained a similar assertion.

Jeff Dean's slide showing that neural architecture search can try 20 different models to find the most accurate
Jeff Dean's slide showing that neural architecture search can try 20 different models to find the most accurate

However, choosing a model is just one piece of the complex process of building machine learning products. In most cases, architecture selection is nowhere near the hardest, most time-consuming, or most significant part of the problem. Currently, there is no evidence that each new problem would be best modeled with it’s own unique architecture, and most practitioners consider it unlikely this will ever be the case.

Organizations like Google working on architecture design and sharing the architectures they discover with the rest of us are providing an important and helpful service. However, the underlying architecture search method is only needed for that tiny fraction of researchers that are working on foundational neural architecture design. The rest of us can just use the architectures they find via transfer learning.

How else could we make machine learning practitioners more effective? AutoML vs. Augmented ML

The field of AutoML, including neural architecture search, has been largely focused on the question: how can we automate model selection and hyperparameter optimization? However, automation ignores the important role of human input. I’d like to propose an alternate question: how can humans and computers work together to make machine learning more effective? The focus of augmented ML is on figuring out how a human and machine can best work together to take advantage of their different strengths.

An example of augmented ML is Leslie Smith’s learning rate finder (paper here), which is implemented in the fastai library (a high level API that sits on top of PyTorch) and taught as a key technique in our free deep learning course. The learning rate is a hyperparameter that can determine how quickly your model trains, or even whether it successfully trains at all. The learning rate finder allows a human to find a good learning rate in a single step, by looking at a generated chart. It’s faster than AutoML approaches to the same problem, improves the data scientist’s understanding of the training process, and encourages more powerful multi-step approaches to training models.

Diagram from Surmenok's blog post on the learning rate finder, showing relationship between learning rate and loss
Diagram from Surmenok's blog post on the learning rate finder, showing relationship between learning rate and loss

There’s another problem with the focus on automating hyperparameter selection: it overlooks the possibility that some types of model are more widely useful, have fewer hyperparameters to tune, and are less sensitive to choice of hyperparameters. For example, a key benefit of random forests over gradient boosting machines (GBMs) is that random forests are more robust, whereas GBMs tend to be fairly sensitive to minor changes in hyperparameters. As a result, random forests are widely used in industry. Researching ways to effectively remove hyperparameters (through smarter defaults, or through new models) can have a huge impact. When I first became interested in deep learning in 2013, it was overwhelming to feel that there were such a huge number of hyperparameters, and I’m happy that newer research and tools has helped eliminate many of those (especially for beginners). For instance, in the fast.ai course, beginners start by only having to choose a single hyperparameter, the learning rate, and we even give you a tool to do that!

Stay tuned...

Now that we have an overview of what the fields of AutoML and neural architecture search are, we can take a closer look at Google’s AutoML in the next post.

Please be sure to check out Part 3 of this post next week!

What do machine learning practitioners actually do?

This post is part 1 of a series. Part 2 is an opinionated introduction to AutoML and neural architecture search, and Part 3 looks at Google’s AutoML in particular.

There are frequent media headlines about both the scarcity of machine learning talent (see here, here, and here) and about the promises of companies claiming their products automate machine learning and eliminate the need for ML expertise altogether (see here, here, and here). In his keynote at the TensorFlow DevSummit, Google’s head of AI Jeff Dean estimated that there are tens of millions of organizations that have electronic data that could be used for machine learning but lack the necessary expertise and skills. I follow these issues closely since my work at fast.ai focuses on enabling more people to use machine learning and on making it easier to use.

In thinking about how we can automate some of the work of machine learning, as well as how to make it more accessible to people with a wider variety of backgrounds, it’s first necessary to ask, what is it that machine learning practitioners do? Any solution to the shortage of machine learning expertise requires answering this question: whether it’s so we know what skills to teach, what tools to build, or what processes to automate.

What do machine learning practitioners do? (Source: #WOCinTech Chat)
What do machine learning practitioners do? (Source: #WOCinTech Chat)

This post is the first in a 3-part series. It will address what it is that machine learning practitioners do, with Part 2 explaining AutoML and neural architecture search (which several high profile figures have suggested will be key to decreasing the need for data scientists) and Part 3 will cover Google’s heavily hyped AutoML product in particular.

Building Data Products is Complex Work

While many academic machine learning sources focus almost exclusively on predictive modeling, that is just one piece of what machine learning practitioners do in the wild. The processes of appropriately framing a business problem, collecting and cleaning the data, building the model, implementing the result, and then monitoring for changes are interconnected in many ways that often make it hard to silo off just a single piece (without at least being aware of what the other pieces entail). As Jeremy Howard et al. wrote in Designing great data products, Great predictive modeling is an important part of the solution, but it no longer stands on its own; as products become more sophisticated, it disappears into the plumbing.

Building Data Products is Complex Work (Source: Wikimedia Commons)
Building Data Products is Complex Work (Source: Wikimedia Commons)

A team from Google, D. Sculley et al., wrote the classic Machine Learning: The High-Interest Credit Card of Technical Debt about the code complexity and technical debt often created when using machine learning in practice. The authors identify a number of system-level interactions, risks, and anti-patterns, including:

  • glue code: massive amount of supporting code written to get data into and out of general-purpose packages
  • pipeline jungles: the system for preparing data in an ML-friendly format may become a jungle of scrapes, joins, and sampling steps, often with intermediate files output
  • re-use input signals in ways that create unintended tight coupling of otherwise disjoint systems
  • risk that changes in the external world may make models or input signals change behavior in unintended ways, and these can be difficult to monitor

The authors write, A remarkable portion of real-world “machine learning” work is devoted to tackling issues of this form… It’s worth noting that glue code and pipeline jungles are symptomatic of integration issues that may have a root cause in overly separated “research” and “engineering” roles… It may be surprising to the academic community to know that only a tiny fraction of the code in many machine learning systems is actually doing “machine learning”. (emphasis mine)

When machine learning projects fail

In a previous post, I identified some failure modes in which machine learning projects are not effective in the workplace:

  • The data science team builds really cool stuff that never gets used. There’s no buy-in from the rest of the organization for what they’re working on, and some of the data scientists don’t have a good sense of what can realistically be put into production.
  • There is a backlog with data scientists producing models much faster than there is engineering support to put them in production.
  • The data infrastructure engineers are separate from the data scientists. The pipelines don’t have the data the data scientists are asking for now, and the data scientists are under-utilizing the data sources the infrastructure engineers have collected.
  • The company has definitely decided on feature/product X. They need a data scientist to gather some data that supports this decision. The data scientist feels like the PM is ignoring data that contradicts the decision; the PM feels that the data scientist is ignoring other business logic.
  • The data science team interviews a candidate with impressive math modeling and engineering skills. Once hired, the candidate is embedded in a vertical product team that needs simple business analytics. The data scientist is bored and not utilizing their skills.

I framed these as organizational failures in my original post, but they can also be described as various participants being overly focused on just one slice of the complex system that makes up a full data product. These are failures of communication and goal alignment between different parts of the data product pipeline.

So, what do machine learning practitioners do?

As suggested above, building a machine learning product is a multi-faceted and complex task. Here are some of the things that machine learning practitioners may need to do during the process:

Understanding the context:

  • identify areas of the business that could benefit from machine learning
  • communicate with other stakeholders about what machine learning is and is not capable of (there are often many misconceptions)
  • develop understanding of business strategy, risks, and goals to make sure everyone is on the same page
  • identify what kind of data the organization has
  • appropriately frame and scope the task
  • understand operational constraints (e.g. what data is actually available at inference time)
  • proactively identify ethical risks, including how your work could be mis-used by harassers, trolls, authoritarian governments, or for propaganda/disinformation campaigns (and plan how to reduce these risks)
  • identify potential biases and potential negative feedback loops

Data:

  • make plans to collect more of different data (if needed and if possible)
  • stitch together data from many different sources: often this data has been collected in different formats or with inconsistent conventions
  • deal with missing or corrupted data
  • visualize the data
  • create appropriate training, validation, and test sets

Modeling:

  • choose which model to use
  • fit model resource needs into constraints (e.g. will the completed model need to run on an edge device, in a low memory or high latency environment, etc)
  • choose hyperparameters (e.g. in the case of deep learning, this includes choosing an architecture, loss function, and optimizer)
  • train the model (and debug why it’s not training). This can involve:
    • adjusting hyperparmeters (e.g. such as the learning rate)
    • outputing intermediate results to see how the loss, training error, and validation error are changing with time
    • inspecting the data the model is wrong on to look for patterns
    • identifying underlying errors or issues with the data
    • realizing you need to change how you clean and pre-process the data
    • realizing you need more or different data augmentation
    • realizing you need more or different data
    • trying out different models
    • identifying if you are under- or over-fitting

Productionize:

  • creating an API or web app with your model as an endpoint in order to productionize
  • exporting your model into the needed format
  • plan for how often your model will need to be retrained with updated data (e.g. perhaps you will retrain nightly or weekly)

Monitor:

  • track model performance over time
  • monitor the input data, to identify if it changes with time in a way that would invalidate your model
  • communicate your results to the rest of the organization
  • have a plan in place for how you will monitor and respond to mistakes or unexpected consequences

Certainly, not every machine learning practitioner needs to do all of the above steps, but components of this process will be a part of many machine learning applications. Even if you are working on just a subset of these steps, a familiarity with the rest of the process will help ensure that you are not overlooking considerations that would keep your project from being successful!

Two of the hardest parts of Machine Learning

For myself and many others I know, I would highlight two of the most time-consuming and frustrating aspects of machine learning (in particular, deep learning) as:

  1. Dealing with data formatting, inconsistencies, and errors is often a messy and tedious process.
  2. Training deep learning models is a notoriously brittle process right now.

Is cleaning data really part of ML? Yes.

Dealing with data formatting, inconsistencies, and errors is often a messy and tedious process. People will sometimes describe machine learning as separate from data science, as though for machine learning, you can just begin with your nicely cleaned, formatted data set. However, in my experience, the process of cleaning a data set and training a model are usually interwoven: I frequently find issues in the model training that cause me to go back and change the pre-processing for the input data.

Dealing with messy and inconsistent data is necessary
Dealing with messy and inconsistent data is necessary

Training Deep Learning Models is Brittle and Finicky (for now)

The difficulty of getting models to train deters many beginners, who often wind up feeling discouraged. Even experts frequently complain of how frustrating and fickle the training process can be. One AI researcher at Stanford told me, I taught a course on deep learning and had all the students do their own projects. It was so hard. The students couldn’t get their models to train, and we were like “well, that’s deep learning”. Ali Rahimi, an AI researcher with over a decade of experience and winner of the NIPS 2017 Test of Time Award, complained about the brittleness of training in his NIPS award speech. How many of you have designed a deep net from scratch, built it from the ground up, architecture and all, and when it didn’t work, you felt bad about yourself? Rahimi asked the audience of AI researchers, and many raised their hands. Rahimi continued, This happens to me about every 3 months.

The fact that even AI experts sometimes have trouble training new models implies that the process has yet to be automated in a way where it could be incorporated into a general-purpose product. Some of the biggest advances in deep learning will come through discovering more robust training methods. We have already seen this some with advances like dropout, super convergence, and transfer learning, all of which make training easier. Through the power of transfer learning (to be discussed in Part 3) training can be a robust process when defined for a narrow enough problem domain; however, we still have a ways to go in making training more robust in general.

For Academic Researchers

Even if you are working on theoretical machine learning research, it is useful to understand the process that machine learning practitioners working on practical problems go through, as that might provide insights on what the most relevant or high-impact areas of research are.

As Googler engineers D. Sculley et al. wrote, Technical debt is an issue that both engineers and researchers need to be aware of. Research solutions that provide a tiny accuracy benefit at the cost of massive increases in system complexity are rarely wise practice… Paying down technical debt is not always as exciting as proving a new theorem, but it is a critical part of consistently strong innovation. And developing holistic, elegant solutions for complex machine learning systems is deeply rewarding work. (emphasis mine)

AutoML

Now that we have an overview of some of the tasks that machine learning practitioners do as part of their work, we are ready to evaluate attempts to automate this work. As it’s name suggests, AutoML is one field in particular that has focused on automating machine learning, and a subfield of AutoML called neural architecture search is currently receiving a ton of attention. In part 2, I will explain what AutoML and neural architecture search are, and in part 3, look at Google’s AutoML in particular.

Be sure to check out Part 2 here, and stay tuned for Part 3!

AdamW and Super-convergence is now the fastest way to train neural nets

Note from Jeremy: Welcome to fast.ai’s first scholar-in-residence, Sylvain Gugger. What better way to introduce him than to publish the results of his first research project at fast.ai. We’ll be using the results of this research to change how we train models in the next version of our course and in our fastai library, as a result of which students and practitioners will be able to reliably train their models far faster than previous approaches.

The Adam roller-coaster

The journey of the Adam optimizer has been quite a roller coaster. First introduced in 2014, it is, at its heart, a simple and intuitive idea: why use the same learning rate for every parameter, when we know that some surely need to be moved further and faster than others? Since the square of recent gradients tells us how much signal we’re getting for each weight, we can just divide by that to ensure even the most sluggish weights get their chance to shine. Adam takes that idea, adds on the standard approach to momentum, and (with a little tweak to keep early batches from being biased) that’s it!

When first released, the deep learning community was full of excitement after seeing charts like this one from the original paper:

Comparison between Adam and other optimizers
Comparison between Adam and other optimizers

200% speed up in training! “Overall, we found Adam to be robust and well-suited to a wide range of non-convex optimization problems in the field machine learning” concluded the paper. Ah yes, those were the days, over three years ago now, a life-time in deep-learning-years. But it started to become clear that all was not as we hoped. Few research articles used it to train their models, new studies began to clearly discourage to apply it and showed on several experiments that plain ole SGD with momentum was performing better. By the time the 2018 fast.ai course had come around, the decision was made to cut poor Adam from the early lessons.

But at the end of 2017, Adam seemed to get a new lease of life. Ilya Loshchilov and Frank Hutter pointed out in their paper that the way weight decay is implemented in Adam in every library seems to be wrong, and proposed a simple way (which they call AdamW) to fix it. Although their results were slightly mixed, they did show some encouraging charts, such as this one:

Comparison between Adam and AdamW
Comparison between Adam and AdamW

We expected to see the Adam enthusiasm return, since it seemed those first results could perhaps be found again. But that’s not what happened. Indeed, the only deep learning framework that implemented the fix was fastai, using code written by Sylvain. Without broad framework availability, day-to-day practitioners were stuck with the old, “broken” Adam.

But that’s not the only problem. More obstacles lay ahead. Two separate papers pointed out apparent problems with the convergence proof of poor Adam, although one of them claimed a fix (and won a “best paper” award at the prestigious ICLR conference), which they called amsgrad. But if we’ve learned anything from this potted history of this most dramatic life (at least, dramatic by optimizer standards), it’s that nothing is as it seems. And indeed, PhD student Jeremy Bernstein has pointed out that the claimed convergence problems are actually just signs of poorly chosen hyper-parameters, and that perhaps amsgrad won’t fix things anyway. Another PhD student, Filip Korzeniowski, showed some early results that seemed to support this discouraging view of amsgrad.

Getting off the roller-coaster

So for those of us that just want to train accurate models fast, what do we do? Let’s solve this debate the same way scientific debates have been solved for hundreds of years: with experiments! We’ll tell you all the details in just a moment, but first, here’s a summary of the results:

  • Properly tuned, Adam really works! We got new state of the art results (in terms of training time) on various tasks like
    • training CIFAR10 to >94% accuracy in as few as 18 epochs with Test Time Augmentation or with 30 epochs without, as in the DAWNBench competition;
    • fine-tuning Resnet50 to 90% accuracy on the Cars Stanford Dataset in just 60 epochs (previous reports to the same accuracy used 600);
    • training from scratch an AWD LSTM or QRNN in 90 epochs (or 1 hour and a half on a single GPU) to state-of-the-art perplexity on Wikitext-2 (previous reports used 750 for LSTMs, 500 for QRNNs).
  • That means that we’ve seen (for the first time we’re aware of) super convergence using Adam! Super convergence is a phenomenon that occurs when training a neural net with high learning rates, growing for half the training. Before it was understood, training CIFAR10 to 94% accuracy took about 100 epochs.
  • In contrast to previous work, we see Adam getting about as good accuracy as SGD+Momentum on every CNN image problem we’ve tried it on, as long as it’s properly tuned, and it’s nearly always a bit faster too.
  • The suggestions that amsgrad are a poor “fix” are correct. We consistently found that amsgrad didn’t achieve any gain in accuracy (or other relevant metric) than plain Adam/AdamW.

When you hear people saying that Adam doesn’t generalize as well as SGD+Momentum, you’ll nearly always find that they’re choosing poor hyper-parameters for their model. Adam generally requires more regularization than SGD, so be sure to adjust your regularization hyper-parameters when switching from SGD to Adam.

Here’s an overview of the rest of this article:

  1. AdamW
    1. Understanding AdamW
    2. Implementing AdamW
    3. Results of AdamW experiments
  2. amsgrad
    1. Understanding amsgrad
    2. Implementing amsgrad
    3. Results of amsgrad experiments
  3. Tables of full results

AdamW

Understanding AdamW: Weight decay or L2 regularization?

L2 regularization is a classic method to reduce over-fitting, and consists in adding to the loss function the sum of the squares of all the weights of the model, multiplied by a given hyper-parameter (all equations in this article use python, numpy, and pytorch notation):

final_loss = loss + wd * all_weights.pow(2).sum() / 2

…where wd is the hyper-parameter to set. This is also called weight decay, because when applying vanilla SGD it’s equivalent to updating the weight like this:

w = w - lr * w.grad - lr * wd * w

(Note that the derivative of w2 with respect to w is 2w.) In this equation we see how we subtract a little portion of the weight at each step, hence the name decay.

All libraries we have looked at use the first of these forms. (In practice, it is nearly always implemented by adding wd*w to the gradients, rather than actually changing the loss function: we don’t want to add more computations by modifying the loss when there is an easier way.)

So why make a distinction between those two concepts if they are the same thing? The answer is that they are only the same thing for vanilla SGD, but as soon as we add momentum, or use a more sophisticated optimizer like Adam, L2 regularization (first equation) and weight decay (second equation) become different. In the rest of this article, when we talk about weight decay, we will always refer to this second formula (decay the weight by a little bit) and talk about L2 regularization if we want to mention the classic way.

Let’s look at SGD with momentum for instance. Using L2 regularization consists in adding wd*w to the gradients (as we saw earlier) but the gradients aren’t subtracted from the weights directly. First we compute a moving average:

moving_avg = alpha * moving_avg + (1-alpha) * (w.grad + wd*w)

…and it’s this moving average that will be multiplied by the learning rate and subtracted from w. So the part linked to the regularization that will be taken from w is lr* (1-alpha)*wd * w plus a combination of the previous weights that were already in moving_avg.

On the other hand, weight decay’s update will look like

moving_avg = alpha * moving_avg + (1-alpha) * w.grad 
w = w - lr * moving_avg - lr * wd * w

We can see that the part subtracted from w linked to regularization isn’t the same in the two methods. When using the Adam optimizer, it gets even more different: in the case of L2 regularization we add this wd*w to the gradients then compute a moving average of the gradients and their squares before using both of them for the update. Whereas the weight decay method simply consists in doing the update, then subtract to each weight.

Clearly those are two different approaches. And after experimenting with this, Ilya Loshchilov and Frank Hutter suggest in their article we should use weight decay with Adam, and not the L2 regularization that classic deep learning libraries implement.

Implementing AdamW

How can we do this? Very easily if you’re using the fastai library since its implemented inside. Specifically if you use the fit function, just add the argument use_wd_sched=True:

learn.fit(lr, 1, wds=1e-4, use_wd_sched=True)

If you prefer the new training API, you can use the argument wd_loss=False (for weight decay not computed in the loss) in each of your training phases:

phases = [TrainingPhase(1, optim.Adam, lr, wds=1-e4, wd_loss=False)]
learn.fit_opt_sched(phases)

Here’s a quick summary of how we implemented this in fastai. Inside the step function of the optimizer, only the gradients are used to modify the parameters, the value of the parameters themselves isn’t used at all (except for the weight decay, but we will be dealing with that outside). We can then implement weight decay by simply doing it before the step of the optimizer. It still has to be done after the gradients are computed (otherwise it would impact the gradients values) so inside your training loop, you have to look for this spot.

loss.backward()
#Do the weight decay here!
optimizer.step()

Of course, the optimizer should have been set with wd=0 otherwise it will do some L2 regularization, which is exactly what we don’t want right now. Now in that spot, we have to loop over all the parameters and do our little weight decay update. Your parameters should all be inside the dictionary param_groups of your optimizer, so the loop looks something like this:

loss.backward()
for group in optimizer.param_groups():
    for param in group['params']:
        param.data = param.data.add(-wd * group['lr'], param.data)
optimizer.step()

Results of AdamW experiments: does it work?

Our first tests on computer vision problems were very encouraging. Specifically, the accuracy we managed to get in 30 epochs (which is the necessary time for SGD to get to 94% accuracy with a 1cycle policy) with Adam and L2 regularization was at 93.96% on average, going over 94% one time out of two. We consistently reached values between 94% and 94.25% with Adam and weight decay. To do this, we found the optimal value for beta2 when using a 1cycle policy was 0.99. We treated the beta1 parameter as the momentum in SGD (meaning it goes from 0.95 to 0.85 as the learning rates grow, then goes back to 0.95 when the learning rates get lower).

Accuracy with L2 regularization or weight decay
Accuracy with L2 regularization or weight decay

Even more impressive, using Test Time Augmentation (i.e. taking the average of the predictions on one image of the test set and four data-augmented versions of it), we can get to those 94% accuracy in just 18 epochs (93.98% on average)! With plain Adam and L2 regularization, going over the 94% happened once every twenty tries.

One thing to take into account in those comparisons is that changing the way we regularize changes the best values of weight decay or learning rate. In the tests we ran, the best learning rate with L2 regularization was 1e-6 (with a maximum learning rate of 1e-3) while 0.3 was the best value for weight decay (with a learning rate of 3e-3). The difference of orders of magnitude has been very consistent in all our tests, and comes primarily from the fact that L2 regularization gets effectively divided by the average norm of the gradients (which are pretty low) and that learning rates with Adam are pretty small (so the update of weight decay needs a stronger coefficient).

So, weight decay is always better than L2 regularization with Adam then? We haven’t found a situation where it’s significantly worse, but for either a transfer-learning problem (e.g. fine-tuning Resnet50 on Stanford cars) or RNNs, it didn’t give better results.

amsgrad

Understanding amsgrad

Amsgrad was introduced in a recent article by Sashank J. Reddi, Satyen Kale and Sanjiv Kumar. By analyzing the proof of convergence for the Adam optimizer, they spotted a mistake in the update rule that could cause the algorithm to converge to a sub-optimal point. They designed theoretical experiments that showed situations where Adam would fail and proposed a simple fix.

To understand the error and the fix, let’s have a look at the update rule of Adam (if you need a refresher, Sebastian got you covered):

avg_grads = beta1 * avg_grads + (1-beta1) * w.grad
avg_squared = beta2 * (avg_squared) + (1-beta2) * (w.grad ** 2)
w = w - lr * avg_grads / sqrt(avg_squared)

We’ve just skipped the bias correction (useful for the beginning of training) to focus on the important point. The error in the proof of Adam the authors spotted is that it requires the quantity

lr / sqrt(avg_squared)

…which is the step we take in the direction of our average gradients, to be decreasing over training. Since the learning rate is often taken constant or decreasing (except for crazy people like us trying to obtain super-convergence), the fix the authors proposed was to force the avg_squared quantity to be increasing by adding another variable to keep track of their maximums.

Implementing amsgrad

The associated article won an award at ICLR 2018 and gained such popularity that it’s already implemented in two of the main deep learning libraries, pytorch and Keras. There is little to do except turn the option on with amsgrad=True.

This causes the weight update code from the previous section to be changed to something like this:

avg_grads = beta1 * avg_grads + (1-beta1) * w.grad
avg_squared = beta2 * (avg_squared) + (1-beta2) * (w.grad ** 2)
max_squared = max(avg_squared, max_squared)
w = w - lr * avg_grads / sqrt(max_squared)

Results of amsgrad experiments: a lot of noise for nothing

Amsgrad turns out to be very disappointing. In none of our experiments did we find that it helped the slightest bit, and even if it’s true that the minimum found by amsgrad is sometimes slightly lower (in terms of loss) than the one reached by Adam, the metrics (accuracy, f1 score…) always end up worse (see the tables in our introduction, or more examples here)

The proof of convergence for the Adam optimizer in deep learning (since it’s for convex problems) and the mistake they found in it mattered for synthetic experiments that have nothing to do with real-life problems. Actual tests show that when those avg_squared gradients want to decrease, it’s best for the final result to do so.

This shows that even if the focus on theory can be useful to gain some new ideas, nothing replaces experiments (and lots of them!) to make sure these ideas actually help practitioners train better models.

Appendix: Full results

Training of CIFAR10 from scratch (model is a wide resnet 22, average of the error on the test set with five models shown):

Method Without amsgrad With amsgrad
AdamW 5.66% 6.31%
Adam 6.06% 6.64%

Fine-tuning Resnet50 on the Stanford Cars dataset using the standard head introduced by the fastai library (training the head for 20 epochs before unfreezing and training with differential learning rates for 40 epochs).:

Method Without amsgrad With amsgrad
AdamW 10.8%/9.5% 10.1%/9.5%
Adam 10.4%/9% 10.1%/9%

Training an AWD LSTM with the hyper-parameters from the github repo (results show the perplexity on the validation/test set, with or without cache pointer):

Method Raw model With cache pointer
Adam 68.7/65.5 52.9/50.9
Adam + amsgrad 69.4/66.5 53.1/51.3
AdamW 68.9/65.7 52.8/50.9
AdamW + amsgrad 72.7/69 57/54.7

And the same with QRNNs instead of LSTMs:

Method Raw model With cache pointer
Adam 69.6/66.7 53.6/51.7
Adam + amsgrad 71.5/68.4 54.2/52.2
AdamW 70.5/67.3 55.5/53.3
AdamW + amsgrad 74.3/70.9 57.8/55.6

For this specific task, we used a modified version of the 1cycle policy, growing the learning rate faster, then having a long period of high constant learning rates before going down again.

Comparison between Adam and other optimizers
Comparison between Adam and other optimizers

The values of all relevant hyper-parameters as well as the code used to produce these results are available here.