At fast.ai, we want to do our part to increase diversity in deep learning and to lower the unnecessary barriers to entry for everyone. We are providing diversity scholarships for our updated part-time, in-person Practical Deep Learning for Coders course presented in conjunction with the University of San Francisco Data Institute, to be offered one evening per week for 7 weeks, starting October 22, in downtown San Francisco. Women, people of Color, LGBTQ people, people with disabilities, and/or veterans are eligible to apply.
The deadline to apply is September 17, 2018. Details on how to apply, and FAQ, are at the end of this post.
However, there is also great potential for harm. We are worried about unethical uses of data science, and about the ways that society’s racial and gender biases (summary here) are being encoded into our machine learning systems. We are concerned that an extremely homogeneous group is building technology that impacts everyone. People can’t address problems that they’re not aware of, and with more diverse practitioners, a wider variety of important societal problems will be tackled.
We want to get deep learning into the hands of as many people as possible, from as many diverse backgrounds as possible. People with different backgrounds have different problems they’re interested in solving. The traditional approach is to start with an AI expert and then give them a problem to work on; at fast.ai we want people who are knowledgeable and passionate about the problems they are working on, and we’ll teach them the deep learning they need.
People sometimes ask if I think it's risky for everyone to have access to AI. I think it's MORE risky for an exclusive & homogeneous group alone to develop tech that impacts us all. https://t.co/0jRfmgXGDJ
While some people worry that it’s risky for more people to have access to AI; I believe the opposite. We’ve already seen the harm wreaked by elite and exclusive companies such as Facebook, Palantir, and YouTube/Google. Getting people from a wider range of backgrounds involved can help us address these problems.
The fast.ai approach
We began fast.ai with an experiment: to see if we could teach deep learning to coders, with no math pre-requisites beyond high school math, and get them to state-of-the-art results in just 7 weeks. This was very different from other deep learning materials, many of which assume a graduate level math background, focus on theory, only work on toy problems, and don’t even include the practical tips. We didn’t even know if what we were attempting was possible, but the fast.ai course has been a huge success!
fast.ai is not just an educational resource; we also do cutting-edge research and have achieved state-of-the-art results. Our wins (and here) in Stanford’s DAWNBench competition against much better funded teams from Google and Intel were covered in the MIT Tech Review and the Verge. Jeremy’s work with Sebastian Ruder achieving state-of-the art on 6 language classification datasets was accepted by ACL and is being built upon by OpenAI. All this research is incorporated into our course, teaching students state-of-the-art techniques.
How to Apply
Deep Learning Part 1 covers the use of deep learning for image recognition, recommendation systems, sentiment analysis, and time-series prediction. Wondering if you’re qualified? The only requirements are:
At least 1 year of coding experience (the course is taught in Python)
At least 8 hours a week to commit to the course (includes time for homework)
Curiosity and a willingness to work hard
Identify as a woman, person of Color, LGBTQ person, person with a disability, and/or veteran
Be available to attend in-person 6:30-9pm in downtown San Francisco (SOMA), one evening per week (exact schedule found here under details, day of the week varies)
The number of scholarships we are able to offer depends on how much funding we receive (if your organization may be able to sponsor one or more places, please let us know). To apply for the fellowship, you will need to submit a resume and statement of purpose. The statement of purpose will include the following:
1 paragraph describing one or more problems you’d like to apply deep learning to
I’m not eligible for the diversity scholarship, but I’m still interested. Can I take the course? Absolutely! You can register here.
I don’t live in the San Francisco Bay Area; can I participate remotely? Yes! Once again, we will be offering remote international fellowships. Stay tuned for details to be released in a blog post in the next few weeks.
Will this course be made available online later? Yes, this course will be made freely available online afterwards. Benefits of taking the in-person course include earlier access, community and in-person interaction, and more structure (for those that struggle with motivation when taking online courses).
Is fast.ai able to sponsor visas or provide stipends for living expenses? No, we are not able to sponsor visas nor to cover living expenses.
How will this course differ from the previous fast.ai courses? Our goal at fast.ai is to push the state-of-the-art. Each year, we want to make deep learning increasingly intuitive to use while giving better results. With our fastai library, we are beating our own state-of-the-art results from last year. This year’s course will coincide with our release of an updated version of the fastai library (built on top of PyTorch).
What language is the course taught in? The course is taught in Python, using the fastai library and PyTorch. Some of our students have gone on to use the fastai library in production at Fortune 500 companies.
Note from Jeremy: I’ll be teaching Deep Learning for Coders at the University of San Francisco starting in October; if you’ve got at least a year of coding experience, you can apply here.
A team of fast.ai alum Andrew Shaw, DIU researcher Yaroslav Bulatov, and I have managed to train Imagenet to 93% accuracy in just 18 minutes, using 16 public AWS cloud instances, each with 8 NVIDIA V100 GPUs, running the fastai and PyTorch libraries. This is a new speed record for training Imagenet to this accuracy on publicly available infrastructure, and is 40% faster than Google’s DAWNBench record on their proprietary TPU Pod cluster. Our approach uses the same number of processing units as Google’s benchmark (128) and costs around $40 to run.
DIU and fast.ai will be releasing software to allow anyone to easily train and monitor their own distributed models on AWS, using the best practices developed in this project. The main training methods we used (details below) are: fast.ai’s progressive resizing for classification, and rectangular image validation; NVIDIA’s NCCL with PyTorch’s all-reduce; Tencent’s weight decay tuning; a variant of Google Brain’s dynamic batch sizes, gradual learning rate warm-up (Goyal et al 2018, and Leslie Smith 2018). We used the classic ResNet-50 architecture, and SGD with momentum.
Four months ago, fast.ai (with some of the students from our MOOC and our in person course at the Data Instituate at USF) achieved great success in the DAWNBench competition, winning the competition for fastest training of CIFAR-10 (a small dataset of 25,000 32x32 pixel images) overall, and fastest training of Imagenet (a much larger dataset of over a million megapixel images) on a single machine (a standard AWS public cloud instance). We previously wrote about the approaches we used in this project. Google also put in a very strong showing, winning the overall Imagenet speed category using their “TPU Pod” cluster, which is not available to the public. Our single machine entry took around three hours, and Google’s cluster entry took around half an hour. Before this project, training ImageNet on the public cloud generally took a few days to complete.
We entered this competition because we wanted to show that you don’t have to have huge resources to be at the cutting edge of AI research, and we were quite successful in doing so. We particularly liked the headline from The Verge: “An AI speed test shows clever coders can still beat tech giants like Google and Intel.”
However, lots of people asked us – what would happen if you trained on multiple publicly available machines. Could you even beat Google’s impressive TPU Pod result? One of the people that asked this question was Yaroslav Bulatov, from DIU (the Department of Defense’s Silicon Valley-based experimental innovation unit.) Andrew Shaw (One of our DAWNBench team members) and I decided to team up with Yaroslav to see if we could achieve this very stretch goal. We were encouraged to see that recently AWS had managed to train Imagenet in just 47 minutes, and in their conclusion said: “A single Amazon EC2 P3 instance with 8 NVIDIA V100 GPUs can train ResNet50 with ImageNet data in about three hours [fast.ai] using Super-Convergence and other advanced optimization techniques. We believe we can further lower the time-to-train across a distributed configuration by applying similar techniques.” They were kind enough to share the code for this article as open source, where we found some helpful network configuration tips there to ensure we got the most out of the Linux networking stack and the AWS infrastructures.
Iterating quickly required solving challenges such as:
How to easily run multiple experiments across multiple machines, without having a large pool of expensive instances running constantly?
How to conveniently take advantage of AWS’s spot instances (which are around 70% cheaper than regular instances) but that need to be set up from scratch each time you want to use them?
fast.ai built a system for DAWNBench that included a Python API for launching and configuring new instances, running experiments, collecting results, and viewing progress. Some of the more interesting design decisions in the systems included:
Not to use a configuration file, but instead configuring experiments using code leveraging a Python API. As a result, we were able to use loops, conditionals, etc to quickly design and run structured experiments, such as hyper-parameter searches
Writing a Python API wrapper around tmux and ssh, and launching all setup and training tasks inside tmux sessions. This allowed us to later login to a machine and connect to the tmux session, to monitor its progress, fix problems, and so forth
Keeping everything as simple as possible – avoiding container technologies like Docker, or distributed compute systems like Horovod. We did not use a complex cluster architecture with separate parameter servers, storage arrays, cluster management nodes, etc, but just a single instance type with regular EBS storage volumes.
Independently, DIU faced a similar set of challenges and developed a cluster framework, with analogous motivation and design choices, providing the ability to run many large scale training experiments in parallel. The solution, nexus-scheduler, was inspired by Yaroslav’s experience running machine learning experiments on Google’s Borg system.
The set of tools developed by fast.ai focused on fast iteration with single-instance experiments, whilst the nexus-scheduler developed by DIU was focused on robustness and multi-machine experiments. Andrew Shaw merged parts of the fast.ai software into nexus-scheduler, so that we had the best pieces of each, and we used this for our experiments.
Using nexus-scheduler helped us iterate on distributed experiments, such as:
Launching multiple machines for a single experiment, to allow distributed training. The machines for a distributed run are automatically put into a placement group, which results in faster network performance
Providing monitoring through Tensorboard (a system originally written for Tensorflow, but which now works with Pytorch and other libraries) with event files and checkpoints stored on a region-wide file system.
Automating setup. Various necessary resources for distributed training, like VPCs, security groups, and EFS are transparently created behind the scenes.
AWS provides a really useful API that allowed us to build everything we needed quickly and easily. For distributed computation we used NVIDIA’s excellent NCCL library, which implements ring-style collectives that are integrated with PyTorch’s all-reduce distributed module. We found that AWS’s instances were very reliable and provided consistent performance, which was very important for getting best results from the all-reduce algorithm.
The first official release of nexus-scheduler, including the features merged from the fast.ai tools, is planned for Aug 25th.
A simple new training trick: rectangles!
I mentioned to Andrew after DAWNBench finished that I thought deep learning practitioners (including us!) were doing something really dumb: we were taking rectangular images (such as those used in Imagenet) and cropping out just the center piece when making predictions. Or a (very slow) alternative widely used is to pick 5 crops (top and bottom left and right, plus center) and average the predictions. Which leaves the obvious question: why not just use the rectangular image directly?
A lot of people mistakenly believe that convolutional neural networks (CNNs) can only work with one fixed image size, and that that must be rectangular. However, most libraries support “adaptive” or “global” pooling layers, which entirely avoid this limitation. It doesn’t help that some libraries (such as Pytorch) distribute models that do not use this feature – it means that unless users of these libraries replace those layers, they are stuck with just one image size and shape (generally 224x224 pixels). The fastai library automatically converts fixed-size models to dynamically sized models.
I’ve never seen anyone try to train with rectangular images before, and haven’t seen them mentioned in any research papers yet, and none of the standard deep learning libraries I’ve found support this. So Andrew went away and figured out how to make it work with fastai and Pytorch for predictions.
The result was amazing – we saw an immediate speedup of 23% in the amount of time it took to reach the benchmark accuracy of 93%. You can see a comparison of the different approaches in this notebook, and compare the accuracy of them in this notebook.
Progressive resizing, dynamic batch sizes, and more
One of our main advances in DAWNBench was to introduce progressive image resizing for classification – using small images at the start of training, and gradually increasing size as training progresses. That way, when the model is very inaccurate early on, it can quickly see lots of images and make rapid progress, and later in training it can see larger images to learn about more fine-grained distinctions.
In this new work, we additionally used larger batch sizes for some of the intermediate epochs – this allowed us to better utilize the GPU RAM and avoid network latency.
Recently, Tencent published a very nice paper showing <7 minute training of Imagenet on 2,048 GPUs. They mentioned a trick we hadn’t tried before, but makes perfect sense: removing weight decay from batchnorm layers. That allowed us to trim another couple of epochs from our training time. (The Tencent paper also used a dynamic learning rate approach developed by NVIDIA research, called LARS, which we’ve also been developing for fastai, but is not included yet in these results.)
When we put all this together, we got a training time of 18 minutes on 16 AWS instances, at a total compute cost (including the cost of machine setup time) of around $40. The benefits of being able to train on datasets of >1 million images are significant, such as:
Organizations with large image libraries, such as radiology centers, car insurance companies, real estate listing services, and e-commerce sites, can now create their own customized models. Whilst with transfer learning using so many images is often overkill, for highly specialized image types or fine-grained classification (as is common in medical imaging) using larger volumes of data may give even better results
Smaller research labs can experiment with different architectures, loss functions, optimizers, and so forth, and test on Imagenet, which many reviewers expect to see in published papers
By allowing the use of standard public cloud infrastructure, no up-front capital expense is required to get started on cutting-edge deep learning research.
Unfortunately, big companies using big compute tend to get far more than their fair share of publicity. This can lead to AI commentators coming to the conclusion that only big companies can compete in the most important AI research. For instance, following Tencent’s recent paper, OpenAI’s Jack Clark claimed “for all the industries talk about democratization it’s really the case that advantages accrue to people with big computers”. OpenAI (and Jack Clark) are working to democratize AI, such as through the excellent OpenAI Scholars program and Jack’s informative Import AI newsletter; however, this misperception that success in AI comes down to having bigger computers can distort the research agenda.
I’ve seen variants of the “big results need big compute” claim continuously over the last 25 years. It’s never been true, and there’s no reason that will change. Very few of the interesting ideas we use today were created thanks to people with the biggest computers. Ideas like batchnorm, ReLU, dropout, adam/adamw, and LSTM were all created without any need for large compute infrastructure. And today, anyone can access massive compute infrastructure on demand, and pay for just what they need. Making deep learning more accessible has a far higher impact than focusing on enabling the largest organizations - because then we can use the combined smarts of millions of people all over the world, rather than being limited to a small homogeneous group clustered in a couple of geographic centers.
We’re in the process of incorporating all of the best practices used in this project directly into the fastai library, including automating the selection of hyper-parameters for fast and accurate training.
And we’re not even done yet - we have some ideas for further simple optimizations which we’ll be trying out. We don’t know if we’ll hit the sub seven minute mark of Tencent, since they used a faster network than AWS provides, but there’s certainly plenty of room to go faster still.
Special thanks to the AWS and PyTorch teams who helped us by patiently answering our questions throughout this project, and for the wonderfully pragmatic products that they’ve made available for everyone to use!
The Harvard Business Review recently published an article, Want Less-Biased Decisions? Use Algorithms. by Alex P. Miller. The article focuses on the fact that humans make very biased decisions (which is true), yet ignores many important related issues, including:
algorithms are often implemented without any appeals method in place (due to the misconception that algorithms are objective, accurate, and won’t make mistakes)
algorithms are often used at a much larger scale than human decision makers, in many cases, replicating an identical bias at scale (part of the appeal of algorithms is how cheap they are to use)
users of algorithms may not understand probabilities or confidence intervals (even if these are provided), and may not feel comfortable overriding the algorithm in practice (even if this is technically an option)
instead of just focusing on the least-terrible existing option, it is more valuable to ask how we can create better, less biased decision-making tools by leveraging the strengths of humans and machines working together
Miller acknowledges that critics of the “algorithmic revolution” are “concerned that algorithms are often opaque, biased, and unaccountable tools being wielded in the interests of institutional power”, although he then focuses exclusively on the biased part for the remainder of the article, without addressing the opaque or unaccountable charges (as well as how these interact with bias).
Humans vs. machines is not a helpful framing
The media often frames advances in AI through a lens of humans vs. machines: who is the champion at X task. This framework is both inaccurate as to how most algorithms are used, as well as a very limited way to think about AI. In all cases, algorithms have a human component, in terms of who gathers the data (and what biases they have), which design decisions are made, how they are implemented, how results are used to make decisions, the understanding various stakeholders have of correct uses and limitations of the algorithm, and so on.
Most people working on medical applications of AI are not trying to replace doctors; they are trying to create tools that will allow doctors to be more accurate and more efficient, improving quality of care. The best chess “players” are neither humans nor computers, but rather, teams of humans and computers working together.
Miller’s HBR article points out (correctly) that humans are very biased, and then compares our current not-so-great approaches to see which is less terrible. The article does not ask the question, how can we develop less biased ways to make decisions (perhaps using some combination of humans and algorithms)? which is a far more interesting and important question.
Algorithms are often used differently than human decision makers
Algorithms are often used at a larger scale, mass-producing identical biases, and assumed to be error-proof or objective. The studies that Miller shares compares them in an apples-to-apples way, which doesn’t acknowledge how differently they are often used in practice.
Cathy O’Neil writes in Weapons of Math Destruction that the algorithms she is critiquing tend to punish the poor. They specialize in bulk, and they’re cheap. That’s part of their appeal. The wealthy, by contrast, often benefit from personal input. A white-shoe law firm or an exclusive prep school will lean far more on recommendations and face-to-face interviews than will a fast-food chain or a cash-strapped urban school district. The privileged, we’ll see time and again, are processed more by people, the masses by machines. (emphasis mine)
One example from O’Neil’s book is that of a college student with bipolar disorder who wanted to get a summer job bagging groceries. Every store he applied to was using the same pyschometric evaluation software to screen candidates, and he was rejected from every store. This captures another danger of algorithms: even though humans often have similar biases, not all humans will make the exact same decisions (e.g. perhaps that college student would have been able to find one place to hire him, even if some of the people making decisions had biases about mental health).
Many people will put more trust in algorithmic decisions than they might in human decisions. While the researchers designing the algorithms may have a good grasp on probability and confidence intervals, often the general public using them will not. Even if people are given the power to override algorithmic decisions, it is crucial to understand if they will feel comfortable doing so in practice.
The need for meaningful appeals and explanations
Many of the most chilling stories of algorithmic bias don’t involve meaningful explanations or a meaningful appeals process. This seems to be a particular trend amongst algorithmic decision making systems, perhaps since people mistakenly assume algorithms are objective, they believe there is no need for appeals. Also, as explained above, algorithmic decision making systems are often used as a cost-cutting device, and allowing appeals would be more expensive.
Cathy O’Neil writes the account of a teacher who is beloved by her students, their parents, and the principal, yet is inexplicably fired by an algorithm. She is never able to get an answer as to why she was fired. Stories like this would be somewhat less disturbing if there had been a relatively quick and simple way for her to appeal the decision, or even to know for sure what factors it was related to.
The Verge investigated software used in over half of U.S. states to determine how much healthcare people receive. After its implemention in Arkansas, people (many with severe disabilities) drastically had their healthcare cut. For instance, Tammy Dobbs, a woman with cerebral palsy who needs an aid to help her to get out of bed, to go to the bathroom, to get food, and more, had her hours of help suddenly reduced by 20 hours a week. She couldn’t get any explanation for why her healthcare was cut. Eventually, a court case revealed that there were mistakes in the software implementation of the algorithm, negatively impacting people with diabetes or cerebral palsy. However, Dobbs and many other people reliant on these health care benefits live in fear that their benefits could again be cut suddenly and inexplicably.
The creator of the algorithm, who is a professor and earning royalties off of this software, was asked whether there should be a way to communicate decisions, “It’s probably something we should do. I should also probably dust under my bed.” He later clarified that he thought it was someone else’s responsibility. We can not keep claiming the problems caused by our technology are someone else’s responsibility.
For a separate computer system used in Colorado to determine public benefits in the mid-2000s, it was discovered that more than 900 incorrect rules had been coded into the system, resulting in problems like pregnant women being denied Medicaid. It is often hard for lawyers to even discover these flaws, since the inner-workings of the algorithms are typically protected as trade secrets. Systems used to make decisions related to healthcare, hiring/firing, criminal justice, and other life-altering areas should include some sort of human appeals process, that is relatively fast and easy to navigate. Many of the most chilling stories of algorithmic decision making would not be nearly as concerning if there had been an easy way to appeal and correct faulty decisions. Mistakes are possible in anything we do, so it’s important to have a tight loop in which we make it easy to discover and correct mistakes.
Complicated, real-world systems
When we think about AI, we need to think about complicated, real-world systems. The studies in the HBR article treat decision making as an isolated action, without taking into account that this decision-making happens within complicated real-world systems. A decision about whether someone is likely to commit another crime is not an isolated decision: it lives within the complicated system of our criminal justice system. We have a responsibility to understand the real-world systems with which our work will interact, and to not lose sight of the actual people who will be impacted.
The COMPAS recidivism algorithm is used in some US courtrooms for decisions related to pre-trial bail, sentencing, and parole. It was the subject of a ProPublica investigation finding that the false positive rate (people that were labeled “high risk” but were not re-arrested) for white defendants was 24% and for Black defendants was 45%. Later research found that COMPAS (which uses 137 inputs in a black-box algorithm) was no more accurate than a simple linear equation on two variables. COMPAS was also not more accurate than untrained Mechanical Turk workers. (You can find out more about various definitions of fairness in Princeton CS Professor Arvind Narayanan’s excellent 21 Definitions of Fairness talk).
Kristian Lum, statistics PhD and lead data scientist at the Human Rights Digital Analysis Group, organized a workshop together with Elizabeth Bender, a staff attorney for the NY Legal Aid Society and former public defender, and Terrence Wilkerson, an innocent man who had been arrested and could not afford bail. Together, they shared first hand experience about the obstacles and inefficiencies that occur in the legal system, providing valuable context to the debate around COMPAS. Bender shared that for public defenders meet with defendants at Rikers Island, where many pre-trial detainees in NYC who can’t afford bail are held, involves a bus ride that is two hours each way and they then only get 30 minutes to see the defendant, assuming the guards are on time (which is not always the case). Wilkerson explained how frequently innocent defendents who can’t afford bail accept guilty plea bargains just so they can get out of jail faster. Again, all this is for people that have not even faced a trial yet! This panel was an excellent way to illuminate the real-world systems and educate about the first-hand impact. I hope more statisticians and computer scientists will follow this example.
As this example shows, algorithms can often exacerbate underlying societal problems. There are deep, structural problems with the US courts and prison systems, including racial bias, the use of cash bail (nearly half a million people in the USA are languishing in jail before even facing a trial, because they are too poor to afford bail), predatory for-profit prisons, and extreme over-use of prisons (the US is home to 4% of the world’s population and 22% of the world’s prisoners). We have a responsibility to understand the systems and underlying problems our algorithms may interact with.
Most critics of unjust bias aren't anti-algorithm
Most critics of biased algorithms are opposed to unjust bias; they are not people who hate algorithms. Miller says that critics of biased algorithms “rarely ask how well the systems they analyze would operate without algorithms,” suggesting that those speaking out against biased algorithms are perhaps unaware of how biased humans are or perhaps just don’t like algorithms. I spent a great deal of time researching and writingaboutstudiesof human bias (particularly as to how they pertain to the tech industry), long before I began writing about bias in machine learning.
To announce Google’s AutoML, Google CEO Sundar Pichai wrote, “Today, designing neural nets is extremely time intensive, and requires an expertise that limits its use to a smaller community of scientists and engineers. That’s why we’ve created an approach called AutoML, showing that it’s possible for neural nets to design neural nets. We hope AutoML will take an ability that a few PhDs have today and will make it possible in three to five years for hundreds of thousands of developers to design new neural nets for their particular needs.” (emphasis mine)
When Google’s Head of AI, Jeff Dean, suggested that 100x computational power could replace the need for machine learning expertise, computationally expensive neural architecture search was the only example he gave to illustrate this point. (around 23:50 in his TensorFlow DevSummit keynote)
This raises a number of questions: do hundreds of thousands of developers need to “design new neural nets for their particular needs” (to quote Pichai’s vision), or is there an effective way for neural nets to generalize to similar problems? Can large amounts of computational power really replace machine learning expertise?
In evaluating Google’s claims, it’s valuable to keep in mind Google has a vested financial interest in convincing us that the key to effective use of deep learning is more computational power, because this is an area where they clearly beat the rest of us. If true, we may all need to purchase Google products. On its own, this doesn’t mean that Google’s claims are false, but it’s good be aware of what financial motivations could underlie their statements.
In my previous posts, I shared an introduction to the history of AutoML, defined what neural architecture search is, and pointed out that for many machine learning projects, designing/choosing an architecture is nowhere near the hardest, most time-consuming, or most painful part of the problem. In today’s post, I want to look specifically at Google’s AutoML, a product which has received a lot of media attention, and address the following:
Google’s Cloud AutoML was announced in January 2018 as a suite of machine learning products. So far it consists of one publicly available product, AutoML Vision, an API that identifies or classifies objects in pictures. According to the product page, Cloud AutoML Vision relies on two core techniques: transfer learning and neural architecture search. Since we’ve already explained neural architecture search, let’s now take a look at transfer learning, and see how it relates to neural architecture search.
Note: Google Cloud AutoML also has a drag-and-drop ML product that is still in alpha. I applied for access to it over 2 months ago, but I have not heard back from Google yet. I plan to write a post once it’s released.
What is transfer learning?
Transfer learning is a powerful technique that lets people with smaller datasets or less computational power achieve state-of-the-art results, by taking advantage of pre-trained models that have been trained on similar, larger data sets. Because the model learned via transfer learning doesn’t have to learn from scratch, it can generally reach higher accuracy with much less data and computation time than models that don’t use transfer learning.
Transfer learning is a core technique that we use throughout our free Practical Deep Learning for Coders course– and that our students have been applying in production in everything from their own startups to Fortune 500 companies. Although transfer learning seems to be considered “less sexy” than neural architecture search, it is being used to achieve ground-breaking academic results, such as in Jeremy Howard and Sebastian Ruder’s application of transfer learning to NLP, which achieved state-of-the-art classification on 6 datasets and is serving as a basis for further research in this area at OpenAI.
Neural architecture search vs. transfer learning: two opposing approaches
The underlying idea of transfer learning is that neural net architectures will generalize for similar types of problems: for example, that many images have underlying features (such as corners, circles, dog faces, or wheels) that show up in a variety of different types of images. In contrast, the underlying idea of promoting neural architecture search for every problem is the opposite: that each dataset has a unique, highly specialized architecture it will perform best with.
When neural architecture search discovers a new architecture, you must learn weights for that architecture from scratch, while with transfer learning, you begin with existing weights from a pre-trained model. In this sense, you you can’t use neural architecture search and transfer learning on the same problem: if you’re learning a new architecture, you would need to train new weights for it; whereas if you are using transfer learning on a pretrained model, you can’t make substantial changes to the architecture.
Of course, you can apply transfer learning to an architecture learned through neural architecture search (which I think is a good idea!). This requires only that a few researchers use neural architecture search and open-source the models that they find. It is not necessary for all machine learning practitioners to be using neural architecture search themselves on all problems when they can instead use transfer learning.
However, Jeff Dean’s keynote, Sundar Pichai’s blog post, Google Cloud’s promotional materials, and the media coverage all suggest the opposite: that everybody needs to be able to use neural architecture search directly.
What Neural Architecture Search is good for
Neural architecture search is good for finding new architectures! Google’s AmoebaNet was learned via neural architecture search, and (with the inclusion of fast.ai advances such as an aggressive learning schedule and changing the image size as training progresses) is now the cheapest way to train ImageNet on a single machine!
AmoebaNet was not designed with a reward function that involved the ability to scale, and so it didn’t scale as well as ResNet to multiple machines, but a neural net that scales well could potentially be learned in the future, optimized for different qualities.
In need of more evidence
We haven’t seen evidence that every dataset would be best modeled with its own custom model, as opposed to instead fine-tuning an existing model. Since neural architecture search requires a larger training set, this would particularly be an issue for smaller data sets. Even some of Google’s own research uses transferable techniques instead of finding a new architecture for each data set, such as NASNet (blog post here), which learned an architectural building block on Cifar10 and then used that building block to create an architecture for ImageNet. I don’t know of any widely-entered machine learning competitions that have been won using neural architectures search yet.
Furthermore, we don’t know that the mega-computationally expensive approach to neural architecture search that Google touts is the superior approach. For instance, more recent papers such as Efficient Neural Architecture Search (ENAS) and Differentiable architecture search (DARTS) propose significantly more efficient algorithms. DARTS takes just 4 GPU days, compared to 1800 GPU days for NASNet and 3150 GPU days for AmoebaNet (all learned to the same accuracy on Cifar-10). Jeff Dean is an author on the ENAS paper, which proposed a technique that is 1000x less computationally expensive, which seems inconsistent with his emphasis at the TF DevSummit one month later on using approaches that are 100x more computationally expensive.
Then why all the hype about Google's AutoML?
Given the above limitations, why has Google AutoML’s hype been so disproportionate to its proven usefulness (at least so far)? I think there are a few explanations:
Google’s AutoML highlights some of the dangers of having an academic research lab embedded in a for-profit corporation. There is a temptation to try to build products around interesting academic research, without assessing if they fulfill an actual need. This is also the story of many AI start-ups, such as MetaMind or Geometric Intelligence, that end up as acquihires without ever having produced a product. My advice for startup founders is to avoid productionizing your PhD thesis and to avoid hiring only academic researchers.
Google excels at marketing. Artificial intelligence is seen as an inaccessible and intimidating field by many outsiders, who don’t feel that they have a way to evaluate claims, particularly from lionized companies like Google. Many journalists fall prey to this as well, and uncritically channel Google’s hype into glowing articles. I periodically talk to people that do not work in machine learning, yet are excited about various Google ML products that they’ve never used and can’t explain anything about.
One example of Google’s misleading coverage of its own achievements occurred when Google AI researchers released “a deep learning technology to reconstruct the true human genome”, compared their own work to Nobel prize-winning discoveries (the hubris!), and the story was picked up by Wired. However, Steven Salzberg, a distinguished professor of Biomedical Engineering, Computer Science, and Biostatistics at Johns Hopkins University debunked Google’s post. Salzberg pointed out that the research didn’t actually reconstruct the human genome and was “little more than an incremental improvement over existing software, and it might be even less than that.” A number of other genomics researchers chimed in to agree with Salzberg.
There is some great work happening at Google, but it would be easier to appreciate if we didn’t have to sift through so much misleading hype to figure out what is legitimate.
Google’s DeepVariant “is little more than an incremental improvement over existing software, and it might be even less than that.” @StevenSalzberg1
Google has a vested interest in convincing us that the key to effective use of deep learning is more computational power, because this is an area where they clearly beat the rest of us. AutoML is often very computationally expensive, such as in the examples of Google using 450 K40 GPUs for 7 days (the equivalent of 3150 GPU days) to learn AmoebaNet.
While engineers and the media often drool over bare-metal power and anything bigger, history has shown that innovation is often birthed instead by constraint and creativity. Google works on the biggest data possible using the most expensive computers possible; how well can this really generalize to the problems that the rest of us face living in a constrained world of limited resources?
Innovation comes from doing things differently, not from doing things bigger. The recent success of fast.ai in the Stanford DAWNBench competition is one example of this.
How can we address the shortage of machine learning expertise?
To return to the issue that Jeff Dean raised in his TensorFlow DevSummit keynote about the global shortage of machine learning practitioners, a different approach is possible. We can remove the biggest obstacles to using deep learning in several ways by:
“Custom heads” for existing architectures (e.g. modifying ResNet, which was initially designed for classification, so that it can be used to find bounding boxes or perform style transfer) allow for easier architecture reuse across a range of problems.
None of the above discoveries involve bare-metal power; instead, all of them were creative ideas of ways to do things differently.
Address Myths About What it Takes to Do Deep Learning
Another obstacle is the many myths that cause people to believe that deep learning isn’t for them: falsely believing that their data is too small, that they don’t have the right education or background, or that their computers aren’t big enough. One such myth says that only machine learning PhDs are capable of using deep learning, and many companies that can’t afford to hire expensive experts don’t even bother trying. However, it’s not only possible for companies to train the employees they already have to become machine learning experts, it’s even preferable, because your current employees already have domain expertise for the area you work in!
In my talk at the MIT Technology Review Conference, I addressed 6 Myths that lead people to incorrectly believe that using deep learning is harder than it is.
Although the cost of cloud GPUs (around 50 cents per hour) are within the budgets of many of us, I’m periodically contacted by students from around the world that can’t afford any GPU use at all. In some countries, rules about banking and credit cards can make it difficult for students to use services like AWS, even when they have the money. Google Colab notebooks are a solution! Colab notebooks provide a Jupyter notebook environment that requires no setup to use, runs entirely in the cloud, and gives users access to a free GPU (although long-running GPU use is not allowed). They can also be used to create documentation that contains working code samples running in an interactive environment. Google colab notebooks will do much more to democratize deep learning than Google’s AutoML will; perhaps this would be a better target for Google’s marketing machine in the future.
Researchers from CMU and DeepMind recently released an interesting new paper, called Differentiable Architecture Search (DARTS), offering an alternative approach to neural architecture search, a very hot area of machine learning right now. Neural architecture search has been heavily hyped in the last year, with Google’s CEO Sundar Pichai and Google’s Head of AI Jeff Dean promoting the idea that neural architecture search and the large amounts of computational power it requires are essential to making machine learning available to the masses. Google’s work on neural architecture search has been widely and adoringly covered by the tech media (see here, here, here, and here for examples).
During his keynote (starts around 22:20) at the TensorFlow DevSummit in March 2018, Jeff Dean posited that perhaps in the future Google could replace machine learning expertise with 100x computational power. He gave computationally expensive neural architecture search as a primary example (the only example he gave) of why we need 100x computational power in order to make ML accessible to more people.
What is neural architecture search? Is it the key to making machine learning available to non-machine learning experts? I will dig into these questions in this post, and in my next post, I will look specifically at Google’s AutoML. Neural architecture search is a part of a broader field called AutoML, which has also been receiving a lot of hype and which we will consider first.
The term AutoML has traditionally been used to describe automated methods for model selection and/or hyperparameter optimization. These methods exist for many types of algorithms, such as random forests, gradient boosting machines, neural networks, and more. The field of AutoML includes open-source AutoML libraries, workshops, research, and competitions. Beginners often feel like they are just guessing as they test out different hyperparameters for a model, and automating the process could make this piece of the machine learning pipeline easier, as well as speeding things up even for experienced machine learning practitioners.
There are a number of AutoML libraries, the oldest of which is AutoWEKA, which was first released in 2013 and automatically chooses a model and selects hyperparameters. Other notable AutoML libraries include auto-sklearn (which extends AutoWEKA to python), H2O AutoML, and TPOT. AutoML.org (formerly known as ML4AAD, Machine Learning for Automated Algorithm Design) has been organzing AutoML workshops at the academic machine learning conference ICML yearly since 2014.
How useful is AutoML?
AutoML provides a way to select models and optimize hyper-parameters. It can also be useful in getting a baseline to know what level of performance is possible for a problem. So does this mean that data scientists can be replaced? Not yet, as we need to keep the context of what else it is that machine learning practitioners do.
For many machine learning projects, choosing a model is just one piece of the complex process of building machine learning products. As I covered in my previous post, projects can fail if participants don’t see how interconnected the various parts of the pipeline are. I thought of over 30 different steps that can be involved in the process. I highlighted two of the most time-consuming aspects of machine learning (in particular, deep learning) as cleaning data (and yes, this is an inseparable part of machine learning) and training models. While AutoML can help with selecting a model and choosing hyperparameters, it is important to keep perspective on what other data expertise is still needed and on the difficult problems remain.
I will suggest some alternate approaches to AutoML for making machine learning practitioners more effective in the final section.
What is neural architecture search?
Now that we’ve covered some of what AutoML is, let’s look at a particularly active subset of the field: neural architecture search. Google CEO Sundar Pichai wrote that, “designing neural nets is extremely time intensive, and requires an expertise that limits its use to a smaller community of scientists and engineers. That’s why we’ve created an approach called AutoML, showing that it’s possible for neural nets to design neural nets.”
What Pichai refers to as using “neural nets to design neural nets” is known as neural architecture search; typically reinforcement learning or evolutionary algorithms are used to design the new neural net architectures. This is useful because it allows us to discover architectures far more complicated than what humans may think to try, and these architectures can be optimized for particular goals. Neural architecture search is often very computationally expensive.
To be precise, neural architecture search usually involves learning something like a layer (often called a “cell”) that can be assembled as a stack of repeated cells to create a neural network:
The term AutoML jumped to “mainstream” prominence with work by Google AI researchers (paper here) Quoc Le and Barret Zoph, which was featured at Google I/O in May 2017. This work used reinforcement learning to find new architectures for the computer vision problem Cifar10 and the NLP problem Penn Tree Bank, and achieved similar results to existing architectures.
NASNet from Learning Transferable Architectures for Scalable Image Recognition (blog post here). This work searches for an architectural building block on a small data set (Cifar10) and then builds an architecture for a large data set (ImageNet). This research was very computationally intensive with it taking 1800 GPU days (the equivalent of almost 5 years for 1 GPU) to learn the architecture (the team at Google used 500 GPUs for 4 days!).
AmoebaNet from Regularized Evolution for Image Classifier Architecture Search This research was even more computationally intensive than NASNet, with it taking the equivalent of 3150 GPU days (the equivalent of almost 9 years for 1 GPU) to learn the architecture (the team at Google used 450 K40 GPUs for 7 days!). AmoebaNet consists of “cells” learned via an evolutionary algorithm, showing that artificially-evolved architectures can match or surpass human-crafted and reinforcement learning-designed image classifiers. After incorporating advances from fast.ai such as an aggressive learning schedule and changing the image size as training progresses, AmoebaNet is now the cheapest way to train ImageNet on a single machine.
Efficient Neural Architecture Search (ENAS): used much fewer GPU-hours than previously existing automatic model design approaches, and notably, was 1000x less expensive than standard Neural Architecture Search. This research was done using a single GPU for just 16 hours.
What about DARTS?
Differentiable architecture search (DARTS). This research was recently released from a team at Carnegie Mellon University and DeepMind, and I’m excited about the idea. DARTS assumes the space of candidate architectures is continuous, not discrete, and this allows it to use gradient-based aproaches, which are vastly more efficient than the inefficient black-box search used by most neural architecture search algorithms.
To learn a network for Cifar-10, DARTS takes just 4 GPU days, compared to 1800 GPU days for NASNet and 3150 GPU days for AmoebaNet (all learned to the same accuracy). This is a huge gain in efficiency! Although more exploration is needed, this is a promising research direction. Given how Google frequently equates neural architecture search with huge computational expense, efficient ways to do architecture search have most likely been under-explored.
How useful is Neural Architecture Search?
In his TensorFlow DevSummit keynote (starts around 22:20), Jeff Dean suggested that a significant part of deep learning work is trying out different architectures. This was the only step of machine learning that Dean highlighted in his short talk, and I was surprised by his emphasis. Sundar Pichai’s blog post contained a similar assertion.
However, choosing a model is just one piece of the complex process of building machine learning products. In most cases, architecture selection is nowhere near the hardest, most time-consuming, or most significant part of the problem. Currently, there is no evidence that each new problem would be best modeled with it’s own unique architecture, and most practitioners consider it unlikely this will ever be the case.
Organizations like Google working on architecture design and sharing the architectures they discover with the rest of us are providing an important and helpful service. However, the underlying architecture search method is only needed for that tiny fraction of researchers that are working on foundational neural architecture design. The rest of us can just use the architectures they find via transfer learning.
How else could we make machine learning practitioners more effective? AutoML vs. Augmented ML
The field of AutoML, including neural architecture search, has been largely focused on the question: how can we automate model selection and hyperparameter optimization? However, automation ignores the important role of human input. I’d like to propose an alternate question: how can humans and computers work together to make machine learning more effective? The focus of augmented ML is on figuring out how a human and machine can best work together to take advantage of their different strengths.
An example of augmented ML is Leslie Smith’s learning rate finder (paper here), which is implemented in the fastai library (a high level API that sits on top of PyTorch) and taught as a key technique in our free deep learning course. The learning rate is a hyperparameter that can determine how quickly your model trains, or even whether it successfully trains at all. The learning rate finder allows a human to find a good learning rate in a single step, by looking at a generated chart. It’s faster than AutoML approaches to the same problem, improves the data scientist’s understanding of the training process, and encourages more powerful multi-step approaches to training models.
There’s another problem with the focus on automating hyperparameter selection: it overlooks the possibility that some types of model are more widely useful, have fewer hyperparameters to tune, and are less sensitive to choice of hyperparameters. For example, a key benefit of random forests over gradient boosting machines (GBMs) is that random forests are more robust, whereas GBMs tend to be fairly sensitive to minor changes in hyperparameters. As a result, random forests are widely used in industry. Researching ways to effectively remove hyperparameters (through smarter defaults, or through new models) can have a huge impact. When I first became interested in deep learning in 2013, it was overwhelming to feel that there were such a huge number of hyperparameters, and I’m happy that newer research and tools has helped eliminate many of those (especially for beginners). For instance, in the fast.ai course, beginners start by only having to choose a single hyperparameter, the learning rate, and we even give you a tool to do that!
Now that we have an overview of what the fields of AutoML and neural architecture search are, we can take a closer look at Google’s AutoML in the next post.
Please be sure to check out Part 3 of this post next week!