Question: I’m an undergrad student passionate about machine learning, and I feel a bit of pressure to get a PhD. Would it maybe make more sense to go into industry for a couple years and then consider going back to school? Any advice you have would be greatly appreciated.
Conversations around whether or not to do a PhD often suffer from selection bias: people considering PhDs ask successful people with PhDs for their advice. On the other side, there are many people doing fascinating and cutting-edge work without PhDs, who are less likely to be asked for advice on the topic. Other important factors, such as the disproportionately high rate of depression amongst graduate students or the opportunity cost of doing a PhD, are rarely discussed. As someone with a math PhD, I regret spending so many years over-focusing on a narrow area, while neglecting many other important skills. Once I joined the workforce, I felt like I was playing catch-up on many crucial skills and experiences!
Understanding Opportunity Costs
I grossly underestimated how much I could learn by working in industry. I believed the falsehood that the best way to always keep learning is to stay in academia, and I didn’t have a good grasp on the opportunity costs of doing a PhD. My undergraduate experience had been magical, and I had always both excelled at and enjoyed being in school. The idea of getting paid to be in school sounded like a sweet deal!
As I wrote about here, I later realized that my traditional academic success was actually a weakness, as I’d learned how to solve problems I was given, but not how to how to find and scope interesting problems on my own. I think for many top students (my former self included), getting a PhD feels like a “safe” option: it’s a well-defined path to doing something considered prestigious. But this can just be a way of postponing many necessary personal milestones: of learning to define and set your own goals apart from a structured academic system and of connecting more deeply with your own intrinsic motivations and values.
At the time, I felt like I was learning a lot during my PhD: taking advanced courses, reading papers, conducting research, regularly giving presentations, organizing two conferences in my field, coordinating a student-run graduate course, serving as an elected representative for grad students in my department, and writing a thesis. In hindsight, all of these were part of a narrower range of skills than I realized, and many of these skills were less transferable than I’d hoped. For instance, academic writing is very different from the type of writing I do through my blogging (which reaches a much wider audience!), and understanding academic politics was very different from startup politics, since the structure and incentives are so different.
I finished my PhD and started my first full-time adult job around the time I turned 27 (Note: I was earning a stipend through various research and teaching fellowships in graduate school, but that was different.) I had a lot to learn about working in industry and major gaps in my practical skills. Despite taking 2 years of C++ in high school, minoring in CS in college, and doing a few programming projects during my math PhD, I had focused on the more theoretical parts of computer science and was lacking in many practical computer skills. In contrast, my fast.ai co-founder Jeremy Howard started his first full-time adult job at 18 as a McKinsey consultant, and by the same age when I was first entering the workforce, Jeremy had been working full-time for nearly a decade and had founded two start-ups that are still operational today. I could have learned so many other things working in tech during the time I instead did my PhD.
To be clear, life is not a race. You can switch into tech and learn new skills at any age. The tech industry is deeply ageist, and the glorification of young founders is a harmful myth. However, I am never again going to have the energy I did in my early 20s (I eat healthy, lift heavy weights, and prioritize sleep, but I don’t feel the same), and I regret spending that time and energy being miserable while over-focusing on a narrow subject area and neglecting a lot of other skills.
You don’t need a PhD
Just off the top of my head, I thought of the following people who don’t have PhDs and who are doing interesting, cutting-edge work in deep learning (this list is incomplete and there are tons of others):
Chris Olah, co-editor of distill.pub, creator of insightful visualizations, researcher at Google Brain (no college degree)
Jeremy Howard, co-founder of fast.ai, founder of Enlitic (1st start-up to apply deep learning to medicine), previous #1-ranked Kaggler and Kaggle president, founder of fastmail and Optimal Decisions Group
In all the jobs I’ve had, including a couple that technically “required” a PhD, I had teammates without graduate degrees. My teammates without PhDs were often more productive and helpful then those of us with PhDs (perhaps because they had more practical experience).
Depression, Isolation, & Mental Health Problems among Grad Students
67 percent of graduate students said they had felt hopeless at least once in the last year; 54 percent felt so depressed they had a hard time functioning; and nearly 10 percent said they had considered suicide, a 2004 survey found. By comparison, an estimated 9.5 percent of American adults suffer from depressive disorders in a given year, according to the National Institute of Mental Health,according to research on UC Berkeley students.
Grad school is not all fun and personal enrichment for many people. It can involve poverty-level wages, uncertain employment conditions, contradictory demands by supervisors, irrelevant research projects, and disrespectful treatment by both the tenured faculty members and the undergraduates (both of whom behave, all too often, as management and customers.) Grad school is a confidence-killing daily assault of petty degradations. All of this is compounded by the fear that it is all for nothing; that you are a useful fool,one professor wrote in the Chronicle of Higher Education, in an article that was about humanities students in particular, yet applies to many STEM students as well. I hardly know anyone who was a grad student in the last decade who is not deeply embittered. Because of my columns on this site, a few people have told me how their graduate-school years coincided with long periods of suicidal ideation. More commonly, grad students suffer from untreated chronic ailments such as weight fluctuation, fatigue, headache, stomach pain, nervousness, and alcoholism.
While sexism and harassment contributed to my own negative experience in graduate school, many of my male classmates were miserable as well, due to isolation, bullying, or humiliating treatment from professors, and an exploitative system dominated by egos, rigid hierarchy, and obsession with prestige. One of the authors of a comprehensive report from the National Academy of Sciences stated, “Scientists have equated rigor and being critical with being cruel.”
Sexism and Racism in Academia
In science, engineering, & medicine, between 20%-50% of female students and more than 50% of women faculty have experienced harassment, according to a National Academy of Sciences report. In interviews with 60 women of Color who work in STEM research, 100% of them had experienced discrimination, and the particular negative stereotypes they faced differed depending on their race.
Credentials can be more important for people from underrepresented groups, who frequently face a higher level of scrutiny due to unconscious bias (particularly if they are self-taught). While underrepresented minorities may need the credentials more, unfortunately, due to the sexism and racism in higher education, they also may face worse environments in trying to obtain those credentials. I don’t have an answer for this, but wanted to note the tension.
4. I've watched a lot of very smart women get driven out of math PhD programs. Math is not open to all with a community like that.
Piper Harron, a Black woman who earned her PhD in math at Princeton, wrote a passage in her thesis, Respected research math is dominated by men of a certain attitude. Even allowing for individual variation, there is still a tendency towards an oppressive atmosphere, which is carefully maintained and even championed by those who find it conducive to success. As any good grad student would do, I tried to fit in, mathematically. I absorbed the atmosphere and took attitudes to heart. I was miserable, and on the verge of failure. The problem was not individuals, but a system of self-preservation that, from the outside, feels like a long string of betrayals, some big, some small, perpetrated by your only support system.
Toxic Graduate School is Worse than Other Toxic Jobs
I consider my time in graduate school as one of the two most toxic environments I’ve been in. While most of the advice I gave for coping with toxic jobs applies to toxic graduate school as well, there is one key distinction: it is much, much harder to switch graduate programs than it is to switch jobs. This makes the power difference between student and professor much greater than the power difference between an employee and boss in the tech industry (which thus means there is greater potential for abuse or exploitation).
I know people who have switched advisors or even switched programs, and yes, this can set you back years. However, the costs (in terms of mental and physical health, as well as opportunity costs) of staying in a toxic program is very high, and I know people who have spent years recovering from graduate school. It becomes even more complex if you are an immigrant on a student visa and have to consider visa/residency issues. There is not an easy solution for toxic graduate school situations.
Higher Education is Changing
The only situation where you definitely need a PhD is to become a professor. However, higher education is changing a lot: the shift to more adjuncts, the overproduction of PhDs, severe budget cuts to research funding in the USA, an increasing number of schools laying offtenured faculty, having to make repeated major moves for a series of post-docs, and unsustainable levels of student loan debt amongst undergraduates. I’m not sure what the future holds for higher education, but I think it will be different than the past (and this played a significant role in my own change of career goals).
I feel skeptical now when I hear undergraduates (including my younger self) say that they are certain they want to become professors, as it can be hard coming straight from undergraduate to understand the huge breadth and depth of career options that are out there, even if they have had a few internships or part-time jobs. Also, at that point, many students have primarily been surrounded by professors and students.
Coding bootcamps and MOOCs such as Coursera were not invented until I was well into my transition into tech, but both can be useful and are having a big impact on education. I’ve taken and benefitted from a number of online courses, and I would have benefitted from a coding bootcamp if they’d existed 10 or 15 years ago. In the past few years, I’ve worked both as an instructor for an in-person bootcamp and been a co-founder in building fast.ai’s MOOCs, which include Practical Deep Learning for Coders and Computational Linear Algebra. I’ve seen how powerful and useful these new educational formats can be when done well (there are also plenty of useless or sketchy bootcamps and MOOCs out there as well, so do your research).
You may be interested in some of my previous posts and talks on related topics:
When considering a PhD, it is important to carefully weigh the opportunity costs and risks, as well as to consider the experiences of a variety of people: those that have found success without PhDs, the many who have had negative graduate school experiences, and those that have succeeded following a traditional academic path.
At fast.ai, we want to do our part to increase diversity in deep learning and to lower the unnecessary barriers to entry for everyone. We are providing diversity scholarships for our updated part-time, in-person Practical Deep Learning for Coders course presented in conjunction with the University of San Francisco Data Institute, to be offered one evening per week for 7 weeks, starting October 22, in downtown San Francisco. Women, people of Color, LGBTQ people, people with disabilities, and/or veterans are eligible to apply.
The deadline to apply is September 17, 2018. Details on how to apply, and FAQ, are at the end of this post.
However, there is also great potential for harm. We are worried about unethical uses of data science, and about the ways that society’s racial and gender biases (summary here) are being encoded into our machine learning systems. We are concerned that an extremely homogeneous group is building technology that impacts everyone. People can’t address problems that they’re not aware of, and with more diverse practitioners, a wider variety of important societal problems will be tackled.
We want to get deep learning into the hands of as many people as possible, from as many diverse backgrounds as possible. People with different backgrounds have different problems they’re interested in solving. The traditional approach is to start with an AI expert and then give them a problem to work on; at fast.ai we want people who are knowledgeable and passionate about the problems they are working on, and we’ll teach them the deep learning they need.
People sometimes ask if I think it's risky for everyone to have access to AI. I think it's MORE risky for an exclusive & homogeneous group alone to develop tech that impacts us all. https://t.co/0jRfmgXGDJ
While some people worry that it’s risky for more people to have access to AI; I believe the opposite. We’ve already seen the harm wreaked by elite and exclusive companies such as Facebook, Palantir, and YouTube/Google. Getting people from a wider range of backgrounds involved can help us address these problems.
The fast.ai approach
We began fast.ai with an experiment: to see if we could teach deep learning to coders, with no math pre-requisites beyond high school math, and get them to state-of-the-art results in just 7 weeks. This was very different from other deep learning materials, many of which assume a graduate level math background, focus on theory, only work on toy problems, and don’t even include the practical tips. We didn’t even know if what we were attempting was possible, but the fast.ai course has been a huge success!
fast.ai is not just an educational resource; we also do cutting-edge research and have achieved state-of-the-art results. Our wins (and here) in Stanford’s DAWNBench competition against much better funded teams from Google and Intel were covered in the MIT Tech Review and the Verge. Jeremy’s work with Sebastian Ruder achieving state-of-the art on 6 language classification datasets was accepted by ACL and is being built upon by OpenAI. All this research is incorporated into our course, teaching students state-of-the-art techniques.
How to Apply
Deep Learning Part 1 covers the use of deep learning for image recognition, recommendation systems, sentiment analysis, and time-series prediction. Wondering if you’re qualified? The only requirements are:
At least 1 year of coding experience (the course is taught in Python)
At least 8 hours a week to commit to the course (includes time for homework)
Curiosity and a willingness to work hard
Identify as a woman, person of Color, LGBTQ person, person with a disability, and/or veteran
Be available to attend in-person 6:30-9pm in downtown San Francisco (SOMA), one evening per week (exact schedule found here under details, day of the week varies)
The number of scholarships we are able to offer depends on how much funding we receive (if your organization may be able to sponsor one or more places, please let us know). To apply for the fellowship, you will need to submit a resume and statement of purpose. The statement of purpose will include the following:
1 paragraph describing one or more problems you’d like to apply deep learning to
I’m not eligible for the diversity scholarship, but I’m still interested. Can I take the course? Absolutely! You can register here.
I don’t live in the San Francisco Bay Area; can I participate remotely? Yes! Once again, we will be offering remote international fellowships. Stay tuned for details to be released in a blog post in the next few weeks.
Will this course be made available online later? Yes, this course will be made freely available online afterwards. Benefits of taking the in-person course include earlier access, community and in-person interaction, and more structure (for those that struggle with motivation when taking online courses).
Is fast.ai able to sponsor visas or provide stipends for living expenses? No, we are not able to sponsor visas nor to cover living expenses.
How will this course differ from the previous fast.ai courses? Our goal at fast.ai is to push the state-of-the-art. Each year, we want to make deep learning increasingly intuitive to use while giving better results. With our fastai library, we are beating our own state-of-the-art results from last year. This year’s course will coincide with our release of an updated version of the fastai library (built on top of PyTorch).
What language is the course taught in? The course is taught in Python, using the fastai library and PyTorch. Some of our students have gone on to use the fastai library in production at Fortune 500 companies.
Note from Jeremy: I’ll be teaching Deep Learning for Coders at the University of San Francisco starting in October; if you’ve got at least a year of coding experience, you can apply here.
A team of fast.ai alum Andrew Shaw, DIU researcher Yaroslav Bulatov, and I have managed to train Imagenet to 93% accuracy in just 18 minutes, using 16 public AWS cloud instances, each with 8 NVIDIA V100 GPUs, running the fastai and PyTorch libraries. This is a new speed record for training Imagenet to this accuracy on publicly available infrastructure, and is 40% faster than Google’s DAWNBench record on their proprietary TPU Pod cluster. Our approach uses the same number of processing units as Google’s benchmark (128) and costs around $40 to run.
DIU and fast.ai will be releasing software to allow anyone to easily train and monitor their own distributed models on AWS, using the best practices developed in this project. The main training methods we used (details below) are: fast.ai’s progressive resizing for classification, and rectangular image validation; NVIDIA’s NCCL with PyTorch’s all-reduce; Tencent’s weight decay tuning; a variant of Google Brain’s dynamic batch sizes, gradual learning rate warm-up (Goyal et al 2018, and Leslie Smith 2018). We used the classic ResNet-50 architecture, and SGD with momentum.
Four months ago, fast.ai (with some of the students from our MOOC and our in person course at the Data Instituate at USF) achieved great success in the DAWNBench competition, winning the competition for fastest training of CIFAR-10 (a small dataset of 25,000 32x32 pixel images) overall, and fastest training of Imagenet (a much larger dataset of over a million megapixel images) on a single machine (a standard AWS public cloud instance). We previously wrote about the approaches we used in this project. Google also put in a very strong showing, winning the overall Imagenet speed category using their “TPU Pod” cluster, which is not available to the public. Our single machine entry took around three hours, and Google’s cluster entry took around half an hour. Before this project, training ImageNet on the public cloud generally took a few days to complete.
We entered this competition because we wanted to show that you don’t have to have huge resources to be at the cutting edge of AI research, and we were quite successful in doing so. We particularly liked the headline from The Verge: “An AI speed test shows clever coders can still beat tech giants like Google and Intel.”
However, lots of people asked us – what would happen if you trained on multiple publicly available machines. Could you even beat Google’s impressive TPU Pod result? One of the people that asked this question was Yaroslav Bulatov, from DIU (the Department of Defense’s Silicon Valley-based experimental innovation unit.) Andrew Shaw (One of our DAWNBench team members) and I decided to team up with Yaroslav to see if we could achieve this very stretch goal. We were encouraged to see that recently AWS had managed to train Imagenet in just 47 minutes, and in their conclusion said: “A single Amazon EC2 P3 instance with 8 NVIDIA V100 GPUs can train ResNet50 with ImageNet data in about three hours [fast.ai] using Super-Convergence and other advanced optimization techniques. We believe we can further lower the time-to-train across a distributed configuration by applying similar techniques.” They were kind enough to share the code for this article as open source, where we found some helpful network configuration tips there to ensure we got the most out of the Linux networking stack and the AWS infrastructures.
Iterating quickly required solving challenges such as:
How to easily run multiple experiments across multiple machines, without having a large pool of expensive instances running constantly?
How to conveniently take advantage of AWS’s spot instances (which are around 70% cheaper than regular instances) but that need to be set up from scratch each time you want to use them?
fast.ai built a system for DAWNBench that included a Python API for launching and configuring new instances, running experiments, collecting results, and viewing progress. Some of the more interesting design decisions in the systems included:
Not to use a configuration file, but instead configuring experiments using code leveraging a Python API. As a result, we were able to use loops, conditionals, etc to quickly design and run structured experiments, such as hyper-parameter searches
Writing a Python API wrapper around tmux and ssh, and launching all setup and training tasks inside tmux sessions. This allowed us to later login to a machine and connect to the tmux session, to monitor its progress, fix problems, and so forth
Keeping everything as simple as possible – avoiding container technologies like Docker, or distributed compute systems like Horovod. We did not use a complex cluster architecture with separate parameter servers, storage arrays, cluster management nodes, etc, but just a single instance type with regular EBS storage volumes.
Independently, DIU faced a similar set of challenges and developed a cluster framework, with analogous motivation and design choices, providing the ability to run many large scale training experiments in parallel. The solution, nexus-scheduler, was inspired by Yaroslav’s experience running machine learning experiments on Google’s Borg system.
The set of tools developed by fast.ai focused on fast iteration with single-instance experiments, whilst the nexus-scheduler developed by DIU was focused on robustness and multi-machine experiments. Andrew Shaw merged parts of the fast.ai software into nexus-scheduler, so that we had the best pieces of each, and we used this for our experiments.
Using nexus-scheduler helped us iterate on distributed experiments, such as:
Launching multiple machines for a single experiment, to allow distributed training. The machines for a distributed run are automatically put into a placement group, which results in faster network performance
Providing monitoring through Tensorboard (a system originally written for Tensorflow, but which now works with Pytorch and other libraries) with event files and checkpoints stored on a region-wide file system.
Automating setup. Various necessary resources for distributed training, like VPCs, security groups, and EFS are transparently created behind the scenes.
AWS provides a really useful API that allowed us to build everything we needed quickly and easily. For distributed computation we used NVIDIA’s excellent NCCL library, which implements ring-style collectives that are integrated with PyTorch’s all-reduce distributed module. We found that AWS’s instances were very reliable and provided consistent performance, which was very important for getting best results from the all-reduce algorithm.
The first official release of nexus-scheduler, including the features merged from the fast.ai tools, is planned for Aug 25th.
A simple new training trick: rectangles!
I mentioned to Andrew after DAWNBench finished that I thought deep learning practitioners (including us!) were doing something really dumb: we were taking rectangular images (such as those used in Imagenet) and cropping out just the center piece when making predictions. Or a (very slow) alternative widely used is to pick 5 crops (top and bottom left and right, plus center) and average the predictions. Which leaves the obvious question: why not just use the rectangular image directly?
A lot of people mistakenly believe that convolutional neural networks (CNNs) can only work with one fixed image size, and that that must be rectangular. However, most libraries support “adaptive” or “global” pooling layers, which entirely avoid this limitation. It doesn’t help that some libraries (such as Pytorch) distribute models that do not use this feature – it means that unless users of these libraries replace those layers, they are stuck with just one image size and shape (generally 224x224 pixels). The fastai library automatically converts fixed-size models to dynamically sized models.
I’ve never seen anyone try to train with rectangular images before, and haven’t seen them mentioned in any research papers yet, and none of the standard deep learning libraries I’ve found support this. So Andrew went away and figured out how to make it work with fastai and Pytorch for predictions.
The result was amazing – we saw an immediate speedup of 23% in the amount of time it took to reach the benchmark accuracy of 93%. You can see a comparison of the different approaches in this notebook, and compare the accuracy of them in this notebook.
Progressive resizing, dynamic batch sizes, and more
One of our main advances in DAWNBench was to introduce progressive image resizing for classification – using small images at the start of training, and gradually increasing size as training progresses. That way, when the model is very inaccurate early on, it can quickly see lots of images and make rapid progress, and later in training it can see larger images to learn about more fine-grained distinctions.
In this new work, we additionally used larger batch sizes for some of the intermediate epochs – this allowed us to better utilize the GPU RAM and avoid network latency.
Recently, Tencent published a very nice paper showing <7 minute training of Imagenet on 2,048 GPUs. They mentioned a trick we hadn’t tried before, but makes perfect sense: removing weight decay from batchnorm layers. That allowed us to trim another couple of epochs from our training time. (The Tencent paper also used a dynamic learning rate approach developed by NVIDIA research, called LARS, which we’ve also been developing for fastai, but is not included yet in these results.)
When we put all this together, we got a training time of 18 minutes on 16 AWS instances, at a total compute cost (including the cost of machine setup time) of around $40. The benefits of being able to train on datasets of >1 million images are significant, such as:
Organizations with large image libraries, such as radiology centers, car insurance companies, real estate listing services, and e-commerce sites, can now create their own customized models. Whilst with transfer learning using so many images is often overkill, for highly specialized image types or fine-grained classification (as is common in medical imaging) using larger volumes of data may give even better results
Smaller research labs can experiment with different architectures, loss functions, optimizers, and so forth, and test on Imagenet, which many reviewers expect to see in published papers
By allowing the use of standard public cloud infrastructure, no up-front capital expense is required to get started on cutting-edge deep learning research.
Unfortunately, big companies using big compute tend to get far more than their fair share of publicity. This can lead to AI commentators coming to the conclusion that only big companies can compete in the most important AI research. For instance, following Tencent’s recent paper, OpenAI’s Jack Clark claimed “for all the industries talk about democratization it’s really the case that advantages accrue to people with big computers”. OpenAI (and Jack Clark) are working to democratize AI, such as through the excellent OpenAI Scholars program and Jack’s informative Import AI newsletter; however, this misperception that success in AI comes down to having bigger computers can distort the research agenda.
I’ve seen variants of the “big results need big compute” claim continuously over the last 25 years. It’s never been true, and there’s no reason that will change. Very few of the interesting ideas we use today were created thanks to people with the biggest computers. Ideas like batchnorm, ReLU, dropout, adam/adamw, and LSTM were all created without any need for large compute infrastructure. And today, anyone can access massive compute infrastructure on demand, and pay for just what they need. Making deep learning more accessible has a far higher impact than focusing on enabling the largest organizations - because then we can use the combined smarts of millions of people all over the world, rather than being limited to a small homogeneous group clustered in a couple of geographic centers.
We’re in the process of incorporating all of the best practices used in this project directly into the fastai library, including automating the selection of hyper-parameters for fast and accurate training.
And we’re not even done yet - we have some ideas for further simple optimizations which we’ll be trying out. We don’t know if we’ll hit the sub seven minute mark of Tencent, since they used a faster network than AWS provides, but there’s certainly plenty of room to go faster still.
Special thanks to the AWS and PyTorch teams who helped us by patiently answering our questions throughout this project, and for the wonderfully pragmatic products that they’ve made available for everyone to use!
The Harvard Business Review recently published an article, Want Less-Biased Decisions? Use Algorithms. by Alex P. Miller. The article focuses on the fact that humans make very biased decisions (which is true), yet ignores many important related issues, including:
algorithms are often implemented without any appeals method in place (due to the misconception that algorithms are objective, accurate, and won’t make mistakes)
algorithms are often used at a much larger scale than human decision makers, in many cases, replicating an identical bias at scale (part of the appeal of algorithms is how cheap they are to use)
users of algorithms may not understand probabilities or confidence intervals (even if these are provided), and may not feel comfortable overriding the algorithm in practice (even if this is technically an option)
instead of just focusing on the least-terrible existing option, it is more valuable to ask how we can create better, less biased decision-making tools by leveraging the strengths of humans and machines working together
Miller acknowledges that critics of the “algorithmic revolution” are “concerned that algorithms are often opaque, biased, and unaccountable tools being wielded in the interests of institutional power”, although he then focuses exclusively on the biased part for the remainder of the article, without addressing the opaque or unaccountable charges (as well as how these interact with bias).
Humans vs. machines is not a helpful framing
The media often frames advances in AI through a lens of humans vs. machines: who is the champion at X task. This framework is both inaccurate as to how most algorithms are used, as well as a very limited way to think about AI. In all cases, algorithms have a human component, in terms of who gathers the data (and what biases they have), which design decisions are made, how they are implemented, how results are used to make decisions, the understanding various stakeholders have of correct uses and limitations of the algorithm, and so on.
Most people working on medical applications of AI are not trying to replace doctors; they are trying to create tools that will allow doctors to be more accurate and more efficient, improving quality of care. The best chess “players” are neither humans nor computers, but rather, teams of humans and computers working together.
Miller’s HBR article points out (correctly) that humans are very biased, and then compares our current not-so-great approaches to see which is less terrible. The article does not ask the question, how can we develop less biased ways to make decisions (perhaps using some combination of humans and algorithms)? which is a far more interesting and important question.
Algorithms are often used differently than human decision makers
Algorithms are often used at a larger scale, mass-producing identical biases, and assumed to be error-proof or objective. The studies that Miller shares compares them in an apples-to-apples way, which doesn’t acknowledge how differently they are often used in practice.
Cathy O’Neil writes in Weapons of Math Destruction that the algorithms she is critiquing tend to punish the poor. They specialize in bulk, and they’re cheap. That’s part of their appeal. The wealthy, by contrast, often benefit from personal input. A white-shoe law firm or an exclusive prep school will lean far more on recommendations and face-to-face interviews than will a fast-food chain or a cash-strapped urban school district. The privileged, we’ll see time and again, are processed more by people, the masses by machines. (emphasis mine)
One example from O’Neil’s book is that of a college student with bipolar disorder who wanted to get a summer job bagging groceries. Every store he applied to was using the same pyschometric evaluation software to screen candidates, and he was rejected from every store. This captures another danger of algorithms: even though humans often have similar biases, not all humans will make the exact same decisions (e.g. perhaps that college student would have been able to find one place to hire him, even if some of the people making decisions had biases about mental health).
Many people will put more trust in algorithmic decisions than they might in human decisions. While the researchers designing the algorithms may have a good grasp on probability and confidence intervals, often the general public using them will not. Even if people are given the power to override algorithmic decisions, it is crucial to understand if they will feel comfortable doing so in practice.
The need for meaningful appeals and explanations
Many of the most chilling stories of algorithmic bias don’t involve meaningful explanations or a meaningful appeals process. This seems to be a particular trend amongst algorithmic decision making systems, perhaps since people mistakenly assume algorithms are objective, they believe there is no need for appeals. Also, as explained above, algorithmic decision making systems are often used as a cost-cutting device, and allowing appeals would be more expensive.
Cathy O’Neil writes the account of a teacher who is beloved by her students, their parents, and the principal, yet is inexplicably fired by an algorithm. She is never able to get an answer as to why she was fired. Stories like this would be somewhat less disturbing if there had been a relatively quick and simple way for her to appeal the decision, or even to know for sure what factors it was related to.
The Verge investigated software used in over half of U.S. states to determine how much healthcare people receive. After its implemention in Arkansas, people (many with severe disabilities) drastically had their healthcare cut. For instance, Tammy Dobbs, a woman with cerebral palsy who needs an aid to help her to get out of bed, to go to the bathroom, to get food, and more, had her hours of help suddenly reduced by 20 hours a week. She couldn’t get any explanation for why her healthcare was cut. Eventually, a court case revealed that there were mistakes in the software implementation of the algorithm, negatively impacting people with diabetes or cerebral palsy. However, Dobbs and many other people reliant on these health care benefits live in fear that their benefits could again be cut suddenly and inexplicably.
The creator of the algorithm, who is a professor and earning royalties off of this software, was asked whether there should be a way to communicate decisions, “It’s probably something we should do. I should also probably dust under my bed.” He later clarified that he thought it was someone else’s responsibility. We can not keep claiming the problems caused by our technology are someone else’s responsibility.
For a separate computer system used in Colorado to determine public benefits in the mid-2000s, it was discovered that more than 900 incorrect rules had been coded into the system, resulting in problems like pregnant women being denied Medicaid. It is often hard for lawyers to even discover these flaws, since the inner-workings of the algorithms are typically protected as trade secrets. Systems used to make decisions related to healthcare, hiring/firing, criminal justice, and other life-altering areas should include some sort of human appeals process, that is relatively fast and easy to navigate. Many of the most chilling stories of algorithmic decision making would not be nearly as concerning if there had been an easy way to appeal and correct faulty decisions. Mistakes are possible in anything we do, so it’s important to have a tight loop in which we make it easy to discover and correct mistakes.
Complicated, real-world systems
When we think about AI, we need to think about complicated, real-world systems. The studies in the HBR article treat decision making as an isolated action, without taking into account that this decision-making happens within complicated real-world systems. A decision about whether someone is likely to commit another crime is not an isolated decision: it lives within the complicated system of our criminal justice system. We have a responsibility to understand the real-world systems with which our work will interact, and to not lose sight of the actual people who will be impacted.
The COMPAS recidivism algorithm is used in some US courtrooms for decisions related to pre-trial bail, sentencing, and parole. It was the subject of a ProPublica investigation finding that the false positive rate (people that were labeled “high risk” but were not re-arrested) for white defendants was 24% and for Black defendants was 45%. Later research found that COMPAS (which uses 137 inputs in a black-box algorithm) was no more accurate than a simple linear equation on two variables. COMPAS was also not more accurate than untrained Mechanical Turk workers. (You can find out more about various definitions of fairness in Princeton CS Professor Arvind Narayanan’s excellent 21 Definitions of Fairness talk).
Kristian Lum, statistics PhD and lead data scientist at the Human Rights Digital Analysis Group, organized a workshop together with Elizabeth Bender, a staff attorney for the NY Legal Aid Society and former public defender, and Terrence Wilkerson, an innocent man who had been arrested and could not afford bail. Together, they shared first hand experience about the obstacles and inefficiencies that occur in the legal system, providing valuable context to the debate around COMPAS. Bender shared that for public defenders meet with defendants at Rikers Island, where many pre-trial detainees in NYC who can’t afford bail are held, involves a bus ride that is two hours each way and they then only get 30 minutes to see the defendant, assuming the guards are on time (which is not always the case). Wilkerson explained how frequently innocent defendents who can’t afford bail accept guilty plea bargains just so they can get out of jail faster. Again, all this is for people that have not even faced a trial yet! This panel was an excellent way to illuminate the real-world systems and educate about the first-hand impact. I hope more statisticians and computer scientists will follow this example.
As this example shows, algorithms can often exacerbate underlying societal problems. There are deep, structural problems with the US courts and prison systems, including racial bias, the use of cash bail (nearly half a million people in the USA are languishing in jail before even facing a trial, because they are too poor to afford bail), predatory for-profit prisons, and extreme over-use of prisons (the US is home to 4% of the world’s population and 22% of the world’s prisoners). We have a responsibility to understand the systems and underlying problems our algorithms may interact with.
Most critics of unjust bias aren't anti-algorithm
Most critics of biased algorithms are opposed to unjust bias; they are not people who hate algorithms. Miller says that critics of biased algorithms “rarely ask how well the systems they analyze would operate without algorithms,” suggesting that those speaking out against biased algorithms are perhaps unaware of how biased humans are or perhaps just don’t like algorithms. I spent a great deal of time researching and writingaboutstudiesof human bias (particularly as to how they pertain to the tech industry), long before I began writing about bias in machine learning.
To announce Google’s AutoML, Google CEO Sundar Pichai wrote, “Today, designing neural nets is extremely time intensive, and requires an expertise that limits its use to a smaller community of scientists and engineers. That’s why we’ve created an approach called AutoML, showing that it’s possible for neural nets to design neural nets. We hope AutoML will take an ability that a few PhDs have today and will make it possible in three to five years for hundreds of thousands of developers to design new neural nets for their particular needs.” (emphasis mine)
When Google’s Head of AI, Jeff Dean, suggested that 100x computational power could replace the need for machine learning expertise, computationally expensive neural architecture search was the only example he gave to illustrate this point. (around 23:50 in his TensorFlow DevSummit keynote)
This raises a number of questions: do hundreds of thousands of developers need to “design new neural nets for their particular needs” (to quote Pichai’s vision), or is there an effective way for neural nets to generalize to similar problems? Can large amounts of computational power really replace machine learning expertise?
In evaluating Google’s claims, it’s valuable to keep in mind Google has a vested financial interest in convincing us that the key to effective use of deep learning is more computational power, because this is an area where they clearly beat the rest of us. If true, we may all need to purchase Google products. On its own, this doesn’t mean that Google’s claims are false, but it’s good be aware of what financial motivations could underlie their statements.
In my previous posts, I shared an introduction to the history of AutoML, defined what neural architecture search is, and pointed out that for many machine learning projects, designing/choosing an architecture is nowhere near the hardest, most time-consuming, or most painful part of the problem. In today’s post, I want to look specifically at Google’s AutoML, a product which has received a lot of media attention, and address the following:
Google’s Cloud AutoML was announced in January 2018 as a suite of machine learning products. So far it consists of one publicly available product, AutoML Vision, an API that identifies or classifies objects in pictures. According to the product page, Cloud AutoML Vision relies on two core techniques: transfer learning and neural architecture search. Since we’ve already explained neural architecture search, let’s now take a look at transfer learning, and see how it relates to neural architecture search.
Note: Google Cloud AutoML also has a drag-and-drop ML product that is still in alpha. I applied for access to it over 2 months ago, but I have not heard back from Google yet. I plan to write a post once it’s released.
What is transfer learning?
Transfer learning is a powerful technique that lets people with smaller datasets or less computational power achieve state-of-the-art results, by taking advantage of pre-trained models that have been trained on similar, larger data sets. Because the model learned via transfer learning doesn’t have to learn from scratch, it can generally reach higher accuracy with much less data and computation time than models that don’t use transfer learning.
Transfer learning is a core technique that we use throughout our free Practical Deep Learning for Coders course– and that our students have been applying in production in everything from their own startups to Fortune 500 companies. Although transfer learning seems to be considered “less sexy” than neural architecture search, it is being used to achieve ground-breaking academic results, such as in Jeremy Howard and Sebastian Ruder’s application of transfer learning to NLP, which achieved state-of-the-art classification on 6 datasets and is serving as a basis for further research in this area at OpenAI.
Neural architecture search vs. transfer learning: two opposing approaches
The underlying idea of transfer learning is that neural net architectures will generalize for similar types of problems: for example, that many images have underlying features (such as corners, circles, dog faces, or wheels) that show up in a variety of different types of images. In contrast, the underlying idea of promoting neural architecture search for every problem is the opposite: that each dataset has a unique, highly specialized architecture it will perform best with.
When neural architecture search discovers a new architecture, you must learn weights for that architecture from scratch, while with transfer learning, you begin with existing weights from a pre-trained model. In this sense, you you can’t use neural architecture search and transfer learning on the same problem: if you’re learning a new architecture, you would need to train new weights for it; whereas if you are using transfer learning on a pretrained model, you can’t make substantial changes to the architecture.
Of course, you can apply transfer learning to an architecture learned through neural architecture search (which I think is a good idea!). This requires only that a few researchers use neural architecture search and open-source the models that they find. It is not necessary for all machine learning practitioners to be using neural architecture search themselves on all problems when they can instead use transfer learning.
However, Jeff Dean’s keynote, Sundar Pichai’s blog post, Google Cloud’s promotional materials, and the media coverage all suggest the opposite: that everybody needs to be able to use neural architecture search directly.
What Neural Architecture Search is good for
Neural architecture search is good for finding new architectures! Google’s AmoebaNet was learned via neural architecture search, and (with the inclusion of fast.ai advances such as an aggressive learning schedule and changing the image size as training progresses) is now the cheapest way to train ImageNet on a single machine!
AmoebaNet was not designed with a reward function that involved the ability to scale, and so it didn’t scale as well as ResNet to multiple machines, but a neural net that scales well could potentially be learned in the future, optimized for different qualities.
In need of more evidence
We haven’t seen evidence that every dataset would be best modeled with its own custom model, as opposed to instead fine-tuning an existing model. Since neural architecture search requires a larger training set, this would particularly be an issue for smaller data sets. Even some of Google’s own research uses transferable techniques instead of finding a new architecture for each data set, such as NASNet (blog post here), which learned an architectural building block on Cifar10 and then used that building block to create an architecture for ImageNet. I don’t know of any widely-entered machine learning competitions that have been won using neural architectures search yet.
Furthermore, we don’t know that the mega-computationally expensive approach to neural architecture search that Google touts is the superior approach. For instance, more recent papers such as Efficient Neural Architecture Search (ENAS) and Differentiable architecture search (DARTS) propose significantly more efficient algorithms. DARTS takes just 4 GPU days, compared to 1800 GPU days for NASNet and 3150 GPU days for AmoebaNet (all learned to the same accuracy on Cifar-10). Jeff Dean is an author on the ENAS paper, which proposed a technique that is 1000x less computationally expensive, which seems inconsistent with his emphasis at the TF DevSummit one month later on using approaches that are 100x more computationally expensive.
Then why all the hype about Google's AutoML?
Given the above limitations, why has Google AutoML’s hype been so disproportionate to its proven usefulness (at least so far)? I think there are a few explanations:
Google’s AutoML highlights some of the dangers of having an academic research lab embedded in a for-profit corporation. There is a temptation to try to build products around interesting academic research, without assessing if they fulfill an actual need. This is also the story of many AI start-ups, such as MetaMind or Geometric Intelligence, that end up as acquihires without ever having produced a product. My advice for startup founders is to avoid productionizing your PhD thesis and to avoid hiring only academic researchers.
Google excels at marketing. Artificial intelligence is seen as an inaccessible and intimidating field by many outsiders, who don’t feel that they have a way to evaluate claims, particularly from lionized companies like Google. Many journalists fall prey to this as well, and uncritically channel Google’s hype into glowing articles. I periodically talk to people that do not work in machine learning, yet are excited about various Google ML products that they’ve never used and can’t explain anything about.
One example of Google’s misleading coverage of its own achievements occurred when Google AI researchers released “a deep learning technology to reconstruct the true human genome”, compared their own work to Nobel prize-winning discoveries (the hubris!), and the story was picked up by Wired. However, Steven Salzberg, a distinguished professor of Biomedical Engineering, Computer Science, and Biostatistics at Johns Hopkins University debunked Google’s post. Salzberg pointed out that the research didn’t actually reconstruct the human genome and was “little more than an incremental improvement over existing software, and it might be even less than that.” A number of other genomics researchers chimed in to agree with Salzberg.
There is some great work happening at Google, but it would be easier to appreciate if we didn’t have to sift through so much misleading hype to figure out what is legitimate.
Google’s DeepVariant “is little more than an incremental improvement over existing software, and it might be even less than that.” @StevenSalzberg1
Google has a vested interest in convincing us that the key to effective use of deep learning is more computational power, because this is an area where they clearly beat the rest of us. AutoML is often very computationally expensive, such as in the examples of Google using 450 K40 GPUs for 7 days (the equivalent of 3150 GPU days) to learn AmoebaNet.
While engineers and the media often drool over bare-metal power and anything bigger, history has shown that innovation is often birthed instead by constraint and creativity. Google works on the biggest data possible using the most expensive computers possible; how well can this really generalize to the problems that the rest of us face living in a constrained world of limited resources?
Innovation comes from doing things differently, not from doing things bigger. The recent success of fast.ai in the Stanford DAWNBench competition is one example of this.
How can we address the shortage of machine learning expertise?
To return to the issue that Jeff Dean raised in his TensorFlow DevSummit keynote about the global shortage of machine learning practitioners, a different approach is possible. We can remove the biggest obstacles to using deep learning in several ways by:
“Custom heads” for existing architectures (e.g. modifying ResNet, which was initially designed for classification, so that it can be used to find bounding boxes or perform style transfer) allow for easier architecture reuse across a range of problems.
None of the above discoveries involve bare-metal power; instead, all of them were creative ideas of ways to do things differently.
Address Myths About What it Takes to Do Deep Learning
Another obstacle is the many myths that cause people to believe that deep learning isn’t for them: falsely believing that their data is too small, that they don’t have the right education or background, or that their computers aren’t big enough. One such myth says that only machine learning PhDs are capable of using deep learning, and many companies that can’t afford to hire expensive experts don’t even bother trying. However, it’s not only possible for companies to train the employees they already have to become machine learning experts, it’s even preferable, because your current employees already have domain expertise for the area you work in!
In my talk at the MIT Technology Review Conference, I addressed 6 Myths that lead people to incorrectly believe that using deep learning is harder than it is.
Although the cost of cloud GPUs (around 50 cents per hour) are within the budgets of many of us, I’m periodically contacted by students from around the world that can’t afford any GPU use at all. In some countries, rules about banking and credit cards can make it difficult for students to use services like AWS, even when they have the money. Google Colab notebooks are a solution! Colab notebooks provide a Jupyter notebook environment that requires no setup to use, runs entirely in the cloud, and gives users access to a free GPU (although long-running GPU use is not allowed). They can also be used to create documentation that contains working code samples running in an interactive environment. Google colab notebooks will do much more to democratize deep learning than Google’s AutoML will; perhaps this would be a better target for Google’s marketing machine in the future.