There’s been a lot of discussion in the last couple of days about OpenAI’s new language model. OpenAI made the unusual decision to not release their trained model (the AI community is usually extremely open about sharing them). On the whole, the reaction has been one of both amazement and concern, and has been widely discussed in the media, such as this thoughtful and thorough coverage in The Verge. The reaction from the academic NLP community, on the other hand, has been largely (but not exclusively) negative, claiming that:
This shouldn’t be covered in the media, because it’s nothing special
OpenAI had no reason to keep the model to themselves, other than to try to generate media hype through claiming their model is so special it has to be kept secret.
On (1), whilst it’s true that there’s no real algorithmic leap being done here (the model is mainly a larger version of something that was published by the same team months ago), the academic “nothing to see here” reaction misses the point entirely. Whilst academic publishing is (at least in this field) largely driven by specific technical innovations, broader community interest is driven by societal impact, surprise, narrative, and other non-technical issues. Every layperson I’ve spoken to about this new work has reacted with stunned amazement. And there’s clearly a discussion to be had about potential societal impacts of a tool that may be able to scale up disinformation campaigns by orders of magnitude, especially in our current environment where such campaigns have damaged democracy even without access to such tools.
In addition, the history of technology has repeatedly shown that the hard thing is not, generally, solving a specific engineering problem, but showing that a problem can be solved. So showing what is possible is, perhaps, the most important step in technology development. I’ve been warning about potential misuse of pre-trained language models for a while, and even helped develop some of the approaches the people are using now to build this tech; but it’s not until OpenAI actually showed what can be done in practice that the broader community has woken up to some of the concerns.
But what about the second issue: should OpenAI release their pretrained model? This one seems much more complex. We’ve already heard from the “anti-model-release” view, since that’s what OpenAI has published and also discussed with the media. Catherine Olsson (who previously worked at OpenAI) asked on Twitter if anyone has yet seen a compelling explanation of the alternative view:
What have been your favorite *on-the-merits* *pro-release* OpenAI GPT-2 takes (on twitter or elsewhere)?
I'm looking for clear good-faith explanation of the pro-release (or anti-media-attention?) position right now, not clever snark.
I’ve read a lot of the takes on this, and haven’t yet found one that really qualifies. A good-faith explanation would need to engage with what OpenAI’s researchers actually said, which takes a lot of work, since their team have written a lot of research on the societal implications of AI (both at OpenAI, and elsewhere). The most in-depth analysis of this topic is the paper The Malicious Use of Artificial Intelligence. The lead author of this paper now works at OpenAI, and was heavily involved in the decision around the model release. Let’s take a look at the recommendations of that paper:
Policymakers should collaborate closely with technical researchers to investigate, prevent, and mitigate potential malicious uses of AI
Researchers and engineers in artificial intelligence should take the dual-use nature of their work seriously, allowing misuserelated considerations to influence research priorities and norms, and proactively reaching out to relevant actors when harmful applications are foreseeable.
Best practices should be identified in research areas with more mature methods for addressing dual-use concerns, such as computer security, and imported where applicable to the case of AI.
Actively seek to expand the range of stakeholders and domain experts involved in discussions of these challenges.
An important point here is that an appropriate analysis of potential malicious use of AI requires a cross-functional team and deep understanding of history in related fields. I agree. So what follows is just my one little input to this discussion. I’m not ready to claim that I have the answer to the question “should OpenAI have released the model”. I will also try to focus on the “pro-release” side, since that’s the piece that hasn’t had much thoughtful input yet.
A case for releasing the model
OpenAI said that their release strategy is:
Due to concerns about large language models being used to generate deceptive, biased, or abusive language at scale, we are only releasing a much smaller version of GPT-2 along with sampling code.
So specifically we need to be discussing scale. Their claim is that a larger scale model may cause significant harm without time for the broader community to consider it. Interestingly, even they don’t claim to be confident of this concern:
This decision, as well as our discussion of it, is an experiment: while we are not sure that it is the right decision today, we believe that the AI community will eventually need to tackle the issue of publication norms in a thoughtful way in certain research areas.
Let’s get specific. How much scale are we actually talking about? I don’t see this explicitly mentioned in their paper of blog post, but we can make a reasonable guess. The new GPT2 model has (according to the paper) about ten times as many parameters as their previous GPT model. Their previous model took 8 GPUs 1 month to train. One would expect that they can train their model faster by now, since they’ve had plenty of time to improve their algorithms, but on the other hand, their new model probably takes more epochs to train. Let’s assume that these two balance out, so we’re left with the difference of 10x in parameters.
If you’re in a hurry and you want to get this done in a month, then you’re going to need 80 GPUs. You can grab a server with 8 GPUs from the AWS spot market for $7.34/hour. That’s around $5300 for a month. You’ll need ten of these servers, so that’s around $50k to train the model in a month. OpenAI have made their code available, and described how to create the necessary dataset, but in practice there’s still going to be plenty of trial and error, so in practice it might cost twice as much.
If you’re in less of a hurry, you could just buy 8 GPUs. With some careful memory handling (e.g. using Gradient checkpointing) you might be able to get away with buying RTX 2070 cards at $500 each, otherwise you’ll be wanting the RTX 2080 ti at $1300 each. So for 8 cards, that’s somewhere between $4k and $10k for the GPUs, plus probably another $10k or so for a box to put them in (with CPUs, HDDs, etc). So that’s around $20k to train the model in 10 months (again, you’ll need some extra time and money for the data collection, and some trial and error).
Most organizations doing AI already have 8 or more GPUs available, and can often get access to far more (e.g. AWS provides up to $100k credits to startups in its AWS Activate program, and Google provides dozens of TPUs to any research organization that qualifies for their research program).
So in practice, the decision not to release the model has a couple of outcomes:
It’ll probably take at least a couple of months before another organization has successfully replicated it, so we have some breathing room to discuss what to do when this is more widely available
Small organizations that can’t afford to spend $100k or so are not able to use this technology at the scale being demonstrated.
Point (1) seems like a good thing. If suddenly this tech is thrown out there for anyone to use without any warning, then no-one can be prepared at all. (In theory, people could have been prepared because those within the language modeling community have been warning of such a potential issue, but in practice people don’t tend to take it seriously until they can actually see it happening.) This is what happens, for instance, in the computer security community, where if you find a flaw the expectation is that you help the community prepare for it, and only then do you release full details (and perhaps an exploit). When this doesn’t happen, it’s called a zero day attack or exploit, and it can cause enormous damage.
I’m not sure I want to promote a norm that zero-day threats are OK in AI.
On the other hand, point (2) is a problem. The most serious threats are most likely to come from folks with resources to spend $100k or so on (for example) a disinformation campaign to attempt to change the outcome of a democratic election. In practice, the most likely exploit is (in my opinion) a foreign power spending that money to dramatically escalate existing disinformation campaigns, such as those that have been extensively documented by the US intelligence community.
The only practical defense against such an attack is (as far as I can tell) to use the same tools to both attempt to identify, and push back against, such disinformation. These kinds of defenses are likely to be much more powerful when wielded by the broader community of those impacted. The power of a large group of individuals has repeatedly been shown to be more powerful at creating, than at destruction, as we see in projects such as Wikipedia, or open source software.
In addition, if these tools aren’t in the hands of people without access to large compute resources, then they remain abstract and mysterious. What can they actually do? What are their constraints? For people to make informed decisions, they need to have a real understanding of these issues.
So, should OpenAI release their trained model? Frankly, I don’t know. There’s no question in my mind that they’ve demonstrated something fundamentally qualitatively different to what’s been demonstrated before (despite not showing any significant algorithmic or theoretic breakthroughs). And I’m sure it will be used maliciously; it will be a powerful tool for disinformation and for influencing discourse at massive scale, and probably only costs about $100k to create.
By releasing the model, this malicious use will happen sooner. But by not releasing the model, there will be fewer defenses available and less real understanding of the issues from those that are impacted. Those both sound like bad outcomes to me.
AI is being increasingly used to make important decisions. Many AI experts (including Jeff Dean, head of AI at Google, and Andrew Ng, founder of Coursera and deeplearning.ai) say that warnings about sentient robots are overblown, but other harms are not getting enough attention. I agree. I am an AI researcher, and I’m worried about some of the societal impacts that we’re already seeing. In particular, these 5 things scare me about AI:
Before we dive in, I need to clarify one point that is important to understand: algorithms (and the complex systems they are a part of) can make mistakes. These mistakes come from a variety of sources: bugs in the code, inaccurate or biased data, approximations we have to make (e.g. you want to measure health and you use hospital readmissions as a proxy, or you are interested in crime and use arrests as a proxy. These things are related, but not the same), misunderstandings between different stakeholders (policy makers, those collecting the data, those coding the algorithm, those deploying it), how computer systems interact with human systems, and more.
This article discusses a variety of algorithmic systems. I don’t find debates about definitions particularly interesting, including what counts as “AI” or if a particular algorithm qualifies as “intelligent” or not. Please note that the dynamics described in this post hold true both for simpler algorithms, as well as more complex ones.
1. Algorithms are often implemented without ways to address mistakes.
After the state of Arkansas implemented software to determine people’s healthcare benefits, many people saw a drastic reduction in the amount of care they received, but were given no explanation and no way to appeal. Tammy Dobbs, a woman with cerebral palsy who needs an aid to help her to get out of bed, to go to the bathroom, to get food, and more, had her hours of help suddenly reduced by 20 hours a week, transforming her life for the worse. Eventually, a lengthy court case uncovered errors in the software implementation, and Tammy’s hours were restored (along with those of many others who were impacted by the errors).
Observations of 5th grade teacher Sarah Wysocki’s classroom yielded positive reviews. Her assistant principal wrote, “It is a pleasure to visit a classroom in which the elements of sound teaching, motivated students and a positive learning environment are so effectively combined.” Two months later, she was fired by an opaque algorithm, along with over 200 other teachers. The head of the PTA and a parent of one of Wyscoki’s students described her as “One of the best teachers I’ve ever come in contact with. Every time I saw her, she was attentive to the children, went over their schoolwork, she took time with them and made sure.” That people are losing needed healthcare without an explanation or being fired without explanation is truly dystopian!
As I covered in a previous post, people use outputs from algorithms differently than they use decisions made by humans:
Algorithms are more likely to be implemented with no appeals process in place.
Algorithms are often used at scale.
Algorithmic systems are cheap.
People are more likely to assume algorithms are objective or error-free. As Peter Haas said, “In AI, we have Milgram’s ultimate authority figure,” referring to Stanley Milgram’s famous experiments showing that most people will obey orders from authority figures, even to the point of harming or killing other humans. How much more likely will people be to trust algorithms perceived as objective and correct?
There is a lot of overlap between these factors. If the main motivation for implementing an algorithm is cost-cutting, adding an appeals process (or even diligently checking for errors) may be considered an “unnecessary” expense. Cathy O’Neill, who earned her math PhD at Harvard, wrote a book Weapons of Math Destruction, in which she covers how algorithms are disproportionately impacting poor people, whereas the privileged are more likely to still have access to human attention (in hiring, education, and more).
2. AI makes it easier to not feel responsible.
Let’s return to the case of the buggy software used to determine health benefits in Arkansas. How could this have been prevented? In order to prevent severely disabled people from mistakenly losing access to needed healthcare, we need to talk about responsibility. Unfortunately, complex systems lend themselves to a dynamic in which nobody feels responsible for the outcome.
The creator of the algorithm for healthcare benefits, Brant Fries (who has been earning royalties off this algorithm, which is in use in over half the 50 states), blamed state policy makers. I’m sure the state policy makers could blame the implementers of the software. When asked if there should be a way to communicate how the algorithm works to the disabled people losing their healthcare, Fries callously said, “It’s probably something we should do. Yeah, I also should probably dust under my bed,” and then later clarified that he thought it was someone else’s responsibility.
This passing of the buck and failure to take responsibility is common in many bureaucracies. As danah boyd observed, “Bureaucracy has often been used to shift or evade responsibility. Who do you hold responsible in a complex system?” Boyd gives the examples of high-ranking bureaucrats in Nazi Germany, who did not see themselves as responsible for the Holocaust. boyd continues, “Today’s algorithmic systems are extending bureaucracy.”
Another example of nobody feeling responsible comes from the case of research to classify gang crime. A database of gang members assembled by the Los Angeles Police Department (and 3 other California law enforcement agencies) was found to have 42 babies who were under the age of 1 when added to the gang database (28 were said to have admitted to being gang members). Keep in mind these are just some of the most obvious errors- we don’t know how many other people were falsely included. When researchers presented work on using machine learning on this data to classify gang crimes, an audience member asked about ethical concerns. “I’m just an engineer,” responded one of the authors.
I don’t bring this up for the primary purpose of pointing fingers or casting blame. However, a world of complex systems in which nobody feels responsible for the outcomes (which can include severely disabled people losing access to the healthcare they need, or innocent people being labeled as gang members) is not a pleasant place. Our work is almost always a small piece of a larger whole, yet a sense of responsibility is necessary to try to address and prevent negative outcomes.
3. AI encodes & magnifies bias.
But isn’t algorithmic bias just a reflection of how the world is? I get asked a variation of this question every timeI give atalk about bias. To which my answer is: No, our algorithms and products impact the world and are part of feedback loops. Consider an algorithm to predict crime and determine where to send police officers: sending more police to a particular neighhorhood is not just an effect, but also a cause. More police officers can lead to more arrests in a given neighborhood, which could cause the algorithm to send even more police to that neighborhood (a mechanism described in this paper on runaway feedback loops).
Bias is being encoded and even magnified in a variety of applications:
software used to decide prison sentences that has twice as high a false positive rate for Black defendents as for white defendents
Word embeddings, which are a building block for language tools like Gmail’s SmartReply and Google Translate, generate useful analogies such as Rome:Italy :: Madrid:Spain, as well as biased analogies such as man:computer programmer :: woman: homemaker.
Machine learning used in recruiting software developed at Amazon penalized applicants who attended all-women’s colleges, as well as any resumes that contained the word “women’s.”
Over 2/3 of the images in ImageNet, the most studied image data set in the world, are from the Western world (USA, England, Spain, Italy, Australia).
Since a Cambrian explosion of machine learning products is occuring, the biases that are calcified now and in the next few years may have a disproportionately huge impact for ages to come (and will be much harder to undo decades from now).
4. Optimizing metrics above all else leads to negative outcomes.
Worldwide, people watch 1 billion hours of YouTube per day (yes, that says PER DAY). A large part of YouTube’s successs has been due to its recommendation system, in which a video selected by an algorithm automatically begin playing once the previous video is over. Unfortunately, these recommendations are disproportionately for conspiracy theories promoting white supremacy, climate change denial, and denial of the mass shootings that plague the USA. What is going on? YouTube’s algorithm is trying to maximize how much time people spend watching YouTube, and conspiracy theorists watch significantly more YouTube than people who trust a variety of media sources. Unfortunately, a recommendation system trying only to maximize time spent on its own platform will incentivize content that tells you the rest of the media is lying.
YouTube is owned by Google, which is earning billions of dollars by aggressively introducing vulnerable people to conspiracy theories, while the rest of society bears the externalized costs of rising authoritarian governments, a resurgence in white supremacist movements, failure to act on climate change (even as extreme weather is creating increasing numbers of refugees), growing distrust of mainstream news sources, and a failure to pass sensible gun laws.
This problem is an example of the tyranny of metrics: metrics are just a proxy for what you really care about, and unthinkingly optimizing a metric can lead to unexpected, negative results. One analog example is that when the UK began publishing the success rates of surgeons, heart surgeons began turning down risky (but necessary) surgeries to try to keep their scores as high as possible.
Returning to the account of the popular 5th grade teacher who was fired by an algorithm, she suspects that the underlying reason she was fired was that her incoming students had unusually high test scores the previous year (making it seem like their scores had dropped to a more average level after her teaching), and that their former teachers may have cheated. As USA education policy began over-emphasizing student test scores as the primary way to evaluate teachers, there have been widespread scandals of teachers and principals cheating by altering students scores, in Georgia, Indiana, Massachusetts, Nevada, Virginia, Texas, and elsewhere. When metrics are given undue importance, attempts to game those metrics become common.
5. There is no accountability for big tech companies.
Major tech companies are the primary ones driving AI advances, and their algorithms impact billions of people. Unfortunately, these companies have zero accountability. YouTube (owned by Google) is helping to radicalize people into white supremacy. Google allowed advertisers to target people who search racist phrases like “black people ruin neighborhoods” and Facebook allowed advertisers to target groups like “jew haters”. Amazon’s facial recognition technology misidentified 28 members of congress as criminals, yet it is already in use by police departments. Palantir’s predictive policing technology was used for 6 years in New Orleans, with city council members not even knowing about the program, much less having any oversight. The newsfeed/timeline/recommendation algorithms of all the major platforms tend to reward incendiary content, prioritizing it for users.
In early 2018, the UN ruled that Facebook had played a “determining role” in the ongoing genocide in Myanmar. “I’m afraid that Facebook has now turned into a beast,” said the UN investigator. This result was not a surprise to anyone who had been following the situation in Myanmar. People warned Facebook executives about how the platform was being used to spread dehumanizing hate speech and incite violence against an ethnic minority as early as 2013, and again in 2014 and 2015. As early as 2014, news outlets such as Al Jazeera were covering Facebook’s role in inciting ethnic violence in Myanmar.
Contrast Facebook’s inaction in Myanmar with their swift action in Germany after the passage of a new law, which could have resulted in penalties of up to 50 million euros. Facebook hired 1,200 German contractors in under a year. In 2018, five years after Facebook was first warned about how they were being used to incite violence in Myanmar, they hired “dozens” of Burmese contractors, a fraction of their response in Germany. The credible threat of a large financial penalty may be the only thing Facebook responds to.
While it can be easy to focus on regulations that are misguided or ineffective, we often take for granted safety standards and regulations that have largely worked well. One major success story comes from automobile safety. Early cars had sharp metal knobs on dashboard that lodged in people’s skulls during crashes, plate glass windows that shattered dangerously, and non-collapsible steering columns that would frequently impale drivers. Beyond that, there was a widespread belief that the only issue with cars was the people driving them, and car manufactures did not want data on car safety to be collected. It took consumer safety advocates decades to push the conversation to how cars could be designed with greater safety, and to pass laws regarding seat belts, driver licenses, crash tests, and the collection of car crash data. For more on this topic, Datasheets for Datasets covers cases studies of how standardization came to the electronics, pharmaceutical, and automobile industries, and 99% Invisible has a deep dive on the history of car safety (with parallels and contrasts to the gun industry).
How We Can Do Better
The good news: none of the problems listed here are inherent to algorithms! There are ways we can do better:
Make sure there is a meaningful, human appeals process. Plan for how to catch and address mistakes in advance.
Take responsibility, even when our work is just one part of the system.
Be on the lookout for bias. Create datasheets for data sets.
Choose not to just optimize metrics.
Push for thoughtful regulations and standards for the tech industry.
The problems we are facing can feel scary and complex. However, it is still very early on in this age of AI and increasing algorithmic automation. Now is a great time to take action: we can change our culture, cultivate a greater sense of responsibility for our work, seek out thoughtful accountability to counterbalance the inordinate power that major tech companies have, and choose to create more humane products and systems. Technology is just a tool, and it can be used for good or bad. Let’s work to use it for good, to improve the lives of many, rather than just generate wealth for a small number of people.
This post was originally published on 2018-08-16, but has been updated for the newest, upcoming course.
At fast.ai, we want to do our part to increase diversity in deep learning and to lower the unnecessary barriers to entry for everyone. We are providing diversity scholarships for our updated part-time, in-person Deep Learning for Coders part 2 course presented in partnership with the University of San Francisco Data Institute, to be offered one evening per week for 7 weeks, starting March 18, 2019, in downtown San Francisco. Women, people of Color, LGBTQ people, people with disabilities, and/or veterans are eligible to apply. We are still looking for additional financial sponsors, so please contact email@example.com if your company is interested in donating.
The deadline to apply is February 14, 2019. Details on how to apply, and FAQ, are at the end of this post.
However, there is also great potential for harm. We are worried about unethical uses of data science, and about the ways that society’s racial and gender biases (summary here) are being encoded into our machine learning systems. We are concerned that an extremely homogeneous group is building technology that impacts everyone. People can’t address problems that they’re not aware of, and with more diverse practitioners, a wider variety of important societal problems will be tackled.
We want to get deep learning into the hands of as many people as possible, from as many diverse backgrounds as possible. People with different backgrounds have different problems they’re interested in solving. The traditional approach is to start with an AI expert and then give them a problem to work on; at fast.ai we want people who are knowledgeable and passionate about the problems they are working on, and we’ll teach them the deep learning they need. In my TEDx talk, I shared how my unlikely background led me to the work I do now and why we need more people with unlikely backgrounds in the field, both to address misuses of AI, as well as to take full advantage of the positive opportunities.
While some people worry that it’s risky for more people to have access to AI; I believe the opposite. We’ve already seen the harm wreaked by elite and exclusive companies such as Facebook, Palantir, and YouTube/Google. Getting people from a wider range of backgrounds involved can help us address these problems.
The fast.ai approach
We began fast.ai with an experiment: to see if we could teach deep learning to coders, with no math pre-requisites beyond high school math, and get them to state-of-the-art results in just 7 weeks. This was very different from other deep learning materials, many of which assume a graduate level math background, focus on theory, only work on toy problems, and don’t even include the practical tips. We didn’t even know if what we were attempting was possible, but the fast.ai course has been a huge success!
fast.ai is not just an educational resource; we also do cutting-edge research and have achieved state-of-the-art results. Our wins (and here) in Stanford’s DAWNBench competition against much better funded teams from Google and Intel were covered in the MIT Tech Review and the Verge. Jeremy’s work with Sebastian Ruder achieving state-of-the art on 6 language classification datasets was accepted by ACL, has been built upon by OpenAI and Google Brain; and was featured in the New York Times. All this research is incorporated into our course, teaching students state-of-the-art techniques.
How to Sponsor
We are looking for additional companies to sponsor diversity fellowships. Please contact Mindi firstname.lastname@example.org if your company might be interested!
Who is eligible for a diversity fellowship?
Wondering if you’re qualified? The requirements are:
Familiarity with Python, git, and bash
Familiarity with the content covered in Deep Learning Part 1, v3 (available for free online), including the fastai library, a high-level wrapper for PyTorch (it’s OK to start studying this material now, as long as you complete it by the start of the course)
Curiosity and a willingness to work hard
Able to commit 10 hours a week of study to the course (includes time for homework).
Identify as a woman, person of Color, LGBTQ person, person with a disability, and/or veteran
Be available to attend in-person 6:30-9pm in downtown San Francisco, one evening per week (exact schedule found here under details, day of the week varies)
You can fulfill the requirement to be familiar with deep learning, the fastai library, and PyTorch by doing any 1 of the following:
You took the updated, in-person deep learning part 1 course during fall 2018
You have watched the first 2 videos of the online coursebefore you apply, and a commitment to work through all 7 lessons before the start of the course. We estimate that each lesson takes approximately 10 hours of study (so you would need to study for the 7 weeks prior to the course starting on March 18, for 10 hours each week).
You have previously taken the older version of the course (released last year) AND watch the first 4 lessons of the new course to get familiar with the fastai library and PyTorch.
Deep Learning Part 1 covers the use of deep learning for image recognition, recommendation systems, sentiment analysis, and time-series prediction. Part 2 will take this further by teaching you how to read and implement cutting edge research papers, generative models and other advanced architectures, and more in-depth natural language processing. As with all fast.ai courses, it will be practical, state-of-the-art, and geared towards coders.
How to Apply for a Fellowship
The number of scholarships we are able to offer depends on how much funding we receive (if your organization may be able to sponsor one or more places, please let us know). To apply for the fellowship, you will need to submit a resume and statement of purpose. The statement of purpose will include the following:
1 paragraph describing one or more problems you’d like to apply deep learning to
I’m not eligible for the diversity scholarship, but I’m still interested. Can I take the course? Absolutely! You can register here.
I don’t live in the San Francisco Bay Area; can I participate remotely? Yes! Stay tuned for details to be released in a blog post in the next few weeks.
Will this course be made available online later? Yes, this course will be made freely available online afterwards. Benefits of taking the in-person course include earlier access, community and in-person interaction, and more structure (for those that struggle with motivation when taking online courses).
Is fast.ai able to sponsor visas or provide stipends for living expenses? No, we are not able to sponsor visas nor to cover living expenses.
How will this course differ from the previous fast.ai courses? Our goal at fast.ai is to push the state-of-the-art. Each year, we want to make deep learning increasingly intuitive to use while giving better results. With our fastai library, we are beating our own state-of-the-art results from last year.
What language is the course taught in? The course is taught in Python, using the fastai library and PyTorch. Some of our students have gone on to use the fastai library in production at Fortune 500 companies.
Launching today, the 2019 edition of Practical Deep Learning for Coders, the third iteration of the course, is 100% new material, including applications that have never been covered by an introductory deep learning course before (with some techniques that haven’t even been published in academic papers yet). There are seven lessons, each around 2 hours long, and you should plan to spend about 10 hours on assignments for each lesson. Google Cloud and Microsoft Azure have integrated all you need for the courses into their GPU-based platforms, and there are “one-click” platforms available too, such as Crestle and Gradient.
The course assumes you have at least a year of coding experience (preferably in Python, although experienced coders will be able to pick Python up as they go; we have a list of python learning resources available), and have completed high-school math (some university-level math is introduced as needed during the course). Many people who have completed the course tell us it takes a lot of work, but it’s one of the most rewarding things they’ve done; we strongly suggest you get involved with the course’s active online community to help you complete your journey.
After the first lesson you’ll be able to train a state-of-the-art image classification model on your own data. After completing this lesson, some students from the in-person version of this course (where this material was recorded) published new state-of-the-art results in various domains! The focus for the first half of the course is on practical techniques, showing only the theory required to actually use these techniques in practice. Then, in the second half of the course, we dig deeper and deeper into the theory, until by the final lesson we will build and train a “resnet” neural network from scratch which approaches state-of-the-art accuracy.
The key applications covered are:
Computer vision (e.g. classify pet photos by breed)
Image localization (segmentation and activation maps)
NLP (e.g. movie review sentiment analysis)
Tabular data (e.g. sales prediction)
Collaborative filtering (e.g. movie recommendation)
We also cover all the necessary foundations for these applications.
We teach using the PyTorch library, which is the most modern and flexible widely-used library available, and we’ll also use the fastai wrapper for PyTorch, which makes it easier to access recommended best practices for training deep learning models (whilst making all the underlying PyTorch functionality directly available too). We think fastai is great, but we’re biased because we made it… but it’s the only general deep learning toolkit featured on pytorch.org, has over 10,000 GitHub stars, and is used in many competition victories, academic papers, and top university courses, so it’s not just us that like it! Note that the concepts you learn will apply equally well to any work you want to do with Tensorflow/keras, CNTK, MXNet, or any other deep learning library; it’s the concepts which matter. Learning a new library just takes a few days if you understand the concepts well.
One particularly useful addition this year is that we now have a super-charged video player, thanks to the great work of Zach Caceres. It allows you to search the lesson transcripts, and jump straight to the section of the video that you find. It also shows links to other lessons, and the lesson summary and resources, in collapsible panes (it doesn’t work well on mobile yet however, so if you want to watch on mobile you can use this Youtube playlist). And an extra big thanks to Sylvain Gugger, who has been instrumental in the development of both the course and the fastai library—we’re very grateful to Amazon Web Services for sponsoring Sylvain’s work.
If you’re interested in giving it a go, click here to go to the course web site. Now let’s look at each lesson in more detail.
Lesson 1: Image classification
The most important outcome of lesson 1 is that we’ll have trained an image classifier which can recognize pet breeds at state-of-the-art accuracy. The key to this success is the use of transfer learning, which will be a fundamental platform for much of this course. We’ll also see how to analyze the model to understand its failure modes. In this case, we’ll see that the places where the model is making mistakes are in the same areas that even breeding experts can make mistakes.
We’ll discuss the overall approach of the course, which is somewhat unusual in being top-down rather than bottom-up. So rather than starting with theory, and only getting to practical applications later, we start instead with practical applications, and then gradually dig deeper and deeper into them, learning the theory as needed. This approach takes more work for teachers to develop, but it’s been shown to help students a lot, for example in education research at Harvard by David Perkins.
We also discuss how to set the most important hyper-parameter when training neural networks: the learning rate, using Leslie Smith’s fantastic learning rate finder method. Finally, we’ll look at the important but rarely discussed topic of labeling, and learn about some of the features that fastai provides for allowing you to easily add labels to your images.
Note that to follow along with the lessons, you’ll need to connect to a cloud GPU provider which has the fastai library installed (recommended; it should take only 5 minutes or so, and cost under $0.50/hour), or set up a computer with a suitable GPU yourself (which can take days to get working if you’re not familiar with the process, so we don’t recommend it until later). You’ll also need to be familiar with the basics of the Jupyter Notebook environment we use for running deep learning experiments. Up to date tutorials and recommendations for these are available from the course website.
Lesson 2: Data cleaning and production; SGD from scratch
We start today’s lesson by learning how to build your own image classification model using your own data, including topics such as:
Creating a validation set, and
Data cleaning, using the model to help us find data problems.
I’ll demonstrate all these steps as I create a model that can take on the vital task of differentiating teddy bears from grizzly bears. Once we’ve got our data set in order, we’ll then learn how to productionize our teddy-finder, and make it available online.
We’ve had some great additions since this lesson was recorded, so be sure to check out:
The production starter kits on the course web site, such as this one for deploying to Render.com
The new interactive GUI in the lesson notebook for using the model to find and fix mislabeled or incorrectly-collected images.
In the second half of the lesson we’ll train a simple model from scratch, creating our own gradient descent loop. In the process, we’ll be learning lots of new jargon, so be sure you’ve got a good place to take notes, since we’ll be referring to this new terminology throughout the course (and there will be lots more introduced in every lesson from here on).
Lesson 3: Data blocks; Multi-label classification; Segmentation
Lots to cover today! We start lesson 3 looking at an interesting dataset: Planet’s Understanding the Amazon from Space. In order to get this data into the shape we need it for modeling, we’ll use one of fastai’s most powerful (and unique!) tools: the data block API. We’ll be coming back to this API many times over the coming lessons, and mastery of it will make you a real fastai superstar! Once you’ve finished this lesson, if you’re ready to learn more about the data block API, have a look at this great article: Finding Data Block Nirvana, by Wayde Gilliam.
One important feature of the Planet dataset is that it is a multi-label dataset. That is: each satellite image can contain multiple labels, whereas previous datasets we’ve looked at have had exactly one label per image. We’ll look at what changes we need to make to work with multi-label datasets.
Next, we will look at image segmentation, which is the process of labeling every pixel in an image with a category that shows what kind of object is portrayed by that pixel. We will use similar techniques to the earlier image classification models, with a few tweaks. fastai makes image segmentation modeling and interpretation just as easy as image classification, so there won’t be too many tweaks required.
We will be using the popular CamVid dataset for this part of the lesson. In future lessons, we will come back to it and show a few extra tricks. Our final CamVid model will have dramatically lower error than any model we’ve been able to find in the academic literature!
What if your dependent variable is a continuous value, instead of a category? We answer that question next, looking at a keypoint dataset, and building a model that predicts face keypoints with precision.
In lesson 4 we’ll dive into natural language processing (NLP), using the IMDb movie review dataset. In this task, our goal is to predict whether a movie review is positive or negative; this is called sentiment analysis. We’ll be using the ULMFiT algorithm, which was originally developed during the fast.ai 2018 course, and became part of a revolution in NLP during 2018 which led the New York Times to declare that new systems are starting to crack the code of natural language. ULMFiT is today the most accurate known sentiment analysis algorithm.
The basic steps are:
Create (or, preferred, download a pre-trained) language model trained on a large corpus such as Wikipedia (a “language model” is any model that learns to predict the next word of a sentence)
Fine-tune this language model using your target corpus (in this case, IMDb movie reviews)
Remove the encoder in this fine tuned language model, and replace it with a classifier. Then fine-tune this model for the final classification task (in this case, sentiment analysis).
After our journey into NLP, we’ll complete our practical applications for Practical Deep Learning for Coders by covering tabular data (such as spreadsheets and database tables), and collaborative filtering (recommendation systems).
For tabular data, we’ll see how to use categorical and continuous variables, and how to work with the fastai.tabular module to set up and train a model.
Then we’ll see how collaborative filtering models can be built using similar ideas to those for tabular data, but with some special tricks to get both higher accuracy and more informative model interpretation.
This brings us to the half-way point of the course, where we have looked at how to build and interpret models in each of these key application areas:
For the second half of the course, we’ll learn about how these models really work, and how to create them ourselves from scratch. For this lesson, we’ll put together some of the key pieces we’ve touched on so far:
Layers (affine and non-linear)
We’ll be coming back to each of these in lots more detail during the remaining lessons. We’ll also learn about a type of layer that is important for NLP, collaborative filtering, and tabular models: the embedding layer. As we’ll discover, an “embedding” is simply a computational shortcut for a particular type of matrix multiplication (a multiplication by a one-hot encoded matrix).
Lesson 5: Back propagation; Accelerated SGD; Neural net from scratch
In lesson 5 we put all the pieces of training together to understand exactly what is going on when we talk about back propagation. We’ll use this knowledge to create and train a simple neural network from scratch.
We’ll also see how we can look inside the weights of an embedding layer, to find out what our model has learned about our categorical variables. This will let us get some insights into which movies we should probably avoid at all costs…
Although embeddings are most widely known in the context of word embeddings for NLP, they are at least as important for categorical variables in general, such as for tabular data or collaborative filtering. They can even be used with non-neural models with great success.
Lesson 6: Regularization; Convolutions; Data ethics
Today we discuss some powerful techniques for improving training and avoiding over-fitting:
Dropout: remove activations at random during training in order to regularize the model
Data augmentation: modify model inputs during training in order to effectively increase data size
Batch normalization: adjust the parameterization of a model in order to make the loss surface smoother.
Next up, we’ll learn all about convolutions, which can be thought of as a variant of matrix multiplication with tied weights, and are the operation at the heart of modern computer vision models (and, increasingly, other types of models too).
We’ll use this knowledge to create a class activated map, which is a heat-map that shows which parts of an image were most important in making a prediction.
Finally, we’ll cover a topic that many students have told us is the most interesting and surprising part of the course: data ethics. We’ll learn about some of the ways in which models can go wrong, with a particular focus on feedback loops, why they cause problems, and how to avoid them. We’ll also look at ways in which bias in data can lead to biased algorithms, and discuss questions that data scientists can and should be asking to help ensure that their work doesn’t lead to unexpected negative outcomes.
Lesson 7: Resnets from scratch; U-net; Generative (adversarial) networks
In the final lesson of Practical Deep Learning for Coders we’ll study one of the most important techniques in modern architectures: the skip connection. This is most famously used in the resnet, which is the architecture we’ve used throughout this course for image classification, and appears in many cutting-edge results. We’ll also look at the U-net architecture, which uses a different type of skip connection to greatly improve segmentation results (and also for similar tasks where the output structure is similar to the input).
We’ll then use the U-net architecture to train a super-resolution model. This is a model which can increase the resolution of a low-quality image. Our model won’t only increase resolution—it will also remove jpeg artifacts and unwanted text watermarks.
In order to make our model produce high quality results, we will need to create a custom loss function which incorporates feature loss (also known as perceptual loss), along with gram loss. These techniques can be used for many other types of image generation task, such as image colorization.
We’ll learn about a recent loss function known as generative adversarial loss (used in generative adversarial networks, or GANs), which can improve the quality of generative models in some contexts, at the cost of speed.
The techniques we show in this lesson include some unpublished research that:
Let us train GANs more quickly and reliably than standard approaches, by leveraging transfer learning
Combines architectural innovations and loss function approaches that haven’t been used in this way before.
The results are stunning, and train in just a couple of hours (compared to previous approaches that take a couple of days).
Finally, we’ll learn how to create a recurrent neural net (RNN) from scratch. This is the foundation of the models we have been using for NLP throughout the course, and it turns out they are a simple refactoring of a regular multi-layer network.
Thanks for reading! If you’ve gotten this far, then you should probably head over to course.fast.ai and start watching the first video!
Generating numbers from random distributions is a practically useful tool that any coder is likely to need at some point. C++11 added a rich set of random distribution generation capabilities. This makes it easy and fast to use random distributions, not only if you’re using C++11, but if you’re using any language that lets you interop with C++.
In this article, we’ll learn what random distributions are useful for, how they are generated, and how to use them in C++11. I’ll also show how I created new random distribution functionality for Swift by wrapping C++11’s classes as Swift classes. Whilst Swift doesn’t provide direct support for C++, I’ll show how to work around that by creating pure C wrappers for C++ classes.
Random distributions, and why they matter
If names like negative binomial and poisson are mere shadows of memories of something learned long ago, please give me a moment to try to convince you that the world of random distributions is something that deserves your time and attention.
Coders are already aware of the idea of random numbers. But for many, our toolbox is limited to uniform real and integer random numbers, and perhaps some gaussian (normal) random numbers thrown in occassionally as well. There are so many other ways of generating random numbers! In fact, you may even find yourself recreating standard random distributions without being aware of it…
For instance, let’s say you’re writing a music player, and your users have rated various songs from one star to five stars. You want to implement a “shuffle play” function, which will select songs at random, but choosing higher rated songs more often. How would you go about implementing that? The answer is: with a random distribution! More specifically, you want random numbers from a discrete distribution; that is, generate a random integer, using a set of weights where the higher weighted numbers are chosen proportionally more often.
Or perhaps you are trying to simulate the predicted queue length after adding more resources to a system. You simulate the process, becase you want to know not just the average queue length, but how often it will be bigger than some size, what the 95 percentile size will be, and so forth. You’re not sure what some of the inputs to your system might be, but you know the range of possible values, and you have a guess as to what you think is most likely. In this situation, you want random numbers from a triangular distribution; that is, generate a random float, which is normally close to your guess, and is linearly less likely further away, reducing to a probability of zero outside of the range of possible values. (This kind of simulation forms the backbone of probabilistic programming.)
There are dozens of other random distributions, including:
Empirical distribution: pick a number at random from historical data
Negative binomial distribution: the number of successes before a specified number of failures occurs
Poisson distribution: which can be used to model the number of independent events of a regular frequency that happen in a fixed time period
How to generate random distributions
In general, the steps to generate a number from some random distribution are:
Seed your generator
Generate the next bunch of random bits using the generator
Transform those random bits into your chosen distribution
If you need more random numbers, return to (2)
What we normally refer as “random number generation” is really step (2): the use of a pseudorandom generator which deterministically generates a series of numbers that are as “random looking” as possible (i.e. not correlated with each other, well spread out, and so forth). The pseudorandom generator is some function with these properties, such as mersenne twister. To start off the series, you need some seed; that is, the first number to pass to the generator. Most operating systems have some way of generating a random seed, such as /dev/random on Linux and Mac, which uses environmental input such as noise from device drivers to get a number that should be truely random.
Then in step 3 we transform the random bits created by our pseudorandom generator into something that has the distribution we need. There are universally applicable methods for this, such as inverse transform sampling, which transform a uniform random number into any given distribution. There are also faster methods specific to a distribution, such as the Box-Muller transform which creates gaussian (normal) random numbers from a uniform generator.
To create more random numbers, you don’t need to go back to /dev/random, since you already have a pseudorandom generator set up now. Instead, you just grab the next number from your generator (step (2)), and pass that to your distribution generating transform (step (3)).
How this works in C++
C++11 includes, in the <random> standard library header, functionality for each of the steps above. Step (1) is achieved by simply creating a random_device (I’m not including the std:: prefix in this article; you would either type std::random_device or add using namespace std to the top of your C++ file). You then pass this to the constructor of one of various pseudorandom generators provided, such as mt19937, which is the mersenne twister generator; that’s step (2). Then you construct a random distribution object using an appropriate class, such as discrete_distribution, passing in whatever arguments are needed by that distribution (e.g. for discrete_distribution you pass in a list of weights for each possible value). Finally, call that object (it supports the () operator, so it’s a functor, known as callable in Python) passing in the pseudorandom generator you created. Here’s a complete example from the excellent cppreference.com.
If you thought that C++ code had to be verbose and complicated, this example might just make you rethink your assumptions! As you can see, each step maps nicely to the overview of the random distribution process described in the previous section. (BTW: if you’re interested in learning modern C++, cppreference.com has an extraordinary collection of carefully designed examples for every part of the C++ standard library; it’s perhaps the best place to learn how to use the language effectively in a hands on way. You can even edit the examples and run them online!)
Although Swift 4 now provides some basic random number support, it still doesn’t provide any non-uniform distributions. Therefore, I’ve made all of the C++11 random distributions available to Swift, as part of my BaseMath library. For more information on why and how I created this library, see High Performance Numeric Programming with Swift. I’ll show how I built this wrapper in a moment, but first let’s see how to use it. Here’s the same function that we saw in C++, converted to Swift+BaseMath:
As you see, the generation of the random numbers in Swift can be boiled down using BaseMath to just: Int.discrete_distribution([40, 10, 10, 40]). We can do this more concisely than C++11 because we don’t surface as many options and details. BaseMath simply assumes you want to use the standard seeding method, and use the mt19937 mersenne twister generator.
The names of the distributions in BaseMath are exactly the same as in C++11, and you simply prefix each name with the type you wish to generate (either Int or Int32 for integer distributions, or Double or Float for real distributions). Each distribution has an init which matches the same name and types as the C++11 distribution constructor. This returns a Swift object with a number of methods. The C++11 objects, as discussed, provide the () (functor) operator, but unfortunately that operator can not be overloaded in Swift. Therefore instead we borrow Swift’s subscript special method to give us the equivalent behavior. The only difference is we have to use  instead of (). If you just use the empty subscript  then BaseMath will return a single random number; if you use an Int, such as  then BaseMath will return an array. (There are also methods to generate buffer pointers and aligned storage.) Using subscript instead of a functor may feel a bit odd at first, but it’s a perfectly adequte way to get around Swift’s limitation. I’m going to call this a quasi-functor; that is, something that behaves like a functor, but is called using [...].
Wrapping C++ with Swift
In order to make the C++11 random distributions available to Swift, I needed to do two things:
Wrap the C++ classes in a C API, since Swift can’t interop directly with C++
Wrap the C API with Swift classes
We’ll look at each step in turn:
Wrap C++ classes in C
The C wrapper code is in CBaseMath.cpp. The mt19937 mersenne twister generator will be wrapped in a Swift class called RandGen, so the C functions wrapping this class all have the RandGen_ prefix. Here’s the code for the C wrappers (note that the wrappers can use C++ features internally, as long as the interface in the header file is plain C):
The pattern for each class we wrap will be similar to this. We’ll have at least:
A struct which derives from the C++ class to wrap (struct RandGenC), along with a typedef to allow us to use this name directly. By using a struct instead of void* we can call methods directly in C++, but can hide templates and other C++ internals from our pure C header file
A _create function which constructs an object of this class and returns a pointer to it, cast to our struct type
A _destroy function that deletes that object
In our header, we’ll have each of the functions listed, along with the typedef. Anything importing this C API, including our Swift code, won’t actually know anything about what the struct actually contains, so we won’t be able to use the type directly (since its size and layout isn’t provided in the header). Instead, we’ll simply use opaque pointers in code that uses this.
Here’s the code for the wrappers for a distribution; it looks nearly the same:
The main difference is the addition of a _call function to allow us to actually call the method. Also, because the type is templated, we have to create a separate set of wrappers for each template type we want to support; the above shows an example for <int>. Note that this type needs to be included in the name of each function, since C doesn’t support overloading.
Of course, this all looks rather verbose, and we wouldn’t want to write this all out by hand for every distribution. So we don’t! Instead we use gyb templates to create them for us, and also to auto-generate the header file. Time permitting, we’ll look at that in more detail in the future. But for now, you can check the template’s source code.
Wrap C API in Swift
Now that we’ve got our C API, we can recreate the original C++ class easily in Swift, e.g.:
As you see, we simply call our _create function in init, and _destroy in deinit. As discussed in the previous section, our C API users don’t know anything about the internals of our struct, so Swift simply gives us an OpaquePointer.
We create similar wrappers for each distribution (which also define subscript, which will call our _call function), plus extending the numeric type with an appropriate static wrapper, e.g.:
Having to construct and pass in a RandGen object isn’t convenient, particularly when we have to deal with the complexities of thread safety. C++ libraries are not, in general, thread safe; this includes C++11 random generators. So we have to be careful that we don’t share a RandGen object across threads. As discussed in my previous High Performance Numeric Programming with Swift article, we can easily get thread-safe objects by using Thread Local Storage. I added this property to RandGen:
The above steps give us the functionality of generating a single random number at a time. In order to generate a collection, we can add a Distribution protocol which each distribution conforms to, and extend it as follows:
As you see, we leverage the BaseMath method fill, which calls a function or quasi-functor n times and returns a new BaseVector (in this case, an Array) with the results of each call.
You might be wondering about the protocol Nullary that’s mentioned above. Perhaps you’ve already heard of unary (a function or operator with one argument), binary (two arguments), and ternary); less known, but equally useful, is the term nullary, which is simply a function or operator with no arguments. As discussed earlier, Swift doesn’t support overloading the () operator, so we add a Nullary protocol using subscript:
If you’re a C++ or Swift programmer, try out some of these random distributions—perhaps you could even experiment with creating some simulations and entering the world of probabilistic programming! Or if you’re a Swift programmer that want to use functionaity in a C++ library, try wrapping it with an idiomatic Swift API and make it available as a Swift package for anyone to use.