Today we are launching the 2018 edition of Cutting Edge Deep Learning for Coders, part 2 of fast.ai’s free deep learning course. Just as with our part 1 Practical Deep Learning for Coders, there are no pre-requisites beyond high school math and 1 year of coding experience—we teach you everything else you need along the way. This course contains all new material, including new state of the art results in NLP classification (up to 20% better than previously known approaches), and shows how to replicate recent record-breaking performance results on Imagenet and CIFAR10. The main libraries used are PyTorch and fastai (we explain why we use PyTorch and why we created the fastai library in this article).
Each of the seven lessons includes a video that’s around two hours long, an interactive Jupyter notebook, and a dedicated discussion thread on the fast.ai forums. The lessons cover many topics, including: multi-object detection with SSD and YOLOv3; how to read academic papers; customizing a pre-trained model with a custom head; more complex data augmentation (for coordinate variables, per-pixel classification, etc); NLP transfer learning; handling very large (billion+ token) text corpuses with the new fastai.text library; running and intepreting ablation studies; state of the art NLP classification; multi-modal learning; multi-task learning; bidirectional LSTM with attention for seq2seq; neural translation; customizing resnet architectures; GANs, WGAN, and CycleGAN; data ethics; super resolution; image segmentation with u-net.
Lesson 8 starts with a quick recap of what we learned in part 1, and introduces the new focus of this part of the course: cutting edge research. We talk about how to read papers, and what you’ll need to build your own deep learning box to run your experiments. Even if you’ve never read an academic paper before, we’ll show you how to do so in a way that you don’t get overwhelmed by the notation and writing style. Another difference in this part is that we’ll be digging deeply into the source code of the fastai and Pytorch libraries: in this lesson we’ll show you how to quickly navigate and build an understanding of the code. And we’ll see how to use python’s debugger to deepen your understand of what’s going on, as well as to fix bugs.
The main topic of this lesson is object detection, which means getting a model to draw a box around every key object in an image, and label each one correctly. You may be surprised to discover that we can use transfer learning from an Imagenet classifier that was never even trained to do detection! There are two main tasks: find and localize the objects, and classify them; we’ll use a single model to do both these at the same time. Such multi-task learning generally works better than creating different models for each task—which many people find rather counter-intuitive. To create this custom network whilst leveraging a pre-trained model, we’ll use fastai’s flexible custom head architecture.
In this lesson we’ll move from single object to multi-object detection. It turns out that this slight difference makes things much more challenging. In fact, most students found this the most challenging lesson in the whole course. Not because any one piece is highly complex, but because there’s a lot of pieces, so it really tests your understanding of the foundations we’ve learnt so far. So don’t worry if a lot of details are unclear on first viewing – come back to this lesson from time to time as you complete the rest of the course, and you should find more and more of it making sense!
Our focus is on the single shot multibox detector (SSD), and the related YOLOv3 detector. These are ways to handle multi-object detection by using a loss function that can combine losses from multiple objects, across both localization and classification. They also use a custom architecture that takes advantage of the difference receptive fields of different layers of a CNN. And we’ll see how to handle data augmentation in situations like this one where the dependent variable requires augmentation too. Finally, we discuss a simple but powerful trick called focal loss which is used to get state of the art results in this field.
After reviewing what we’ve learned about object detection, in lesson 10 we jump into NLP, starting with an introduction to the new fastai.text library. This is a replacement for torchtext which is faster and more flexible in many situations. A lot of this class will be very familiar—we’re covering a lot of the same ground as lesson 4. But this lesson will show you how to get much more accurate results, by using transfer learning for NLP.
Transfer learning has revolutionized computer vision, but until now it largely has failed to make much of an impact in NLP (and to some extent has been simply ignored). In this class we’ll show how pre-training a full language model can greatly surpass previous approaches based on simple word vectors. We’ll use this language model to show a new state of the art result in text classification.
In lesson 11 we’re going to learn to translate French into English! To do so, we’ll learn how to add attention to an LSTM in order to build a sequence to sequence (seq2seq) model. But before we do, we’ll do a review of some key RNN foundations, since a solid understanding of those will be critical to understanding the rest of this lesson.
A seq2seq model is one where both the input and the output are sequences, and can be of difference lengths. Translation is a good example of a seq2seq task. Because each translated word can correspond to one or more words that could be anywhere in the source sentence, we learn an attention mechanism to figure out which words to focus on at each time step. We’ll also learn about some other tricks to improve seq2seq results, including teacher forcing and bidirectional models.
We finish the lesson by discussing the amazing DeVISE paper, which shows how we can bridge the divide between text and images, using them both in the same model!
We start this lesson with a deep dive into the DarkNet architecture used in YOLOv3, and use it to better understand all the details and choices that you can make when implementing a resnet-ish architecture. The basic approach discussed here is what we used to win the DAWNBench competition!
Then we’ll learn about Generative Adversarial Networks (GANs). This is, at its heart, a different kind of loss function. GANs have a generator and a discriminator that battle it out, and in the process combine to create a generative model that can create highly realistic outputs. We’ll be looking at the Wasserstein GAN variant, since it’s easier to train and more resilient to a range of hyperparameters.
For the start of lesson 13 we’ll cover the CycleGAN, which is a breakthrough idea in GANs that allows us to generate images even where we don’t have direct (paired) training data. We’ll use it to turn horses into zebras, and visa versa; this may not be an application you need right now… but the basic idea is likely to be transferable to a wide range of very valuable applications. One of our students is already using it to create a new form of visual art.
But generative models (and many other techniques we’ve discussed) can cause harm just as easily as they can benefit society. So we spend some time talking about data ethics. It’s a topic that really deserves its own whole course; whilst we can’t go into the detail we’d like in the time available, hopefully you’ll get a taste of some of the key issues, and ideas for where to learn more.
We finish the lesson by looking at style transfer, an interesting approach that allows us to change the style of images in whatever way we like. The approach requires us to optimize pixels, instead of weights, which is an interesting different way of looking at optimization.
In this final lesson, we do a deep dive into super resolution, an amazing technique that allows us to restore high resolution detail in our images, based on a convolutional neural network. In the process, we’ll look at a few modern techniques for faster and more reliable training of generative convnets.
We close with a look at image segmentation, in particular using the Unet architecture, a state of the art technique that has won many Kaggle competitions and is widely used in industry. Image segmentation models allow us to precisely classify every part of an image, right down to pixel level.
DAWNBench is a Stanford University project designed to allow different deep learning methods to be compared by running a number of competitions. There were two parts of the Dawnbench competition that attracted our attention, the CIFAR 10 and Imagenet competitions. Their goal was simply to deliver the fastest image classifier as well as the cheapest one to achieve a certain accuracy (93% for Imagenet, 94% for CIFAR 10).
In the CIFAR 10 competition our entries won both training sections: fastest, and cheapest. Another fast.ai student working independently, Ben Johnson, who works on the DARPA D3M program, came a close second in both sections.
In the Imagenet competition, our results were:
Fastest on publicly available infrastructure, fastest on GPUs, and fastest on a single machine (and faster than Intel’s entry that used a cluster of 128 machines!)
Lowest actual cost (although DAWNBench’s official results didn’t use our actual cost, as discussed below).
Overall, our findings were:
Algorithmic creativity is more important than bare-metal performance
Pytorch, developed by Facebook AI Research and a team of collaborators, allows for rapid iteration and debugging to support this kind of creativity
AWS spot instances are an excellent platform for rapidly and inexpensively running many experiments.
In this post we’ll discuss our approach to each competition. All of the methods discussed here are either already incorporated into the fastai library, or are in the process of being merged into the library.
fast.ai is a research lab dedicated to making deep learning more accessible, both through education, and developing software that simplifies access to current best practices. We do not believe that having the newest computer or the largest cluster is the key to success, but rather utilizing modern techniques and the latest research with a clear understanding of the problem we are trying to solve. As part of this research we recently developed a new library for training deep learning models based on Pytorch, called fastai.
Over time we’ve been incorporating into fastai algorithms from a number of research papers which we believe have been largely overlooked by the deep learning community. In particular, we’ve noticed a tendency of the community to over-emphasize results from high-profile organizations like Stanford, DeepMind, and OpenAI, whilst ignoring results from less high-status places. One particular example is Leslie Smith from the Naval Research Laboratory, and his recent discovery of an extraordinary phenomenon he calls super convergence. He showed that it is possible to train deep neural networks 5-10x faster than previously known methods, which has the potential to revolutionize the field. However, his paper was not accepted to an academic publishing venue, nor was it implemented in any major software.
Within 24 hours of discussing this paper in class, a fast.ai student named Sylvain Gugger had completed an implementation of the method, which was incorporated into fastai and he also developed an interactive notebook showing how to experiment with other related methods too. In essence, Smith showed that if we very slowly increase the learning rate during training, whilst at the same time decreasing momentum, we can train at extremely high learning rates, thus avoiding over-fitting, and training in far fewer epochs.
Such rapid turnaround of new algorithmic ideas is exactly where Pytorch and fastai shine. Pytorch allows for interactive debugging, and the use of standard Python coding methods, whilst fastai provides many building blocks and hooks (such as, in this case, callbacks to allow customization of training, and fastai.sgdr for building new learning rate annealing methods). Pytorch’s tensor library and CUDA allow for fast implementation of new algorithms for exploration.
We have an informal deep learning study group (free for anyone to join) that meets each day to work on projects together during the course, and we thought it would be interesting to see whether this newly contributed code would work as well as Smith claimed. We had heard that Stanford University was running a competition called DAWNBench, which we thought would be an interesting opportunity to test it out. The competition finished just 10 days from when we decided to enter, so timing was tight!
Both CIFAR 10 and Imagenet are image recognition tasks. For instance, imagine that we have a set of pictures of cats and dogs, and we want to build a tool to separate them automatically. We build a model and then train it on many pictures so that afterwards we can classify dog and cat pictures we haven’t seen before. Next, we can take our model and apply it to larger data sets like CIFAR, a collection of pictures of ten various objects like cats and dogs again as well as other animals/vehicles, for example frogs and airplanes. The images are small (32 pixels by 32 pixels) and so this dataset is small (160MB) and easy to work with. It is, nowadays, a rather under-appreciated dataset, simply because it’s older and smaller than the datasets that are fashionable today. However, it is very representative of the amount of data most organizations have in the real world, and the small image size makes it both challenging but also accessible.
When we decided to enter the competition, the current leader had achieved a result of 94% accuracy in a little over an hour. We quickly discovered that we were able to train a Resnet 50 model with super-convergence in around 15 minutes, which was an exciting moment! Then we tried some different architectures, and found that Resnet 18 (in its preactivation variant) achieved the same result in 10 minutes. We discussed this in class, and Ben Johnson independently further developed this by adding a method fast.ai developed called “concat pooling” (which concatenates max pooling and average pooling in the penultimate layer of the network) and got down to an extraordinary 6 minutes on a single NVIDIA GPU.
In the study group we decided to focus on multi-GPU training, in order to get the fastest result we could on a single machine. In general, our view is that training models on multiple machines adds engineering and sysadmin complexity that should be avoided where possible, so we focus on methods that work well on a single machine. We used a library from NVIDIA called NCCL that works well with Pytorch to take advantage of multiple GPUs with minimal overhead.
Most papers and discussions of multi-GPU training focus on the number of operations completed per second, rather than actually reporting how long it takes to train a network. However, we found that when training on multiple GPUs, our architectures showed very different results. There is clearly still much work to be done by the research community to really understand how to leverage multiple GPUs to get better end-to-end training results in practice. For instance, we found that training settings that worked well on single GPUs tended to lead to gradients blowing up on multiple GPUs. We incorporated all the recommendations from previous academic papers (which we’ll discuss in a future paper) and got some reasonable results, but we still weren’t really leveraging the full power of the machine.
In the end, we found that to really leverage the 8 GPUs we had in the machine, we actually needed to give it more work to do in each batch—that is, we increased the number of activations in each layer. We leveraged another of those under-appreciated papers from less well-known institutions: Wide Residual Networks, from Université Paris-Est, École des Ponts. This paper does an extensive analysis of many different approaches to building residual networks, and provides a rich understanding of the necessary building blocks of these architectures.
Another of our study group members, Brett Koonce, started running experiments with lots of different parameter settings to try to find something that really worked well. We ended up creating a “wide-ish” version of the resnet-34 architecture which, using Brett’s carefully selected hyper-parameters, was able to reach the 94% accuracy with multi-GPU training in under 3 minutes!
AWS and spot instances
We were lucky enough to have some AWS credits to use for this project (thanks Amazon!) We wanted to be able to run many experiments in parallel, without spending more credits than we had to, so study group member Andrew Shaw built out a python library which would allow us to automatically spin up a spot instance, set it up, train a model, save the results, and shut the instance down again, all automatically. Andrew even set things up so that all training occurred automatically in a tmux session so that we could log in to any instance and view training progress at any time.
Based on our experience with this competition, our recommendation is that for most data scientists, AWS spot instances are the best approach for training a large number of models, or for training very large models. They are generally about a third of the cost of on-demand instances. Unfortunately, the official DAWNBench results do not report the actual cost of training, but instead report the cost based on an assumption of on-demand pricing. We do not agree that this is the most useful approach, since in practice spot instance pricing is quite stable, and is the recommended approach for training models of this type.
Google’s TPU instances (now in beta) may also a good approach, as the results of this competition show, but be aware that the only way to use TPUs is if you accept lock-in to all of:
Google’s hardware (TPU)
Google’s software (Tensorflow)
Google’s cloud platform (GCP).
More problematically, there is no ability to code directly for the TPU, which severely limits algorithmic creativity (which as we have seen, is the most important part of performance). Given the limited neural network and algorithm support on TPU (e.g. no support for recurrent neural nets, which are vital for many applications, including Google’s own language translation systems), this limits both what problems you can solve, and how you can solve them.
AWS, on the other hand, allows you to run any software, architecture, and algorithm, and you can then take the results of that code and run them on your own computers, or use a different cloud platform. The ability to use spot instances also means you we were able to save quite a bit of money compared to Google’s platform (Google has something similar in beta called “preemptible instances”, but they don’t seem to support TPUs, and automatically kill your job after 24 hours).
For single GPU training, another great option is Paperspace, which is the platform we use for our new courses. They are significantly less complex to set up than AWS instances, and have the whole fastai infrastructure pre-installed. On the other hand, they don’t have the features and flexibility of AWS. They are more expensive than AWS spot instances, but cheaper that AWS on-demand instances. We used a Paperspace instance to win the cost category of this competition, with a cost of just $0.26.
Half precision arithmetic
Another key to fast training was the use of half precision floating point. NVIDIA’s most recent Volta architecture contains tensor cores that only work with half-precision floating point data. However, successfully training with this kind of data has always been complex, and very few people have shown successful implementations of models trained with this data.
NVIDIA was kind enough to provide an open-source demonstration of training Imagenet using half-precision floating point, and Andrew Shaw worked to incorporate these ideas directly into fastai. We’ve now gotten it to a point where you simply write learn.half() in your code, and from there on all the necessary steps to train quickly and correctly with half-precision floating point are automatically done for you.
Imagenet is a different version of the same problem as CIFAR 10, but with larger images (224 pixels, 160GB) and more categories (1000). Smith showed super convergence on Imagenet in his paper, but he didn’t reach the same level of accuracy as other researchers had on this dataset. We had the same problem, and found that when training with really high learning rates that we couldn’t achieve the required 93% accuracy.
Instead, we turned to a method we’d developed at fast.ai, and teach in lessons 1 & 2 of our deep learning course: progressive resizing. Variations of this technique have shown up in the academic literature before (Progressive Growing of GANs and Enhanced Deep Residual Networks) but have never to our knowledge been applied to image classification. The technique is very simple: train on smaller images at the start of training, and gradually increase image size as you train further. It makes intuitive sense that you don’t need large images to learn the general sense of what cats and dogs look like (for instance), but later on when you’re trying to learn the difference between every breed of dog, you’ll often need larger images.
Many people incorrectly believe that networks trained on one size of images can’t be used for other sizes. That was true back in 2013 when the VGG architecture was tied to one specific size of image, but hasn’t been true since then, on the whole. One problem is that many implementations incorrectly used a fixed-size pooling layer at the end of the network instead of a global/adaptive pooling layer. For instance none of the official pytorch torchvision models use the correct adaptive pooling layer. This kind of issue is exactly why libraries like fastai and keras are important—libraries built by people who are committed to ensuring that everything works out-of-the-box and incorporates all relevant best practices. The engineers building libraries like pytorch and tensorflow are (quite rightly) focused on the underlying foundations, not on the end-user experience.
By using progressive resizing we were both able to make the initial epochs much faster than usual (using 128x128 images instead of the usual 224x224), but also make the final epochs more accurate (using 288x288 images for even higher accuracy). But performance was only half of the reason for this success; the other impact is better generalization performance. By showing the network a wider variety of image sizes, it helps it to avoid over-fitting.
A word on innovation and creativity
I’ve been working with machine learning for 25 years now, and throughout that time I’ve noticed that engineers are drawn to using the biggest datasets they can get, on the biggest machines they can access, like moths flitting around a bright light. And indeed, the media loves covering stories about anything that’s “biggest”. The truth though is that throughout this time the genuine advances consistently come from doings things differently, not doing things bigger. For instance, dropout allows us to train on smaller datasets without over-fitting, batch normalization lets us train faster, and rectified linear units avoid gradient explosions during training; these are all examples of thoughtful researchers thinking about doing things differently, and allowing the rest of us to train better networks, faster.
I worry when I talk to my friends at Google, OpenAI, and other well-funded institutions that their easy access to massive resources is stifling their creativity. Why do things smart when you can just throw more resources at them? But the world is a resource-constrained place, and ignoring that fact means that you will fail to build things that really help society more widely. It is hardly a new observation to point out that throughout history, constraints have been drivers of innovation and creativity. But it’s a lesson that few researchers today seem to appreciate.
Worse still are the people I speak to that don’t have access to such immense resources, and tell me they haven’t bothered trying to do cutting edge research because they assume that without a room full of GPUs, they’ll never be able to do anything of value. To me, they are thinking about the problem all wrong: a good experimenter with a slow computer should always be able to overtake a poor experimenter with a fast one.
We’re lucky that there folks like the Pytorch team that are building the tools that creative practitioners need to rapidly iterate and experiment. I hope that seeing that a small non-profit self-funded research lab and some part-time students can achieve these kinds of top-level results can help bring this harmful myth to an end.
Despite what you may have heard, you can use deep learning for the type of data you might keep in a SQL database, a Pandas DataFrame, or an Excel spreadsheet (including time-series data). I will refer to this as tabular data, although it can also be known as relational data, structured data, or other terms (see my twitter poll and comments for more discussion).
Tabular data is the most commonly used type of data in industry, but deep learning on tabular data receives far less attention than deep learning for computer vision and natural language processing. This post covers some key concepts from applying neural networks to tabular data, in particular the idea of creating embeddings for categorical variables, and highlights 2 relevant modules of the fastai library:
fastai.structured: this module works with Pandas DataFrames, is not dependent on PyTorch, and can be used separately from the rest of the fastai library to process and work with tabular data.
fastai.column_data: this module also works with Pandas DataFrames, and provides methods to convert DataFrames (with both continuous and categorical variables) into ModelData objects that can easily be used when training neural networks. It also includes an implementation for creating embeddings of categorical variables, a powerful technique I will explain below.
A key technique to making the most of deep learning for tabular data is to use embeddings for your categorical variables. This approach allows for relationships between categories to be captured. Perhaps Saturday and Sunday have similar behavior, and maybe Friday behaves like an average of a weekend and a weekday. Similarly, for zip codes, there may be patterns for zip codes that are geographically near each other, and for zip codes that are of similar socio-economic status.
Taking Inspiration from Word Embeddings
A way to capture these multi-dimensional relationships between categories is to use embeddings. This is the same idea as is used with word embeddings, such as Word2Vec. For instance, a 3-dimensional version of a word embedding might look like:
[0.9, 1.0, 0.0]
[1.0, 0.2, 0.0]
[0.0, 1.0, 0.9]
[0.0, 0.2, 1.0]
Notice that the first dimension is capturing something related to being a dog, and the second dimension captures youthfulness. This example was made up by hand, but in practice you would use machine learning to find the best representations (while semantic values such as dogginess and youth would be captured, they might not line up with a single dimension so cleanly). You can check out my workshop on word embeddings for more details about how word embeddings work.
Applying Embeddings for Categorical Variables
Similarly, when working with categorical variables, we will represent each category by a vector of floating point numbers (the values of this representation are learned as the network is trained).
For instance, a 4-dimensional version of an embedding for day of week could look like:
[.8, .2, .1, .1]
[.1, .2, .9, .9]
[.2, .1, .9, .8]
Here, Monday and Tuesday are fairly similar, yet they are both quite different from Sunday. Again, this is a toy example. In practice, our neural network would learn the best representations for each category while it is training, and each dimension (or direction, which doesn’t necessarily line up with ordinal dimensions) could have multiple meanings. Rich relationships can be captured in these distributed representations.
Reusing Pretrained Categorical Embeddings
Embeddings capture richer relationships and complexities than the raw categories. Once you have learned embeddings for a category which you commonly use in your business (e.g. product, store id, or zip code), you can use these pre-trained embeddings for other models. For instance, Pinterest has created 128-dimensional embeddings for its pins in a library called Pin2Vec, and Instacart has embeddings for its grocery items, stores, and customers.
The fastai library contains an implementation for categorical variables, which work with Pytorch’s nn.Embedding module, so this is not something you need to code from hand each time you want to use it.
Treating some Continuous Variables as Categorical
We generally recommend treating month, year, day of week, and some other variables as categorical, even though they could be treated as continuous. Often for variables with a relatively small number of categories, this results in better performance. This is a modeling decision that the data scientist makes. Generally, we want to keep continuous variables represented by floating point numbers as continuous.
Although we can choose to treat continuous variables as categorical, the reverse is not true: any variables that are categorical must be treated as categorical.
Time Series Data
The approach of using neural networks together with categorical embeddings can be applied to time series data as well. In fact, this was the model used by students of Yoshua Bengio to win 1st place in the Kaggle Taxi competition(paper here), using a trajectory of GPS points and timestamps to predict the length of a taxi ride. It was also used by the 3rd place winners in the Kaggle Rossmann Competition, which involved using time series data from a chain of stores to predict future sales. The 1st and 2nd place winners of this competition used complicated ensembles that relied on specialist knowledge, while the 3rd place entry was a single model with no domain-specific feature engineering.
In our Lesson 3 jupyter notebook we walk through a solution for the Kaggle Rossmann Competition. This data set (like many data sets) includes both categorical data (such as the state the store is located in, or being one of 3 different store types) and continuous data (such as the distance to the nearest competitor or the temperature of the local weather). The fastai library lets you enter both categorical and continuous variables as input to a neural network.
When applying machine learning to time-series data, you nearly always want to choose a validation set that is a continuous selection with the latest available dates that you have data for. As I wrote in a previous post, “If your data is a time series, choosing a random subset of the data will be both too easy (you can look at the data both before and after the dates your are trying to predict) and not representative of most business use cases (where you are using historical data to build a model for use in the future).”
One key to successfully using deep learning with time series data is to split the date into multiple categorical variables (year, month, week, day of week, day of month, and Booleans for whether it’s the start/end of a month/quarter/year). The fastai library has implemented a method to handle this for you, as described below.
Modules to Know in the Fastai Library
We will be releasing more documentation for the fastai library in coming months, but it is already available on pip and on github, and it is used in the Practical Deep Learning for Coders course. The fastai library is built on top of Pytorch and encodes best practices and helpful high-level abstractions for using neural networks. The fastai library achieves state-of-the-art results and was recently used to win the Stanford DAWNBench competition (fastest CIFAR10 training).
fastai.column_data.ColumnarModelData takes a Pandas DataFrame as input and creates a type of ModelData object (an object which contains data loaders for the training, validation, and test sets, and which is the fundamental way of keeping track of your data while training models).
The fastai.structured module of the fastai library is built on top of Pandas, and includes methods to transform DataFrames in a number of ways, improving the performance of machine learning models by pre-processing the data appropriately and creating the right types of variables.
For instance, fastai.structured.add_datepart converts dates (e.g. 2000-03-11) into a number of variables (year, month, week, day of week, day of month, and booleans for whether it’s the start/end of a month/quarter/year.)
Other useful methods in the module allow you to:
Fill in missing values with the median whilst adding a boolean indicator variable (fix_missing)
Change any columns of strings in a Pandas DataFrame to a column of categorical values (train_cats)
While many news outlets havecoveredwhatZuckerbergwore for his testimony before congress last week, I wish that several more substantial issues were getting greater coverage, such as the lackluster response of Facebook to its role in the genocide in Myanmar, what concrete regulations could help us, the continued push of Facebook lobbyists to gut the few privacy laws we have, and the role that Free Basics (aka Internet.org– remember that?) has played in global hate speech and violence. So let’s discuss them now.
Genocide in Myanmar vs. a Monetary Fine in Germany
The impact of Facebook is much larger than just the USA and UK: there is a genocide happening in Myanmar, with hundreds of thousands of people from an ethnic minority being systematically driven from their homes, raped, and/or killed, and with entire villages being burnt and bulldozed to the ground. Facebook is the primary source of news in Myanmar (in part due to the Facebook program “Free Basics”, explained more below), and it has been used for years to spread dehumanizing propaganda about the ethnic minority (as well as to censor news reports about violence against the minority). Local activists have been asking Facebook to address this issue since 2013 and news outlets have been covering Facebook’s role in the genocide since 2014, yet Facebook has been beyond sluggish to respond.
(I previously wrote about Facebook’s role in Myanmar here.)
Even now, Zuckerberg promised during the congressional hearings to hire “dozens” to address the genocide in Myanmar (in 2018, years after the genocide had begun), which stands in stark contrast to Facebook quickly hiring 1,200 people in Germany to try to avoid expensive penalties under a new German law against hate speech. Clearly, Facebook is more reactive to the threat of a financial penalty than to the systematic destruction of an ethnic minority.
Can AI solve hate speech?
Facebook’s actions in Germany of hiring 1,200 human content moderators also stand in contrast to Zuckerberg’s vague assurance that Facebook will use AI to address hate speech. It’s valuable to look at how Facebook actually behaves when facing a potential financial penalty, and not what sort of vague promises are made to Congress. As Cornell Law Tech professor James Grimmelmann said, “[AI] won’t solve Facebook’s problems, but it will solve Zuckerberg’s: getting someone else to take responsibility.”
As two concrete examples of just how far the tech giants are from effectively using AI on these problems, let’s consider Google. In 2017, Google released a tool called Perspective to automatically detect online harassment and abusive speech, with high-profile launch partners including The New York Times, The Economist, and Wikipedia. Librarian Jessamyn West found a number of sexist, racist, and ableist biases (confirmed and covered in more detail here). For instance, the statement I am a gay Black woman got an 87 percent toxicity score (out of 100), while I am a man was one of the least toxic phrases tested (just 20 percent). Also in 2017, Google boasted that they’d implemented cutting-edge machine learning to identify and remove YouTube videos promoting violent extremism or terrorism. The removed videos included those of activist groups documenting Syrian war crimes, evidence from the Chelsea Manning court case, and documentation of the destruction of ancient artifacts by Islamic State (many of these videos were later put back and Google conceded it had made mistakes). These examples demonstrate a few key points:
Using AI to address harassment and abusive speech is incredibly hard.
You can’t just throw AI at a problem without deeply understanding the problem.
Although Facebook is in the spotlight right now, Google has the same business model (targeted ads) and same monopoly status, and Google is causing huge societal harms as well. For instance, YouTube has been well-documented to recommend white supremacist videos and conspiracy theories from a wide-range of starting points, and it is playing a role in radicalizing users into dangerous viewpoints.
You can’t just throw AI at a problem
Expertise in history, sociology, and psychology is necessary. As I said in my talk at the MIT Tech Review Conference, domain expertise is incredibly important. You can’t just have techies throwing AI at a problem without really understanding it, or the domain. For instance, radiologists who have also become deep learning practitioners have been able to catch some errors in deep learning work on radiology that deep learning experts missed. To this end, I think it is crucial that Facebook work with sociologists, historians, psychologists, and experts on topics like authoritarian governments, propaganda, etc. to better understand the problems they’ve created and plausible solutions.
Data privacy is a public good
Hopefully the above examples have made clear that data privacy is not just an individual choice, but it has profound societal impacts. As UNC Professor and techno-sociology expert Zeynep Tufekci wrote, Data privacy is not like a consumer good, where you click ‘I accept’ and all is well. Data privacy is more like air quality or safe drinking water, a public good that cannot be effectively regulated by trusting in the wisdom of millions of individual choices. A more collective response is needed.
Data collection should only be through a clear, concise, & transparent opt-in process.
People should have access to all data collected on them.
Data collection should be limited to specifically enumerated purposes, for a designated period of time.
Aggregate use of data should be regulated.
Another area of potential regulation is to address the monopoly status of tech giants like Facebook and force interoperability. Professor Jonathan Zittrain of Harvard has suggested, The key is to have the companies actually be the ‘platforms’ they claim they are. Facebook should allow anyone to write an algorithm to populate someone’s feed.
Professor Tufekci also proposed, If anything, we should all be thinking of ways to reintroduce competition into the digital economy. Imagine, for example, requiring that any personal data you consent to share be offered back to you in an “interoperable” format, so that you could choose to work with companies you thought would provide you better service, rather than being locked in to working with one of only a few.
The power of tech lobbies
While regulation is necessary for items we consider public goods, in the USA we face a significant obstacle due to the amount the tech industry spends on lobbying and the outsize impact that corporate lobbies have on our policies. As Alvaro Bedoya, Executive Director at the Georgetown Law Center on Privacy and Technology and a former senate staffer who worked on privacy laws points out, the last major consumer privacy law was passed in 2009, before Google, Facebook, and Amazon had really ramped up their lobbying spending. I’m sitting here watching Mark Zuckerberg say he’s sorry and that Facebook will do better on privacy, yet literally as he testifies lobbyists paid by Facebook in Illinois and California are working to stop or gut privacy laws,said Bedoya.
In the USA, we also face the issue of whether our laws will actually be enforced when tech giants are found to violate them. For instance, in 2016 Propublica discovered that Facebook allowed the placement of housing ads that would not be shown to African-American, Hispanic, or Asian-American people, which is a violation of the Fair Housing Act of 1968. “This is horrifying. This is massively illegal. This is about as blatant a violation of the federal Fair Housing Act as one can find,” said prominent civil rights lawyer John Relman. Facebook apologized (although it did not face any penalty); yet over a year later in 2017, Facebook was still found to let people place housing ads that would not be shown to certain groups of people, such as African Americans, people interested in wheelchair ramps, Jews, and Spanish speakers. Again, please contrast this with how swiftly and strongly Facebook responded in Germany to the threat of a penalty. As a further example, Facebook has been found to be violating the Age Discrimination in Employment Act of 1967 by allowing companies, including Verizon, Amazon, Goldman Sachs, Target, and Facebook itself, to place job advertisements targeted solely at young people.
Free Basics, aka Internet.org
Zooming back out to Facebook’s global impact, it is important to understand the role that Facebook’s program Free Basics has played in the genocide in Myanmar, the election of a violent strongman in the Philipines who is notorious for his use of extrajudicial killings, and elsewhere. Free Basics (formerly called Internet.org) was initially touted as a charitable effort in which Facebook provides free access to Facebook and a small number of other sites, but not to the internet as a whole, in countries including Myanmar, Somalia, the Philippines, Nigeria, Iraq, Pakistan, and others. For users in these countries, there is no charge for using the Facebook app on a smartphone, while the data rates for other apps is often prohibitively expensive. Despite it’s original name and description of benevolent aims, Internet.org was not a non-profit, but rather a business development group within Facebook aimed at increasing users and revenue.
Free Basics has led to large numbers of Facebook users who say they don’t use the internet and never follow links outside of facebook (56% of Facebook users in Indonesia and 61% in Nigeria). In some countries, people use internet and Facebook interchangeably, and Facebook is the primary way that people access news. Free Basics violates net neutrality and was banned in India in 2016 for this reason. While Facebook may have had good intentions with this program, it should have been accompanied with a huge sense of responsibility, and it needs to be re-evaluated in light of its role in inciting violence. Facebook needs to put significant resources into analyzing the impact Free Basics is having in the countries where it is offered, and to work closely with local experts from those countries.
I want to thank Omoju Miller, a machine learning engineer at Github who has her PhD from Berkeley, for recently reminding me about Free Basics and asking why more journalists aren’t writing about the role it’s played in global events.
Rethinking the Web
In French President Emmanuel Macron’s recent interview about AI said: In the US, [AI] is entirely driven by the private sector, large corporations, and some startups dealing with them. All the choices they make are private choices that deal with collective values. That’s exactly the problem you have with Facebook and Cambridge Analytica or autonomous driving… The key driver should not only be technological progress, but human progress. This is a huge issue. I do believe that Europe is a place where we are able to assert collective preferences and articulate them with universal values. Perhaps France will provide a template of laws that support technical progress while protecting human values.
To end on a note of hope, Tim Berners-Lee, inventor of the web, wrote an op-ed saying, two myths currently limit our collective imagination: the myth that advertising is the only possible business model for online companies, and the myth that it’s too late to change the way platforms operate. On both points, we need to be a little more creative. While the problems facing the web are complex and large, I think we should see them as bugs: problems with existing code and software systems that have been created by people – and can be fixed by people. Create a new set of incentives and changes in the code will follow.
It’s easy to jump to simplistic conclusions: many commenters assume that Facebook must be supported at all times on the assumption that technology providers have the power to help society (or by disregarding the role that regulation plays in maintaining healthy markets and a health society); and many assume that Facebook and all involved must be evil and everyone needs to stop using their services right away. The reality is more complex. Technology can be greatly beneficial for society, and it can also have a disastrous impact. The complex problems we are facing require creative and nuanced solutions.
I recently was a guest speaker at the Stanford AI Salon on the topic of accessiblity in AI, which included a free-ranging discussion among assembled members of the Stanford AI Lab. There were a number of interesting questions and topics, so I thought I would share a few of my answers here.
Q: What 3 things would you most like the general public to know about AI?
AI is easier to use than the hype would lead you to believe. In my recent talk at the MIT Technology Review conference, I debunked several common myths that you must have a PhD, a giant data set, or expensive computational power to use AI.
Most AI researchers are not working on getting computers to achieve human consciousness.Artificial intelligence is an incredibly broad field, and a somewhat misleading name for a lot of the stuff included in the field (although since this is the terminology everyone uses, I go along with it). Last week I was 20 minutes into a conversation with an NPR reporter before I realized that he thought we were talking about computer systems achieving human consciousness, and I thought we were talking about the algorithm Facebook uses to decide which advertisements to show you. Things like this happen all too often.
95% of the time when people within AI talk about AI, they are referring to algorithms that do a specific task (e.g. that sort photos, translate language, or win Atari games) and 95% of the time when people outside of AI hear something about AI they think of humanoid robots achieving super-intelligence (numbers made up based on my experience). I believe that this leads to a lot of unnecessary confusion and fear.
There are several problems that are far more urgent than the threat of evil super-intelligence, such as how we are encoding racial and gender biases into algorithms (that are increasingly used to make hiring, firing, healthcare benefits, criminal justice, and other life-impacting decisions) or increasing inequality (and the role that algorithms play in perpetuating & accelerating this).
Q: What can AI researchers do to improve accessibility?
A: My wish-list for AI researchers:
Write a blog post to accompany your paper (some of the advice we give fast.ai students about reading academic papers is to first search if someone has written a blog post version)
Share your code.
Try running your code on a single GPU (which is what most people have access to). Jeremy and I sometimes come across code that was clearly never run on just one GPU.
Use concrete examples in your paper. I was recently reading a paper on fairness that was all math equations, and even as someone with a math PhD, I was having trouble mapping what this meant into the real world.
Q: I recently taught a course on deep learning and had all the students do their own projects. It was so hard. The students couldn’t get their models to train, and we were like “that’s deep learning”. How are you able to teach this with fast.ai?
A: Yes, deep learning models can be notoriously finicky to train. We’ve had a lot of successful student projects come out of fast.ai, and I believe some of the factors for this are:
The fast.ai course is built around a number of practical, hands-on case studies (spanning computer vision, natural language processing, recommendation systems, and time series problems), and I think this gives students a good starting point for many of their projects.
We’ve developed the fastai library with the primary goal of having it be easy for students to use and to apply to new problems. This includes innovations like a learning rate finder, setting good defaults, and encoding best practices (such as cyclical learning rates).
fast.ai is not an education company; we are a research lab, and our research is primarily on how to make deep learning easier to use (which closely aligns with the goals of making deep learning easier to learn).
All this said, I do want to acknowledge that deep learning models can still be frustrating and finicky to train! I think everyone working in the field does routinely experience these frustrations (and I look forward to this improving as the field matures and advances).
Q: What do you mean by accessibility in AI? And why is it important?
A: By accessibility, I mean that people from all sorts of backgrounds: education, location where they live, domain of expertise, race, gender, age, and more should all be able to apply AI to problems that they care about.
I think this is important for two key reasons:
On the positive side, people from different backgrounds will know and care about problems that nobody else knows about. For instance, we had a student this fall who is a Canadian dairy farmer using deep learning to improve the health of his goats’ udders. This is an application that I never would have thought about.
People from different backgrounds are necessary to try to catch biases and negative impacts of AI. I think people from under-represented backgrounds are most likely to identify the ways that AI may be harmful, so their inclusion is vital.
I also want people from a variety of backgrounds to know enough about AI to be able to identify when people are selling snake oil (or just over-promising on what they could reasonably deliver).
Currently, for our Practical Deep Learning for Coders course, you need 1 year of coding experience to be able to learn deep learning, although we are working to lower that barrier to entry. I have not checked out the newest versions of deep learning SAAS APIs (it’s on my to-do list) that purportedly let people use deep learning without knowing how to code, but the last time I checked, these APIs suffered from several short-comings:
not truly state-of-the-art results (for this, you needed to write your own code), which is what most people want
only worked on fairly limited problems, or
in order to effectively use the API, you needed to know as much as what it takes to write your own code (in which case, people prefer to write their own, since it gives them more control and customization).
While I support the long-term vision of a machine learning API that anyone can use, I’m skeptical that the technology is there to truly provide something robust and performant enough to eliminate code yet.
Q: Do we really need everyone to understand AI? If the goal is to make an interface as user-friendly as a car, should we just be focusing on that? Cars have a nice UI, and we can all drive cars without understanding how the engine works. Does it really matter who developed cars initially?
A: While I agree with the long-term goal of making deep learning as easy to use as a car, I think we are a long way from that, and it very much matters who is involved in the meantime. It is dangerous to have a homogeneous group developing technology that impacts us all.
For instance, it wasn’t until 2011 that car manufactures were required to use crash test dummies with prototypical female anatomy, in addition to the “standard” male test dummies (a fact I learned from Datasheets for Datasets by Timnit Gebru, et al., which includes fascinating case studies of how standardization came to the electronics, automobile, and pharmaceutical industries). As described in the paper, “a safety study of automobiles manufactured between 1998 and 2008 concluded that women wearing seat belts were 47% more likely to be seriously injured than males in similar accidents”.
Extending the car analogy further, there are many things that are sub-optimal about our system of cars and how they developed: the USA under-investment in public transportation, how free parking led to sprawl and congestion, and the negative impacts of going with fossil fuels over electric power. Could these things have been different if those developing cars had come from a wider variety of backgrounds and fields?
And finally, returning to the question of AI usability and access, it matters both:
Who helps create the abstractions and usability interfaces
Who is able to use AI in the meantime.
Q: Is it really possible for people to learn the math they need on their own? Isn’t it easier to learn things through an in-person class– so as a grad student at Stanford, should I take advantage of the math courses that I have access to now?
A: At fast.ai, we highly recommend that people learn math on an as-needed basis (and here and here). Feeling like you need to learn all the math you might ever need before getting to work on the topic you’re excited about is a recipe for discouragement, and most people struggle to maintain motivation. If your goal is to apply deep learning to practical problems, much of that math turns out to be unnecessary
As for learning better through in-person courses, this is something that I think about. Even though we put all our course materials online, for free, we’ve had several students travel from other countries (including Australia, Argentina, England, and India) to attend our in-person course. If you are taking a course online or remotely, I definitely recommend trying to organize an in-person study group around it with others in your area. And no matter what, seek out community online.
I have a PhD in math, so I took a lot of math in graduate school. And sadly, I have forgotten much of the stuff that I didn’t use (most of my graduate courses were taken over a decade ago). It wasn’t a good use of time, given what I’ve ended up doing now. So I don’t know how useful it is to take a course that you likely won’t need (unless you are purely taking it for the enjoyment, or you are planning to stay in academia).
And finally, in talking with students, I think that people’s anxiety and negative emotions around math are a much bigger problem than their failure to understand concepts. There is a lot broken with how math is taught and the culture around it in the USA, and so it’s understandable that many people feel this way.
This is just a subset of the many topics we discussed at the AI Salon. All questions are paraphrased as I remember them and are not exact quotes. I should also note that my co-presenter, Stanford post-doc Mark Whiting, made a number of interesting points around HCI and symbolic systems. I enjoyed the event and want to thank the organizers for having me!