The next fast.ai courses will be based nearly entirely on a new framework we have developed, built on Pytorch. Pytorch is a different kind of deep learning library (dynamic, rather than static), which has been adopted by many (if not most) of the researchers that we most respect, and in a recent Kaggle competition was used by nearly all of the top 10 finishers.
We have spent around a thousand hours this year working with Pytorch to get to this point, and we are very excited about what it is allowing us to do. We will be writing a number of articles in the coming weeks talking about each aspect of this. First, we will start with a quick summary of the background to, and implications of, this decision. Perhaps the best summary, however, is this snippet from the start of our first lesson:
fast.ai’s teaching goal
Our goal at fast.ai is for there to be nothing to teach. We believe that the fact that we currently require high school math, one year of coding experience, and seven weeks of study to become a world-class deep learning practitioner, is not an acceptable state of affairs (even although this is less prerequisites for any other course of a similar level). Everybody should be able to use deep learning to solve their problems with no more education than it takes to use a smart phone. Therefore, each year our main research goal is to be able to teach a wider range of deep learning applications, that run faster, and are more accurate, to people with less prerequisites.
We want our students to be able to solve their most challenging and important problems, to transform their industries and organisations, which we believe is the potential of deep learning. We are not just trying to teach people how to get existing jobs in the field — but to go far beyond that.
Therefore, since we first ran our deep learning course, we have been constantly curating best practices, and benchmarking and developing many techniques, trialling them against Kaggle leaderboards and academic state-of-the-art results.
Why we tried Pytorch
As we developed our second course, Cutting-Edge Deep Learning for Coders, we started to hit the limits of the libraries we had chosen: Keras and Tensorflow. For example, perhaps the most important technique in natural language processing today is the use of attentional models. We discovered that there was no effective implementation of attentional models for Keras at the time, and the Tensorflow implementations were not documented, rapidly changing, and unnecessarily complex. We ended up writing our own in Keras, which turned out to take a long time, and be very hard to debug. We then turned our attention to implementing dynamic teacher forcing, for which we could find no implementation in either Keras or Tensorflow, but is a critical technique for accurate neural translation models. Again, we tried to write our own, but this time we just weren’t able to make anything work.
At that point the first pre-release of Pytorch had just been released. The promise of Pytorch was that it was built as a dynamic, rather than static computation graph, framework (more on this in a later post). Dynamic frameworks, it was claimed, would allow us to write regular Python code, and use regular python debugging, to develop our neural network logic. The claims, it turned out, were totally accurate. We had implemented attentional models and dynamic teacher forcing from scratch in Pytorch within a few hours of first using it.
Some pytorch benefits for us and our students
The focus of our second course is to allow students to be able to read and implement recent research papers. This is important because the range of deep learning applications studied so far has been extremely limited, in a few areas that the academic community happens to be interested in. Therefore, solving many real-world problems with deep learning requires an understanding of the underlying techniques in depth, and the ability to implement customised versions of them appropriate for your particular problem, and data. Because Pytorch allowed us, and our students, to use all of the flexibility and capability of regular python code to build and train neural networks, we were able to tackle a much wider range of problems.
An additional benefit of Pytorch is that it allowed us to give our students a much more in-depth understanding of what was going on in each algorithm that we covered. With a static computation graph library like Tensorflow, once you have declaratively expressed your computation, you send it off to the GPU where it gets handled like a black box. But with a dynamic approach, you can fully dive into every level of the computation, and see exactly what is going on. We believe that the best way to learn deep learning is through coding and experiments, so the dynamic approach is exactly what we need for our students.
Much to our surprise, we also found that many models trained quite a lot faster on pytorch than they had on Tensorflow. This was quite against the prevailing wisdom, that said that static computation graphs should allow for more optimization to be done, which should have resulted in higher performance in Tensorflow. In practice, we’re seeing some models are a bit faster, some a bit slower, and things change in this respect every month. The key issues seem to be that:
Improved developer productivity and debugging experience in Pytorch can lead to more rapid development iterations, and therefore better implementations
Smaller, more focussed development community in Pytorch looks for “big wins” rather than investing in micro-optimization of every function.
Why we built a new framework on top of Pytorch
Unfortunately, Pytorch was a long way from being a good option for part one of the course, which is designed to be accessible to people with no machine learning background. It did not have anything like the clear simple API of Keras for training models. Every project required dozens of lines of code just to implement the basics of training a neural network. Unlike Keras, where the defaults are thoughtfully chosen to be as useful as possible, Pytorch required everything to be specified in detail. However, we also realised that Keras could be even better. We noticed that we kept on making the same mistakes in Keras, such as failing to shuffle our data when we needed to, or vice versa. Also, many recent best practices were not being incorporated into Keras, particularly in the rapidly developing field of natural language processing. We wondered if we could build something that could be even better than Keras for rapidly training world-class deep learning models.
After a lot of research and development it turned out that the answer was yes, we could (in our biased opinion). We built models that are faster, more accurate, and more complex than those using Keras, yet were written with much less code. We’ve implemented recent papers that allow much more reliable training of more accurate models, across a number of fields.
The key was to create an OO class which encapsulated all of the important data choices (such as preprocessing, augmentation, test, training, and validation sets, multiclass versus single class classification versus regression, et cetera) along with the choice of model architecture. Once we did that, we were able to largely automatically figure out the best architecture, preprocessing, and training parameters for that model, for that data. Suddenly, we were dramatically more productive, and made far less errors, because everything that could be automated, was automated. But we also provided the ability to customise every stage, so we could easily experiment with different approaches.
With the increased productivity this enabled, we were able to try far more techniques, and in the process we discovered a number of current standard practices that are actually extremely poor approaches. For example, we found that the combination of batch normalisation (which nearly all modern CNN architectures use) and model pretraining and fine-tuning (which you should use in every project if possible) can result in a 500% decrease in accuracy using standard training approaches. (We will be discussing this issue in-depth in a future post.) The results of this research are being incorporated directly into our framework.
There will be a limited release for our in person students at USF first, at the end of October, and a public release towards the end of the year. (By which time we’ll need to pick a name! Suggestions welcome…) (If you want to join the in-person course, there’s still room in the International Fellowship program.)
What should you be learning?
If it feels like new deep learning libraries are appearing at a rapid pace nowadays, then you need to be prepared for a much faster rate of change in the coming months and years. As more people enter the field, they will bring more skills and ideas, and try more things. You should assume that whatever specific libraries and software you learn today will be obsolete in a year or two. Just think about the number of changes of libraries and technology stacks that occur all the time in the world of web programming — and yet this is a much more mature and slow-growing area than deep learning. So we strongly believe that the focus in learning needs to be on understanding the underlying techniques and how to apply them in practice, and how to quickly build expertise in new tools and techniques as they are released.
By the end of the course, you’ll understand nearly all of the code that’s inside the framework, because each lesson we’ll be digging a level deeper to understand exactly what’s going on as we build and train our models. This means that you’ll have learnt the most important best practices used in modern deep learning—not just how to use them, but how they really work and are implemented. If you want to use those approaches in another framework, you’ll have the knowledge you need to develop it if needed.
To help students learn new frameworks as they need them, we will be spending one lesson learning to use Tensorflow, MXNet, CNTK, and Keras. We will work with our students to port our framework to these other libraries, which will make for some great class projects.
We will also spend some time looking at how to productionize deep learning models. Unless you are working at Google-scale, your best approach will probably be to create a simple REST interface on top of your Pytorch model, running inference on the CPU. If you need to scale up to very high volume, you can export your model (as long as it does not use certain kinds of customisations) to Caffe2 or CNTK. If you need computation to be done on a mobile device, you can either export as mentioned above, or use an on-device library.
How we feel about Keras
We still really like Keras. It’s a great library and is far better for fairly simple models than anything that came before. It’s very easy to move between Keras and our new framework, at least for the subset of tasks and architectures that Keras supports. Keras supports lots of backend libraries which means you can run Keras code in many places.
It has a unique (to our knowledge) approach to defining architectures where authors of custom layers are required to create a build() method which tells Keras what shape output it creates for a given input. This allows users to more easily create simple architectures because they almost never have to specify the number of input channels for a layer. For architectures like Densenet which concatenate layers it can make the code quite a bit simpler.
On the other hand, it tends to make it harder to customize models, especially during training. More importantly, the static computation graph on the backend, along with Keras’ need for an extra compile() phase, means that it’s hard to customize a model’s behaviour once it’s built.
What’s next for fast.ai and Pytorch
We expect to see our framework and how we teach Pytorch develop a lot as we teach the course and get feedback and ideas from our students. In past courses students have developed a lot of interesting projects, many of which have helped other students—we expect that to continue. Given the accelerating progress in deep learning, it’s quite possible that by this time next year, there will be very different hardware or software options that will make todays’ technology quite obsolete. Although based on the quick adoption of new technologies we’ve seen from the Pytorch developers, we suspect that they might stay ahead of the curve for a while at least…
In our next post, we’ll be talking more about some of the standout features of Pytorch, and dynamic libraries more generally.
In both our previous deep learning courses at USF (which were recorded and formed the basis of our MOOCs), we allowed students that could not participate in person to attend via video and text chat through our International Fellowship. As Rachel described in discussing our last course, this (along with our Diversity Fellowship program) was an important part of our mission:
“…we worked hard to curate a diverse group of participants, because we’d observed that artificial intelligence is missing out because of its lack of diversity. A study of 366 companies found that ethnically diverse companies are 35% more likely to perform well financially, and teams with more women perform better on collective intelligence tests. Scientific papers written by diverse teams receive more citations and have higher impact factors.”
In fact, many of our strongest students and most effective projects have come from the International Fellowship. By opening up the opportunity to learn deep learning in a collaborative environment, students have been able to apply this powerful technology to local problems in their area. For instance, past International Fellows are working to:
This year, we’re presenting an entirely new version of part 1 of our deep learning course, and today we’re launching the International Fellowship for it. The program allows those who can not get to San Francisco to attend virtual classes for free during the same time period as the in-person class and provides access to all the same online resources. (Note that International Fellowships do not provide an official completion certificate through USF). International fellows can come from anywhere on the planet other than San Francisco (including from the US), but need to be able to attend each class via Youtube Live at 6.30pm-9pm Pacific Time each Monday for 7 weeks from Oct 30, 2017 onwards. For many people that means getting up in the middle of the night—but our past students tell us it’s worth it!
Title your email “International Fellowship Application”
Include your resume
Write 1 paragraph describing one or more problems you’d like to apply deep learning to
State where you are located
Confirm that you that you can commit 8 hours a week to working on the course and that you are able to participate in each class via Youtube Live at 6.30pm-9pm Pacific Time each Monday for 7 weeks from Oct 30, 2017 onwards.
The deadline to apply is 10/6.
Frequently Asked Questions
Will the updated course be released online? Yes, the course will be recorded and released after the in-person version ends.
Can I apply again if I was an international fellow last year? Yes, you are welcome to apply again.
Do I get a certificate of completion for the international fellowship? No, the USF Data Institute only awards certificates to graduates of the in-person course.
Edit one day later… Much to my surprise a lot of people shared this on twitter, and much to my delight there were some very helpful and interesting comments from people I respect—so check out the thread here.
I cleverly trapped Smerity in a Twitter DM conversation while he was trapped on a train with nothing better to do than answer my dumb questions, and I managed to get a download of ~0.001% of what he knows about language modeling. It should be enough to keep me busy for a few months… The background of this conversation is that for “version 2” of our deep learning course at USF we’re curating and implementing in a consistent API the most important best practices in a range of deep learning applications, including computer vision, text, and recommendation systems. Unfortunately, for text applications the best practices are not really collected anywhere, hence the need for the Smerity-brain-dump.
I figured I’d make my notes on the conversation into a little blog post in case other people find this useful too. I’m assuming people are familiar with the topics covered in parts 1 & 2 of our MOOC, such as RNNs, dropout, attentional models, and neural translation. I’ve spent the day scouring the internet for other resources too and I’m incorporating some of my own research here, so if you see anything dumb it’ll almost certainly be my fault, not Smerity’s.
Pytorch code examples
Smerity pointed to two excellent repositories that seemed to contain examples of all the techniques we discussed:
AWD-LSTM Language Model, which is a very recent release that shows substantial improvements in state of the art for language modeling, using techniques that are likely to be useful across a range of NLP problems. Doesn’t work on the latest Pytorch, although might not need too much tweaking to fix
Two other interesting libraries to be aware of are:
Practical pytorch has some nice simple tutorial examples, although there are some significant problems around both approach (e.g. no test sets!), performance, and occassionally clunky code. Also it doesn’t take advantage of the torchtext library, which makes for some redudent code. But really nicely chosen problems and clear descriptions.
torchtext is a small but convenient library for some basic text processing tasks, and also provides convenient access to a few datasets.
Techniques to get state of the art (SotA) results
In part 2 of the course we got pretty close to SotA in neural translation by showing how to use attentional models, dynamic teacher forcing, and of course stacked bidirectional LSTMs. But there have been some interesting approaches that have come to the fore since we developed that course, and Smerity suggested that combining the following should get the biggest wins across a range of NLP tasks without much additional complexity:
Perhaps the most important addition in this paper is through using Pointer networks, which both Smerity et al (pointer sentinel) and Grave et al (continuous cache pointer) have extended to language modeling. Smerity uses the continuous cache for his most recent work and describes it as “The pointer is really really really simple! You could apply that to any LM model output and get similar gains. It’s more engineering though. All it is: store hidden state of LM as history (2000 timesteps), when generating a word calculate attention over that history using current LM hidden state, your word distribution is based on the next word according to where your hidden state is most similar in the past”.
Also shown as important in both the Melis and Smerity summary papers above is thoughtful use of regularization. Smerity found that Dropconnect (which doesn’t really help CNNs much over regular dropout) actually works very nicely for RNNs, which makes intuitive sense when you think about it… In his words “Given an RNN, h = Wx + Uh_t-1 + b, you apply dropout to the recurrent weight matrix U and have it the same over the entire forward + backward. Simple, fast, and prevents overfitting from h to h to h to …” He calls this the weight-dropped LSTM, and he implemented it in a very brief class called WeightDrop in the recent language modeling project..
Something that seems kinda obvious but actually only got written up a year ago is weight-tying, described in Using the Output Embedding to Improve Language Models. It simply means that the output weights and the input embedding weights are the same matrix, and it’s handled in the basic language model example above with a single line of code (self.decoder.weight = self.encoder.weight)
Finally, Smerity suggested adding “activation regularization (add L2 of LSTM output as loss) and temporal activation regularization (add L2 of ht - h_t-1 to loss). That is like three lines of change and gets you near old SotA with weight tying.” The approach is described in this paper.
Problems to solve
Interesting problems to solve in the NLP world, with meaningful benchmarks, include:
Sentiment analysis, or similar standard classification or regression tasks, are great for learning since they’re very similar to other classification and regression problems, so you can focus on the text analysis issues rather than the problem description. And they’re widely useful. There’s a huge dataset of Amazon reviews that can be used, although I’m not aware of good benchmarks. The IMDB review dataset has been widely studied so there’s plenty of good benchmarks (we already use this dataset in our MOOC). Any group of documents that have some kind of categories can be used too, such as the 20 newsgroup dataset that we used in our compuational linear algebra course. (BTW I don’t recommend using Stanford’s sentiment dataset, since it contains phrase annotations as well as sentence annotations, which seems rather artifical and designed to make the researchers’ tree-based techniques look more impressive.)
Language modeling, which basically means trying to predict the next word given the previous words in a sentence. The standard benchmark for this is ‘PTB’ (Penn Treebank), which is available in the Pytorch language modeling example repo.
Language translation, for which there are many datasets and tutorials around nowadays, although I like to think that our dataset and approach from part 2 of our course isn’t too bad…
Textual entailment, which is best explained by the examples on Stanford’s Natural Language Inference (SNLI) Corpus page. You get pairs of sentences, and have to say if they’re in agreement or they contradict. E.g. “A man inspects the uniform of a figure in some East Asian country” and “The man is sleeping” would be labeled as a contradiction. As well as the above corpus, the excellent Sam Bowman and friends have also recently released The Multi-Genre NLI Corpus; the two datasets can be combined. It’s a cool demonstration of the power of deep learning, although it’s not entirely clear (to me at least) whether it’s actually useful…
Text summarization is an area of active with some recent significant advances, although I’m not yet clear on whether it’s useful in practice
A lot of language workers spend their time on sub-tasks, such as part-of-speech tagging and entity recognition, which can then be used in larger applications. These particular tasks are part of the area of sequence labeling, about which NLP research Yoav Goldberg tweeted: “tons of information extraction tasks can be modeled as seq labeling. It’s a killer app.”
Q: I’m a medical doctor (MD-PhD). I do a mix of clinical work and basic science research. My research primarily involves small scale animal studies for hypothesis testing, although others in my lab do some statistical clinical studies, such as paired cohort analysis. I’m interested in AI and wondering if and how it can be applied to my field?
A: AI is being applied to several fields of medicine, including:
Diabetic retinopathy is the fastest growing cause of blindness. The first step in the screening process is for ophthalmologists to examine a picture of the back of the eye, yet in many parts of the world there are not enough specialists available to do so. Researchers at Google and Stanford have used deep learning to create computer models that are as accurate as human ophthalmologists. This technology could help doctors screen a larger number of patients faster, helping to alleviate the global shortage of doctors.
In 2012, Merck sponsored a drug discoverycompetition where participants were given a dataset describing the chemical structure of thousands of molecules and asked to predict which were most likely to make for effective drugs. Remarkably, the winning team had only decided to enter the competition at the last minute and had no specific knowledge of biochemistry. They used deep learning.
In a 2016 New York Times article, it was shared that medical start-up Enlitic, founded by fast.ai’s Jeremy Howard, was 50 percent more accurate than human radiologists in making lung cancer diagnoses.
Fast.ai remote fellow Xinxin Li is working with Ikaishe and Xeed to develop wearable devices for patients with Parkinson’s Disease. Traditionally, doctors observe the patient walking to assess disease progression, and wearable devices will allow for much more data and more precise data to be collected.
Deep learning can classify skin cancer with dermatologist-level accuracy, as published in Nature earlier this year.
Cardiogram is an app for Apple Watch that screens users’ cardio health and is able to detect atrial fibrillation, a common form of heart irregularity, with 97 percent accuracy.
Does this mean I need “big data”? No.
Currently, when news articles talk about “AI”, they are often referring to deep learning, one particular family of algorithms.
Although the above examples involve relatively large datasets, deep learning is being effectively applied to smaller and smaller datasets all the time. Here are a few examples that I listed in a previous blog post: Francois Chollet, creator of the popular deep learning library Keras and now at Google Brain, has an excellent tutorial entitled Building powerful image classification models using very little data in which he trains an image classifier on only 2,000 training examples. At Enlitic, Jeremy Howard led a team that used just 1,000 examples of lung CT scans with cancer to build an algorithm that was more accurate at diagnosing lung cancer than a panel of 4 expert radiologists. The C++ library Dlib has an example in which a face detector is accurately trained using only 4 images, containing just 18 faces!
Transfer learning is a powerful technique in which models trained on larger data sets (by teams with more computational resources) can be fine-tuned to fit other problems, with less data and less computation. For instance, models originally trained on ImageNet (images in a set of 1,000 categories) are a good starting point for other computer vision problems (such as analyzing a CT scan for lung cancer). Transfer learning is a major focus of our Practical Deep Learning for Coders course.
How does machine learning compare to hypothesis testing or paired cohort analysis?
Machine learning shines at handling messy data. Techniques such as controlled studies or paired cohort analysis rely on carefully controlling for different variables in your experiment set-up (or in finding the pairings), whereas machine learning is an excellent choice when this isn’t possible.
Deep learning is just one kind of machine learning. Another machine learning algorithm is the random forest, which is great for observational studies.
If you are at a university, seek out collaborators or interns. Note, if you have a student who knows how to code, they can learn deep learning. If you are looking for a collaborator, you do not need to find a deep learning expert. All you need is someone with a year or two of coding experience who is interested in your project and wants to learn deep learning. It’s even better if they are familiar with your research (for instance, perhaps a student who may already be working in your lab and knows how to code).
I recommend that you learn to code. Even if it’s not your area of focus and you will be collaborating with programmers, knowing some code will help you better understand what’s possible and have a better sense of what the programmers you’re collaborating with are doing.
Last year’s diversity fellowships (funded by University of San Francisco and fast.ai), open to women, people of Color, LGBTQ people, and vets, played a role in helping us create a diverse community. However, we need your help to be able to offer additional scholarships this year. If your company, firm, or organization would be willing to sponsor diversity scholarships ($1,500 each), please email firstname.lastname@example.org.
Deep learning is incredibly powerful and is being used to diagnose cancer, stop deforestation of endangered rainforests, provide better crop insurance to farmers in India, provide better language translation than humans, improve energy efficiency, and more. To find out why diversity in AI is a crucial issue, read this post on the AI diversity crisis.
While many in tech are bemoaning the “skills gap” or “talent shortage” in trying to hire AI practitioners, we at fast.ai set out 1 year ago with a novel experiment: could we teach deep learning to coders with no pre-requisites beyond just 1 year of coding experience? Other deep learning materials often assume an advanced math background, yet we were able to get our students to the state of the art, through a practical, hands-on approach in our part-time course, without the advanced math requirements. Our students have been incredibly successful and their stories include the following: