Our courses (all are free and have no ads):

Our software

International Fellowship applications for Part 1 now open

This post is from 2017. To apply for the 2018 version, please read this post

In both our previous deep learning courses at USF (which were recorded and formed the basis of our MOOCs), we allowed students that could not participate in person to attend via video and text chat through our International Fellowship. As Rachel described in discussing our last course, this (along with our Diversity Fellowship program) was an important part of our mission:

“…we worked hard to curate a diverse group of participants, because we’d observed that artificial intelligence is missing out because of its lack of diversity. A study of 366 companies found that ethnically diverse companies are 35% more likely to perform well financially, and teams with more women perform better on collective intelligence tests. Scientific papers written by diverse teams receive more citations and have higher impact factors.”

In fact, many of our strongest students and most effective projects have come from the International Fellowship. By opening up the opportunity to learn deep learning in a collaborative environment, students have been able to apply this powerful technology to local problems in their area. For instance, past International Fellows are working to:

This year, we’re presenting an entirely new version of part 1 of our deep learning course, and today we’re launching the International Fellowship for it. The program allows those who can not get to San Francisco to attend virtual classes for free during the same time period as the in-person class and provides access to all the same online resources. (Note that International Fellowships do not provide an official completion certificate through USF). International fellows can come from anywhere on the planet other than San Francisco (including from the US), but need to be able to attend each class via Youtube Live at 6.30pm-9pm Pacific Time each Monday for 7 weeks from Oct 30, 2017 onwards. For many people that means getting up in the middle of the night—but our past students tell us it’s worth it!

Updated 10/7/2018: We are removing last years’ instructions to avoid confusion. Please see this post for the 2018 instructions.

Frequently Asked Questions

  • Will the updated course be released online? Yes, the course will be recorded and released after the in-person version ends.
  • Can I apply again if I was an international fellow last year? Yes, you are welcome to apply again.
  • Do I get a certificate of completion for the international fellowship? No, the USF Data Institute only awards certificates to graduates of the in-person course.

Notes on state of the art techniques for language modeling

Edit one day later… Much to my surprise a lot of people shared this on twitter, and much to my delight there were some very helpful and interesting comments from people I respect—so check out the thread here.

I cleverly trapped Smerity in a Twitter DM conversation while he was trapped on a train with nothing better to do than answer my dumb questions, and I managed to get a download of ~0.001% of what he knows about language modeling. It should be enough to keep me busy for a few months… The background of this conversation is that for “version 2” of our deep learning course at USF we’re curating and implementing in a consistent API the most important best practices in a range of deep learning applications, including computer vision, text, and recommendation systems. Unfortunately, for text applications the best practices are not really collected anywhere, hence the need for the Smerity-brain-dump.

I figured I’d make my notes on the conversation into a little blog post in case other people find this useful too. I’m assuming people are familiar with the topics covered in parts 1 & 2 of our MOOC, such as RNNs, dropout, attentional models, and neural translation. I’ve spent the day scouring the internet for other resources too and I’m incorporating some of my own research here, so if you see anything dumb it’ll almost certainly be my fault, not Smerity’s.

Pytorch code examples

Smerity pointed to two excellent repositories that seemed to contain examples of all the techniques we discussed:

  • AWD-LSTM Language Model, which is a very recent release that shows substantial improvements in state of the art for language modeling, using techniques that are likely to be useful across a range of NLP problems. Doesn’t work on the latest Pytorch, although might not need too much tweaking to fix
  • Word-level language modeling RNN, a simpler and really ancient (over 6 months old!) language modeling example, but still some useful example code

Two other interesting libraries to be aware of are:

  • Practical pytorch has some nice simple tutorial examples, although there are some significant problems around both approach (e.g. no test sets!), performance, and occassionally clunky code. Also it doesn’t take advantage of the torchtext library, which makes for some redudent code. But really nicely chosen problems and clear descriptions.
  • torchtext is a small but convenient library for some basic text processing tasks, and also provides convenient access to a few datasets.

Techniques to get state of the art (SotA) results

In part 2 of the course we got pretty close to SotA in neural translation by showing how to use attentional models, dynamic teacher forcing, and of course stacked bidirectional LSTMs. But there have been some interesting approaches that have come to the fore since we developed that course, and Smerity suggested that combining the following should get the biggest wins across a range of NLP tasks without much additional complexity:

  • Two of my favorite researchers, Dyer and Blunsom, and the extraordinary Kaggle, computer olympiad, and Google AI winner Gabor Melis, published On the State of the Art of Evaluation in Neural Language Models, which curates a few best practices and uses hyper-parameter optimization to show great results from LSTMs
  • Smerity et al also recently released a paper curating and combining recent NLP techniques and got a big jump in SotA on language modeling, in Regularizing and Optimizing LSTM Language Models
  • Perhaps the most important addition in this paper is through using Pointer networks, which both Smerity et al (pointer sentinel) and Grave et al (continuous cache pointer) have extended to language modeling. Smerity uses the continuous cache for his most recent work and describes it as “The pointer is really really really simple! You could apply that to any LM model output and get similar gains. It’s more engineering though. All it is: store hidden state of LM as history (2000 timesteps), when generating a word calculate attention over that history using current LM hidden state, your word distribution is based on the next word according to where your hidden state is most similar in the past”.
  • Also shown as important in both the Melis and Smerity summary papers above is thoughtful use of regularization. Smerity found that Dropconnect (which doesn’t really help CNNs much over regular dropout) actually works very nicely for RNNs, which makes intuitive sense when you think about it… In his words “Given an RNN, h = Wx + Uh_t-1 + b, you apply dropout to the recurrent weight matrix U and have it the same over the entire forward + backward. Simple, fast, and prevents overfitting from h to h to h to …” He calls this the weight-dropped LSTM, and he implemented it in a very brief class called WeightDrop in the recent language modeling project..
  • Something that seems kinda obvious but actually only got written up a year ago is weight-tying, described in Using the Output Embedding to Improve Language Models. It simply means that the output weights and the input embedding weights are the same matrix, and it’s handled in the basic language model example above with a single line of code (self.decoder.weight = self.encoder.weight)
  • Finally, Smerity suggested adding “activation regularization (add L2 of LSTM output as loss) and temporal activation regularization (add L2 of ht - h_t-1 to loss). That is like three lines of change and gets you near old SotA with weight tying.” The approach is described in this paper.

Problems to solve

Interesting problems to solve in the NLP world, with meaningful benchmarks, include:

  • Sentiment analysis, or similar standard classification or regression tasks, are great for learning since they’re very similar to other classification and regression problems, so you can focus on the text analysis issues rather than the problem description. And they’re widely useful. There’s a huge dataset of Amazon reviews that can be used, although I’m not aware of good benchmarks. The IMDB review dataset has been widely studied so there’s plenty of good benchmarks (we already use this dataset in our MOOC). Any group of documents that have some kind of categories can be used too, such as the 20 newsgroup dataset that we used in our compuational linear algebra course. (BTW I don’t recommend using Stanford’s sentiment dataset, since it contains phrase annotations as well as sentence annotations, which seems rather artifical and designed to make the researchers’ tree-based techniques look more impressive.)
  • Language modeling, which basically means trying to predict the next word given the previous words in a sentence. The standard benchmark for this is ‘PTB’ (Penn Treebank), which is available in the Pytorch language modeling example repo.
  • Language translation, for which there are many datasets and tutorials around nowadays, although I like to think that our dataset and approach from part 2 of our course isn’t too bad…
  • Textual entailment, which is best explained by the examples on Stanford’s Natural Language Inference (SNLI) Corpus page. You get pairs of sentences, and have to say if they’re in agreement or they contradict. E.g. “A man inspects the uniform of a figure in some East Asian country” and “The man is sleeping” would be labeled as a contradiction. As well as the above corpus, the excellent Sam Bowman and friends have also recently released The Multi-Genre NLI Corpus; the two datasets can be combined. It’s a cool demonstration of the power of deep learning, although it’s not entirely clear (to me at least) whether it’s actually useful…
  • Text summarization is an area of active with some recent significant advances, although I’m not yet clear on whether it’s useful in practice
  • A lot of language workers spend their time on sub-tasks, such as part-of-speech tagging and entity recognition, which can then be used in larger applications. These particular tasks are part of the area of sequence labeling, about which NLP research Yoav Goldberg tweeted: “tons of information extraction tasks can be modeled as seq labeling. It’s a killer app.”

Advice to Medical Experts Interested in AI

This week’s Ask-A-Data-Scientist column is from a medical doctor. Email your data science advice questions to rachel@fast.ai. Previous posts include:

Q: I’m a medical doctor (MD-PhD). I do a mix of clinical work and basic science research. My research primarily involves small scale animal studies for hypothesis testing, although others in my lab do some statistical clinical studies, such as paired cohort analysis. I’m interested in AI and wondering if and how it can be applied to my field?

A: AI is being applied to several fields of medicine, including:

  • Diabetic retinopathy is the fastest growing cause of blindness. The first step in the screening process is for ophthalmologists to examine a picture of the back of the eye, yet in many parts of the world there are not enough specialists available to do so. Researchers at Google and Stanford have used deep learning to create computer models that are as accurate as human ophthalmologists. This technology could help doctors screen a larger number of patients faster, helping to alleviate the global shortage of doctors.

  • In 2012, Merck sponsored a drug discovery competition where participants were given a dataset describing the chemical structure of thousands of molecules and asked to predict which were most likely to make for effective drugs. Remarkably, the winning team had only decided to enter the competition at the last minute and had no specific knowledge of biochemistry. They used deep learning.

  • In a 2016 New York Times article, it was shared that medical start-up Enlitic, founded by fast.ai’s Jeremy Howard, was 50 percent more accurate than human radiologists in making lung cancer diagnoses.

Jeremy Howard Jeremy Howard, photographed by Jason Henry for the New York Times

  • Fast.ai remote fellow Xinxin Li is working with Ikaishe and Xeed to develop wearable devices for patients with Parkinson’s Disease. Traditionally, doctors observe the patient walking to assess disease progression, and wearable devices will allow for much more data and more precise data to be collected.

  • Deep learning can classify skin cancer with dermatologist-level accuracy, as published in Nature earlier this year.

  • Cardiogram is an app for Apple Watch that screens users’ cardio health and is able to detect atrial fibrillation, a common form of heart irregularity, with 97 percent accuracy.

Does this mean I need “big data”? No.

Currently, when news articles talk about “AI”, they are often referring to deep learning, one particular family of algorithms.

Although the above examples involve relatively large datasets, deep learning is being effectively applied to smaller and smaller datasets all the time. Here are a few examples that I listed in a previous blog post: Francois Chollet, creator of the popular deep learning library Keras and now at Google Brain, has an excellent tutorial entitled Building powerful image classification models using very little data in which he trains an image classifier on only 2,000 training examples. At Enlitic, Jeremy Howard led a team that used just 1,000 examples of lung CT scans with cancer to build an algorithm that was more accurate at diagnosing lung cancer than a panel of 4 expert radiologists. The C++ library Dlib has an example in which a face detector is accurately trained using only 4 images, containing just 18 faces!

Face Recognition with Dlib

Fast.ai student Ben Bowles wrote a post on how data platform Quid uses deep learning with small data to improve the quality of one of their data sets.

Transfer learning is a powerful technique in which models trained on larger data sets (by teams with more computational resources) can be fine-tuned to fit other problems, with less data and less computation. For instance, models originally trained on ImageNet (images in a set of 1,000 categories) are a good starting point for other computer vision problems (such as analyzing a CT scan for lung cancer). Transfer learning is a major focus of our Practical Deep Learning for Coders course.

How does machine learning compare to hypothesis testing or paired cohort analysis?

Machine learning shines at handling messy data. Techniques such as controlled studies or paired cohort analysis rely on carefully controlling for different variables in your experiment set-up (or in finding the pairings), whereas machine learning is an excellent choice when this isn’t possible.

Random Forests

Deep learning is just one kind of machine learning. Another machine learning algorithm is the random forest, which is great for observational studies.

The original random forest paper successfully tested the algorithm on a number of small data sets, including 569 images of a finite aspirate of a breast cancer mass, data from 345 men with liver disorders, and 336 ecoli samples.

Getting started

If you are at a university, seek out collaborators or interns. Note, if you have a student who knows how to code, they can learn deep learning. If you are looking for a collaborator, you do not need to find a deep learning expert. All you need is someone with a year or two of coding experience who is interested in your project and wants to learn deep learning. It’s even better if they are familiar with your research (for instance, perhaps a student who may already be working in your lab and knows how to code).

I recommend that you learn to code. Even if it’s not your area of focus and you will be collaborating with programmers, knowing some code will help you better understand what’s possible and have a better sense of what the programmers you’re collaborating with are doing.

For doctors, I think the best way to start coding is by learning R: R has the easiest to use implementation of random forests (which is a great general purpose machine learning algorithm) and R is commonly used by statisticians, so you will most likely meet biostatisticians who use it. Rstudio is a relatively user-friendly, free environment in which to use R (although it still requires writing code). This free coursera class, taught by biostatisticians from Johns Hopkins University, is one good way to get started. Folks that know me may be surprised by this recommendation: in general, I recommend that people interested in becoming data scientists learn Python; I recommend that teenagers or anyone who likes art or games learn JavaScript; and now I’m recommending that doctors learn R. You will need to learn Python if you start learning deep learning, but random forests in R are a great place to get started with machine learning (and random forests still produce top quality results in many areas– they are not just for beginners!)

Sponsor a Deep Learning Diversity Scholarship

Last year’s diversity fellowships (funded by University of San Francisco and fast.ai), open to women, people of Color, LGBTQ people, and vets, played a role in helping us create a diverse community. However, we need your help to be able to offer additional scholarships this year. If your company, firm, or organization would be willing to sponsor diversity scholarships ($1,500 each), please email rachel@fast.ai.

Deep learning is incredibly powerful and is being used to diagnose cancer, stop deforestation of endangered rainforests, provide better crop insurance to farmers in India, provide better language translation than humans, improve energy efficiency, and more. To find out why diversity in AI is a crucial issue, read this post on the AI diversity crisis.

While many in tech are bemoaning the “skills gap” or “talent shortage” in trying to hire AI practitioners, we at fast.ai set out 1 year ago with a novel experiment: could we teach deep learning to coders with no pre-requisites beyond just 1 year of coding experience? Other deep learning materials often assume an advanced math background, yet we were able to get our students to the state of the art, through a practical, hands-on approach in our part-time course, without the advanced math requirements. Our students have been incredibly successful and their stories include the following:

  • Sara Hooker, who only started coding 2 years ago, and is now part of the elite Google Brain Residency
  • Tim Anglade, who used Tensorflow to create the Not Hot Dog app for HBO’s Silicon Valley, leading Google’s CEO to tweet “our work here is done
  • Gleb Esman, who created a new fraud product for Splunk using the tools he learnt in the course, and was featured on Splunk’s blog
  • Jacques Mattheij, who built a robotic system to sort two tons of lego
  • Karthik Kannan, founder of letsenvision.com, who told us “Today I’ve picked up steam enough to confidently work on my own CV startup and the seed for it was sowed by fast.ai with Pt1. and Pt.2”
  • Matthew Kleinsmith and Brendon Fortuner, who in 24 hours built a system to add filters to the background and foreground of videos, giving them victory in the 2017 Deep Learning Hackathon.

For those interested in applying for our diversity fellowships (to take our course), read this post for details.

Diversity Crisis in AI, 2017 edition

Deep learning has great potential, but currently the people using this technology are overwhelmingly white and male. We’re already seeing society’s racial and gender biases being encoded into software that uses AI when built by such a homogeneous group. Additionally, people can’t address problems that they’re not aware of, and with more diverse practitioners, a wider variety of important societal problems will be tackled.

Deep Learning has great potential

Deep learning is being used by fast.ai students and teachers to diagnose cancer, stop deforestation of endangered rainforests, provide better crop insurance to farmers in India (who otherwise have to take predatory loans from thugs, which have led to high suicide rates), help Urdu speakers in Pakistan, develop wearable devices for patients with Parkinson’s disease, and much more. Deep learning offers hope of a way for us to fill the global shortage of doctors, providing more accurate medical diagnoses and potentially saving millions of lives. It could improve energy efficiency, increase farm yields, reduce pesticide use, and more.

We want to get deep learning into the hands of as many people as possible, from as many diverse backgrounds as possible. People with different backgrounds have different problems they’re interested in solving. The traditional approach is to start with an AI expert and then give them a problem to work on; at fast.ai we want people who are knowledgeable and passionate about the problems they are working on, and we’ll teach them the deep learning needed to address them.

Deep Learning can be misused

Deep learning isn’t “more biased” than simpler models such as regression; however, the amazing effectiveness of deep learning suggests that it will be used in far more applications. As a society, we risk encoding our existing gender and racial biases into algorithms that determine medical care, employment decisions, criminal justice decisions, and more. This is already happening with simple models, but the widespread adoption of deep learning will rapidly accelerate this trend. The next 5 to 10 years are a particularly crucial time. We must get more women and people of Color building this technology in order to recognize, prevent, or address these baises.

Earlier this year, Taser (now rebranded Axon), the maker of the electronic stun guns, acquired two AI companies. Taser/Axon owns 80% of the police body camera market in the US, keeps this footage from police body cams in private databases, and is now advertising that they are developing technology for “predictive policing”. As a private company they are not subject to the same public records laws or oversight that police departments are. Given that racial bias in policing has been well-documented and shown to create negative feedback loops, this is terrifying. What kind of biases may be in their datasets or algorithms?

Google’s popular Word2Vec language library (covered in Lesson 5 of our course and in a workshop I gave this summer) has learned meaningful analogies, such as man is to king as women is to queen. However, it also creates sexist analogies such as man is to computer programmer as woman is to homemaker. This is concerning as Word2Vec has become a commonly used building block in a wide variety of applications. This is not the first (or even second) time Google’s use of deep learning has showed troubling biases. In 2015, Google Photos labeled Black people as “gorillas” while automatically labeling photos. Google Translate continues to provide sexist translations such as translating “O bir doktor. O bir hemşire” to “He is a doctor. She is a nurse” even though the original Turkish did not specify gender.

The state of diversity in AI

A year after prominent Google AI leader Jeff Dean said he is deeply worried about the lack of diversity in AI, guess what the diversity stats of the Google Brain team is? It is ~94% male with 44 men and just 3 women and over 70% White. OpenAI’s openness does not extend to sharing diversity stats or who works there, and from photos, the OpenAI team looks extremely homogenous. I’d guess that it’s even less diverse than Google Brain. Earlier this year Vanity Fair ran an article about AI that featured 60 men, without quoting a single woman that works in AI.

Google Brain, OpenAI, and the media can’t solely blame the pipeline for this lack of diversity, given that there are over 1,000 women active in machine learning. Furthermore, Google has a training program to bring engineers in other areas up to speed on AI, which could be a great way to increase diversity. However, this program is only available to Google engineers, and just 3% of Google’s technical employees are Black or Latino (despite the fact that 90,000 Black and Latino students have graduated with computer science majors in the US in the last decade); thus, this training program is not going to have much impact on diversity.

fast.ai diversity scholarships

At fast.ai, we want to do our part to increase diversity in this powerful field. Therefore, we are providing diversity scholarships for our updated in-person Practical Deep Learning for Coders course presented in conjunction with the University of San Francisco Data Institute, to be offered on Monday evenings in in downtown San Francisco and beginning on Oct 30. The only requirements are:

  • At least 1 year of coding experience
  • At least 8 hours a week to commit to the course (includes time for homework)
  • Curiosity and a willingness to work hard
  • Identify as a woman, person of Color, LGBTQ person, and/or veteran
  • Be available to attend in-person 6:30-9pm, Monday evenings, in downtown San Francisco (SOMA)

You can find more details about how to apply in this post.