There is a lot of misleading and even false information about AI out there, ranging from apallingly bad journalism to overhyped marketing materials to quotes from misinformed celebrities. Last month, it even got so bad that Snopes had to debunk a story about Facebook research that was inaccurately covered by a number of outlets.
AI is a complex topic moving at an overwhelming pace (even as someone working in the field, I find it impossible to keep up with everything that is happening). Beyond that, there are those who stand to profit off overhyping advances or drumming up fear.
I want to recommend several credible sources of accurate information. Most of the writing on this list is intended to be accessible to anyone—even if you aren’t a programmer or don’t work in tech:
Jack Clark’s email newsletter, Import AI, provides highlights and summaries of a selection of AI news and research from the previous week. You can check out previous issues (or sign up) here. Jack Clark is Strategy & Communications Director at OpenAI.
Mariya Yao’s writing on Topbots and on Forbes. Mariya is CTO and head of R&D for Topbots, a strategy and research firm for applied artificial intelligence and machine learning. Fun fact: Mariya worked on the LIDAR system for the 2nd place winner in the DARPA grand challenge for autonomous vehicles.
Zeynep Tufekci, a professor at UNC-Chapel Hill, is an expert on the interactions between technology and society. She shares a lot of important ideas on twitter, or read her New York Times op-eds here.
Kate Crawford is a professor at NYU, principal researcher at Microsoft, and co-founder of the AI Now Research institute, dedicated to studying the social impacts of AI. You can follow her on twitter here.
I also want to highlight a few great examples of AI researchers thoughtfully deconstructing the hype around some high-profile stories in the past few months, in an accessible way:
Denny Britz provides a balanced perspective on OpenAIs bot for Defense of the Ancients (a popular computer game).
Twitter is quite useful for keeping up on machine learning news and many people share surprisingly deep insights (that I often can’t find elsewhere). I was skeptical of Twitter before I started using it. The whole idea seemed weird: you can only write a 140 characters at a time? I already had Facebook and Linkedin, did I really need another social media account? It now occupies a useful and distinct niche for me. The hardest part is getting started; feel free to take a look at my twitter or Jeremy’s favorites to look for interesting accounts. Whenever I read an article I like or hear a talk I like, I always look up the author/speaker on twitter and see if I find their tweets interesting. If so, I follow them.
I will update this post based on constructive feedback as I receive it (with attribution)—although I’ve tried to largely stick to my area of specialty (data science) I’ve had to touch on various areas that I’m not an expert in, so please do let me know if you notice any issues. You can reach me on Twitter at @jeremyphoward.
When I first read about this study, I had a strong negative emotional response. The topic is of great personal interest to me—Rachel Thomas and I started fast.ai explicitly for the purpose of increasing diversity in the field of deep learning (including the deep neural networks used in this study), and we even personally pay for scholarships for diverse students, including LGBTQ students. In addition, we want to support the use of deep learning in a wider range of fields, because we believe that it can both positively and negatively impact many people’s lives, so we want to show how to use the technology appropriately and correctly.
Like many commentators, I had many basic concerns about this study. Should it have been done at all? Was the data collection an invasion of privacy? Were the right people involved in the work? Were the results communicated in a thoughtful and sensitive way? These are important issues, and can’t be answered by any single individual. Because deep learning is making it possible for computers to do things that weren’t possible before, we’re going to see more and more areas where these questions are going to arise. Therefore, we need to see more cross-disciplinary studies being done by more cross-disciplinary teams. In this case, the researchers are data scientists and psychologists, but the paper covers topics (and claims to reach conclusions) in fields from sociology to biology.
So, what does the paper actually show - can neural nets do what is claimed, or not? We will analyze this question as a data scientist - by looking at the data.
The key conclusions of both the paper (“Deep neural networks can detect sexual orientation from faces”) and the response (“AI can’t tell if you’re gay”) are not supported by the research shown. What is supported is a weaker claim: in some situations deep neural networks can recognize some photos of gay dating website users from some photos of heterosexual dating website users. We definitely can’t say “AI can’t tell if you’re gay”, and indeed to make this claim is irresponsible: the paper at least shows some sign that the opposite might well be true, and that such technology is readily available and easily used by any government or organization.
The senior researcher on this paper, Michael Kosinski, has successfully warned us of similar problems in the past: his paper Private traits and attributes are predictable from digital records of human behavior is one of the most cited of all time, and was at least partly responsible for getting Facebook to change their policy of having ‘likes’ be public by default. If the key result in this new study does turn out to be correct, then we should certainly be having a discussion about what policy implications it has. If you live in a country where homosexuality is punishable by death, then you need to be open to the possibility that you could be profiled for extra surveillance based on your social media pictures. If you are in a situation where you can’t be open about your sexual preference, you should be aware that a machine learning recommendation system could (perhaps even accidentally) target you with merchandise targeted to a gay demographic.
However, the paper comes to many other conclusions that are not directly related to this key question, are not clearly supported by the research, and are overstated and poorly communicated. In particular, the paper claims that the research supports the “widely accepted” prenatal hormone theory (PHT) that “same-gender sexual orientation stems from the under-exposure of male fetuses or overexposure of female fetuses to androgens that are responsible for sexual differentiation”. The support for this in the paper is far from rigorous, and should be considered inconclusive. Furthermore, the sociologist Greggor Mattson says that not only is the theory not widely accepted, but that “literally the first sentence of a decade-old review of the field is ‘Public perceptions of the effect of testosterone on ‘manly’ behavior are inaccurate’”.
How was the research was done?
A number of studies are presented in the paper, but the key one is ‘study 1a’. In this study, the researchers downloaded on average 5 images of 70,000 people from a dating website. None of the data collected in the study has been made available, although nearly any programmer could easily replicate this (and indeed many coders have created similar datasets in the past). Because the study’s focus is on detecting sexual orientation from faces, they cropped the photos to the area of the face. They also removed photos with multiple people, or where the face wasn’t clear, or wasn’t looking straight at the camera. The technical approach here is very standard and reliable, using widely used open source software called Face++.
They then removed any images that a group of non-expert workers considered either not adult, or not caucasian (using Amazon’s Mechanical Turk system). It wasn’t entirely clear why they did this; most likely they were assuming that handling more types of face would make it harder to train their model.
It’s important to be aware that steps like this used to ‘clean’ a dataset are necessary for nearly all data science projects, but they are rarely if ever perfect - and those imperfections are generally not important in understanding the accuracy of a study. What’s important for evaluation is to be confident that the final metrics reported are evaluated appropriately. More on this shortly…
They labeled each as gay or not based on each dating profile’s listed sexual preference.
The researchers then used a deep neural network (VGG-Face) to create features. Specifically, each image was turned into 4096 numbers, each of which had been trained by University of Oxford researchers to be as good as possible for recognizing humans from their faces. They compressed those 4096 numbers down 500 using a simple statistical technique called SVD, and they then used a simple regression model to map these 500 numbers to the label (gay or not).
They repeated the regression 10 times. Each time they used a different 90% subset of the data, and tested the model using the remaining 10% (this is known as cross-validation). The ten models were scored using a metric called AUC which is a standard approach for evaluating classification models like this one. The AUC for people marked as males in the dataset was 0.91.
How accurate is this model?
The researchers describe their model as “91% accurate”. This is based on the AUC score of 0.91. However, it is very unusual and quite misleading to use the word “accuracy” to describe AUC. The researchers have clarified that the actual accuracy of the model can be understood as follows: if you pick the 10% of people in the study with the highest scores on the model, about half will actually be gay based on the collected labels. If the actual percentage of gay males is 7%, then this shows that the model is a lot better than random. However it is probably not as accurate as most people would imagine if they heard something was “91% accurate”.
It is also important to note that based on this study (study 1a) we can only really say that the model can recognize gay dating profiles from one web-site of people that non-experts label as adult caucasian, not that it can recognize gay photos in general. It’s quite likely that this model will generalize to other similar populations, but we don’t know from this research how similar those populations would need to be, and how accurate it will be.
Have the researchers created a new technology here?
The approach used in this study is literally the very first technique that we teach in our introductory deep learning course. Our course does not require an advanced math background - only high school math is need. So the approach used here is literally something anyone can do with high school math, an hour of free online study, and a basic knowledge of programming.
A model trained in this way takes under 20 seconds to run on a commodity server that can be rented for $0.90/hour. So it does not require any special or expensive resources. The data can be easily downloaded from dating websites by anyone with basic coding skills.
The researchers say that their study shows a potential privacy problem. Since the technology they used is very accessible, then if you believe that the capabilities shown are of concern, then this claim seems reasonable.
It is probably reasonably to assume that many organizations have already completed similar projects, but without publishing them in the academic literature. This paper showing what can already be easily done—it is not creating a new technology. It is becoming increasingly common for marketers to use social media data to help push their products; in these cases the models simply look for correlations between product sales and and social media data that is available. In this case it would be very easy for a model in implicitly find a relationship between certain photos and products targeted to a gay market, without the developers even realizing it had made that connection. Indeed, we have seen somewhat similar issues before such as that described in the article How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did.
Did the model show that gay faces are physically different?
In study 1b, the researchers cover up different parts of each image, to see which parts when covered up cause the prediction to change. This is a common technique for understanding the relative importance of different parts of an input to a neural network.
The results this analysis are shown in this picture from the paper:
The red areas are relatively more important to the model than the blue areas. However, this analysis does not show how much more important it is, or why or in what way the red areas were more important.
In study 1c they try to create an “average face” for each of male and female and for each of gay and heterosexual. This part of the study has no rigorous analysis and relies entirely on an intuitive view of the images shown. From a data science point of view, there is no additional information that can be gained from this section.
The researchers claim that these studies show support for the prenatal hormone theory. However, no data is presented that shows how this theory is supported or what level of support is provided, nor investigates possible alternative theories for the observations.
Is the model more accurate than humans?
The researchers claim in the first sentence of the abstract that “faces contain much more information about sexual orientation than can be perceived and interpreted by the human brain”. They base this claim on Study 4, in which they ask humans to classify images from the same dataset as study 1a. However, this study completely fails to provide an adequate methodology to support the claim. Stanford researcher Andrej Karpathy (now at Tesla) showed a fairly rigorous approach to how human image classification can be compared to a neural net. The key piece is to give the human the same opportunity to study the training data as the computer received. In this case, that would mean letting each human judge study many examples of the faces and labels collected in the dataset, before being asked to classify faces themselves.
By failing to provide this “human training” step, the humans and computers had very different information with which to complete the task. Even if the methodology was better, there would still be many possible explanations other than the very strong and unsupported claim that they decided to open their paper with.
As a rule, academic claims should be made with care and rigor and communicated thoughtfully. Especially when it opens a paper. Especially especially when it is in such a sensitive area. Double especially especially when it covers an area outside the researchers’ area of specialty. This issue is thoughtfully discussed with relation to this paper in a Calling Bullshit case study.
Is the classifier effective for images other than dating pics from one website?
In short: we don’t know. Study 5 in the paper asserts that it is, but it does not provide strong support for this claim, and is set up in a oddly convoluted way. The method used for study 5 was to find facebook pics from some extra-super-gay facebook users: people who listed a same-sex partner, and liked at least two pages such as “Manhunt” and “I love being gay”. It then tried to see if they could train a classifier to separate these pics from heterosexual dating web-site users. The claimed accuracy of this classifier was 74%, although the exact meaning of this 74% figure is not listed. If it means an AUC of 0.74 (which is how the researchers referred to AUC earlier in the paper), this is not a strong result. It’s also comparing across datasets (facebook vs dating website), and using a very particular type of facebook profile to do their test.
The researchers state that they didn’t compare to heterosexual profile pics because they didn’t know how to find them.
Are their conclusions supported by their studies?
In the General Discussion section the researchers come to a number of conclusions. All of the conclusions as stated are stronger than what can be concluded from the research results shown. However, we can at least say (assuming that their data analysis was completed correctly, which we can’t confirm since we don’t have access to their data or code) that sexual preference of people in some photos can be identified much better than randomly in some situations.
They conclude that their model does not simply find differences in presentation between the two groups, but actually shows differences in underlying facial structure. This claim is based partly on an assertion that the VGG-Face model they use is trained to identify non-transient facial features. However, some simple data analysis readily shows that this assertion is incorrect. Victoria University researcher Tom White shared an analysis that showed this exact model, for instance, can recognize happy from neutral faces with a higher AUC (0.92) than the model shown in this paper (and can recognize happy from sad faces even better, with an AUC of 0.96).
Did the paper confuse correlation with causation?
Any time that a paper from social scientists comes to the attention of groups of programmers (such as when shared on coding forums), inevitably we hear the cry “correlation is not causation”. This happened for this paper too. What does this mean, and is it a problem here? Naturally, XKCD has us covered:
Correlation refers to an observation that two things happen at the same time. For instance, you may notice that on days when people buy more ice-cream, they also buy more sun-screen. Sometimes people incorrectly assume that such an observation implies causation—in this case, that eating ice-cream causes people to want sunscreen. When a correlation is observed between some event x (buying ice-cream) and event y (buying sunscreen), there are three main possibilities:
x causes y
y causes x
something else causes both x and y (possibly indirectly)
pure chance (we can measure the probability of this happening - in this study it is vanishingly low)
In this case, of course, a warm and sunny day causes both the desire for ice-cream, and the need for sunscreen.
Much of the social sciences deals with this issue. Researchers in these fields often have to try to reach conclusions from observational studies in the presence of many confounding factors. This is a complex and challenging task, and often results in imperfect results. For mathematicians and computer scientists, results from the social sciences can seem infuriatingly poorly founded. In math, if you want to claim that, for example, no three positive integers a, b, and c satisfy the equation an + bn = cn for any integer value of n greater than 2, then it doesn’t matter if you try millions of values of a, b, and c and show none have this relationship—you have to prove it for all possible integers. But in the social sciences this kind of result is generally not possible. So we must try to weigh the balance of evidence versus our a priori expectations regarding the results.
The Stanford paper tries to separate out correlation from causation by using various studies, as discussed above. And in the end, they don’t do a great job of it. But the simple claim that “correlation is not causation” is a sloppy response. Instead, alternative theories need to be provided, preferably with evidence: that is, can you make a claim that y causes x, or that something else causes both x and y, and show that your alternative theory is supported by the research shown in the paper?
In addition, we need to consider the simple question: does it actually matter? E.g. if it is possible for a government (or an over-zealous marketer) to classify a photo of a face by sexual orientation, mightn’t this be an important result regardless of whether the cause of the identified differences are grooming, facial expression, or facial structure?
Should we worry about privacy?
The paper concludes with a warning that governments are already using sophisticated technology to infer intimate traits of citizens, and that it is only through research like this that we can guess what kind of capabilities they have. They state:
Delaying or abandoning the publication of these findings could deprive individuals of the chance to take preventive measures and policymakers the ability to introduce legislation to protect people. Moreover, this work does not offer any advantage to those who may be developing or deploying classification algorithms, apart from emphasizing the ethical implications of their work. We used widely available off-the-shelf tools, publicly available data, and methods well known to computer vision practitioners. We did not create a privacy-invading tool, but rather showed that basic and widely used methods pose serious privacy threats.
These are genuine concerns and it is clearly a good thing for us all to understand the kinds of tools that could be used to reduce privacy. It is a shame that the overstated claims, weak cross-disciplinary research, and methodological problems clouded this important issue.
The next fast.ai courses will be based nearly entirely on a new framework we have developed, built on Pytorch. Pytorch is a different kind of deep learning library (dynamic, rather than static), which has been adopted by many (if not most) of the researchers that we most respect, and in a recent Kaggle competition was used by nearly all of the top 10 finishers.
We have spent around a thousand hours this year working with Pytorch to get to this point, and we are very excited about what it is allowing us to do. We will be writing a number of articles in the coming weeks talking about each aspect of this. First, we will start with a quick summary of the background to, and implications of, this decision. Perhaps the best summary, however, is this snippet from the start of our first lesson:
fast.ai’s teaching goal
Our goal at fast.ai is for there to be nothing to teach. We believe that the fact that we currently require high school math, one year of coding experience, and seven weeks of study to become a world-class deep learning practitioner, is not an acceptable state of affairs (even although this is less prerequisites for any other course of a similar level). Everybody should be able to use deep learning to solve their problems with no more education than it takes to use a smart phone. Therefore, each year our main research goal is to be able to teach a wider range of deep learning applications, that run faster, and are more accurate, to people with less prerequisites.
We want our students to be able to solve their most challenging and important problems, to transform their industries and organisations, which we believe is the potential of deep learning. We are not just trying to teach people how to get existing jobs in the field — but to go far beyond that.
Therefore, since we first ran our deep learning course, we have been constantly curating best practices, and benchmarking and developing many techniques, trialling them against Kaggle leaderboards and academic state-of-the-art results.
Why we tried Pytorch
As we developed our second course, Cutting-Edge Deep Learning for Coders, we started to hit the limits of the libraries we had chosen: Keras and Tensorflow. For example, perhaps the most important technique in natural language processing today is the use of attentional models. We discovered that there was no effective implementation of attentional models for Keras at the time, and the Tensorflow implementations were not documented, rapidly changing, and unnecessarily complex. We ended up writing our own in Keras, which turned out to take a long time, and be very hard to debug. We then turned our attention to implementing dynamic teacher forcing, for which we could find no implementation in either Keras or Tensorflow, but is a critical technique for accurate neural translation models. Again, we tried to write our own, but this time we just weren’t able to make anything work.
At that point the first pre-release of Pytorch had just been released. The promise of Pytorch was that it was built as a dynamic, rather than static computation graph, framework (more on this in a later post). Dynamic frameworks, it was claimed, would allow us to write regular Python code, and use regular python debugging, to develop our neural network logic. The claims, it turned out, were totally accurate. We had implemented attentional models and dynamic teacher forcing from scratch in Pytorch within a few hours of first using it.
Some pytorch benefits for us and our students
The focus of our second course is to allow students to be able to read and implement recent research papers. This is important because the range of deep learning applications studied so far has been extremely limited, in a few areas that the academic community happens to be interested in. Therefore, solving many real-world problems with deep learning requires an understanding of the underlying techniques in depth, and the ability to implement customised versions of them appropriate for your particular problem, and data. Because Pytorch allowed us, and our students, to use all of the flexibility and capability of regular python code to build and train neural networks, we were able to tackle a much wider range of problems.
An additional benefit of Pytorch is that it allowed us to give our students a much more in-depth understanding of what was going on in each algorithm that we covered. With a static computation graph library like Tensorflow, once you have declaratively expressed your computation, you send it off to the GPU where it gets handled like a black box. But with a dynamic approach, you can fully dive into every level of the computation, and see exactly what is going on. We believe that the best way to learn deep learning is through coding and experiments, so the dynamic approach is exactly what we need for our students.
Much to our surprise, we also found that many models trained quite a lot faster on pytorch than they had on Tensorflow. This was quite against the prevailing wisdom, that said that static computation graphs should allow for more optimization to be done, which should have resulted in higher performance in Tensorflow. In practice, we’re seeing some models are a bit faster, some a bit slower, and things change in this respect every month. The key issues seem to be that:
Improved developer productivity and debugging experience in Pytorch can lead to more rapid development iterations, and therefore better implementations
Smaller, more focussed development community in Pytorch looks for “big wins” rather than investing in micro-optimization of every function.
Why we built a new framework on top of Pytorch
Unfortunately, Pytorch was a long way from being a good option for part one of the course, which is designed to be accessible to people with no machine learning background. It did not have anything like the clear simple API of Keras for training models. Every project required dozens of lines of code just to implement the basics of training a neural network. Unlike Keras, where the defaults are thoughtfully chosen to be as useful as possible, Pytorch required everything to be specified in detail. However, we also realised that Keras could be even better. We noticed that we kept on making the same mistakes in Keras, such as failing to shuffle our data when we needed to, or vice versa. Also, many recent best practices were not being incorporated into Keras, particularly in the rapidly developing field of natural language processing. We wondered if we could build something that could be even better than Keras for rapidly training world-class deep learning models.
After a lot of research and development it turned out that the answer was yes, we could (in our biased opinion). We built models that are faster, more accurate, and more complex than those using Keras, yet were written with much less code. We’ve implemented recent papers that allow much more reliable training of more accurate models, across a number of fields.
The key was to create an OO class which encapsulated all of the important data choices (such as preprocessing, augmentation, test, training, and validation sets, multiclass versus single class classification versus regression, et cetera) along with the choice of model architecture. Once we did that, we were able to largely automatically figure out the best architecture, preprocessing, and training parameters for that model, for that data. Suddenly, we were dramatically more productive, and made far less errors, because everything that could be automated, was automated. But we also provided the ability to customise every stage, so we could easily experiment with different approaches.
With the increased productivity this enabled, we were able to try far more techniques, and in the process we discovered a number of current standard practices that are actually extremely poor approaches. For example, we found that the combination of batch normalisation (which nearly all modern CNN architectures use) and model pretraining and fine-tuning (which you should use in every project if possible) can result in a 500% decrease in accuracy using standard training approaches. (We will be discussing this issue in-depth in a future post.) The results of this research are being incorporated directly into our framework.
There will be a limited release for our in person students at USF first, at the end of October, and a public release towards the end of the year. (By which time we’ll need to pick a name! Suggestions welcome…) (If you want to join the in-person course, there’s still room in the International Fellowship program.)
What should you be learning?
If it feels like new deep learning libraries are appearing at a rapid pace nowadays, then you need to be prepared for a much faster rate of change in the coming months and years. As more people enter the field, they will bring more skills and ideas, and try more things. You should assume that whatever specific libraries and software you learn today will be obsolete in a year or two. Just think about the number of changes of libraries and technology stacks that occur all the time in the world of web programming — and yet this is a much more mature and slow-growing area than deep learning. So we strongly believe that the focus in learning needs to be on understanding the underlying techniques and how to apply them in practice, and how to quickly build expertise in new tools and techniques as they are released.
By the end of the course, you’ll understand nearly all of the code that’s inside the framework, because each lesson we’ll be digging a level deeper to understand exactly what’s going on as we build and train our models. This means that you’ll have learnt the most important best practices used in modern deep learning—not just how to use them, but how they really work and are implemented. If you want to use those approaches in another framework, you’ll have the knowledge you need to develop it if needed.
To help students learn new frameworks as they need them, we will be spending one lesson learning to use Tensorflow, MXNet, CNTK, and Keras. We will work with our students to port our framework to these other libraries, which will make for some great class projects.
We will also spend some time looking at how to productionize deep learning models. Unless you are working at Google-scale, your best approach will probably be to create a simple REST interface on top of your Pytorch model, running inference on the CPU. If you need to scale up to very high volume, you can export your model (as long as it does not use certain kinds of customisations) to Caffe2 or CNTK. If you need computation to be done on a mobile device, you can either export as mentioned above, or use an on-device library.
How we feel about Keras
We still really like Keras. It’s a great library and is far better for fairly simple models than anything that came before. It’s very easy to move between Keras and our new framework, at least for the subset of tasks and architectures that Keras supports. Keras supports lots of backend libraries which means you can run Keras code in many places.
It has a unique (to our knowledge) approach to defining architectures where authors of custom layers are required to create a build() method which tells Keras what shape output it creates for a given input. This allows users to more easily create simple architectures because they almost never have to specify the number of input channels for a layer. For architectures like Densenet which concatenate layers it can make the code quite a bit simpler.
On the other hand, it tends to make it harder to customize models, especially during training. More importantly, the static computation graph on the backend, along with Keras’ need for an extra compile() phase, means that it’s hard to customize a model’s behaviour once it’s built.
What’s next for fast.ai and Pytorch
We expect to see our framework and how we teach Pytorch develop a lot as we teach the course and get feedback and ideas from our students. In past courses students have developed a lot of interesting projects, many of which have helped other students—we expect that to continue. Given the accelerating progress in deep learning, it’s quite possible that by this time next year, there will be very different hardware or software options that will make todays’ technology quite obsolete. Although based on the quick adoption of new technologies we’ve seen from the Pytorch developers, we suspect that they might stay ahead of the curve for a while at least…
In our next post, we’ll be talking more about some of the standout features of Pytorch, and dynamic libraries more generally.
In both our previous deep learning courses at USF (which were recorded and formed the basis of our MOOCs), we allowed students that could not participate in person to attend via video and text chat through our International Fellowship. As Rachel described in discussing our last course, this (along with our Diversity Fellowship program) was an important part of our mission:
“…we worked hard to curate a diverse group of participants, because we’d observed that artificial intelligence is missing out because of its lack of diversity. A study of 366 companies found that ethnically diverse companies are 35% more likely to perform well financially, and teams with more women perform better on collective intelligence tests. Scientific papers written by diverse teams receive more citations and have higher impact factors.”
In fact, many of our strongest students and most effective projects have come from the International Fellowship. By opening up the opportunity to learn deep learning in a collaborative environment, students have been able to apply this powerful technology to local problems in their area. For instance, past International Fellows are working to:
This year, we’re presenting an entirely new version of part 1 of our deep learning course, and today we’re launching the International Fellowship for it. The program allows those who can not get to San Francisco to attend virtual classes for free during the same time period as the in-person class and provides access to all the same online resources. (Note that International Fellowships do not provide an official completion certificate through USF). International fellows can come from anywhere on the planet other than San Francisco (including from the US), but need to be able to attend each class via Youtube Live at 6.30pm-9pm Pacific Time each Monday for 7 weeks from Oct 30, 2017 onwards. For many people that means getting up in the middle of the night—but our past students tell us it’s worth it!
Title your email “International Fellowship Application”
Include your resume
Write 1 paragraph describing one or more problems you’d like to apply deep learning to
State where you are located
Confirm that you that you can commit 8 hours a week to working on the course and that you are able to participate in each class via Youtube Live at 6.30pm-9pm Pacific Time each Monday for 7 weeks from Oct 30, 2017 onwards.
The deadline to apply is 10/6.
Frequently Asked Questions
Will the updated course be released online? Yes, the course will be recorded and released after the in-person version ends.
Can I apply again if I was an international fellow last year? Yes, you are welcome to apply again.
Do I get a certificate of completion for the international fellowship? No, the USF Data Institute only awards certificates to graduates of the in-person course.
Edit one day later… Much to my surprise a lot of people shared this on twitter, and much to my delight there were some very helpful and interesting comments from people I respect—so check out the thread here.
I cleverly trapped Smerity in a Twitter DM conversation while he was trapped on a train with nothing better to do than answer my dumb questions, and I managed to get a download of ~0.001% of what he knows about language modeling. It should be enough to keep me busy for a few months… The background of this conversation is that for “version 2” of our deep learning course at USF we’re curating and implementing in a consistent API the most important best practices in a range of deep learning applications, including computer vision, text, and recommendation systems. Unfortunately, for text applications the best practices are not really collected anywhere, hence the need for the Smerity-brain-dump.
I figured I’d make my notes on the conversation into a little blog post in case other people find this useful too. I’m assuming people are familiar with the topics covered in parts 1 & 2 of our MOOC, such as RNNs, dropout, attentional models, and neural translation. I’ve spent the day scouring the internet for other resources too and I’m incorporating some of my own research here, so if you see anything dumb it’ll almost certainly be my fault, not Smerity’s.
Pytorch code examples
Smerity pointed to two excellent repositories that seemed to contain examples of all the techniques we discussed:
AWD-LSTM Language Model, which is a very recent release that shows substantial improvements in state of the art for language modeling, using techniques that are likely to be useful across a range of NLP problems. Doesn’t work on the latest Pytorch, although might not need too much tweaking to fix
Two other interesting libraries to be aware of are:
Practical pytorch has some nice simple tutorial examples, although there are some significant problems around both approach (e.g. no test sets!), performance, and occassionally clunky code. Also it doesn’t take advantage of the torchtext library, which makes for some redudent code. But really nicely chosen problems and clear descriptions.
torchtext is a small but convenient library for some basic text processing tasks, and also provides convenient access to a few datasets.
Techniques to get state of the art (SotA) results
In part 2 of the course we got pretty close to SotA in neural translation by showing how to use attentional models, dynamic teacher forcing, and of course stacked bidirectional LSTMs. But there have been some interesting approaches that have come to the fore since we developed that course, and Smerity suggested that combining the following should get the biggest wins across a range of NLP tasks without much additional complexity:
Perhaps the most important addition in this paper is through using Pointer networks, which both Smerity et al (pointer sentinel) and Grave et al (continuous cache pointer) have extended to language modeling. Smerity uses the continuous cache for his most recent work and describes it as “The pointer is really really really simple! You could apply that to any LM model output and get similar gains. It’s more engineering though. All it is: store hidden state of LM as history (2000 timesteps), when generating a word calculate attention over that history using current LM hidden state, your word distribution is based on the next word according to where your hidden state is most similar in the past”.
Also shown as important in both the Melis and Smerity summary papers above is thoughtful use of regularization. Smerity found that Dropconnect (which doesn’t really help CNNs much over regular dropout) actually works very nicely for RNNs, which makes intuitive sense when you think about it… In his words “Given an RNN, h = Wx + Uh_t-1 + b, you apply dropout to the recurrent weight matrix U and have it the same over the entire forward + backward. Simple, fast, and prevents overfitting from h to h to h to …” He calls this the weight-dropped LSTM, and he implemented it in a very brief class called WeightDrop in the recent language modeling project..
Something that seems kinda obvious but actually only got written up a year ago is weight-tying, described in Using the Output Embedding to Improve Language Models. It simply means that the output weights and the input embedding weights are the same matrix, and it’s handled in the basic language model example above with a single line of code (self.decoder.weight = self.encoder.weight)
Finally, Smerity suggested adding “activation regularization (add L2 of LSTM output as loss) and temporal activation regularization (add L2 of ht - h_t-1 to loss). That is like three lines of change and gets you near old SotA with weight tying.” The approach is described in this paper.
Problems to solve
Interesting problems to solve in the NLP world, with meaningful benchmarks, include:
Sentiment analysis, or similar standard classification or regression tasks, are great for learning since they’re very similar to other classification and regression problems, so you can focus on the text analysis issues rather than the problem description. And they’re widely useful. There’s a huge dataset of Amazon reviews that can be used, although I’m not aware of good benchmarks. The IMDB review dataset has been widely studied so there’s plenty of good benchmarks (we already use this dataset in our MOOC). Any group of documents that have some kind of categories can be used too, such as the 20 newsgroup dataset that we used in our compuational linear algebra course. (BTW I don’t recommend using Stanford’s sentiment dataset, since it contains phrase annotations as well as sentence annotations, which seems rather artifical and designed to make the researchers’ tree-based techniques look more impressive.)
Language modeling, which basically means trying to predict the next word given the previous words in a sentence. The standard benchmark for this is ‘PTB’ (Penn Treebank), which is available in the Pytorch language modeling example repo.
Language translation, for which there are many datasets and tutorials around nowadays, although I like to think that our dataset and approach from part 2 of our course isn’t too bad…
Textual entailment, which is best explained by the examples on Stanford’s Natural Language Inference (SNLI) Corpus page. You get pairs of sentences, and have to say if they’re in agreement or they contradict. E.g. “A man inspects the uniform of a figure in some East Asian country” and “The man is sleeping” would be labeled as a contradiction. As well as the above corpus, the excellent Sam Bowman and friends have also recently released The Multi-Genre NLI Corpus; the two datasets can be combined. It’s a cool demonstration of the power of deep learning, although it’s not entirely clear (to me at least) whether it’s actually useful…
Text summarization is an area of active with some recent significant advances, although I’m not yet clear on whether it’s useful in practice
A lot of language workers spend their time on sub-tasks, such as part-of-speech tagging and entity recognition, which can then be used in larger applications. These particular tasks are part of the area of sequence labeling, about which NLP research Yoav Goldberg tweeted: “tons of information extraction tasks can be modeled as seq labeling. It’s a killer app.”