Notes on state of the art techniques for language modeling

technical
Author

Jeremy Howard

Published

August 25, 2017

Edit one day later… Much to my surprise a lot of people shared this on twitter, and much to my delight there were some very helpful and interesting comments from people I respect—so check out the thread here.

I cleverly trapped Smerity in a Twitter DM conversation while he was trapped on a train with nothing better to do than answer my dumb questions, and I managed to get a download of ~0.001% of what he knows about language modeling. It should be enough to keep me busy for a few months… The background of this conversation is that for “version 2” of our deep learning course at USF we’re curating and implementing in a consistent API the most important best practices in a range of deep learning applications, including computer vision, text, and recommendation systems. Unfortunately, for text applications the best practices are not really collected anywhere, hence the need for the Smerity-brain-dump.

I figured I’d make my notes on the conversation into a little blog post in case other people find this useful too. I’m assuming people are familiar with the topics covered in parts 1 & 2 of our MOOC, such as RNNs, dropout, attentional models, and neural translation. I’ve spent the day scouring the internet for other resources too and I’m incorporating some of my own research here, so if you see anything dumb it’ll almost certainly be my fault, not Smerity’s.

Pytorch code examples

Smerity pointed to two excellent repositories that seemed to contain examples of all the techniques we discussed:

  • AWD-LSTM Language Model, which is a very recent release that shows substantial improvements in state of the art for language modeling, using techniques that are likely to be useful across a range of NLP problems. Doesn’t work on the latest Pytorch, although might not need too much tweaking to fix
  • Word-level language modeling RNN, a simpler and really ancient (over 6 months old!) language modeling example, but still some useful example code

Two other interesting libraries to be aware of are:

  • Practical pytorch has some nice simple tutorial examples, although there are some significant problems around both approach (e.g. no test sets!), performance, and occassionally clunky code. Also it doesn’t take advantage of the torchtext library, which makes for some redudent code. But really nicely chosen problems and clear descriptions.
  • torchtext is a small but convenient library for some basic text processing tasks, and also provides convenient access to a few datasets.

Techniques to get state of the art (SotA) results

In part 2 of the course we got pretty close to SotA in neural translation by showing how to use attentional models, dynamic teacher forcing, and of course stacked bidirectional LSTMs. But there have been some interesting approaches that have come to the fore since we developed that course, and Smerity suggested that combining the following should get the biggest wins across a range of NLP tasks without much additional complexity:

  • Two of my favorite researchers, Dyer and Blunsom, and the extraordinary Kaggle, computer olympiad, and Google AI winner Gabor Melis, published On the State of the Art of Evaluation in Neural Language Models, which curates a few best practices and uses hyper-parameter optimization to show great results from LSTMs
  • Smerity et al also recently released a paper curating and combining recent NLP techniques and got a big jump in SotA on language modeling, in Regularizing and Optimizing LSTM Language Models
  • Perhaps the most important addition in this paper is through using Pointer networks, which both Smerity et al (pointer sentinel) and Grave et al (continuous cache pointer) have extended to language modeling. Smerity uses the continuous cache for his most recent work and describes it as “The pointer is really really really simple! You could apply that to any LM model output and get similar gains. It’s more engineering though. All it is: store hidden state of LM as history (2000 timesteps), when generating a word calculate attention over that history using current LM hidden state, your word distribution is based on the next word according to where your hidden state is most similar in the past”.
  • Also shown as important in both the Melis and Smerity summary papers above is thoughtful use of regularization. Smerity found that Dropconnect (which doesn’t really help CNNs much over regular dropout) actually works very nicely for RNNs, which makes intuitive sense when you think about it… In his words “Given an RNN, h = Wx + Uh_t-1 + b, you apply dropout to the recurrent weight matrix U and have it the same over the entire forward + backward. Simple, fast, and prevents overfitting from h to h to h to …” He calls this the weight-dropped LSTM, and he implemented it in a very brief class called WeightDrop in the recent language modeling project..
  • Something that seems kinda obvious but actually only got written up a year ago is weight-tying, described in Using the Output Embedding to Improve Language Models. It simply means that the output weights and the input embedding weights are the same matrix, and it’s handled in the basic language model example above with a single line of code (self.decoder.weight = self.encoder.weight)
  • Finally, Smerity suggested adding “activation regularization (add L2 of LSTM output as loss) and temporal activation regularization (add L2 of ht - h_t-1 to loss). That is like three lines of change and gets you near old SotA with weight tying.” The approach is described in this paper.

Problems to solve

Interesting problems to solve in the NLP world, with meaningful benchmarks, include:

  • Sentiment analysis, or similar standard classification or regression tasks, are great for learning since they’re very similar to other classification and regression problems, so you can focus on the text analysis issues rather than the problem description. And they’re widely useful. There’s a huge dataset of Amazon reviews that can be used, although I’m not aware of good benchmarks. The IMDB review dataset has been widely studied so there’s plenty of good benchmarks (we already use this dataset in our MOOC). Any group of documents that have some kind of categories can be used too, such as the 20 newsgroup dataset that we used in our compuational linear algebra course. (BTW I don’t recommend using Stanford’s sentiment dataset, since it contains phrase annotations as well as sentence annotations, which seems rather artifical and designed to make the researchers’ tree-based techniques look more impressive.)
  • Language modeling, which basically means trying to predict the next word given the previous words in a sentence. The standard benchmark for this is ‘PTB’ (Penn Treebank), which is available in the Pytorch language modeling example repo.
  • Language translation, for which there are many datasets and tutorials around nowadays, although I like to think that our dataset and approach from part 2 of our course isn’t too bad…
  • Textual entailment, which is best explained by the examples on Stanford’s Natural Language Inference (SNLI) Corpus page. You get pairs of sentences, and have to say if they’re in agreement or they contradict. E.g. “A man inspects the uniform of a figure in some East Asian country” and “The man is sleeping” would be labeled as a contradiction. As well as the above corpus, the excellent Sam Bowman and friends have also recently released The Multi-Genre NLI Corpus; the two datasets can be combined. It’s a cool demonstration of the power of deep learning, although it’s not entirely clear (to me at least) whether it’s actually useful…
  • Text summarization is an area of active with some recent significant advances, although I’m not yet clear on whether it’s useful in practice
  • A lot of language workers spend their time on sub-tasks, such as part-of-speech tagging and entity recognition, which can then be used in larger applications. These particular tasks are part of the area of sequence labeling, about which NLP research Yoav Goldberg tweeted: “tons of information extraction tasks can be modeled as seq labeling. It’s a killer app.”