The new fast.ai research datasets collection, on AWS Open Data

In machine learning and deep learning we can’t do anything without data. So the people that create datasets for us to train our models are the (often under-appreciated) heroes. Some of the most useful and important datasets are those that become important “academic baselines”; that is, datasets that are widely studied by researchers and used to compare algorithmic changes. Some of these become household names (at least, among households that train models!), such as MNIST, CIFAR 10, and Imagenet.

We all owe a debt of gratitude to those kind folks who have made datasets available for the research community. So fast.ai and the AWS Public Dataset Program have teamed up to try to give back a little: we’ve made some of the most important of these datasets available in a single place, using standard formats, on reliable and fast infrastructure. For a full list and links see the fast.ai datasets page.

fast.ai uses these datasets in the Deep Learning for Coders courses, because they provide great examples of the kind of data that students are likely to encounter, and the academic literature has many examples of model results using these datasets which students can compare their work to. If you use any of these datasets in your research, please show your gratitude by citing the original paper (we’ve provided the appropriate citation link below for each), and if you use them as part of a commercial or educational project, consider adding a note of thanks and a link to the dataset.

Dataset example: the French/English parallel corpus

One of the lessons that gets the most “wow” feedback from fast.ai students is when we study neural machine translation. It seems like magic when we can teach a model to translate from French to English, even if we can’t speak both languages ourselves!

But it’s not magic; the key is the wonderful dataset that we leverage in this lesson: the French/English parallel text corpus prepared back in 2009 by Professor Chris Callison-Burch of the University of Pennsylvania. This dataset contains over 20 million sentence pairs in French and English. He built the dataset in a really clever way: by crawling millions of Canadian web pages (which are often multi-lingual) and then using a set of simple heuristics to transform French URLs onto English URLs. The dataset is particularly important for researchers since it is used in the most important annual competition for benchmarking machine translation models.

Here’s some examples of the sentence pairs that our translation models can learn from:

Often considered the oldest science, it was born of our amazement at the sky and our need to question Astronomy is the science of space beyond Earth’s atmosphere. | Souvent considérée comme la plus ancienne des sciences, elle découle de notre étonnement et de nos questionnements envers le ciel L’astronomie est la science qui étudie l’Univers au-delà de l’atmosphère terrestre. |
The name is derived from the Greek root astron for star, and nomos for arrangement or law. | Son nom vient du grec astron, qui veut dire étoile et nomos, qui veut dire loi. |
Astronomy is concerned with celestial objects and phenomena – like stars, planets, comets and galaxies – as well as the large-scale properties of the Universe, also known as “The Big Picture”. | Elle s’intéresse à des objets et des phénomènes tels que les étoiles, les planètes, les comètes, les galaxies et les propriétés de l’Univers à grande échelle. |

So what’s Professor Callison-Burch doing now? When we reached out to him to check some details for his dataset, he told us he’s now preparing the University of Pennsylvania’s new AI course; and part of his preparation: watching the videos at course.fast.ai! It’s a small world indeed…

The dataset collection

The following categories are currently included in the collection:

Image classification, with a focus on fine-grained classification and transfer learning
Image localization, including the 2017 COCO datasets
NLP, including datasets for language modeling, translation, and text classification.

The datasets are all stored in the same tgz format, and (where appropriate) the contents have been converted into standard formats, suitable for import into most machine learning and deep learning software. For examples of using the datasets to build practical deep learning models, keep an eye on the fast.ai blog where many tutorials will be posted soon.