fastdownload: the magic behind one of the famous 4 lines of code

technical
Author

Jeremy Howard

Published

August 2, 2021

Summary: Today we’re launching fastdownload, a new library that makes it easy for your users to download, verify, and extract archives.

Background

At fast.ai we focussed on making important technical topics more accessible. That means that the libraries we create do as much as possible for the user, without limiting what’s possible.

fastai is famous for needing just four lines of code to get world-class deep learning results with vision, text, tabular, or recommendation system data:

path = untar_data(URLs.PETS)
dls = ImageDataLoaders.from_name_func(path, get_image_files(path/"images"),
    label_func, item_tfms=Resize(224))
learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1)

There have been many pages written about most of these: the flexibility of the Data Block API, the power of cnn_learner, and the state of the art transfer learning provided by fine_tune.

But what about untar_data? This first line of code, although rarely discussed, is actually a critical part of the puzzle. Here’s what it does:

  1. If required, download the URL to a special folder (by default, ~/.fastai/archive). If it was already downloaded earlier, skip this step
  2. Check whether the size and hash of the downloaded (or cached) archive matches what fastai expects. If it doesn’t, try downloading again
  3. If required, extract the downloaded file to another special folder (by default, ~/.fastai/archive). If it was already extracted earlier, skip this step
  4. Return a Path object pointing at the location of the extracted archive.

Thanks to this, users don’t have to worry about where their archives and data can be stored, whether they’ve downloaded a URL before or not, and whether their downloaded file is the correct version. fastai handles all this for the user, letting them spend more of their time on the actual modeling process.

fastdownload

fastdownload, launched today, allows you to provide this same convenience for your users. It helps you make datasets or other archives available for your users while ensuring they are downloaded correctly with the latest version.

Your user just calls a single method, FastDownload.get, passing the URL required, and the URL will be downloaded and extracted to the directories you choose. The path to the extracted file is returned. If that URL has already been downloaded, then the cached archive or contents will be used automatically. However, if that size or hash of the archive is different to what it should be, then the user will be informed, and a new version will be downloaded.

In the future, you may want to update one or more of your archives. When you do so, fastdownload will ensure your users have the latest version, by checking their downloaded archives against your updated file size and hash information.

fastdownload will add a file download_checks.py to your Python module which contains file sizes and hashes for your archives. Because it’s a regular python file, it will be automatically included in your package if you upload it to pypi or a conda channel.

Here’s all you need to provide a function that works just like untar_data:

from fastdownload import FastDownload
def untar_data(url): return FastDownload(base='~/.myapp').get(url)

You can modify the locations that files are downloaded to by creating a config file ~/.myapp/config.ini (if you don’t have one, it will be created for you). The values in this file can be absolute or relative paths (relative paths are resolved relative to the location of the ini file).

If you want to give fastdownload a try, then head over to the docs and follow along with the walk-thru.