Our online courses (all are free and have no ads):

Our software: fastai v1 for PyTorch

Take our course in person, March-April 2019 in SF: Register here

fast.ai in the news:


Practical Deep Learning for Coders 2019

Launching today, the 2019 edition of Practical Deep Learning for Coders, the third iteration of the course, is 100% new material, including applications that have never been covered by an introductory deep learning course before (with some techniques that haven’t even been published in academic papers yet). There are seven lessons, each around 2 hours long, and you should plan to spend about 10 hours on assignments for each lesson. Google Cloud and Microsoft Azure have integrated all you need for the courses into their GPU-based platforms, and there are “one-click” platforms available too, such as Crestle and Gradient.

The course assumes you have at least a year of coding experience (preferably in Python, although experienced coders will be able to pick Python up as they go; we have a list of python learning resources available), and have completed high-school math (some university-level math is introduced as needed during the course). Many people who have completed the course tell us it takes a lot of work, but it’s one of the most rewarding things they’ve done; we strongly suggest you get involved with the course’s active online community to help you complete your journey.

After the first lesson you’ll be able to train a state-of-the-art image classification model on your own data. After completing this lesson, some students from the in-person version of this course (where this material was recorded) published new state-of-the-art results in various domains! The focus for the first half of the course is on practical techniques, showing only the theory required to actually use these techniques in practice. Then, in the second half of the course, we dig deeper and deeper into the theory, until by the final lesson we will build and train a “resnet” neural network from scratch which approaches state-of-the-art accuracy.

Some application examples from the course
Some application examples from the course

The key applications covered are:

  • Computer vision (e.g. classify pet photos by breed)
    • Image classification
    • Image localization (segmentation and activation maps)
    • Image key-points
  • NLP (e.g. movie review sentiment analysis)
    • Language modeling
    • Document classification
  • Tabular data (e.g. sales prediction)
    • Categorical data
    • Continuous data
  • Collaborative filtering (e.g. movie recommendation)

We also cover all the necessary foundations for these applications.

Foundations covered in the course
Foundations covered in the course

We teach using the PyTorch library, which is the most modern and flexible widely-used library available, and we’ll also use the fastai wrapper for PyTorch, which makes it easier to access recommended best practices for training deep learning models (whilst making all the underlying PyTorch functionality directly available too). We think fastai is great, but we’re biased because we made it… but it’s the only general deep learning toolkit featured on pytorch.org, has over 10,000 GitHub stars, and is used in many competition victories, academic papers, and top university courses, so it’s not just us that like it! Note that the concepts you learn will apply equally well to any work you want to do with Tensorflow/keras, CNTK, MXNet, or any other deep learning library; it’s the concepts which matter. Learning a new library just takes a few days if you understand the concepts well.

One particularly useful addition this year is that we now have a super-charged video player, thanks to the great work of Zach Caceres. It allows you to search the lesson transcripts, and jump straight to the section of the video that you find. It also shows links to other lessons, and the lesson summary and resources, in collapsible panes (it doesn’t work well on mobile yet however, so if you want to watch on mobile you can use this Youtube playlist). And an extra big thanks to Sylvain Gugger, who has been instrumental in the development of both the course and the fastai library—we’re very grateful to Amazon Web Services for sponsoring Sylvain’s work.

fast.ai's video player with searchable timeline
fast.ai's video player with searchable timeline

If you’re interested in giving it a go, click here to go to the course web site. Now let’s look at each lesson in more detail.

Lesson 1: Image classification

The most important outcome of lesson 1 is that we’ll have trained an image classifier which can recognize pet breeds at state-of-the-art accuracy. The key to this success is the use of transfer learning, which will be a fundamental platform for much of this course. We’ll also see how to analyze the model to understand its failure modes. In this case, we’ll see that the places where the model is making mistakes are in the same areas that even breeding experts can make mistakes.

Training and analyzing a pet breed classifier
Training and analyzing a pet breed classifier

We’ll discuss the overall approach of the course, which is somewhat unusual in being top-down rather than bottom-up. So rather than starting with theory, and only getting to practical applications later, we start instead with practical applications, and then gradually dig deeper and deeper into them, learning the theory as needed. This approach takes more work for teachers to develop, but it’s been shown to help students a lot, for example in education research at Harvard by David Perkins.

We also discuss how to set the most important hyper-parameter when training neural networks: the learning rate, using Leslie Smith’s fantastic learning rate finder method. Finally, we’ll look at the important but rarely discussed topic of labeling, and learn about some of the features that fastai provides for allowing you to easily add labels to your images.

Note that to follow along with the lessons, you’ll need to connect to a cloud GPU provider which has the fastai library installed (recommended; it should take only 5 minutes or so, and cost under $0.50/hour), or set up a computer with a suitable GPU yourself (which can take days to get working if you’re not familiar with the process, so we don’t recommend it until later). You’ll also need to be familiar with the basics of the Jupyter Notebook environment we use for running deep learning experiments. Up to date tutorials and recommendations for these are available from the course website.

Lesson 2: Data cleaning and production; SGD from scratch

We start today’s lesson by learning how to build your own image classification model using your own data, including topics such as:

  • Image collection
  • Parallel downloading
  • Creating a validation set, and
  • Data cleaning, using the model to help us find data problems.

I’ll demonstrate all these steps as I create a model that can take on the vital task of differentiating teddy bears from grizzly bears. Once we’ve got our data set in order, we’ll then learn how to productionize our teddy-finder, and make it available online.

Putting your model in production
Putting your model in production

We’ve had some great additions since this lesson was recorded, so be sure to check out:

  • The production starter kits on the course web site, such as this one for deploying to Render.com
  • The new interactive GUI in the lesson notebook for using the model to find and fix mislabeled or incorrectly-collected images.

In the second half of the lesson we’ll train a simple model from scratch, creating our own gradient descent loop. In the process, we’ll be learning lots of new jargon, so be sure you’ve got a good place to take notes, since we’ll be referring to this new terminology throughout the course (and there will be lots more introduced in every lesson from here on).

Gradient descent in action
Gradient descent in action

Lesson 3: Data blocks; Multi-label classification; Segmentation

Lots to cover today! We start lesson 3 looking at an interesting dataset: Planet’s Understanding the Amazon from Space. In order to get this data into the shape we need it for modeling, we’ll use one of fastai’s most powerful (and unique!) tools: the data block API. We’ll be coming back to this API many times over the coming lessons, and mastery of it will make you a real fastai superstar! Once you’ve finished this lesson, if you’re ready to learn more about the data block API, have a look at this great article: Finding Data Block Nirvana, by Wayde Gilliam.

One important feature of the Planet dataset is that it is a multi-label dataset. That is: each satellite image can contain multiple labels, whereas previous datasets we’ve looked at have had exactly one label per image. We’ll look at what changes we need to make to work with multi-label datasets.

The result of our image segmentation model
The result of our image segmentation model

Next, we will look at image segmentation, which is the process of labeling every pixel in an image with a category that shows what kind of object is portrayed by that pixel. We will use similar techniques to the earlier image classification models, with a few tweaks. fastai makes image segmentation modeling and interpretation just as easy as image classification, so there won’t be too many tweaks required.

We will be using the popular CamVid dataset for this part of the lesson. In future lessons, we will come back to it and show a few extra tricks. Our final CamVid model will have dramatically lower error than any model we’ve been able to find in the academic literature!

What if your dependent variable is a continuous value, instead of a category? We answer that question next, looking at a keypoint dataset, and building a model that predicts face keypoints with precision.

Lesson 4: NLP; Tabular data; Collaborative filtering; Embeddings

In lesson 4 we’ll dive into natural language processing (NLP), using the IMDb movie review dataset. In this task, our goal is to predict whether a movie review is positive or negative; this is called sentiment analysis. We’ll be using the ULMFiT algorithm, which was originally developed during the fast.ai 2018 course, and became part of a revolution in NLP during 2018 which led the New York Times to declare that new systems are starting to crack the code of natural language. ULMFiT is today the most accurate known sentiment analysis algorithm.

Overview of ULMFiT
Overview of ULMFiT

The basic steps are:

  1. Create (or, preferred, download a pre-trained) language model trained on a large corpus such as Wikipedia (a “language model” is any model that learns to predict the next word of a sentence)
  2. Fine-tune this language model using your target corpus (in this case, IMDb movie reviews)
  3. Remove the encoder in this fine tuned language model, and replace it with a classifier. Then fine-tune this model for the final classification task (in this case, sentiment analysis).

After our journey into NLP, we’ll complete our practical applications for Practical Deep Learning for Coders by covering tabular data (such as spreadsheets and database tables), and collaborative filtering (recommendation systems).

For tabular data, we’ll see how to use categorical and continuous variables, and how to work with the fastai.tabular module to set up and train a model.

Then we’ll see how collaborative filtering models can be built using similar ideas to those for tabular data, but with some special tricks to get both higher accuracy and more informative model interpretation.

This brings us to the half-way point of the course, where we have looked at how to build and interpret models in each of these key application areas:

  • Computer vision
  • NLP
  • Tabular
  • Collaborative filtering

For the second half of the course, we’ll learn about how these models really work, and how to create them ourselves from scratch. For this lesson, we’ll put together some of the key pieces we’ve touched on so far:

  • Activations
  • Parameters
  • Layers (affine and non-linear)
  • Loss function.

We’ll be coming back to each of these in lots more detail during the remaining lessons. We’ll also learn about a type of layer that is important for NLP, collaborative filtering, and tabular models: the embedding layer. As we’ll discover, an “embedding” is simply a computational shortcut for a particular type of matrix multiplication (a multiplication by a one-hot encoded matrix).

Lesson 5: Back propagation; Accelerated SGD; Neural net from scratch

In lesson 5 we put all the pieces of training together to understand exactly what is going on when we talk about back propagation. We’ll use this knowledge to create and train a simple neural network from scratch.

Neural net from scratch
Neural net from scratch

We’ll also see how we can look inside the weights of an embedding layer, to find out what our model has learned about our categorical variables. This will let us get some insights into which movies we should probably avoid at all costs…

Interpreting movie review embeddings
Interpreting movie review embeddings

Although embeddings are most widely known in the context of word embeddings for NLP, they are at least as important for categorical variables in general, such as for tabular data or collaborative filtering. They can even be used with non-neural models with great success.

Comparative performance of common models with vs without embeddings
Comparative performance of common models with vs without embeddings

Lesson 6: Regularization; Convolutions; Data ethics

Today we discuss some powerful techniques for improving training and avoiding over-fitting:

  • Dropout: remove activations at random during training in order to regularize the model
  • Data augmentation: modify model inputs during training in order to effectively increase data size
  • Batch normalization: adjust the parameterization of a model in order to make the loss surface smoother.
Data augmentation examples for a single image
Data augmentation examples for a single image

Next up, we’ll learn all about convolutions, which can be thought of as a variant of matrix multiplication with tied weights, and are the operation at the heart of modern computer vision models (and, increasingly, other types of models too).

We’ll use this knowledge to create a class activated map, which is a heat-map that shows which parts of an image were most important in making a prediction.

How a convolution works
How a convolution works

Finally, we’ll cover a topic that many students have told us is the most interesting and surprising part of the course: data ethics. We’ll learn about some of the ways in which models can go wrong, with a particular focus on feedback loops, why they cause problems, and how to avoid them. We’ll also look at ways in which bias in data can lead to biased algorithms, and discuss questions that data scientists can and should be asking to help ensure that their work doesn’t lead to unexpected negative outcomes.

Example of algorithmic bias in the US justice system
Example of algorithmic bias in the US justice system

Lesson 7: Resnets from scratch; U-net; Generative (adversarial) networks

In the final lesson of Practical Deep Learning for Coders we’ll study one of the most important techniques in modern architectures: the skip connection. This is most famously used in the resnet, which is the architecture we’ve used throughout this course for image classification, and appears in many cutting-edge results. We’ll also look at the U-net architecture, which uses a different type of skip connection to greatly improve segmentation results (and also for similar tasks where the output structure is similar to the input).

Impact on loss surface of resnet skip connections
Impact on loss surface of resnet skip connections

We’ll then use the U-net architecture to train a super-resolution model. This is a model which can increase the resolution of a low-quality image. Our model won’t only increase resolution—it will also remove jpeg artifacts and unwanted text watermarks.

In order to make our model produce high quality results, we will need to create a custom loss function which incorporates feature loss (also known as perceptual loss), along with gram loss. These techniques can be used for many other types of image generation task, such as image colorization.

Super-resolution results using feature loss and gram loss
Super-resolution results using feature loss and gram loss

We’ll learn about a recent loss function known as generative adversarial loss (used in generative adversarial networks, or GANs), which can improve the quality of generative models in some contexts, at the cost of speed.

The techniques we show in this lesson include some unpublished research that:

  • Let us train GANs more quickly and reliably than standard approaches, by leveraging transfer learning
  • Combines architectural innovations and loss function approaches that haven’t been used in this way before.

The results are stunning, and train in just a couple of hours (compared to previous approaches that take a couple of days).

A recurrent neural net
A recurrent neural net

Finally, we’ll learn how to create a recurrent neural net (RNN) from scratch. This is the foundation of the models we have been using for NLP throughout the course, and it turns out they are a simple refactoring of a regular multi-layer network.

Thanks for reading! If you’ve gotten this far, then you should probably head over to course.fast.ai and start watching the first video!

C++11, random distributions, and Swift

Overview

Generating numbers from random distributions is a practically useful tool that any coder is likely to need at some point. C++11 added a rich set of random distribution generation capabilities. This makes it easy and fast to use random distributions, not only if you’re using C++11, but if you’re using any language that lets you interop with C++.

In this article, we’ll learn what random distributions are useful for, how they are generated, and how to use them in C++11. I’ll also show how I created new random distribution functionality for Swift by wrapping C++11’s classes as Swift classes. Whilst Swift doesn’t provide direct support for C++, I’ll show how to work around that by creating pure C wrappers for C++ classes.

Random distributions, and why they matter

If names like negative binomial and poisson are mere shadows of memories of something learned long ago, please give me a moment to try to convince you that the world of random distributions is something that deserves your time and attention.

Coders are already aware of the idea of random numbers. But for many, our toolbox is limited to uniform real and integer random numbers, and perhaps some gaussian (normal) random numbers thrown in occassionally as well. There are so many other ways of generating random numbers! In fact, you may even find yourself recreating standard random distributions without being aware of it…

For instance, let’s say you’re writing a music player, and your users have rated various songs from one star to five stars. You want to implement a “shuffle play” function, which will select songs at random, but choosing higher rated songs more often. How would you go about implementing that? The answer is: with a random distribution! More specifically, you want random numbers from a discrete distribution; that is, generate a random integer, using a set of weights where the higher weighted numbers are chosen proportionally more often.

Or perhaps you are trying to simulate the predicted queue length after adding more resources to a system. You simulate the process, becase you want to know not just the average queue length, but how often it will be bigger than some size, what the 95 percentile size will be, and so forth. You’re not sure what some of the inputs to your system might be, but you know the range of possible values, and you have a guess as to what you think is most likely. In this situation, you want random numbers from a triangular distribution; that is, generate a random float, which is normally close to your guess, and is linearly less likely further away, reducing to a probability of zero outside of the range of possible values. (This kind of simulation forms the backbone of probabilistic programming.)

There are dozens of other random distributions, including:

  • Empirical distribution: pick a number at random from historical data
  • Negative binomial distribution: the number of successes before a specified number of failures occurs
  • Poisson distribution: which can be used to model the number of independent events of a regular frequency that happen in a fixed time period

How to generate random distributions

In general, the steps to generate a number from some random distribution are:

  1. Seed your generator
  2. Generate the next bunch of random bits using the generator
  3. Transform those random bits into your chosen distribution
  4. If you need more random numbers, return to (2)

What we normally refer as “random number generation” is really step (2): the use of a pseudorandom generator which deterministically generates a series of numbers that are as “random looking” as possible (i.e. not correlated with each other, well spread out, and so forth). The pseudorandom generator is some function with these properties, such as mersenne twister. To start off the series, you need some seed; that is, the first number to pass to the generator. Most operating systems have some way of generating a random seed, such as /dev/random on Linux and Mac, which uses environmental input such as noise from device drivers to get a number that should be truely random.

Then in step 3 we transform the random bits created by our pseudorandom generator into something that has the distribution we need. There are universally applicable methods for this, such as inverse transform sampling, which transform a uniform random number into any given distribution. There are also faster methods specific to a distribution, such as the Box-Muller transform which creates gaussian (normal) random numbers from a uniform generator.

To create more random numbers, you don’t need to go back to /dev/random, since you already have a pseudorandom generator set up now. Instead, you just grab the next number from your generator (step (2)), and pass that to your distribution generating transform (step (3)).

How this works in C++

C++11 includes, in the <random> standard library header, functionality for each of the steps above. Step (1) is achieved by simply creating a random_device (I’m not including the std:: prefix in this article; you would either type std::random_device or add using namespace std to the top of your C++ file). You then pass this to the constructor of one of various pseudorandom generators provided, such as mt19937, which is the mersenne twister generator; that’s step (2). Then you construct a random distribution object using an appropriate class, such as discrete_distribution, passing in whatever arguments are needed by that distribution (e.g. for discrete_distribution you pass in a list of weights for each possible value). Finally, call that object (it supports the () operator, so it’s a functor, known as callable in Python) passing in the pseudorandom generator you created. Here’s a complete example from the excellent cppreference.com.

Example of C++ discrete_distribution() from cppreference.com
Example of C++ discrete_distribution() from cppreference.com

If you thought that C++ code had to be verbose and complicated, this example might just make you rethink your assumptions! As you can see, each step maps nicely to the overview of the random distribution process described in the previous section. (BTW: if you’re interested in learning modern C++, cppreference.com has an extraordinary collection of carefully designed examples for every part of the C++ standard library; it’s perhaps the best place to learn how to use the language effectively in a hands on way. You can even edit the examples and run them online!)

The distributions provided by C++11 are:

  • Integer generation: uniform_int_distribution, binomial_distribution, negative_binomial_distribution, geometric_distribution, poisson_distribution
  • Real generation: uniform_real_distribution, exponential_distribution, gamma_distribution, weibull_distribution, normal_distribution, lognormal_distribution, chi_squared_distribution, cauchy_distribution, fisher_f_distribution, student_t_distribution
  • Boolean generation: bernoulli_distribution

How this works in Swift

Although Swift 4 now provides some basic random number support, it still doesn’t provide any non-uniform distributions. Therefore, I’ve made all of the C++11 random distributions available to Swift, as part of my BaseMath library. For more information on why and how I created this library, see High Performance Numeric Programming with Swift. I’ll show how I built this wrapper in a moment, but first let’s see how to use it. Here’s the same function that we saw in C++, converted to Swift+BaseMath:

let arr = Int.discrete_distribution([40, 10, 10, 40])[10000]
let counts = arr.reduce(into: [:]) { $0[$1, default:0] += 1 }
counts.sorted(by:<).forEach { print("\($0) generated \($1) times") }

As you see, the generation of the random numbers in Swift can be boiled down using BaseMath to just: Int.discrete_distribution([40, 10, 10, 40])[10000]. We can do this more concisely than C++11 because we don’t surface as many options and details. BaseMath simply assumes you want to use the standard seeding method, and use the mt19937 mersenne twister generator.

The names of the distributions in BaseMath are exactly the same as in C++11, and you simply prefix each name with the type you wish to generate (either Int or Int32 for integer distributions, or Double or Float for real distributions). Each distribution has an init which matches the same name and types as the C++11 distribution constructor. This returns a Swift object with a number of methods. The C++11 objects, as discussed, provide the () (functor) operator, but unfortunately that operator can not be overloaded in Swift. Therefore instead we borrow Swift’s subscript special method to give us the equivalent behavior. The only difference is we have to use [] instead of (). If you just use the empty subscript [] then BaseMath will return a single random number; if you use an Int, such as [10000] then BaseMath will return an array. (There are also methods to generate buffer pointers and aligned storage.) Using subscript instead of a functor may feel a bit odd at first, but it’s a perfectly adequte way to get around Swift’s limitation. I’m going to call this a quasi-functor; that is, something that behaves like a functor, but is called using [...].

Wrapping C++ with Swift

In order to make the C++11 random distributions available to Swift, I needed to do two things:

  1. Wrap the C++ classes in a C API, since Swift can’t interop directly with C++
  2. Wrap the C API with Swift classes

We’ll look at each step in turn:

Wrap C++ classes in C

The C wrapper code is in CBaseMath.cpp. The mt19937 mersenne twister generator will be wrapped in a Swift class called RandGen, so the C functions wrapping this class all have the RandGen_ prefix. Here’s the code for the C wrappers (note that the wrappers can use C++ features internally, as long as the interface in the header file is plain C):

struct RandGenC:mt19937 {}; 
typedef struct RandGenC RandGenC;
RandGenC* RandGen_create() {
  return (RandGenC*)(new mt19937(random_device()()));
}
void RandGen_destroy(RandGenC* v) {
  delete(v);
}

The pattern for each class we wrap will be similar to this. We’ll have at least:

  1. A struct which derives from the C++ class to wrap (struct RandGenC), along with a typedef to allow us to use this name directly. By using a struct instead of void* we can call methods directly in C++, but can hide templates and other C++ internals from our pure C header file
  2. A _create function which constructs an object of this class and returns a pointer to it, cast to our struct type
  3. A _destroy function that deletes that object

In our header, we’ll have each of the functions listed, along with the typedef. Anything importing this C API, including our Swift code, won’t actually know anything about what the struct actually contains, so we won’t be able to use the type directly (since its size and layout isn’t provided in the header). Instead, we’ll simply use opaque pointers in code that uses this.

Here’s the code for the wrappers for a distribution; it looks nearly the same:

struct uniform_int_distribution_intC:uniform_int_distribution<int> {};
typedef struct uniform_int_distribution_intC uniform_int_distribution_intC;
uniform_int_distribution_intC* uniform_int_distribution_int_create(int a,int b) {
  return (uniform_int_distribution_intC*)(new uniform_int_distribution<int>(a,b));
}
void uniform_int_distribution_int_destroy(uniform_int_distribution_intC* v) {
  delete(v);
}
int uniform_int_distribution_int_call(uniform_int_distribution_intC* p, RandGenC* g) {
  return (*p)(*g);
}

The main difference is the addition of a _call function to allow us to actually call the method. Also, because the type is templated, we have to create a separate set of wrappers for each template type we want to support; the above shows an example for <int>. Note that this type needs to be included in the name of each function, since C doesn’t support overloading.

Of course, this all looks rather verbose, and we wouldn’t want to write this all out by hand for every distribution. So we don’t! Instead we use gyb templates to create them for us, and also to auto-generate the header file. Time permitting, we’ll look at that in more detail in the future. But for now, you can check the template’s source code.

Wrap C API in Swift

Now that we’ve got our C API, we can recreate the original C++ class easily in Swift, e.g.:

public class RandGen {
  public let ptr:OpaquePointer?
  public init() { ptr=RandGen_create() }
  deinit { RandGen_destroy(ptr) }
}

As you see, we simply call our _create function in init, and _destroy in deinit. As discussed in the previous section, our C API users don’t know anything about the internals of our struct, so Swift simply gives us an OpaquePointer.

We create similar wrappers for each distribution (which also define subscript, which will call our _call function), plus extending the numeric type with an appropriate static wrapper, e.g.:

extension Int32 {
  public static func uniform_int_distribution(_ g_:RandGen, _ a:Int32, _ b:Int32)
      -> uniform_int_distribution_Int32 {
    return uniform_int_distribution_Int32(g_, a,b)
  }
}

Thread-safe generators

Having to construct and pass in a RandGen object isn’t convenient, particularly when we have to deal with the complexities of thread safety. C++ libraries are not, in general, thread safe; this includes C++11 random generators. So we have to be careful that we don’t share a RandGen object across threads. As discussed in my previous High Performance Numeric Programming with Swift article, we can easily get thread-safe objects by using Thread Local Storage. I added this property to RandGen:

static var stored:RandGen {
  if let r = Thread.current.threadDictionary["RandGen"] as? RandGen { return r }
  return Thread.setToTLS(RandGen(), "RandGen")
}

This lets use add versions of the following to each distribution class, which allows users to never have to think about creating or using the RandGen class.

public convenience init(_ a:Int32, _ b:Int32) { 
  self.init(RandGen.stored, a,b)
} 

Extensions using protocols and BaseMath

The above steps give us the functionality of generating a single random number at a time. In order to generate a collection, we can add a Distribution protocol which each distribution conforms to, and extend it as follows:

public protocol Distribution:Nullary {
  subscript()->Element {get}
  subscript(n:Int)->[Element] {get}
}
extension Distribution {
  public subscript(n:Int)->[Element] {
    return [Element].fill(self, n)
  }
}

As you see, we leverage the BaseMath method fill, which calls a function or quasi-functor n times and returns a new BaseVector (in this case, an Array) with the results of each call.

You might be wondering about the protocol Nullary that’s mentioned above. Perhaps you’ve already heard of unary (a function or operator with one argument), binary (two arguments), and ternary); less known, but equally useful, is the term nullary, which is simply a function or operator with no arguments. As discussed earlier, Swift doesn’t support overloading the () operator, so we add a Nullary protocol using subscript:

public protocol Nullary {
  associatedtype Element:SignedNumeric
  subscript()->Element {get}
}

Try it out!

If you’re a C++ or Swift programmer, try out some of these random distributions—perhaps you could even experiment with creating some simulations and entering the world of probabilistic programming! Or if you’re a Swift programmer that want to use functionaity in a C++ library, try wrapping it with an idiomatic Swift API and make it available as a Swift package for anyone to use.

High Performance Numeric Programming with Swift: Explorations and Reflections

Over the past few weeks I’ve been working on building some numeric programming libraries for Swift. But wait, isn’t Swift just what iOS programmers use for building apps? Not any more! Nowadays Swift runs on Linux and Mac, and can be used for web applications, command line tools, and nearly anything else you can think of.

Using Swift for numeric programming, such as training machine learning models, is not an area that many people are working on. There’s very little information around on the topic. But after a few weeks of research and experimentation I’ve managed to create a couple of libraries that can achieve the same speed as carefully optimized vectorized C code, whilst being concise and easy to use. In this article, I’ll take you through this journey and show you what I’ve learned about how to use Swift effectively for numeric programming. I will include examples mainly from my BaseMath library, which provides generic math functions for Float and Double, and optimized versions for various collections of them. (Along the way, I’ll have plenty to say, both positive and negative, about both Swift and other languages; if you’re someone who has a deep emotional connection to your favorite programming language and doesn’t like to see any criticism of it, you might want to skip this post!)

In a future post I’ll also show how to get additional speed and functionality by interfacing with Intel’s Performance Libraries for C.

Background

Generally around the new year I try to experiment with a new language or framework. One approach that’s worked particularly well for me is to look at what the people that built some of my favorite languages, books, and libraries are doing now. This approach led me to being a very early user of Delphi, Typescript, and C# (Anders Hejlsberg, after I used his Turbo Pascal), Perl (Larry Wall, after I used rn), JQuery (John Resig, after I read Modern Javascript), and more. So when I learnt that Chris Lattner (who wrote the wonderful LLVM) is creating a new deep learning framework called Swift for Tensorflow (which I’ll shorten to S4TF from here), I decided that I should take a look.

Note that S4TF is not just a boring Swift wrapper for Tensorflow! It’s the first serious effort I’ve seen to incorporate differentiable programming deep in to the heart of a widely used language. I’m hoping that S4TF will give us a language and framework that, for the first time, treats differentiable-programming as a first-class citizen of the programming world, and will allow us to do things like:

  • Write custom GPU kernels in Swift
  • Provide compile-time checks for named tensor axis name and size matching
  • Differentiate any arbitrary code, whilst also providing vectorized and fused implementations automatically.

These things are not available in S4TF, at least as yet (in fact, it’s such early days for the project that nearly none of the deep learning functionality works yet). But I fully expect them to happen eventually, and when that happens, I’m confident that using differentiable programming in Swift will be a far better experience in Swift than in any other language.

I was lucky enough to bump in to Chris at a recent conference, and when I told him about my interest in S4TF, he was kind enough to offer to help me get started with Swift. I’ve always found that who I work with matters much more to my productivity and happiness than what I work on, so that was another excellent reason to spend time on this project. Chris has been terrifically helpful, and he’s super-nice as well—so thanks, Chris!

About Swift

Swift is a general-purpose, multi-paradigm, compiled programming language. It was started by Chris Lattner while he was at Apple, and supported many concepts from Objective-C (the main language used for programming for Apple devices). Chris described the language to me as “syntax sugar for LLVM”, since it maps so closely to many of the ideas in that compiler framework.

I’ve been coding for around 30 years, and in that time have used dozens of languages (and have even contributed to some. I always hope that when I start looking at a new language that there will be some mind-opening new ideas to find, and Swift definitely doesn’t disappoint. Swift tries to be expressive, flexible, concise, safe, easy to use, and fast. Most languages compromise significantly in at least one of these areas. Here’s my personal view of some languages that I’ve used and enjoyed, but all of which have limitations I’ve found frustrating at times:

  • Python: Slow at runtime, poor support for parallel processing (but very easy to use)
  • C, C++: hard to use (and C++ is slow at compile time), but fast and (for C++) expressive
  • Javascript: Unsafe (unless you use Typescript); somewhat slow (but easy to use and flexible)
  • Julia: Poor support for general purpose programming, but fast and expressive for numeric programming. ( Edit: this may be a bit unfair to Julia; it’s come a long way since I’ve last looked at it!)
  • Java: verbose (but getting better, particularly if you use Kotlin), less flexible (due to JVM issues), somewhat slow (but overall a language that has many useful application areas)
  • C# and F#: perhaps the fewest compromises of any major programming language, but still requires installation of a runtime, limited flexibility due to garbage collection, and difficulties making code really fast (except on Windows, where you can interface via C++/CLI)

I’d say that Swift actually does a pretty good job of avoiding any major compromises (possibly Rust does too; I haven’t used it seriously so can’t make an informed comment). It’s not the best at any of the areas I’ve mentioned, but it’s not too far off either. I don’t know of another single language that can make that claim (but note that it also has its downsides, which I’ll address in the last section of this article). I’ll look briefly at each in turn:

  • Concise: Here’s how to create a new array b that adds 2 to every element of a: let b=a.map {$0+2}. Here, {$0+2} is an anonymous function, with $0 being the automatic name for the first parameter (you can optionally add names and types if you like). The type of b is inferred automatically. As you can see, there’s a lot we’ve done with just a small amount of code!
  • Expressive: The above line of code works not just for arrays, but for any object that supports certain operations (as defined by Sequence in the Swift standard library). You can also add support for Sequence to any of your objects, and even add it to existing Swift types or types in other libraries. As soon as you do so, those objects will get this functionality for free.
  • Flexible: There’s not much that Swift can’t do. You can use it for mobile apps, desktop apps, server code, and even systems programming. It works well for parallel computing, and also can handle (somewhat) small-memory devices.
  • Safe: Swift has a fairly strong type system, which does a good job of noticing when you’ve done something that’s not going to work. It has good support for optional values, without making your code verbose. But when you need extra speed or flexibility, there’s generally ways to bypass Swift’s checks.
  • Fast: Swift avoids the things that can make a language slow; e.g. it doesn’t use garbage collection, allows you to use value types nearly everywhere, and minimizes the need for locking. It uses LLVM behind the scenes, which has excellent support for creating optimized machine code. Swift also makes it easy for the compiler to know when things are immutable, and avoids aliasing, which also helps the compiler optimize. As you’ll see, you can often get the same performance as carefully optimized C code.
  • Easy to use: This is the one area where there is, perhaps, a bit of a compromise. It’s quite easy to write basic Swift programs, but there can be obscure type issues that pop up, mysterious error messages a long way from the real site where the problem occurred, and installing and distributing applications and libraries can be challenging. Also, the language has been changing a lot (for the better!) so most information online is out of date and requires changes to make it work. Having said all that, it’s far easier to use than something like C++.

Protocol-oriented programming

The main trick that lets Swift avoid compromises is its use of Protocol-oriented programming. The basic idea is that we try to use value types as much as possible. In most languages where ease-of-use is important, reference types are widely used since they allow the use of garbage collection, virtual functions, overriding super-class behavior, and so forth. Protocol-oriented programming is Swift’s way of getting many of these benefits, whilst avoiding the overhead of reference types. In addition, by avoiding reference types, we avoid all the complex bugs introduced when we have two variables pointing at the same thing.

Value types are also a great match for functional styles of programming, since they allow for better support of immutability and related functional concerns. Many programmers, particularly in the Javascript world, have recently developed an understanding of how code can be more concise, understandable, and correct, by leveraging a functional style.

If you’ve used a language like C#, you’ll already be familiar with the idea that defining something with struct gives you a value type, and using class gives you a reference type. This is exactly how Swift handles things too.

Before we get to protocols, let’s mention a couple of other fundamentals: Automatic Reference Counting (ARC), and copy-on-write.

Automatic Reference Counting (ARC)

From the docs: “Swift uses Automatic Reference Counting (ARC) to track and manage your app’s memory usage. In most cases, this means that memory management “just works” in Swift, and you do not need to think about memory management yourself. ARC automatically frees up the memory used by class instances when those instances are no longer needed.” Reference counting has traditionally been used by dynamic languages like Perl and Python. Seeing it in a modern compiled language is quite unusual. However, Swift’s compiler works hard to track references carefully, without introducing overhead.

ARC is important both for handling Swift’s reference types (which we still need to use sometimes), and also to handle memory use in value type objects sharing memory with copy-on-write semantics, or which are embedded in a reference type. Chris also mentioned to me a number of other benefits: it provides deterministic destruction, eliminates the common problems with GC finalizers, allows scaling down to systems that don’t/can’t want a GC, and eliminates unpredictable/unreproducible pauses.

Copy-on-write

One major problem with value types in most languages is that if you have something like a big array, you wouldn’t want to pass the whole thing to a function, since that would require a lot of slow memory allocation and copying. So most languages use a pointer or reference in this situation. Swift, however, passes a reference to the original memory, but if the reference mutates the object, only then does it get copied (this is done behind the scenes automatically). So we get the best performance characteristics of value and reference types combined! This is refered to as “copy-on-wrote”, which is rather delightfully refered to in some S4TF docs as “COW 🐮” (yes, with the cow face emoji too!)

COW also helps with programming in a functional style, yet still allowing for mutation when needed—but without the overhead of unnecessary copying or verbosity of manual references.

Protocols

With value types, we can’t use inheritance hierarchies to get the benefits of object-oriented programming (although you can still use these if you use reference types, which are also supported by Swift). So instead, Swift gives us protocols. Many languages, such as Typescript, C#, and Java, have the idea of interfaces—metadata which describes what properties and methods an object can contain. At first glance, protocols look a lot like interfaces. For instance, here’s the definition from my BaseMath library of ComposedStorage, which is a protocol describing a collection that wraps some other collection. It defines two properties, data and endIndex, and one method, subscript (which is a special method in Swift, and provides indexing functionality, like an array). This protocol definition simply says that anything that conforms to this protocol must provide implementations of these three things.

public protocol ComposedStorage {
  associatedtype Storage:MutableCollection where Storage.Index==Int
  typealias Index=Int

  var data: Storage {get set}
  var endIndex: Int {get}
  subscript(i: Int)->Storage.Element {get set}
}

This is a generic protocol. Generic protocols don’t use <Type> markers like generic classes, but instead use the associatedtype keyword. So in this case, ComposedStorage is saying that the data attribute contains something of a generic type called Storage which conforms to the MutableCollection protocol, and that type in turn has an associatedtype called Index which must be of type Int in order to conform to ComposedStorage. It also says that the subscript method returns whatever type the Storage’s Element associatedtype contains. As you can see, protocols provide quite an expressive type system.

Now look further, and you’ll see something else… there are also implementations provided for this protocol!

public extension ComposedStorage {
  subscript(i: Int)->Storage.Element {
    get { return data[i]     }
    set { data[i] = newValue }
  }
  var endIndex: Int {
    return data.count
  }
}

This is where things get really interesting. By providing implementations, we’re automatically adding functionality to any object that conforms to this protocol. For instance, here is the entire definition from BaseMath of AlignedStorage, a class provides array-like functionality but internally uses aligned-memory, which is often required for fast vectorized code:

public class AlignedStorage<T:SupportsBasicMath>: BaseVector, ComposedStorage {
  public typealias Element=T
  public var data: UnsafeMutableBufferPointer<T>

  public required init(_ data: UnsafeMutableBufferPointer<T>) {self.data=data}
  public required convenience init(_ count: Int)      { self.init(UnsafeMutableBufferPointer(count)) }
  public required convenience init(_ array: Array<T>) { self.init(UnsafeMutableBufferPointer(array)) }

  deinit { UnsafeMutableRawBufferPointer(data).deallocate() }

  public var p: MutPtrT {get {return data.p}}
  public func copy()->Self { return .init(data.copy()) }
}

As you can see, there’s not much code at all. And yet this class provides all of the functionality of the protocols RandomAccessCollection, MutableCollection, ExpressibleByArrayLiteral, Equatable, and BaseVector (which together include hundreds of methods such as map, find, dropLast, and distance). This is possible because the protocols that this class conforms to, BaseVector and ComposedStorage, provide this functionality through protocol extensions (either directly, or by other protocols that they in turn conform to).

Incidentally, you may have noticed that I defined AlignedStorage as class, not struct, despite all my earlier hype about value types! It’s important to realize that there are still some situations where classes are required. Apple’s documentation provides some helpful guidance on this topic. One thing that structs don’t (yet) support is deinit; that is, the ability to run some code when an object is destroyed. In this case, we need to deallocate our memory when we’re all done with our object, so we need deinit, which means we need a class.

One common situation where you’ll find you really need to use protocols is where you want the behavior of abstract classes. Swift doesn’t support abstract classes at all, but you can get the same effect by using protocols (e.g. in the above code ComposedStorage defines data but doesn’t implement it in the protocol extension, therefore it acts like an abstract property). The same is true of multiple inheritance: it’s not supported by Swift classes, but you can conform to multiple protocols, each of which can have extensions (this is sometimes refered to as mixins in Swift). Protocol extensions share a lot of ideas with traits in Rust and typeclasses in Haskell.

Generics over Float and Double

For numeric programming, if you’re creating a library then you probably want it to transparently support at least Float and Double. However, Swift doesn’t make this easy. There is a protocol called BinaryFloatingPoint which in theory supports these types, but unfortunately only three math functions in Swift are defined for this protocol (abs, max, and min - and the standard math operators +-*/).

You could, of course, simply provide separate functionality for each type, but then you’ve got to deal with creating two versions of everything, and your users have to deal with the same problem too. Interestingly enough, I’ve found no discussions of this issue online, and Swift’s own libraries suffer from this issue in multiple places. As discussed below, Swift hasn’t been used much at all for numeric programming, and these are the kinds of issues we have to deal with. BTW, if you search for numerical programming code online, you will often see the use of the CGFloat type (which suffers from Objective-C’s naming conventions and limitations, which we’ll learn more about later), but that only provides support for one of float or double (depending on the system you’re running on). The fact that CGFloat exists at all in the Linux port of Swift is rather odd, because it was only created for Apple-specific compatibility reasons; it is almost certainly not something you’ll be wanting to use.

Resolving this problem is actually fairly straightforward, and is a good example of how to use protocols. In BaseMath I’ve created the SupportsBasicMath protocol, which is extracted below:

public protocol SupportsBasicMath:BinaryFloatingPoint {
  func log2() -> Self
  func logb() -> Self
  func nearbyint() -> Self
  func rint() -> Self
  func sin() -> Self
   
}

Then we tell Swift that Float conforms to this protocol, and we also provide implementations for the methods:

extension Float : SupportsBasicMath {
  @inlinable public func log2() -> Float {return Foundation.log2(self)}
  @inlinable public func logb() -> Float {return Foundation.logb(self)}
  @inlinable public func nearbyint() -> Float {return Foundation.nearbyint(self)}
  @inlinable public func rint() -> Float {return Foundation.rint(self)}
  @inlinable public func sin() -> Float {return Foundation.sin(self)}
   
}

Now in our library code we can simply use SupportsBasicMath as a constraint on a generic type, and we can call all the usual math functions directly. (Swift already provides support for the basic math operators in a transparent way, so we don’t have to do anything to make that work.)

If you’re thinking that it must have been a major pain to write all those wrapper functions, then don’t worry—there’s a handy trick I used that meant the computer did it for me. The trick is to use gyb templates to auto-generate the methods using python code, like so:

% for f in binfs:
  func ${f}(_ b: Self) -> Self
% end # f

If you look at the Swift code base itself, you’ll see that this trick is used liberally, for example to define the basic math functions themselves. Hopefully in some future version we’ll see generic math functions in the standard library. In the meantime, just use SupportsBasicMath from BaseMath.

Performance tricks and results

One of the really cool things about Swift is that wrappers like the above have no run-time overhead. As you see, I’ve marked them with the inlinable attribute, which tells LLVM that it’s OK to replace calls to this function with the actual function body. This kind of zero-overhead abstraction is one of the most important features of C++; it’s really amazing to see it in such a concise and expressive language as Swift.

Let’s do some experiments to see how this works, by running a simple benchmark: adding 2.0 to every element of an array of 1,000,000 floats in Swift. Assuming we’ve already allocated an array of appropriate size, we can use this code (note: benchmark is a simple function in BaseMath that times a block of code):

benchmark(title:"swift add") { for i in 0..<ar1.count {ar2[i]=ar1[i]+2.0} }
> swift add: .963 ms

Doing a million floating point additions in a millisecond is pretty impressive! But look what happens if we try one minor tweak:

benchmark(title:"swift ptr add") {
  let (p1,p2) = (ar1.p,ar2.p)
  for i in 0..<ar1.count {p2[i]=p1[i]+2.0}
}
> swift ptr add: .487 ms

It’s nearly the same code, yet twice as fast - so what happened there? BaseMath adds the p property to Array, which returns a pointer to the array’s memory; so the above code is using a pointer, instead of the array object itself. Normally, because Swift has to handle the complexities of COW, it can’t fully optimize a loop like this. But by using a pointer instead, we skip those checks, and Swift can run the code at full speed. Note that due to copy-on-write it’s possible for the array to move if you assign to it, and it can also move if you do things such as resize it; therefore, you should only grab the pointer at the time you need it.

The above code is still pretty clunky, but Swift makes it easy for us to provide an elegant and idiomatic interface. I added a new map method to Array, which puts the result into a preallocated array, instead of creating a new array. Here’s the definition of map (it uses some typealiases from BaseMath to make it a bit more concise):

@inlinable public func map<T:BaseVector>(_ f: UnaryF, _ dest: T) where Self.Element==T.Element {
  let pSrc = p; let pDest = dest.p; let n = count
  for i in 0..<n {pDest[i] = f(pSrc[i])}
}

As you can see, it’s plain Swift code. The cool thing is that this lets us now use this clear and concise code, and still get the same performance we saw before:

benchmark(title:"map add") { ar1.map({$0+2.0}, ar2) }
> map add: .518 ms

I think this is quite remarkable; we’ve been able to create a simple API which is just as fast as the pointer code, but to the class user that complexity is entirely hidden away. Of course, we don’t really know how fast this is yet, because we haven’t compared to C. So let’s do that next.

Using C

One of the really nice things about Swift is how easy it is to add C code that you write, or use external C libraries. To use our own C code, we simply create a new package with Swift Package Manager (SPM), pop a .c file in its Sources directory, and a .h file in its Sources/include directory. (Oh and BTW, in BaseMath that .h file is entirely auto-generated from the .c file using gyb!) This level of C integration is extremely rare, and the implications are huge. It means that every C library out there, including all the functionality built in to your operating system, optimized math libraries, Tensorflow’s underlying C API, and so forth can all be accessed from Swift directly. And if you for any reason need to drop in to C yourself, then you can, without any manual interfacing code or any extra build step.

Here’s our sum function in C (this is the float version—the double version is similar, and the two are generated from a single gyb template):

void smAdd_float(const float* pSrc, const float val, float* pDst, const int len) {
  for (int i=0; i<len; ++i) { pDst[i] = pSrc[i]+val; }
}

To call this, we need to pass in the count as an Int32; BaseMath adds the c property to arrays for this purpose (alternatively you could simply use numericCast(ar1.count). Here’s the result:

benchmark(title:"C add") {smAdd_float(ar1.p, 2.0, ar2.p, ar1.c)}
> C add: .488 ms

It’s basically the same speed as Swift. This is a very encouraging result, because it shows that we can get the same performance as optimized C using Swift. And not just any Swift, but idiomatic and concise Swift, which (thanks to methods like reduce and map can look much closer to math equations than most languages that are this fast.

Reductions

Let now try a different experiment: taking the sum of our array. Here’s the most idiomatic Swift code:

benchmark(title:"reduce sum") {a1 = ar1.reduce(0.0, +)}
> reduce sum: 1.308 ms

…and here’s the same thing with a loop:

benchmark(title:"loop sum") { a1 = 0; for i in 0..<size {a1+=ar1[i]} }
> loop sum: 1.331 ms

Let’s see if our earlier pointer trick helps this time too:

benchmark(title:"pointer sum") {
  let p1 = ar1.p
  a1 = 0; for i in 0..<size {a1+=p1[i]}
}
> pointer sum: 1.379 ms

Well that’s odd. It’s not any faster, which suggests that it isn’t getting the best possible performance. Let’s again switch to C and see how it performs there:

float smSum_float(const float* pSrc, const int len) {
  float r = 0;
  for (int i=0; i<len; ++i) { r += pSrc[i]; }
  return r;
}

Here’s the result:

benchmark(title:"C sum") {a1 = smSum_float(ar1.p, ar1.c)}
> C sum: .230 ms

I compared this performance to Intel’s optimized Performance Libraries version of sum and found this is even faster than their hand-optimized assembler! To get this to perform better than Swift, I did however need to know a little trick (provided by LLVM’s vectorization docs), which is to compile with the -ffast-math flag. For numeric programming like this, I recommend you always use at least these flags (this is all I’ve used for these experiments, although you can also add -march=native, and change the optimization level from O2 for Ofast):

-Xswiftc -Ounchecked -Xcc -ffast-math -Xcc -O2

Why do we need this flag? Because strictly speaking, addition is not associative, due to the quirks of floating point. But this is, in practice, very unlikely to be something that most people will care about! By default, clang will use the “strictly correct” behavior, which means it can’t vectorize the loop with SIMD. But with -ffast-math we’re telling the compiler that we don’t mind treating addition as associative (amongst other things), so it will vectorize the loop, giving us a 4x improvement in speed.

The other important thing to remember for good performance in C code like this is to ensure you have const marked on everything that won’t change, as I’ve done in the code above.

Unfortunately, there doesn’t seem to currently be a way to get Swift to vectorize any reduction. So for now at least, we have to use C to get good performance here. This is not a limitation of the language itself, it’s just an optimization that the Swift team hasn’t gotten around to implementing yet.

The good news is: BaseMath adds the sum method to Array, which uses this optimized C version, so if you use BaseMath, you get this performance automatically. So the result of test #1 is: failure. We didn’t manage to get pure Swift to reach the same performance as C. But at least we got a nice C version we can call from Swift. Let’s move on to another test and see if we can get better performance by avoiding doing any reductions.

Temporary storage

So what if we want to do a function reduction, such as sum-of-squares? Ideally, we’d like to be able to combine our map style above with sum, but without getting the performance penalty of Swift’s unoptimized reductions. To make this work, the trick is to use temporary storage. If we use our map function above to store the result in preallocated memory, we can then pass that to our C sum implementation. We want something like a static variable for storing our preallocated memory, but then we’d have to deal with locking to handle contention between threads. To avoid that, we can use thread local storage (TLS). Like most languages, Swift provides TLS functionality; however rather than make it part of the core language (like, say, C#), it provides a class, which we can access through Thread.current.threadDictionary. BaseMath adds the preallocated memory to this dictionary, and makes it available internally as tempStore; this is then the internal implementation of unary function reduction (there are also binary and ternary versions available):

@inlinable public func sum(_ f: UnaryF)->Element {
  self.map(f, tempStore)
  return tempStore.sum()
}

We can then use this as follows:

benchmark(title:"lib sum(sqr)") {a1 = ar1.sum(Float.sqr)}
> lib sum(sqr): .786 ms

This provides a nice speedup over the regular Swift reduce version:

benchmark(title:"reduce sumsqr") {a1 = ar1.reduce(0.0, {$0+Float.sqr($1)})}
> reduce sumsqr: 1.459 ms

Here’s the C version:

float smSum_sqr_float(const float* restrict pSrc, const int len) {
  float r = 0;
  #pragma clang loop interleave_count(8)
  for (int i=0; i<len; ++i) { r += sqrf(pSrc[i]); }
  return r;
}

Let’s try it out:

benchmark(title:"c sumsqr") {a1 = smSum_sqr_float(ar1.p, ar1.c)}
> c sumsqr: .229 ms

C implementations of sum for all standard unary math functions are made available by BaseMath, so you can call the above implementation by simply using:

benchmark(title:"lib sumsqr") {a1 = ar1.sumsqr()}
> c sumsqr: .219 ms

In summary: whilst the Swift version using temporary storage (and calling C for just the final sum) was twice as fast as just using reduce, using C is another 3 or more times faster.

The warts

As you can see, there’s a lot to like about numeric programming in Swift. You can get both the performance of optimized C with the convenience of automatic memory management and elegant syntax.

The most concise and flexible language I’ve used is Python. And the fastest I’ve used is C (well… actually it’s FORTRAN, but let’s not go there.) So how does it stack up against these high bars? The very idea that we could compare a single language to the flexibility of Python and the speed of C is an amazing achievement itself!

Overall, my view is that in general it takes a bit more code in Swift than Python to write the stuff I want to write, and there’s fewer ways to abstract common code out. For instance, I use decorators a lot in Python, and use them to write loads of code for me. I use *args and **kwargs a lot (the new dynamic features in Swift can provide some of that functionality, but it doesn’t go as far). I zip multiple variables together at once (in Swift you have to zip pairs of pairs for multiple variables, and then use nested parens to destructure them). And then there’s the code you have to write to get your types all lined up nicely.

I also find Swift’s performance is harder to reason about and optimize than C. C has its own quirks around performance (such as the need to use const and sometimes even requiring restrict to help the compiler) but they’re generally better documented, better understood, and more consistent. Also, C compilers such as clang and gcc provide powerful additional capabilities using pragmas such as omp and loop which can even auto-parallelize code.

Having said that, Swift closer to achieving the combination of Python’s expressiveness and C’s speed than any other language I’ve used.

There are some issues still to be aware of. One thing to consider is that protocol-oriented programming requires a very different way to doing things to what you’re probably used to. In the long run, that’s probably a good thing, since learning new programming styles can help you become a better programmer; but it could lead to some frustrations for the first few weeks.

This issue is particularly challenging because Swift’s compiler often has no idea where the source of a protocol type problem really is, and its ability to guess types is still pretty flaky. So extremely minor changes, such as changing the name of a type, or changing where a type constraint is defined, can change something that used to work, into something that spits out four screens for error messages. My advice is to try to create minimal versions of your type structures in a standalone test file, and get things working there first.

Note, however, that ease of use generally requires compromises. Python is particularly easy, because it’s perfectly happy for you to shoot yourself in the foot. Swift at least makes sure you first know how to untie your shoelaces. Chris told me: the pitch when building Swift in the first place was that the important thing to optimize for is the “end to end time to get to a correct implementation of whatever you’re trying to do”. This includes both time to pound out code, time to debug it, and time to refactor/maintain it if you’re making a change to an existing codebase. I don’t have enough experience yet, but I suspect that on this metric Swift will turn out to be a great performer.

There are some parts of Swift which I’m not a fan of: the compromises made due to Apple’s history with Objective-C, it’s packaging system, it’s community, and the lack of C++ support. Or to be more precise: it is largely parts of the Swift ecosystem that I’m not a fan of. The language itself is quite delightful. And the ecosystem can be fixed. But, for now, this is the situation that Swift programmers have to deal with, so let’s look at each of these issues in turn.

Objective-C

Objective-C is a language developed in the 1980’s designed to bring some of the object-oriented features of Smalltalk to C. It was a very successful project, and was picked by NeXT as the basis for programming NeXTSTEP in 1988. With NeXT’s acquisition by Apple, it became the primary language for coding for Apple devices. Today, it’s showing its age, and the constraints imposed by the decision to make it a strict superset of C. For instance, Objective-C doesn’t support true function overloading. Instead, it uses something called selectors, which are simply required keyword arguments. Each function’s full name is defined by the concatenation of the function name with all the selector names. This idea is also used by AppleScript, which provides something very similar to allow the name print to mean different things in different contexts:

print page 1
print document 2
print pages 1 thru 5 of document 2

AppleScript in turn inherited this idea from HyperTalk, a language created in 1987 for Apple’s much-loved (and missed) HyperCard program. Given all this history, it’s not surprising that today the idea of required named arguments is something that most folks at Apple have quite an attachment to. Perhaps more importantly, it provided a useful compromise for the designers of Objective-C, since they were able to avoid adding true function overloading to the language, keeping close compatibility with C.

Unfortunately, this constraint impacts Swift today, over 40 years after the situation that led to its introduction in Objective-C. Swift does provide true function overloading, which is particularly important in numeric programming, where you really don’t want to have to create whole separate functions for floats, doubles, and complex numbers (and quaternions, etc…). But by default all keyword names are still required, which can lead to verbose and visually cluttered code. And Apple’s style guide strongly promotes this style of coding; their Objective-C and Swift, style guides closely mirror each other, rather than allowing programmers to really leverage Swift’s unique capabilities. You can opt out of requiring named arguments by prefixing a parameter name with _ , which BaseMath uses everywhere that optional arguments are not needed.

Another area where things get rather verbose is when it comes to working with Foundation, Apple’s main class library, which is also used by Objective-C. Swift’s standard library is missing a lot of functionality that you’ll need, so you’ll often need to turn to Foundation to get stuff done. But you won’t enjoy it. After the pleasure of using such a elegantly designed language as Swift, there’s something particularly sad about using it to access as unwieldy a library as Foundation. For instance, Swift’s standard library doesn’t provide a builtin way to format floats with fixed precision, so I decided to add that functionality to my SupportsBasicMath protocol. Here’s the code:

extension SupportsBasicMath {
  public func string(_ digits:Int) -> String {
    let fmt = NumberFormatter()
    fmt.minimumFractionDigits = digits
    fmt.maximumFractionDigits = digits
    return fmt.string(from: self.nsNumber) ?? "\(self)"
  }
}

The fact that we can add this functionality to Float and Double by writing such an extension is really cool, as is the ability to handle failed conversions with Swift’s ?? operator. But look at the verbosity of the code to actually use the NumberFormatter class from Foundation! And it doesn’t even accept Float or Double, but the awkward NSNumber type from Objective-C (which is itself a clunky workaround for the lack of generics in Objective-C). So I had to add an nsNumber property to SupportsBasicMath to do the casting.

The Swift language itself does help support more concise styles, such as the {f($0)} style of closures. Concision is important for numeric programming, since it lets us write our code to reflect more closely the math that we’re implementing, and to understand the whole equation at a glance. For a masterful exposition of this (and much more), see Iverson’s Turing Award lecture Notation as a Tool for Thought.

Objective-C also doesn’t have namespaces, which means that each project picks some 2 or 3 letter prefix which it adds to all symbols. Most of the Foundation library still uses names inherited from Objective-C, so you’ll find yourself using types like CGFloat and functions like CFAbsoluteTimeGetCurrent. (Every time I type one of these symbols I’m sure a baby unicorn cries out in pain…)

The Swift team made the surprising decision to use an Objective-C implementation of Foundation and other libraries when running Swift on Apple devices, but to use native Swift libraries on Linux. As a result, you will sometimes see different behavior on each platform. For instance, the unit test framework on Apple devices is unable to find and run tests that are written as protocol extensions, but they work fine under Linux.

Overall, I feel like the constraints and history of Objective-C seem to bleed in to Swift programming too often, and each time it happens, there’s a real friction that pops up. Over time, however, these issues seem to be reducing, and I hope that in the future we’ll see Swift break out from the Objective-C shackles more and more. For instance, perhaps we’ll see a real effort to create idiomatic Swift replacements for some of the Objective-C class libraries.

Community

I’ve been using Python a lot over the last couple of years, and one thing that always bothered me is that too many people in the Python community have only ever used that one language (since it’s a great beginners’ language and is widely taught to under-graduates). As a result, there’s a lack of awareness that different languages can do things in different ways, and each choice has its own pros and cons. Instead, in the Python world, there’s a tendency for folks to think that the Python way is the one true way.

I’m seeing something similar in Swift, but in some ways it’s even worse: most Swift programmers got their start as Objective-C programmers. So a lot of the discussion you see online is from Objective-C programmers writing Swift in a style that closely parallels how things are done in Objective-C. And nearly all of them do nearly all of their programming in Xcode (which is almost certainly my least favorite IDE, except for its wonderful Swift Playgrounds feature), so a lot of advice you’ll find online shows how to solve Swift problems by getting Xcode to do things for you, rather than writing the code yourself.

Most Swift programmers are writing iOS apps, so you’ll also find a lot of guidance on how to lay out a mobile GUI, but there’s almost no information about things like how to distribute command line programs for Linux, or how to compile static libraries. In general, because the Linux support for Swift is still so new, there’s not much information available how how to use it, and many libraries and tools don’t work under Linux.

Most of the time when I was tracking down problems with my protocol conformance, or trying to figure out how to optimize some piece of code, the only information I could find would be a mailing list discussion amongst Apple’s Swift language team. These discussions tend to focus on the innards of the compiler and libraries, rather than how to use them. So there’s a big missing middle ground between app developers discussing how to use Xcode and Swift language implementation discussing how to modify the compiler. There is a good community forming now around around the Discorse forums at [https://forums.swift.org/], which hopefully over time will turn in to a useful knowledge base for Swift programmers.

Packaging and installation

Swift has an officially sanctioned package system, called Swift Package Manager (SPM). Unfortunately, it’s one of the worst packaging systems I’ve ever used. I’ve noticed that nearly every language, when creating a package manager, reinvents everything from scratch, and fails to leverage all the successes and failures of previous attempts. Swift follows this unfortunate pattern.

There are some truly excellent packaging systems out there. The best, perhaps, was and still is Perl’s CPAN, which includes an international automated testing service that tests all packages on a wide range of systems, deeply integrates documentation, has excellent tutorials, and much more. Another terrific (and more modern) system is conda, which not only handles language-specific libraries (with a focus on Python) but also handles automatically installing compatible system libraries and binaries too—and manages to do everything in your home directory so you don’t even need root access. And it works well on Linux, Mac, and Windows. It can handle distribution of both compiled modules, or source.

SPM, on the other hand, has none of the benefits of any of these systems. Even though Swift is a compiled language, it doesn’t provide a way to create or distribute compiled packages, which means users of your package will have to install all the pre-requisites for building it. And SPM doesn’t let you describe how to build your package, so (for instance) if you use BaseMath it’s up to you to remember to add the flags required for good performance when you build something that uses it.

The way dependencies is handled is really awkward. Git tags or branches are used for dependencies, and there’s no easy way to switch between a local dev build and the packaged version (like, for instance the -e flag to pip or the conda develop command). Instead, you have to modify the package file to change the location of the dependency, and remember to switch it back before you commit.

It would take far too long to document all the deficiencies of SPM; instead, you can work on the assumption that any useful feature you’ve appreciated from whatever packaging system you’re using now probably won’t be in SPM. Hopefully someone will get around to setting up a conda-based system for Swift and we can all just start using that instead…

Also, installation of Swift is a mess. On Linux, for instance, only Ubuntu is supported, and different versions require different installers. On Mac, Swift versions are tied to Xcode versions in a confusing and awkward way, and command line and Xcode versions are somewhat separate, yet somewhat linked, in ways that make my brain hurt. Again, conda seems like it could provide the best option to avoid this, since a single conda package can be used to support any flavor of Linux, and Mac can also be supported in the same way. If the work was done to get Swift on to conda, then it would be possible to say just conda install swift on any system, and everything would just work. This would also provide a solution for versioning, isolated environments, and complex dependency tracking.

(If you’re on Windows, you are, for now, out of luck. There’s an old unofficial port to Cygwin. And Swift runs fine on the Windows Subsystem for Linux. But no official native Windows Swift as yet, sadly. But there is some excellent news on this front: a hero named Saleem Abdulrasool has made great strides towards making a complete native port entirely indepdently, and in the last few days it has gotten to a point where the vast majority of the Swift test suite passes.)

C++

Whilst Apple went with Objective-C for their “C with objects” solution, the rest of the world went with C++. Eventually, the Objective-C extensions were also added to C++, to create “Objective-C++”, but there was no attempt to unify the concepts across the languages, so the resulting language is a mutant with many significant restrictions. However, there is a nice subset of the language that gets around some of the biggest limitations of C; for instance you can use function overloading, and have access to a rich standard library.

Unfortunately, Swift can’t interface with C++ at all. Even something as simple as a header file containing overloaded functions will cause Swift language interop to fail.

This is a big problem for numeric programmers, because many of the most useful numeric libraries today are written in C++. For instance, the ATen library at the heart of PyTorch is C++. There are good reasons that numeric programmers lean towards C++: it provides the features that are needed for concise and expressive solutions to numeric programming problems. For example, Julia programmers are (rightly) proud of how easy it is to support the critical broadcasting functionality in their language, which they have documented in the Julia challenge. In C++ this challenge has an elegant and fast solution. You won’t find something like this in pure C, however.

So this means that a large and increasing number of the most important building blocks for numeric programming are out of bounds for Swift programmers. This is a serious problem. (You can write plain C wrappers for a C++ class, and then create a Swift class that uses those wrappers, but that’s a very big and very boring job which I’m not sure many people are likely to embark on.)

Other languages have shown potential ways around this. For instance, C# on Windows provides “It Just Works” (IJW) interfacing with C++/CLI, a superset of C++ with support for .Net. Even more interestingly, the CPPSharp project leverages LLVM to auto-generate a C# wrapper for C++ code with no calling overhead.

Solving this problem will not be easy for Swift. But because Swift uses LLVM, and already interfaces with C (and Objective-C) it is perhaps better placed to come up with a great solution than nearly any other language. Except, perhaps, for Julia, since they’ve already done this. Twice.

Conclusion

Swift is a really interesting language which can support fast, concise, expressive numeric programming. The Swift for Tensorflow project may be the best opportunity for creating a programming language where differentiable programming is a first class citizen. Swift also lets us easily interface with C code and libraries.

However, Swift on Linux is still immature, the packaging system is weak, installation is clunky, and the libraries suffer from some rough spots due to the historical ties to Objective-C.

So how does it stack up? In the data science world, we’re mainly stuck using either R (which is the least pleasant language I’ve ever used, but with the most beautifully designed data munging and plotting libraries anywhere) or Python (which is painfully slow, very hard to parallelize, but is extremely expressive and has the best deep learning libraries available). We really need another option. Something that is fast, flexible, and provides good interop with existing libraries.

Overall, the Swift language itself looks to be exactly what we need, but much of the ecosystem needs to be replaced or at least dramatically leveled-up. There is no data science ecosystem to speak of, although the S4TF project seems likely to create some important pieces. This is a really good place to be spending time if you’re interested in being part of something that has a huge amount of potential, and has some really great people working to make that happen, and you are OK with helping smooth out the warts along the way.

One year of deep learning

My resolution for 2018 was to get into deep learning. I had stumbled upon a website called fast.ai in October 2017 after reading an article from the New York Times describing the shortage of people capable of training a deep learning model. It sounds a bit clichéd to say it changed my life, but I could hardly have imagined that, one year later, I would be helping prepare the next version of this course from behind the scenes. So in this article, I’ll tell you a little bit about my personal journey into Deep Learning, and I will share some advice which I feel could have been useful to me six months ago.

Example of adding the Eiffel Tower to a painting using a neural network
Example of adding the Eiffel Tower to a painting using a neural network

Who am I and where do I come from?

My background is mostly in Math. I have a Master’s degree from a French University; I started a PhD but stopped after six months because I found it too depressing, and went on to teach undergrads for seven years in Paris. I’m a self-taught coder, my father having had the good idea to put an introduction to ‘Basic’ in my hands when I was 13.

My husband got relocated to New York City, so I moved there three and a half years ago, and became half stay-at-home dad, half textbook writer for a publisher in France. When I first looked at fast.ai, I was curious about what the hype around Artificial Intelligence was all about, and I wanted to know if I could understand what seemed to only be accessible to a few geniuses.

I have to confess that I almost didn’t start the course; the claim it could explain Deep Learning to anyone with just one year of experience of coding and high school math sounded very suspicious to me, and I was wondering if it wasn’t all bogus (spoiler alert: it’s not). I did decide to go with it though; I had finished my latest books and finding seven hours a week to work the course while my little boys napped didn’t seem too much.

Although I started the first version of the MOOC with a clear advantage in math, I struggled a lot with what I call the geeky stuff. I’m a Windows user and I had never launched a terminal before. The setup took me the better part of a week before I was finally able to train my own dogs and cats classifier. It felt like some form of torture every time I had to run some commands in a terminal (that still bit hasn’t changed that much!)

If you’re new to the field and struggling with a part (or all) of it, remember no one had it easy. There’s always something you won’t know and that will be a challenge, but you will overcome it if you persevere. And with time, it will become easier, at least a little bit… I still need help with half my bash commands, and broke the docs and course website twice during the first lesson. Fortunately, everyone was too busy watching Jeremy to notice.

Do you need advanced math to do deep learning?

The short answer is no. The long answer is no, and anyone telling you the opposite just wants to scare you. You might need advanced math in some areas of theoretical research in Deep Learning, but there’s room for everyone at the table.

To be able to train a model in practice, you only need three things: have a sense of what a derivative is, know log and exp for the error functions, and know what a matrix product is. And you can learn all of this in a very short amount of time, with multiple resources online. In the course, Jeremy recommends Khan Academy for learning about derivatives, log, and exp, and 3 Blue 1 Brown for learning about matrix products.

In my opinion, the one mathy (it’s a bit at the intersection of math and code) thing you’ll really need to master (or at least get as comfortable with as you can) is broadcasting.

Make your own boot camp, if you’re serious about it

After I finished the first part of the course, it was clear to me that I wanted to work in this field (hence the good resolution). I contemplated various boot camps that were promising to turn me into a Data Scientist in exchange for a substantial tuition fee. I found enough testimonials online to scare me a bit from it, so fortunately I quickly gave up on that idea.

There are enough free (or cheap) resources online to teach you all you need, so as long as you have the self-discipline, you can make your own boot camp. The best of all being the courses from fast.ai of course (but I’m a bit biased since I work there now ;) ).

I thought I’d never be selected to join the International Fellowship for the second version of the second part of the course so I was a tad unprepared when the acceptance email came in. I booked a coworking space to distance myself from the craziness of a baby and a toddler at home, hired an army of sitters in a rush until we found a nanny, then worked from 9 to 5 each day, plus the evenings, on the course materials. I thought I’d follow other MOOCs, but with all the challenges Jeremy left on the forum, and the vibrant community there, I never got the time to look elsewhere.

Even though the course is intended for people to spend seven hours a week on homework, there is definitely enough to keep you busy for far longer, especially in the second part. If you’re serious about switching carreers to Deep Learning, you should spend the seven weeks working your ass off on this course. And if you can afford it money-wise/family-wise, fly yourself to San Francisco to attend in person and go every day to the study group at USF. If you can’t, find other people in your city following the course (or start your own group). In any case, be active on the forum, not just to ask questions when you have a bug, but also to help other people with their code.

Show what you can do

I’m shy and I hate networking. Those who have met me in person know I can’t chitchat for the love of God. Fortunately, there are plenty of ways you can sell yourself to potential employers behind the safety of your computer. Here are a few things that can help:

  1. Make your own project to show what you learned. Be sure to completely polish one project before moving to another one. In my case, it was reproducing superconvergence from Leslie Smith’s paper then the Deep painterly harmonization paper.
  2. Write a blog to explain what you learned. It doesn’t have to be complex new research articles, start with the basics of a model even if you think there are thousands of articles like this already. You’ll learn a lot just by trying to explain things you thought you understood.
  3. Contribute to a Deep Learning related open source project (like the fastai library).
  4. Get into Kaggle competitions (still on my to-do list, maybe it’s going to be my 2019 resolution ;) ).
  5. Get a Twitter account to tell people about all of the above.

I was amazed and surprised to get several job offers before the course even ended. Then, Jeremy mentioned he was going to rebuild the library entirely and I offered to help. One thing led to another one and he managed to get a sponsorship from AWS for me to be a Research Scientist at fast.ai.

Behind the mirror

In my opinion, there are always three stages in learning. First you understand something abstractly, then you can explain it, and eventually you manage to actually do it. This is why it’s so important to see if you can redo by yourself the code you see in the courses.

As far as Deep Learning is concerned, following the course was the first stage for me; writing blog posts, notebooks or answering questions on the forum was the second stage; re-building the library from scratch with Jeremy was the third one.

I have learned even more in those past few months than during the time when I was following the courses. Some of it in areas I had discarded a bit too quickly … and a lot of it by refactoring pieces of codes under Jeremy’s guidance until we got to the outcome you can see today. Building a fully integrated framework means you have to implement everything, so you need to master every part of the process.

All in all, this has been one of the years in which I’ve learned the most in my life. I’ll be forever thankful to Rachel and Jeremy for creating this amazing course, and I’m very proud to add my little stone to it.

The new fast.ai research datasets collection, on AWS Open Data

In machine learning and deep learning we can’t do anything without data. So the people that create datasets for us to train our models are the (often under-appreciated) heroes. Some of the most useful and important datasets are those that become important “academic baselines”; that is, datasets that are widely studied by researchers and used to compare algorithmic changes. Some of these become household names (at least, among households that train models!), such as MNIST, CIFAR 10, and Imagenet.

We all owe a debt of gratitude to those kind folks who have made datasets available for the research community. So fast.ai and the AWS Public Dataset Program have teamed up to try to give back a little: we’ve made some of the most important of these datasets available in a single place, using standard formats, on reliable and fast infrastructure. For a full list and links see the fast.ai datasets page.

fast.ai uses these datasets in the Deep Learning for Coders courses, because they provide great examples of the kind of data that students are likely to encounter, and the academic literature has many examples of model results using these datasets which students can compare their work to. If you use any of these datasets in your research, please show your gratitude by citing the original paper (we’ve provided the appropriate citation link below for each), and if you use them as part of a commercial or educational project, consider adding a note of thanks and a link to the dataset.

Dataset example: the French/English parallel corpus

One of the lessons that gets the most “wow” feedback from fast.ai students is when we study neural machine translation. It seems like magic when we can teach a model to translate from French to English, even if we can’t speak both languages ourselves!

But it’s not magic; the key is the wonderful dataset that we leverage in this lesson: the French/English parallel text corpus prepared back in 2009 by Professor Chris Callison-Burch of the University of Pennsylvania. This dataset contains over 20 million sentence pairs in French and English. He built the dataset in a really clever way: by crawling millions of Canadian web pages (which are often multi-lingual) and then using a set of simple heuristics to transform French URLs onto English URLs. The dataset is particularly important for researchers since it is used in the most important annual competition for benchmarking machine translation models.

Here’s some examples of the sentence pairs that our translation models can learn from:

Often considered the oldest science, it was born of our amazement at the sky and our need to question Astronomy is the science of space beyond Earth’s atmosphere. Souvent considérée comme la plus ancienne des sciences, elle découle de notre étonnement et de nos questionnements envers le ciel L’astronomie est la science qui étudie l’Univers au-delà de l’atmosphère terrestre.
The name is derived from the Greek root astron for star, and nomos for arrangement or law. Son nom vient du grec astron, qui veut dire étoile et nomos, qui veut dire loi.
Astronomy is concerned with celestial objects and phenomena – like stars, planets, comets and galaxies – as well as the large-scale properties of the Universe, also known as “The Big Picture”. Elle s’intéresse à des objets et des phénomènes tels que les étoiles, les planètes, les comètes, les galaxies et les propriétés de l’Univers à grande échelle.

So what’s Professor Callison-Burch doing now? When we reached out to him to check some details for his dataset, he told us he’s now preparing the University of Pennsylvania’s new AI course; and part of his preparation: watching the videos at course.fast.ai! It’s a small world indeed…

The dataset collection

The following categories are currently included in the collection:

The datasets are all stored in the same tgz format, and (where appropriate) the contents have been converted into standard formats, suitable for import into most machine learning and deep learning software. For examples of using the datasets to build practical deep learning models, keep an eye on the fast.ai blog where many tutorials will be posted soon.