03 May 2019 Jason Antic (Deoldify), Jeremy Howard (fast.ai), and Uri Manor (Salk Institute)
We presented this work at the Facebook f8 conference. You can see this video of our talk here, or read on for more details and examples.
Decrappification, DeOldification, and Super Resolution
In this article we will introduce the idea of “decrappification”, a deep learning method implemented in fastai on PyTorch that can do some pretty amazing things, like… colorize classic black and white movies—even ones from back in the days of silent movies, like this:
The same approach can make your old family photos look like they were taken on a modern camera, and even improve the clarity of microscopy images taken with state of the art equipment at the Salk Institute, resulting in 300% more accurate cellular analysis.
The genesis of decrappify
Generative models are models that generate music, images, text, and other complex data types. In recent years generative models have advanced at an astonishing rate, largely due to deep learning, and particularly due to generative adversarial models (GANs). However, GANs are notoriously difficult to train, due to requiring a large amount of data, needing many GPUs and a lot of time to train, and being highly sensitive to minor hyperparameter changes.
fast.ai has been working in recent years towards making a range of models easier and faster to train, with a particular focus on using transfer learning. Transfer learning refers to pre-training a model using readily available data and quick and easy to calculate loss functions, and then fine-tuning that model for a task that may have fewer labels, or be more expensive to compute. This seemed like a potential solution to the GAN training problem, so in late 2018 fast.ai worked on a transfer learning technique for generative modeling.
The pre-trained model that fast.ai selected was this: Start with an image dataset and “crappify” the images, such as reducing the resolution, adding jpeg artifacts, and obscuring parts with random text. Then train a model to “decrappify” those images to return them to their original state. fast.ai started with a model that was pre-trained for ImageNet classification, and added a U-Net upsampling network, adding various modern tweaks to the regular U-Net. A simple fast loss function was initially used: mean squared pixel error. This U-Net could be trained in just a few minutes. Then, the loss function was replaced was a combination of other loss functions used in the generative modeling literature (more details in the f8 video) and trained for another couple of hours. The plan was then to finally add a GAN for the last few epochs - however it turned out that the results were so good that fast.ai ended up not using a GAN for the final models.
The genesis of DeOldify
DeOldify was developed at around the same time that fast.ai started looking at decrappification, and was designed to colorize black and white photos. Jason Antic watched the Spring 2018 fast.ai course that introduced GANs, U-Nets, and other techniques, and wondered about what would happen if they were combined for the purpose of colorization. Jason’s initial experiments with GANs were largely a failure, so he tried something else - the self-attention GAN (SAGAN). His ambition was to be able to successfully colorize real world old images with the noise, contrast, and brightness problems caused by film degradation. The model needed to be trained on photos with these problems simulated. To do this, he started with the images in the ImageNet dataset, converted them to b&w, and then added random contrast, brightness, and other changes. In other words, he was “crappifying” images too!
The results were amazing, and people all over the internet were talking about Jason’s new “DeOldify” program. Jeremy saw some of the early results and was excited to see that someone else was getting great results in image generation. He reached out to Jason to learn more. Jeremy and Jason soon realized that they were both using very similar techniques, but had both developed in some different directions too. So they decided to join forces and develop a decrappification process that included all of their best ideas.
The result of joining forces was a process that allowed GANs to be skipped entirely, and which can be trained on a gaming pc. All of Jason’s development was done on a Linux box in a dining room, and each experiment used only a single consumer GPU (a GeForce 1080Ti). The lack of impressive hardware and industrial resources didn’t prevent highly tangible progress. In fact, it probably encouraged it.
Jason then took the process even further, moving from still images to movies. He discovered that just a tiny bit of GAN fine-tuning on top of the process developed with fast.ai could create colorized movies in just a couple of hours, at a quality beyond any automated process that had been built before.
The genesis of microscopy super-resolution
Meanwhile, Uri Manor, Director of the Waitt Advanced Biophotonics Core (WABC) at the Salk Institute, was looking for ways to simultaneously improve the resolution, speed, and signal-to-noise of the images taken by the WABC’s state of the art ZEISS scanning electron and laser scanning confocal microscopes. These three parameters are notably in tension with one another - a variant of the so-called “triangle of compromise”, the bane of existence for all photographers and imaging scientists alike. The advanced microscopes at the WABC are heavily used by researchers at the Salk (as well as several neighboring institutions including Scripps and UCSD) to investigate the ultrastructural organization and dynamics of life, ranging anywhere from carbon capturing machines in plant tissues to synaptic connections in brain circuits to energy generating mitochondria in cancer cells and neurons. The scanning electron microscope is distinguished by its ability to serially slice and image an entire block of tissue, resulting in a 3-dimensional volumetric dataset at nearly nanometer resolution. The so-called “Airyscan” scanning confocal microscopes at the WABC boast a cutting-edge array of 32 hexagonally packed detectors that facilitate fluorescence imaging at nearly double the resolution of a normal microscope while also providing 8-fold sensitivity and speed.
Thanks to the Wicklow AI Medical Research Initiative (WAMRI), Jeremy Howard and Fred Monroe were able to visit Salk to see some of the amazing work done there, and discuss opportunities to use deep learning to help with some of Salk’s projects. Upon meeting Uri, it was immediately clear that fast.ai’s techniques would be a great fit for Uri’s needs for higher resolution microscopy. Fred, Uri, and a Salk-led team of scientists ranging from UCSD to UT-Austin, worked together to bring the methods into the microscopy domain, and the results were stunning. Using carefully acquired high resolution images for training, the group validated “generalized” models for super-resolution processing of electron and fluorescence microscope images, enabling faster imaging with higher throughput, lower sample damage, and smaller file sizes than ever reported. Since the models are able to restore images acquired on relatively low-cost microscopes, this model also presents an opportunity to “democratize” high resolution imaging to those not working at elite institutions that can afford the latest cutting edge instrumentation.
For creating microscopy movies, Fred used a different approach to the one Jason used for classic Hollywood movies. Taking inspiration from this blog post about stabilizing neural style transfer in video, he was able to add a “stability” measure to the loss function being used for creating single image super-resolution. This stability loss measure encourages images with stable features in the face of small amounts of random noise. Noise injection is already part of the process to create training sets at Salk anyways - so this was an easy modification. This stability when combined with information about the preceding and following frames of video significantly reduces flicker and improves the quality of output when processing low resolution movies. See more details about the process in the section below - “Notes on Creating Super-resolution Microscopy Videos”.
A deep dive into DeOldify
Let’s look at what’s going on behind the scenes of DeOldify in some detail. But first, here’s how you can use DeOldify yourself! The easiest way is to use these free Colab notebooks, that run you thru the whole process:
Or you can download the code and run it locally from the GitHub repo.
Advances in the state of the art
The Zhang et al “Colorful Image Colorization” model is currently popular, widely used, and was previously state of the art. What follows are original black and white photos (left), along with comparisons between the “Colorful Image Colorization” model (middle), and the latest version of DeOldify (right). Notice that the people and objects in the DeOldify photos are colorized more consistently, accurately, and in greater detail. Often, the images that DeOldify produces can be considered nearly photorealistic.
Additionally, high quality video colorization in DeOldify is made possible by advances in training that greatly increase stability and quality of rendering, largely brought about by employing NoGAN training. The following clips illustrate just how well DeOldify not only colorizes video (even special effects!), but also maintains temporal consistency across frames.
The Design of DeOldify
There are a few key design decisions in DeOldify that make a significant impact on the quality of rendered images.
“Traditional convolutional GANs generate high-resolution details as a function of only spatially local points in lower-resolution feature maps. In SAGAN, details can be generated using cues from all feature locations. Moreover, the discriminator can check that highly detailed features in distant portions of the image are consistent with each other.”
This same approach was adapted to the critic and U-Net based generators used in DeOldify. The motivation is simple: You want to have maximal continuity, consistency, and completeness in colorization, and self-attention is vital for this. This becomes a particularly apparent problem in models without self-attention when you attempt to colorize images containing large features such as ocean water. Often, you’ll see these large areas filled in with inconsistent coloration in such models.
Zhang et al “Colorful Image Colorization” model output is on the left, and DeOldify output is on the right. Notice that the water in the background, as well as the clothing of the man on the left, is much more consistently, completely, and correctly colorized in the DeOldify model. This is thanks in large part to self-attention, by which colorization decisions can easily take into account a more global context of features.
Self-attention also helps drive fantastic levels of detail in the colorizations.
Reliable Feature Detection
In contrast to other colorization models, DeOldify uses custom U-Nets with pretrained resnet backbones for each of its generator models. These are based on Fast.AI’s well-designed DynamicUnet, with a few minor modifications. The deviations from standard U-Nets include the aforementioned self-attention, as well as the addition of spectral normalization. These changes were modeled after the work done in the “Self-Attention Generative Adversarial Networks” paper.
The “video” and “stable” models use a resnet101 backbone and the decoder side emphasizes width (number of filters) over depth (number of layers). This configuration has proven to support the most stable and reliable renderings seen so far. In contrast, the “artistic” model has a resnet34 backbone and the decoder side emphasizes depth over width. This configuration is great for creating interesting colorizations and highly detailed renders, but at the cost of being more inconsistent in rendering than the “stable” and “video” models.
There are two primary underlying motivations for using a pretrained U-Net. First, it saves unnecessary training time that a large task in colorization is already trained for free- object recognition. That’s ImageNet-based object recognition, which for a single GPU will take at least a few days to train from scratch. Instead, we’re just fine-tuning that pretrained network to fit our task, which is much less work. Additionally, the U-Net architecture, especially Fast.AI’s DynamicUnet, is simply superior in image generation applications. This is due to key detail preserving and enhancing features like cross connections from encoder to decoder, learnable blur, and pixel shuffle. The resnet backbone itself is well-suited for the task of scene feature recognition.
To further encourage robustness in dealing with old and low quality images and film, we train with fairly extreme brightness and contrast augmentations. We’ve also employed gaussian noise augmentation in video model training in order to reduce model sensitivity to meaningless noise (grain) in film.
When feature recognition fails, jarring render failures such as “zombie hands” can result.
NoGAN is a new and exciting technique in GAN training that we developed, in pursuit of higher quality and more stable renders. How, and how well, it works is a bit surprising.
Here is the NoGAN training process:
Pretrain the Generator. The generator is first trained in a more conventional and easier to control manner - with Perceptual Loss (aka Feature Loss) by itself. GAN training is not introduced yet. At this point you’re training the generator as best as you can in the easiest way possible. This takes up most of the time in NoGAN training. Keep in mind: this pretraining by itself will get the generator model far. Colorization will be well-trained as a task, albeit the colors will tend toward dull tones. Self-Attention will also be well-trained at the at this stage, which is very important.
Save Generated Images From Pretrained Generator.
Pretrain the Critic as a Binary Classifier. Much like in pretraining the generator, what we aim to achieve in this step is to get as much training as possible for the critic in a more “conventional” manner which is easier to control. And there’s nothing easier than a binary classifier! Here we’re training the critic as a binary classifier of real and fake images, with the fake images being those saved in the previous step. A helpful thing to keep in mind here is that you can simply use a pre-trained critic used for another image-to-image task and refine it. This has already been done for super-resolution, where the critic’s pretrained weights were loaded from that of a critic trained for colorization. All that is needed to make use of the pre-trained critic in this case is a little fine-tuning.
Train Generator and Critic in (Almost) Normal GAN Setting. Quickly! This is the surprising part. It turns out that in this pretraining scenario, the critic will rapidly drive adjustments in the generator during GAN training. This happens during a narrow window of time before an “inflection point” of sorts is hit. After this point, there seems to be little to no benefit in training any further in this manner. In fact, if training is continued after this point, you’ll start seeing artifacts and glitches introduced in renderings.
In the case of DeOldify, training to this point requires iterating through only about 1% to 3% of ImageNet data (or roughly 2600 to 7800 iterations on a batch size of five). This amounts to just around 30-90 minutes of GAN training, which is in stark contrast to the three to five days of progressively-sized GAN training that was done previously. Surprisingly, during that short amount of training, the change in the quality of the renderings is dramatic. In fact, this makes up the entirety of GAN training for the video model. The “artistic” and “stable” models go one step further and repeat the the NoGAN training process steps 2-4 until there’s no more apparent benefit (around five repeats).
Note: a small but significant change to this GAN training that deviates from conventional GANs is the use of a loss threshold that must be met by the critic before generator training commences. Until then, the critic continues training to “catch up” in order to be able to provide the generator with constructive gradients. This catch up chiefly takes place at the beginning of GAN training which immediately follows generator and critic pretraining.
Note that the frame at 1.4% is considered to be just before the “inflection point” - after this point, artifacts and incorrect colorization start to be introduced. In this case, the actors’ skin becomes increasingly orange and oversaturated, which is undesirable. These were generated using a learning rate of 1e-5. The current video model of DeOldify was trained at a learning rate of 5e-6 to make it easier to find the “inflection point”.
Research on NoGAN training is still ongoing, so there are still quite a few questions to investigate. First, the technique seems to accommodate small batch sizes well. The video model was trained using a batch size of 5 (and the model uses batch normalization). However, there’s still the issue of artifacts and discoloration being introduced after the “inflection point”, and it’s suspected that this could be reduced or eliminated by mitigating batch size related issues. This could be done either by increasing batch size, or perhaps by using normalization that isn’t affected by batch size. It may very well be that once these issues are addressed, additional productive training can be accomplished before hitting a point of diminishing returns.
Another open question with NoGAN is how broadly applicable the approach is. It stands to reason that this should work well for most image-to-image tasks, and even perhaps non-image related training. However, this simply hasn’t yet been explored enough to draw strong conclusions. We did get interesting and impressive results on applying NoGAN to superresolution. In just fifteen minutes of direct GAN training (or 1500 iterations), the output of Fast.AI’s Part 1 Lesson 7 Feature Loss based super resolution training is noticeably sharpened (original lesson Jupyter notebook here).
Finally, the best practices for NoGAN training haven’t yet been fully explored. It’s worth mentioning again that the “artistic” and “stable” models were trained on not just one, but repeated cycles of NoGAN. What’s still unknown is just how many repeats are possible to still get a benefit, and how to make the training process less tedious by automatically detecting an early stopping point just before the “inflection point”. Right now, the determination of the inflection point is a manual process, and consists of a person visually assessing the generated images at model checkpoints. These checkpoints need to be saved at an interval of least every 0.1% of total data—otherwise, it is easily missed. This is definitely tedious, and prone to error.
How Stable Video is Achieved
The Problem – A Flickering Mess
Just a few months ago, the technology to create well-colorized video seemed to be out of reach. If you took the original DeOldify model and merely colorized each frame just as you would any other image, the result was this—a flickering mess:
The immediate solution that usually springs to mind is temporal modeling of some sort, whereby you enforce constraints on the model during training over a window of frames to keep colorization decisions more consistent. This would seem to make sense, but it does add a significant amount of complication, and the not-so-rare cases of changing scenes raises further questions about how to handle continuity. The rabbit hole deepens, and quickly. With these assumptions of needing temporal coherence enforced, the prospect of making seamless and flicker-free colorized video seemed quite far off. Luckily, it turns out these assumptions were wrong.
The Problems Melt Away with NoGAN
A surprising observation that we’ve made while developing DeOldify is that the colorization decisions are not at all arbitrary, even after GAN training and across different models. Rather, different models and training regimes keep arriving at almost the same solution, with only very minor variations in colorization. This even happens with the colorization of things you might expect to be unknowable and unconstrained by the luminance information in black and white photos: Clothing, cars, special effects in movies, etc. We’re not sure yet what exactly the model is learning to be able to more or less deterministically colorize images. But the bottom line is that temporal coherence concerns simply go away when you realize that there’s nothing to track—objects in frames will keep rendering the same regardless.
Additionally, it turns out that in NoGAN training the learning that takes place before the “inflection point” trains the generator in a very effective way. This is not only in terms of quickly achieving good colorization, but also without introducing artifacts, discoloration and inconsistency in generator renders. In other words, those artifacts and glitches in that “flickering mess” render of Metropolis above are coming from too much GAN training, and we can mitigate that pretty much completely with NoGAN!
NoGAN is the most significant element in achieving video render stability, but there’s a few additional design choices that also make an impact. For example, a larger resnet backbone (resnet101) makes a noticeable difference in how accurately and robustly features are detected, and therefore how consistently the frames are rendered as objects and scenes move. Another consideration is render resolution—increasing this makes a positive difference in some cases, but not nearly as big as one may expect. Most of the videos we’ve rendered have been done at resolutions ranging from 224px to 360px, and this tends to work just fine.
The end result of all this is that flicker-free, temporally consistent, and colorful video is achieved simply by rendering individual frame as if they were any other image! There is zero temporal modeling involved.
The DeOldify Graveyard: Things That Didn’t Work
For every design experiment that actually worked, there were at least ten that didn’t. This list is not exhaustive by any stretch but it includes what we consider to be particularly helpful to know.
Wasserstein GAN (WGAN) and Its Variants
The original approach attempted in the development of DeOldify was to base the architecture on Wasserstein GAN (WGAN). It turns out the stability of the WGAN and its subsequent improvements were not sufficient for practical application in colorization. Training would work somewhat for a while, but would invariably diverge from a stable solution as GANs are known to do. To an extent, the results were entertaining. What actually did wind up working extremely well (the first time even) was modeling DeOldify after Self-Attention Generative Adversarial Networks instead.
Various Other Normalization Schemes
The following normalization variations were attempted. None of them worked nearly as well as having batch normalization and spectral normalization combined in the generator, and just spectral normalization in the critic.
Spectral Normalization Only in Generator. This trained more slowly and was generally more unstable.
Batchnorm at Output of Generator. This slowed down training significantly and didn’t seem to provide any real benefit.
Weight Normalization in Generator. Ditto on the slowed training, and images didn’t turn out looking as good either. Interestingly however, it seems like weight normalization works the best when doing NoGAN training for super resolution.
Other Loss Functions
The interaction between the conventional (non-GAN) loss function and the GAN loss/training turns out to be crucial, yet tricky. For one, you have to weigh the non-GAN and GAN losses carefully, and it seems this can only come out of experimentation. The nice thing about NoGAN is that the iterations on this process are very quick relative to other GAN training regimes—it’s a matter of minutes instead of days to see the end result.
Another tricky aspect of the loss interaction is that you can’t just select any non-GAN loss function to work side by side with GAN loss. It turns out that Perceptual Loss (aka Feature Loss) empirically works best for this, especially in comparison to L1 and Mean Squared Error (MSE) loss.
It seems that since most of the emphasis in NoGAN training is in pretraining, it would be especially important to make sure that pretraining is taken as far as possible in rendering quality before making the switch to GAN. Perceptual Loss does just that—by itself, it produces the most colorful results of all the non-GAN losses attempted. In contrast, simpler loss functions such as MSE and L1 loss tend to produce dull colorizations as they encourage the networks to “play it safe” and bet on gray and brown by default.
Additions to perceptual loss were also attempted. Most notable were gram style loss and wasserstein distance. While the two cannot be ruled out and will be revisited in the future, the losses wound up encouraging strange orange and yellow discolorations when present in conjunction with GAN training. It’s suspected that the losses were simply not used effectively.
Reduced Number of Model Parameters
Something that tends to surprise people about DeOldify is the large model size. In the latest iteration, the “video” and “stable” models are set to a width of 1000 filters on the decoder side for most of the layers. The “artistic” model has the number of filters multiplied by 1.5 over the standard DynamicUnet configuration. Similarly, the critic is also rather hefty, with a starting width of 256 as opposed to the more conventional 64 or 128. Many experiments were done attempting to reduce the number of parameters, but they all generally ran into the same problem: The resulting renders were significantly less colorful.
Creating Super-resolution Microscopy Videos
Finally, we’ll discuss some of the details of the approach used at the Salk Institute for creating high resolution microsopy videos. The high level overview of the steps involved in creating model to produce high resolution microscopy videos from low resolution sequences is:
Acquire high resolution source material for training and low resolution material to be improved.
Develop a crappifier function
Create low res training dataset of image-tuples (groups of 3 images)
Create two training sets, A and B, by applying the crappifier to each image-tuple twice
Train the model on both training sets simultaneously with “stability” loss
Use the trained model to generate high resolution videos by running it across real low resolution source material
Acquisition of Source Material
At Salk we have been fortunate because we have produced decent results using only synthetic low resolution data. This is important because it is time consuming and rare to have perfectly aligned pairs of high resolution and low resolution images - and that would be even harder or impossible with video (live cells). In this case the files were acquired in proprietary czi format. Fortunately there is a python based tool for reading this format here.
Developing a Crappifier
In order to produce synthetic training data we need a “crappifier” function. This is a function that transforms a high resolution image into a low resolution image that approximates the real low resolution images we will be working with once our model is trained. The crappifier injects some randomness in the forms of both gaussian and poisson noise. These are both present in microscopy images. We were influenced in this design by the excellent work done by the CSBDeep team.
The crappifier can be simple but does materially impact both the quality and characteristics of output. For example, we found that if our crappifier injected too much high frequency noise into the training data, the trained model would have a tendency to eliminate thin and fine structures like those of neurons.
Generating the Synthetic Low Resolution Data for Training
The next step is to bundle sequences of images and their target together for training like this:
This image shows an example from a training where we are using 5 sequential images ( t-2, t-1, t 0, t+1, t+2) - to predict a single super-resolution output image (also at time t 0 )
For the movies we used bundles of 3 images and predicted the high resolution image at the corresponding middle time. In other words, we predicted super-resolution at time t0 with low resolution images from times t-1, t 0 and t+1.
We chose 3 images because that conveniently allowed us to easily use pre-existing super-resolution network architectures, data loaders and loss functions that were written for 3 channels of input.
Creating a Second Set of Low Resolution Data
To use stability loss we actually have to apply the crappifier function to the source material twice in order to create two parallel datasets. Because the crappifier injects random noise - the two datasets will differ from each other slightly - but have the same high resolution targets. A perfectly stable model would predict identical high resolution output images from both training datasets - while ignoring the random noise.
Training the model with Stability Loss
In addition to the normal loss functions we would use for super-resolution, we need to choose a measure of stability loss. This is a measure of similarity of output generation when appy the model to each of the two training sets which as we explained previously differ only in their application of randomly applied noise.
Given the low resolution image sequence X that we will use to predict the true high resolution image T, we create X1 and X2 which result from to separate applications of the random noise generating crappifier function.
X1 = crappifier(X) and X2 = crappifier(X)
Given our trained model M, we then predict Y1 and Y2 as follows:
Y1 = M(X1) and Y2 = M(X2)
Giving us super resolutions L1 = loss(Y1, T) and L2 = loss(Y2,T). Our stability loss is the difference between the predicted images. We used L1 loss but you could also use a feature loss or some other approach to measure the difference:
LossStable = loss(Y1,Y2)
Our final training loss is therefore: loss = L1 + L2 + LossStable
Now that we have a trained model, generating high resolution output from low resolution input is simply a matter of running the model across a sliding window of, in this case, three low resolution input images at a time. Imageio in one convenient way to write out multiimage tif files or mp4 files.
In this series, I want to share actions you can take to have a practical, positive impact on making tech more ethical, and to highlight some real world examples. Some are big; some are small; not all of them will be relevant to your situation. Today’s post covers items 12-16 (see part 1 for 1-5 and part 2 for 6-11).
Decide in advance what your personal values and limits are
Humans are excellent at post-hoc justifications. We make a decision or find ourselves in a situation, and we are masters at going back and constructing a justification for it. For example, in a study where participants were asked to select between two applicants for police chief, if the male applicant had more street smarts and the female applicant had more formal education, evaluators decided that street smarts were the most important trait. If the genders were reversed, evaluators decided that formal education was the most important trait.
This propensity for post-hoc justifications can make it difficult for people to recognize that something they are profiting from might be unethical. One approach is to decide in advance what your personal values and limits are.
Ask your company to sign the SafeFace Pledge
Joy Buolamwini is founder of the Algorithmic Justice League and an AI researcher at MIT Media Lab. Her research has shown that commercial computer vision products from IBM, Microsoft, and Amazon have much higher error rates on women with dark skin, compared to men with light skin. Joy’s work has been covered in over 230 news articles in 37 different countries. She created the SafeFace Pledge, together with the Georgetown Law Center on Privacy and Technology, for companies to commit to showing value for human life, dignity, and rights; addressing harmful bias; facilitating transparency; and embedding these commitments into business practices. Please note this pledge does not remove the need for thoughtful regulation and human rights protections concerning the use of facial recognition, but it is a great way for companies to commit to a set of healthy principles.
I want to commend Robbie.AI, Yoti, and Simprints for already having signed the SafeFace Pledge. If your company is working on facial recognition technology, I hope that they will consider signing it as well. If you work at a larger company, you may need to organize together with co-workers who share your values and concerns, to strategize about what steps would make it more likely for your company to sign.
Increase diversity by ensuring that employees from underrepresented groups are set up for success and prepared for promotions
While it may feel easier to focus on teaching little girls how to code, this will not result in change if qualified women continue to leave the tech industry at twice the rate that men do (in large part because of how they are treated, including a lack of advancement opportunities). As I wrote in a previous post, The first step, and the most important step, to improving diversity is to make sure that you are treating the women and people of color who already work at your company very well. This includes: appreciate their contributions, assign them to high impact projects, bring up their accomplishments in high level meetings, pay them equitably, provide chances to grow their skillset, listen to them, help them prepare for promotions, give them good managers, believe them about their experiences, and generally support them.
Increase diversity by overhauling your interviewing and onboarding processes
The interview process is broken in tech, and people from underrepresented groups are disproportionately impacted by this dysfunction (this also means that your company is missing out on many great candidates!). A study involving technical interviews with over 300 candidates and comparisons of where those candidates got offers/rejections concluded that instead of hiring programmers that have the skills the company needs, founders hire people that remind them of themselves.Since only 3% of VC funding goes to women and less than 1% goes to Black founders, how rare is it for a founder to think that a Black woman candidate reminds him of himself? This approach is frustrating for candidates, and inefficient for companies that end up not even hiring the people they most need. I share research on this topic, together with action items on how to improve in this post.
Good onboarding is another necessary component for ensuring that people from diverse backgrounds are able to succeed. Engineer Kate Heddleston noticed that for employees starting with the same experience level, again and again men were getting promoted much faster than women. Lack of onboarding was the source of the difference. Valuable information is shared through informal social networks, and people who differ from the majority group (such as women, people of color, LGBTQ people, parents, and older employees) will have the most trouble integrating into these networks. Comprehensive onboarding is necessary to make sure that everyone has the information they need to succeed at their jobs.
Share your success stories!
News is biased towards the negative and the outrageous. If you achieve success, no matter the size of your win, please share it to inspire and encourage others. If there are lessons you learn that could be useful to others, please share them! Here are some tips on how to get started with blogging or public speaking, as two possible ways to share your success with a broader audience.
This list may be overwhelming, so please just choose one concrete action you can take to get started. I am still in the early stages of developing further ideas and plans of what we can do to address scary applications of AI that encode bias, lack ways to correct mistakes, contribute to surveillance, promote extremist content, and more. If you are working on projects to address these, please let me know, and please stay tuned for additional updates.
This post is part 2 in a 3-part series. Please check out part 1 and part 3 as well.
We’ve seen the litany of moral failure from tech company executives: paying off tens of millions of dollars to executives accused of sexual harrassment (while sending victims away with nothing); firing women directly after they’ve had cancer treaments, major surgeries, or stillbirths; being told they were contributing to genocide and not responding to mitigate this; allowing top executives to evade responsibility and make deeply misleading statements in the New York Times; and more. Clearly, these executives are not going to lead us in AI ethics, when they are failing at “regular” ethics. The people who created our current problems will not be the ones to solve them, and it is up to the rest of us to act. In this series, I want to share actions you can take to have a practical, positive impact, and to highlight some real world examples. Some are big; some are small; not all of them will be relevant to your situation. Today’s post covers items 6-11 (see part 1 for 1-5 and part 3 for 12-16).
Liz Fong-Jones was an engineer at Google for 11 years and remains a leader in the field of site reliability engineering. In a post about her recent decision to leave Google, she shares some of the strategies she used over the years to advocate for change from within the company. For instance, when Google announced in 2010 a real-name policy for Google+, which would be harmful for some teachers, therapists, LGBT+ people, and other vulnerable people, Liz put together a list of ways that the policy was misguided and could encourage abuse. Many of her colleagues joined her, and a group of employees was successful in gaining a seat at the negotiating table. In response to negative public feedback about the real-name policy, Google executives sought increased feedback from these employees and later removed the policy.
Employee Organizing (both internally and externally)
Much of the work Liz did falls under the category of employee organizing. Labor movements are most effective when they have clear goals and overarching principles, as opposed to simply being reactionary. There have been numerous recent employee movements and protests which have been successful:
Last year, Survey Monkey began offering full benefits to contract workers, in response to pressure from it’s full-time employees (contract workers are often treated like a lower caste by tech companies).
Google ended forced arbitration (which often prevents victims of sexual harassment or discrimination from seeking justice) after 20,000 employees participated in a protest of how the company has mishandled harassment cases and supported abusers. (See below for a note about how Google is now retaliating against the organizers).
After Google appointed the anti-trans, anti-LGBTQ, and anti-immigrant president of the Heritage Foundation to it’s AI Ethics board, over 2,000 Google employees signed a petition calling for her removal. Google ended up canceling the entire board, less than a week after they’d first announced it.
Just this week, news broke that Google has retaliated against the organizers of the Google Walkout, telling organizer Meredith Whittaker that she would have to “abandon” her work on AI ethics and her role at the AI Now Institute (which she co-founded), and another organizer, Claire Stapleton, received a demotion, lost half her reports, and was told to go on sick leave even though she isn’t sick. This is a common strategy by companies to attempt to intimidate employees in order to discourage them from organizing (and it highlights that companies do see such collective organizing as a threat). It is important that we show support and solidarity for organizing employees.
Leave your company when internal advocacy no longer works
Liz Fong-Jones writes that such tactics have proved less effective at Google more recently, and that she has been deeply troubled by the direction Google is heading in, with the harassment and doxxing of LGBT+ employees to white supremacist sites, putting profits above ethics in its business in China and the Middle East, and the huge financial payouts given to executives accused of sexually harassed subordinates. “I can no longer bail out a raft with a teaspoon while those steering punch holes in it,” Liz wrote about her decision to leave Google after 11 years.
For parents who are supporting children, non-US residents who are reliant on work visas, people with chronic health conditions, and many others, quitting a job without another one lined up is often not an option. Don’t worry, nobody is requiring that you do that. In the tech industry, a large number of companies are hiring and there is almost no stigma for switching jobs. As I wrote in a previous post about toxic jobs, people in the tech industry consistently underestimate their employability, how in-demand their skills are, and how many options they have.
Avoid non-disparagement agreements (if needed and when possible)
When the DataCamp CEO was accused of sexually assaulting one of his employees, the only repercussion for the assailant was a single day of sensitivity training (for more background on this, please read the posts by Noam Ross and Julia Silge on how instructors spent months collectively organizing for increased accountability and transparency, but DataCamp failed to engage in good faith). DataCamp employees Dhavide Aruliah and Greg Wilson brought up concerns internally about the mishandling of this case and how it set a precedent that executives could do what they want with impunity. Both Dhavide and Greg were fired within days of raising their concerns. They were offered severance packages which would have required them to sign agreements silencing them about what happened. Both declined this severance pay, which is what makes it possible now for them to speak out publicly (especially as DataCamp continues to handle the case poorly, by failing to engage with concerned DataCamp instructors, and by writing a victim-blaming “apology” with settings so that search engines won’t index it).
I appreciate all the DataCamp instructors who are organizing and calling for boycotts of their own courses, and I admire Dhavide and Greg for turning down a month’s salary so that they could continue to speak up. Please read their posts here and here. Not everyone can afford to turn down a severance package. That’s okay– not every item on this list applies to every situation, so do what you need to take care of yourself. However, if you can afford it, this can be a powerful tool.
Speak with reporters
Last year, the Verge reported on a secret program in New Orleans where Palantir had been testing out it’s predictive policing technology for the previous 6 years. The program was so secretive that even city council members weren’t aware of it prior to the article. There was a public outcry in response to the Verge’s article, and two weeks later, New Orleans chose not to renew its contract with Palantir. This is a win! I’m grateful to the reporters at the Verge who investigated this, and to the sources who bravely tipped off them off and spoke out about this (including one anonymous law enforcement official). Speaking to a journalist about your employer can be scary, but it can also be an effective strategy for enacting change.
I’m also grateful to the 20 current and former YouTube employees who spoke to reporters about the failure of YouTube (part of Google) leadership to act on toxic, extremist, & false videos, even as numerous employees raised alarms. I hope that you are financially supporting high quality journalism (through paid subscriptions or donations, depending on the outlet) if it is within your means to do so.
Support Thoughtful Regulations & Legislation
For years, I’ve had others in the tech industry look at me with shock and disgust when I say that I support thoughtful regulation. Well-enforced regulations are crucial and necessary to protect human rights and to ensure the well-being of society. Furthermore, regulations can even encourage innovation by ensuring stability and fair competition.
I’m so grateful for all the regulations that make our lives better in the USA, including the FDA, EPA, Civil Rights Act, Fair Housing Act, Pregnancy Discrimination Act, Americans with Disabilities Act, Age Discrimination in Employment Act, National Research Act, Family Medical Leave Act, and Freedom of Information Act. The Voting Rights Act was crucial, and I’m angry that it was gutted in 2013. I’m grateful that California passed a stricter vaccination law in 2015. I’m grateful for car safety standards. These laws did not occcur in a vaccum, and I am grateful to the activists and advocates that worked for years (and in many cases, decades) to get these regulations passed.
The regulations and acts I listed above could all be improved. Some of them are not enforced well enough, and some are currently under attack. Some of them don’t go far enough. My point is not that they are perfect; my point is that regulations can be good and helpful. Yes, there are plenty of regulations that are stupid or harmful, but I talk to far too many people in the tech industry who have concluded that ALL regulations are bad or destined to fail.
Getting thoughtful regulation passed will be challenging, but we need to protect human rights, and to counterbalance the huge power that corporations currently have. We also need to be skeptical of how corporate leaders often say one thing while the lobbyists they employ work towards the opposite. The tech giants are currently earning a ton of money (while externalizing many costs to society) and we can not underestimate how hard they will fight against meaningful changes that would impact their profits.
If you want to know where tech companies stand on an issue, look at the actions of their lobbyists, not the statements from their leaders. https://t.co/o7TZtMDAjU
This post is the second in a series. You can find part 1 here and part 3 here. The problems we are facing can be overwhelming, so it may help to choose a single item from this list as a good way to start. I encourage you to choose one thing you will do, and to make a commitment now.
AI ethics is not separate from other ethics, siloed off into its own much sexier space. Ethics is ethics, and even AI ethics is ultimately about how we treat others and how we protect human rights, particularly of the most vulnerable. The people who created our current crises will not be the ones to solve them, and it is up to all of us to act. In this series, I want to share a few actions you can take to have a practical, positive impact, and to highlight some real world examples. Some are big; some are small; not all of them will be relevant to your situation, but I hope this will inspire you around concrete ways you can make a difference. Each post in this series will cover 5 or 6 different action steps.
Ethics and Data Science (available for free online) argues that just as checklists have helped doctors make fewer errors, they can also help those working in tech make fewer ethical mistakes. The authors propose a checklist for people who are working on data projects. Here are items to include on the checklist:
Have we listed how this technology can be attacked or abused?
Have we tested our training data to ensure that it is fair and representative?
Have we studied and understood possible sources of bias in our data?
Does our team reflect diversity of opinions, backgrounds, and kinds of thought?
What kinds of user consent do we need to collect or use the data?
Do we have a mechanism for gathering consent from users?
Have we explained clearly what users are consenting to?
Do we have a mechanism for redress if people are harmed by the results?
Can we shut down this software in production if it is behaving badly?
Have we tested for fairness with respect to different user groups?
Have we tested for disparate error rates among different user groups?
Do we test and monitor for model drift to ensure our software remains fair over time?
Conduct Ethical Risks Sweeps
Even when we have good intentions, our systems can be manipulated or exploited (or otherwise fail), leading to widespread harm. Just consider the role of Facebook in the genocide in Myanmar and how conspiracy theories spread via recommendation systems led to huge decreases in vaccination rates, leading to rising death rates from preventable diseases (Nature has referred to viral misinformation as “the biggest pandemic risk”).
Similar to the way that penetration testing is standard in the area of info-security, we need to proactively search for potential failures, manipulation, and exploitation of our systems, before they occur. As I wrote in a previous post, we need to ask:
How could trolls use your service to harass vulnerable people?
How could an authoritarian government use your work for surveillance? There have even been multiple times when mass surveillance and data collection have played key roles in humanitarian crises or genocides)
How could your work be used to spread harmful misinformation or propaganda?
What safeguards could be put in place to mitigate the above?
Ethicists at the Markkula Center have developed an ethical toolkit for engineering design & practice. I encourage you to read it all, but I particularly wanted to highlight tool 1, Ethical Risk Sweeps, which formalizes my questions from above in a process, including by instituting regularly scheduled ethical risk-sweeping and rewarding team members for spotting new ethical risks.
Fortunately, we don’t have to just unthinkingly optimize metrics! Chapter 4 of Technically Wrong documents how in response to widespread racial profiling on its platform, NextDoor made a set of comprehensive changes. NextDoor overhauled the design of how users report suspicious activity, including a new requirement that users describe several features of the person’s appearance other than just race to be allowed to post. These changes had the side effect of reducing engagement, but not all user engagement is good.
Evan Estola, lead machine learning engineer at Meetup, discussed the example of men expressing more interest than women in tech meetups. Meetup’s algorithm could recommend fewer tech meetups to women, and as a result, fewer women would find out about and attend tech meetups, which could cause the algorithm to suggest even fewer tech meetups to women, and so on in a self-reinforcing feedback loop. Evan and his team made the ethical decision for their recommendation algorithm to not create such a feedback loop.
Choose a revenue model other than advertising
There is a fundamental misalignment between personalized ad-targeting and the well-being of society. This shows up in myriad ways:
These issues could and should be addressed in many ways, including stricter regulations and enforcement around political ads, housing discrimination, and employment discrimination; revising section 230 to only cover content and not recommendations; and taxing personalized ad-targeting in the way that we tax tobacco (as something that externalizes huge costs to society).
Companies such as Google/YouTube, Facebook, and Twitter primarily rely on advertising revenue, which creates perverse incentives. Even when people genuinely want to do the right thing, it is tough to constantly be struggling against misaligned incentives. It is like filling your kitchen with donuts just as you are embarking on a diet. One way to head this off is to seek out revenue models other than advertising. Build a product that people will pay for. There are many companies out there with other business models, such as Medium, Eventbrite, Patreon, GitHub, Atlassian, Basecamp, and others.
Personalized ad targeting is so entrenched in online platforms, the idea of trying to move away from it can feel daunting. However, Tim Berners-Lee, founder of the world-wide web wrote last year: “Two myths currently limit our collective imagination: the myth that advertising is the only possible business model for online companies, and the myth that it’s too late to change the way platforms operate. On both points, we need to be a little more creative.”
Have Product & Engineering sit with Trust & Safety
Medium’s head of legal, Alex Feerst, wrote a post in which he interviewed 15 people who work on Trust & Safety (which includes content moderation) at a variety of major tech companies, roles that Feerst refers to being simultaneously “the judges and janitors of the internet.” As one Trust & Safety (T&S) employee observed, “Creators and product people want to live in optimism, in an idealized vision of how people will use the product, not the ways that people will predictably break it… The separation of product people and trust people worries me, because in a world where product managers and engineers and visionaries cared about this stuff, it would be baked into how things get built. If things stay this way—that product and engineering are Mozart and everyone else is Alfred the butler—the big stuff is not going to change.”
Having product and engineering shadow or even sit with T&S agents (including content moderators) is one way to address this. It makes the ways that tech platforms are misused more apparent to the creators of these platforms. A T&S employee at another company describes this approach, “We have executives and product managers shadow trust and safety agents during calls with users. Sitting with an agent talking to a sexual assault victim helps build some empathy, so when they go back to their teams, it’s running in the back of their brain when they’re thinking about building things.”
To be continued...
This post is the first in a series. Please check out part 2 and part 3, with further steps you can take to improve the tech industry. The problems we are facing can be overwhelming, so it may help to choose a single item from this list as a good way to start. I encourage you to choose one thing you will do, and to make a commitment now.
Note from Jeremy: If you want to join the next deep learning course at the University of San Francisco, discussed below, please apply as soon as possible because it’s under 2 weeks away! You can apply here. At least a year of coding experience, and deep learning experience equivalent to completing Practical Deep Learning for Coders is required.
Today at the TensorFlow Dev Summit we announced that two lessons in our next course will cover Swift for TensorFlow. These lessons will be co-taught with the inventor of Swift, Chris Lattner; together, we’ll show in the class how to take the first steps towards implementing an equivalent of the fastai library in Swift for TensorFlow. We’ll be showing how to get started programming in Swift, and explain how to use and extend Swift for TensorFlow.
Last month I showed that Swift can be used for high performance numeric computing (that post also has some background on what Swift is, and why it’s a great language, so take a look at that if you haven’t read it before). In my research on this topic, I even discovered that Swift can match the performance of hand-tuned assembly code from numerical library vendors. But I warned that: “Using Swift for numeric programming, such as training machine learning models, is not an area that many people are working on. There’s very little information around on the topic”.
So, why are we embracing Swift at this time? Because Swift for TensorFlow is the first serious effort I’ve seen to incorporate differentiable programming deep in to the heart of a widely used language that is designed from the ground up for performance.
Our plans for Swift at fast.ai
The combination of Python, PyTorch, and fastai is working really well for us, and for our community. We have many ongoing projects using fastai for PyTorch, including a forthcoming new book, many new software features, and the majority of the content in the upcoming courses. This stack will remain the main focus of our teaching and development.
It is very early days for Swift for TensorFlow. We definitely don’t recommend anyone tries to switch all their deep learning projects over to Swift just yet! Right now, most things don’t work. Most plans haven’t even been started. For many, this is a good reason to skip the project entirely.
But for me, it’s a reason to jump in! I love getting involved in the earliest days of projects that I’m confident will be successful, and helping our community to get involved too. Indeed, that’s what we did with PyTorch, including it in our course within a few weeks of its first pre-release version. People who are involved early in a project like this can have a big influence on its development, and soon enough they find themselves the “insiders” in something that’s getting big and popular!
I’ve been looking for a truly great numerical programming language for over 20 years now, so for me the possibility that Swift could be that language is hugely exciting. There are many project opportunities for students to pick something that’s not yet implemented in Swift for TensorFlow, and submit a PR implementing and testing that functionality.
Python: What’s missing
In the last three years, we’ve switched between many different deep learning libraries in our courses: Theano, TensorFlow, Keras, PyTorch, and of course our own fastai library. But they’ve all had one thing in common: they are Python libraries. This is because Python is today the language that’s used in nearly all research, teaching, and commercial applications of deep learning. To be a deep learning practitioner and use a language other than Python means giving up a vast ecosystem of interconnected libraries, or else using Python’s libraries through clunky inter-language communication mechanisms.
But Python is not designed to be fast, and it is not designed to be safe. Instead, it is designed to be easy, and flexible. To work around the performance problems of using “pure Python” code, we instead have to use libraries written in other languages (generally C and C++), like numpy, PyTorch, and TensorFlow, which provide Python wrappers. To work around the problem of a lack of type safety, recent versions of Python have added type annotations that optionally allow the programmer to specify the types used in a program. However, Python’s type system is not capable of expressing many types and type relationships, does not do any automated typing, and can not reliably check all types at compile time. Therefore, using types in Python requires a lot of extra code, but falls far short of the level of type safety that other languages can provide.
The C/C++ libraries that are at the heart of nearly all Python numeric programming are also a problem for both researchers, and for educators. Researchers can not easily modify the underlying code, or inspect it, since it requires a whole different toolbox—and in the case of libraries like MKL and cudnn the underlying code is optimized machine language. Educators cannot easily show students what’s really going on in a piece of code, because the normal Python-based debugging and inspection approaches can not handle libraries in other languages. Developers struggle to profile and optimize code where it crosses language boundaries, and Python itself can not properly optimize code that crosses language or library boundaries.
For instance, we’ve been doing lots of research in to different types of recurrent neural network architectures and normalization layers. In both cases, we haven’t been able to get the same level of performance that we see in pure CUDA C implementations, even when using PyTorch’s fantastic new JIT compiler.
At the PyTorch Dev Summit last year I participated in a panel with Soumith Chintala, Yangqing Jia, Noah Goodman, and Chris Lattner. In the panel discussion, I said that: “I love everything about PyTorch, except Python.” I even asked Soumith “Do you think we might see a ‘SwifTorch’ one day?” At the time, I didn’t know that we might be working with Swift ourselves so soon!
So what now?
In the end, anything written in Python has to deal with one or more of the following:
Being run as pure Python code, which means it’s slow
Being a wrapper around some C library, which means it’s hard to extend, can’t be optimized across library boundaries, and hard to profile and debug
Being converted in to some different language (such as PyTorch using TorchScript, or TensorFlow using XLA), which means you’re not actually writing in the final target language, and have to deal with the mismatch between the language you think you’re writing, and the actual language that’s really being used (with at least the same debugging and profiling challenges of using a C library).
On the other hand, Swift is very closely linked with its underlying compiler infrastructure, LLVM. In fact, Chris Lattner has described it before as “syntactic sugar for LLVM”. This means that code written in Swift can take full advantage of all of the performance optimization infrastructure provided by LLVM. Furthermore, Chris Lattner and Jacques Pienaar recently launched the MLIR compiler infrastructure project, which has the potential to significantly improve the capabilities of Swift for TensorFlow.
Our hope is that we’ll be able to use Swift to write every layer of the deep learning stack, from the highest level network abstractions all the way down to the lowest level RNN cell implementation. There would be many benefits to doing this:
For education, nothing is mysterious any more; you can see exactly what’s going on in every bit of code you use
For research, nothing is out of bounds; whatever you can conceive of, you can implement, and have it run at full speed
For development, the language helps you; your editor will deeply understand your code, doing intelligent completions and warning you about problems like tensor mismatches, your profiler will show you all the steps going on so you can find and fix performance problems, and your debugger will let you step all the way to the bottom of your call stack
For deployment, you can deploy the same exact code that you developed on using your laptop. No need to convert it to some arcane format that only your deep learning server understands!
For education, our focus has always been on explaining the concepts of deep learning, and the practicalities of actually using this tool. We’ve found that our students can very easily (within a couple of days) switch to being productive in a different library, as long as they understand the foundations well, and have practiced applying them to solve real problems.
Our Python fastai library will remain the focus of our development and teaching. We will, however, be doing lots of research using Swift for TensorFlow, and if it reaches the potential we think it has, expect to see it appearing more and more in future courses! We will be working to make practical, world-class, deep learning in Swift as accessible as possible—and that probably means bringing our fastai library (or something even better!) to Swift too. It’s too early to say exactly what that will look like; if you want to be part of making this happen, be sure to join the upcoming class, either in person at the University of San Francisco, or in the next part 2 MOOC (coming out June 2019).