Available on Monday evenings to attend in person course in SOMA, from March 19 to April 30
Able to commit 10 hours a week of study to the course.
You can fulfill the requirement to be familiar with deep learning, the fastai library, and PyTorch by doing any 1 of the following:
You took the updated, in-person deep learning part 1 course during fall 2017
You have watched the first 2 videos of the online course before you apply, and a commitment to work through all 7 lessons before the start of the course. We estimate that each lesson takes approximately 10 hours of study (so you would need to study for the 7 weeks prior to the course starting on March 19, for 10 hours each week).
You have previously taken the older version of the course (released last year) AND watch the first 4 lessons of the new course to get familiar with the fastai library and PyTorch.
Deep Learning Part 1 covers the use of deep learning for image recognition, recommendation systems, sentiment analysis, and time-series prediction. Part 2 will take this further by teaching you how to read and implement cutting edge research papers, generative models and other advanced architectures, and more in-depth natural language processing. As with all fast.ai courses, it will be practical, state-of-the-art, and geared towards coders.
Increasing diversity in AI is a core part of our mission at fast.ai to make deep learning more accessible. We want to get deep learning into the hands of as many people as possible, from as many diverse backgrounds as possible. People with different backgrounds have different problems they’re interested in solving. We are horrified by unethical uses of AI, widespread bias, and how overwhelmingly white and male most deep learning teams are. Increasing diversity won’t solve the problem of ethics and bias alone, but it is a necessary step.
How to Apply
Women, people of Color, LGBTQ people, people with disabilities, and veterans in the Bay Area, if you have at least one year of coding experience, can fulfill the deep learning pre-requisite (described above), and can commit 8 hours a week to working on the course, we encourage you to apply for a diversity scholarship. The number of scholarships we are able to offer depends on how much funding we receive (if your organization may be able to sponsor one or more places, please let us know).
To apply for the fellowship, you will need to submit a resume and statement of purpose. The statement of purpose will include the following:
1 paragraph describing one or more problems you’d like to apply deep learning to
1 paragraph describing previous machine learning education (e.g. fast.ai courses, coursera, deeplearning.ai,…)
Confirm that you fulfill the deep learning part 1 pre-requisite (or that you have already completed the first 2 lessons and plan to complete the rest before the course starts)
Confirm that you are available to attend the course on Monday evenings in SOMA (for 7 weeks, beginning March 19), and that you can commit 8 hours a week to working on the course
Which under-indexed group(s) you are a part of (gender, race, sexual identity, veteran)
I’m not eligible for the diversity scholarship, but I’m still interested. Can I take the course? Absolutely! You can register here.
I don’t live in the San Francisco Bay Area; can I participate remotely? Yes! Once again, we will be offering remote international fellowships. Stay tuned for details to be released in a blog post in the next few weeks.
Will this course be made available online later? Yes, this course will be made freely available online afterwards. Benefits of taking the in-person course include earlier access, community and in-person interaction, and more structure (for those that struggle with motivation when taking online courses).
Is fast.ai able to sponsor visas or provide stipends for living expenses? No, we are not able to sponsor visas nor to cover living expenses.
How will this course differ from the fast.ai Deep Learning part 2 course taught in spring 2017? Our goal at fast.ai is to push the state-of-the-art. Each year, we want to make deep learning increasingly intuitive to use while giving better results. With our fastai library, we are beating our own state-of-the-art results from last year. Also, last year’s course was taught primarily in TensorFlow, while this was in in PyTorch.
As a child, I was nerdy and shy. At my elementary and middle schools, we had to present our science projects to judges in the school science fair each year, and I noticed that students who were outgoing and good at presenting were more likely to win. I remember feeling indignant– shouldn’t we just be judged on scientific merit? Why should things like being able to smile, make eye contact, and show enthusiasm (all things I didn’t do) with the judges have any impact?
But it turns out that those other skills are actually useful! Personal branding is similar– we may want our professional work to stand on its own merit, but how we present and share it is important. And so, two weeks ago I found myself mentoring on the cringe-inducing topic of personal branding at the Women in Machine Learning Workshop, co-located with the deep learning conference NIPS. Part of me felt embarrassed to be talking about something as seemingly shallow as personal branding, while just a few tables away deep learning star Yoshua Bengio mentored on the more serious topic of deep learning. However, I’ve worked hard to make peace with the concept and wanted to share what I’ve discovered.
What is “personal branding” and why is it useful?
Over the past two years, I’ve consistently put time in to twitter and blog posts. Here are a few ways this has been helpful to me:
Being able to raise money for 18 AI diversity scholarships and $250,000 of AWS credits to give to fast.ai students
I talked to a grad student who was giving an oral presentation at NIPS, and she noted how a classmate of hers with a much larger twitter following got significantly more retweets and attendees for his talk. This struck her as unfair since a larger twitter following doesn’t equate with better research, but it also convinced her that building a personal brand would be useful.
I think of personal branding as anything that helps people find out about you and your work. This includes blogging, using twitter, and public speaking. Personal branding is a bit like a web: your blog post may lead to a job interview; you may get a speaking engagement from someone who follows you on twitter; and your conference talk may lead some of the audience members to read your blog or follow you on twitter, continuing the cycle.
Personal branding is no substitute for doing high-quality technical work; it’s just the means by which you can share this work with a broader audience.
Making peace with personal branding
Here are a few things that helped me get okay with the idea of personal branding:
Personal branding sounds icky if you think of it as a shallow popularity contest, or of trying to trick people to click on links they don’t really want to click on. However, I now think about it as wanting people to know about high-quality work that I’m proud of and care about.
Realizing that social skills and communication skills are things I could get better at with practice (I consider personal branding to be a subset of communication skills). I felt more resentful when as a kid I thought that people were either born outgoing or not, and that there wasn’t anything I could do, but as I started working on those skills and saw them improve, I was encouraged.
People have a bias towards thinking the most valuable skills are the ones we are already good at and have already put a lot of time into (whether that’s a particular academic subject, programming language, or sport). I still catch myself feeling a particular affinity towards other mathematicians (I know firsthand that getting a math PhD was hard!) But a lot of other things are hard and valuable too.
Learn by observation
I recommend finding people who are doing personal branding well, and observing what they do and what works. What is it about that conference talk that made it so good? Why do you enjoy following X on twitter? What keeps you returning to Y’s blog?
In defense of twitter
Twitter seems really weird at first (I was a twitter skeptic for years, not starting to actively use it until 2014), but it’s actually really useful. I’ve met new people through twitter. I know people who have gotten jobs through twitter. There are some really interesting conversations that I see on twitter that I don’t see elsewhere, such as this discussion about what it means to do “more rigorous” deep learning experiments, or here where several genomics researchers responded to my question about whether Google’s DeepVariant is overhyped.
Apart from the “personal branding” aspects, twitter helps me practice being more concise. It’s been good for my writing skills. I also use it as a way of bookmarking blog posts I like and highlights from talks and conferences I attend, so sometimes I refer back to it a reference.
A tweet about a talk by Sandya Sankarram that I really enjoyed.
Twitter for beginners
Your enjoyment of twitter will vary greatly depending on who you follow. It will take some experimenting to get this right. Feel free to unfollow people if you realize you’re not getting anything out of their tweets. Whenever I read an article I like or hear a talk I like, I always look up the author/speaker on twitter and see if I find their tweets interesting. If so, I follow them. Also, there are people for whom you may love their writing/talks/other work, but don’t really enjoy their tweets. You don’t have to follow them. Twitter is it’s own distinct medium, and being good at something else doesn’t necessarily translate. If you are particularly looking for deep learning tweets, you can check out Jeremy Howard’s likes, and follow some of the accounts shared there.
People use twitter in a variety of ways: as a social network, for political activism, for self-expression, and more. I use twitter primarily as a professional tool (I think of it as a more dynamic version of LinkedIn), so I try to keep most of my tweets related to data science. If your goal is personal branding or finding a job, I recommend keeping your tweets mostly focused on your field. Some people deal with this by having separate personal and professional twitter accounts (for instance, Data Science Renee does this).
New post: Please Don't Say "It used to be called Big Data and now it's called Deep Learning" https://t.co/4mm47R2uXE
Above: Sharing one of my own blog posts on Twitter.
Feel free to mute topics you don’t want to hear about (you can mute particular words), and mute people who bring you down. You are allowed to use twitter however you like, and you aren’t required to argue with anyone you don’t want to.
Twitter can be a low time commitment. You don’t need to check it every day. It’s fine to just tweet once a week. When I started, I primarily used it as a way to bookmark blog posts or articles I liked. Building up followers can be a long, slow process. Be patient.
Observe successful twitter accounts, of people who aren’t “famous” (famous people will have a ton of followers regardless of the quality of their tweets), to see what works. A few accounts you might want to check out for inspiration are: Mariya Yao, Julia Evans, Data Science Renee, and Stephanie Hurlburt. They each have built up over 20k followers, by providing thoughtful and interesting tweets, and generously promoting the work of others.
Speaking at Meetups or Conferences
Most people (including experts with tons of experience) are terrified and intimidated by public speaking, yet it is such a great way to share your work that it’s worth it.
Two years ago I decided I wanted to do more public speaking after not having done much for many years (my previous experience was primarily academic and from before I switched into the tech industry). I was nervous and also uncertain if I had anything of value to say. I started small, giving a 5 minute lightning talk at a PyLadies meetup to a particularly supportive audience, gradually working up through events with 50-100 people, to eventually presenting to 700 people at JupyterCon.
I prepare a ton for talks, since it both helps me feel less anxious and results in stronger talks. I prepare for short talks and small audiences, as well as big talks, because I want to be respectful of the audience. I think it’s particularly important to go through your timing to make sure that you’ll be able to cover what you plan (I’ve seen some talks get cut off before the speaker even reached their main point).
Nothing is more irritating to me as an audience member than having to sit through an infomercial. It’s important to offer useful information to your audience, and not just advertise your product or company. My goal with all my talks is to have some information that will be useful or thought-provoking, even if the listeners never take a fast.ai course.
For every talk I give, I ask if the venue will be able to do a video-recording (here are professional recordings of me speaking at an ML meetup at AWS and at PyBay). If not, I will often do my own recording. I use the software Camtasia to capture my screen and video, and have my own microphone that plugs into my computer via usb. For instance, this is how I created the below tutorial on Word Embeddings. About 80 people attended the live workshop and now 2,400 have watched the recording online! Getting or making recordings allows you to reach a broader audience, and it will make it easier for you to get future speaking engagements as you build up a portfolio of your past talks.
If my talk involves code, I try to create a demo on github (like this or this) that has enough documentation to stand alone as a tutorial or guide. Even if I don’t plan to cover all the set-up or background in my talk, I want to give people a resource that they can use later. You don’t need to create a recording or a demo to give a talk (particularly if it will stress you out), but it’s worth considering.
Public Speaking Resources
Technically Speaking was an excellent newsletter sharing links to blog posts and videos with public speaking advice for those in tech, created by Cate Huston and Chiu-Ki Chan, senior developers with a ton of speaking experience. Although it is no longer active, you can still check out the archives here.
If you are a woman or non-binary person living in Atlanta, NYC, SF, Chicago, or LA, I highly recommend Write Speak Code meetups or workshops as a great place to practice technical talks and receive constructive feedback.
I was scheduled to speak to an audience of 1,000 people at TEDx San Francisco in October (unfortunately, I ended up in the ICU with a life-threatening illness at the last minute and couldn’t attend, but I’d already gone through months of preparation and was completely ready). I was terrified, so I started working with a public speaking coach in preparation, and it was super helpful. I searched for coaches on yelp, and met with a few to find one that I particularly liked. From asking around, a lot of excellent and famous speakers have worked with speech coaches. They can help with anything– from your voice and body language, to crafting engaging intros and conclusions. In hindsight, I probably should’ve met with a speech coach even earlier in my public speaking journey; you certainly don’t need to be preparing for an audience of 1,000 to hire one.
Many years ago I participated in a chapter of Toastmasters, and I enjoyed that. When I asked about speech coaches on twitter, several people told me that training in improv, theater, or singing had been helpful to them in the realm of public speaking.
It’s like a resume, only better. I know of a few people who have had blog posts lead to job offers!
Helps you learn. Organizing knowledge always helps me synthesize my own ideas. One of the tests of whether you understand something is whether you can explain it to someone else. A blog post is a great way to do that.
I’ve gotten invitations to conferences and invitations to speak from my blog posts. I was invited to the TensorFlow Dev Summit (which was awesome!) for writing a blog post about how I don’t like TensorFlow.
Meet new people. I’ve met several people who have responded to blog posts I wrote.
Saves time. Any time you answer a question multiple times through email, you should turn it into a blog post, which makes it easier for you to share the next time someone asks.
It can be intimidating to start blogging, but remember that your target audience is you-6-months-ago, not Geoffrey Hinton. What would have been most helpful to your slightly younger self? You are best positioned to help people one step behind you. The material is still fresh in your mind. Many experts have forgotten what it was like to be a beginner (or an intermediate). The context of your particular background, your particular style, and your knowledge level will give a different twist to what you’re writing about.
And as inspiration, here are links to a few blogs that I consistently enjoy:
Sharing high quality work (both your own and that of others) will help you develop a platform to further your goals. Observe people who are doing this well: who are writing blog posts you enjoy, producing tweets you like to follow, or giving engaging talks. While we may want technical expertise to stand on its own, communication skills are vital in helping your work reach an audience. Starting to work on any new skillset is often intimidating, but with small steps and practice you will improve. As a meta-exercise, I did some personal branding in this post by linking to my own work. It felt uncomfortable, but I followed my own advice and did it anyway!
I want to answer some questions that I’m commonly asked: What kind of computer do I need to do deep learning? Why does fast.ai recommend Nvidia GPUs? What deep learning library do you recommend for beginners? How do you put deep learning into production? I think these questions all fall under a general theme of What do you need (in terms of hardware, software, background, and data) to do deep learning? This post is geared towards those new to the field and curious about getting started.
The hardware you need
We are indebted to the gaming industry
The video game industry is larger (in terms of revenue) than the film and music industries combined. In the last 20 years, the video gaming industry drove forward huge advances in GPUs (graphical processing units), used to do the matrix math needed for rendering graphics. Fortunately, these are exactly the type of computations needed for deep learning. These advances in GPU technology are a key part of why neural networks are proving so much more powerful now than they did a few decades ago. Training a deep learning model without a GPU would be painfully slow in most cases.
Not all GPUs are the same
Most deep learning practitioners are not programming GPUs directly; we are using software libraries (such as PyTorch or TensorFlow) that handle this. However, to effectively use these libraries, you need access to the right type of GPU. In almost all cases, this means having access to a GPU from the company Nvidia.
CUDA and OpenCL are the two main ways for programming GPUs. CUDA is by far the most developed, has the most extensive ecosystem, and is the most robustly supported by deep learning libraries. CUDA is a proprietary language created by Nvidia, so it can’t be used by GPUs from other companies. When fast.ai recommends Nvidia GPUs, it is not out of any special affinity or loyalty to Nvidia on our part, but that this is by far the best option for deep learning.
Nvidia dominates the market for GPUs, with the next closest competitor being the company AMD. This summer, AMD announced the release of a platform called ROCm to provide more support for deep learning. The status of ROCm for major deep learning libraries such as PyTorch, TensorFlow, MxNet, and CNTK is still under development. While I would love to see an open source alternative succeed, I have to admit that I find the documentation for ROCm hard to understand. I just read the Overview, Getting Started, and Deep Learning pages of the ROCm website and still can’t explain what ROCm is in my own words, although I want to include it here for completeness. (I admittedly don’t have a background in hardware, but I think that data scientists like me should be part of the intended audience for this project.)
If you don’t have a GPU…
If your computer doesn’t have a GPU or has a non-Nvidia GPU, you have several great options:
Use Crestle, through your browser: Crestle is a service (developed by fast.ai student Anurag Goel) that gives you an already set up cloud service with all the popular scientific and deep learning frameworks already pre-installed and configured to run on a GPU in the cloud. It is easily accessed through your browser. New users get 10 hours and 1 GB of storage for free. After this, GPU usage is 59 cents per hour. I recommend this option to those who are new to AWS or new to using the console.
Set up an AWS cloud instance through your console: You can create an AWS instance (which remotely provides you with Nvidia GPUs) by following the steps in this fast.ai setup lesson. AWS charges 90 cents per hour for this. Although our set-up materials are about AWS (and you’ll find the most forum support for AWS), one fast.ai student created a guide for Setting up an Azure Virtual Machine for Deep learning. And I’m happy to share and add a link if anyone writes a blog post about doing this with Google Cloud Engine.
Build your own box. Here’s a lengthy thread from our fast.ai forums where people ask questions, share what components they are using, and post other useful links and tips. The cheapest new Nvidia GPUs are around $300, with some students finding even cheaper used ones on eBay or Craigslist, and others paying more for more powerful GPUs. A few of our students wrote blog posts documenting how they built their machines:
Deep learning is a relatively young field, and the libraries and tools are changing quickly. For instance, Theano, which we chose to use for part 1 of our course in 2016, was just retired. PyTorch, which we are using currently, was only released earlier this year (2017). As Jeremy wrote previously, you should assume that whatever specific libraries and software you learn today will be obsolete in a year or two. The most important thing is to understand the underlying concepts, and towards that end, we are creating our own library on top of Pytorch that we believe makes deep learning concepts clearer, as well as encoding best practices as defaults.
Python is by far the most commonly used language for deep learning. There are a number of deep learning libraries available, with almost every major tech company backing a different library, although employees at those companies often use a mix of tools. Deep learning libraries include TensorFlow (Google), PyTorch (Facebook), MxNet (University of Washington, adapted by Amazon), CNTK (Microsoft), DeepLearning4j (Skymind), Caffe2 (also Facebook), Nnabla (Sony), PaddlePaddle (Baidu), and Keras (a high-level API that runs on top of several libraries in this list). All of these have Python options available.
Dynamic vs. Static Graph Computation
At fast.ai, we prioritize the speed at which programmers can experiment and iterate (through easier debugging and more intutive design) as more important than theoretical performance speed-ups. This is the reason we use PyTorch, a flexible deep learning library with dynamic computation.
One distinction amongst deep learning libraries is whether they use dynamic or static computations (some libraries, such as MxNet and now TensorFlow, allow for both). Dynamic computation mean that the program is executed in the order you wrote it. This typically makes debugging easier, and makes it more straightforward to translate ideas from your head into code. Static computation means that you build a structure for your neural network in advance, and then execute operations on it. Theoretically, this allows the compiler to do greater optimizations, although it also means there may be more of a disconnect between what you intended your program to be and what the compiler executes. It also means that bugs can seem more removed from the code that caused them (for instance, if there is an error in how you constructed your graph, you may not realize until you perform an operation on it later). Even though there are theoretical arguments that languages with static computation graphs are capable of better performance than languages with dynamic computation, we often find that is not the case for us in practice.
Google’s TensorFlow mostly uses a static computation graph, whereas Facebook’s PyTorch uses dynamic computation. (Note: TensorFlow announced a dynamic computation option, Eager Execution, just two weeks ago, although it is still quite early and most TensorFlow documentation and projects use the static option). In September, fast.ai announced that we had chosen PyTorch over TensorFlow to use in our course this year and to use for the development of our own library (a higher-level wrapper for PyTorch that encodes best practices). Briefly, here are a few of our reasons for choosing PyTorch (explained in much greater detail here):
easier to debug
dynamic computation is much better suited for natural language processing
traditional Object Oriented Programming style (which feels more natural to us)
TensorFlow’s use of unusual conventions like scope and sessions can be confusing and are more to learn
Google has put far more resources into marketing TensorFlow than anyone else, and I think this is one of the reasons that TensorFlow is so well known (for many people outside deep learning, TensorFlow is the only DL framework that they’ve heard of). As mentioned above, TensorFlow released a dynamic computation option a few weeks ago, which addresses some of the above issues. Many people have asked fast.ai if we are going to switch back to TensorFlow. The dynamic option is still quite new and far less developed, so we will happily continue with PyTorch for now. However, the TensorFlow team has been very receptive to our ideas, and we would love to see our fastai library ported to TensorFlow.
Note: The in-person version of our updated course, which uses PyTorch as well as our own fastai library, is happening currently. It will be released online for free after the course ends (estimated release: January).
What you need for production: not a GPU
Many people overcomplicate the idea of using deep learning in production and believe that they need much more complex systems than they actually do. You can use deep learning in production with a CPU and the webserver of your choice, and in fact, this is what we recommend for most use cases. Here are a few key points:
It is incredibly rare to need to train in production. Even if you want to update your model weights daily, you don’t need to train in production. Good news! This means that you are just doing inference (a forward pass through your model) in production, which is much quicker and easier than training.
You can use whatever webserver you like (e.g. Flask) and set up inference as a simple API call.
GPUs only provide a speed-up if you are effectively able to batch your data. Even if you are getting 32 requests per second, using a GPU would most likely slow you down, because you’d have to wait a second from when the 1st arrived to collect all 32, then perform the computation, and then return the results. We recommend using a CPU in production, and you can always add more CPUs (easier than using multiple GPUs) as needed.
For big companies, it may make sense to use GPUs in production for serving– however, it will be clear when your reach this size. Prematurely trying to scale before it’s needed will only add needless complexity and slow you down.
The background you need: 1 year of coding
One of the frustrations that inspired Jeremy and I to create Practical Deep Learning for Coders was (is) that most deep learning materials fall into one of two categories:
so shallow and high-level as to not give you the information or skills needed to actually use deep learning in the workplace or create state-of-the-art models. This is fine if you just want a high-level overview, but disappointing if you want to become a working practitioner.
highly theoretical and assume a graduate level math background. This is a prohibitive barrier for many folks, and even as someone who has a math PhD, I found that the theory wasn’t particularly useful in learning how to code practical solutions. It’s not surprising that many materials have this slant. Until quite recently, deep learning was almost entirely an academic discipline and largely driven by questions of what would publish in top academic journals.
Our free course Practical Deep Learning for Coders is unique in that the only pre-requisite is 1 year of programming experience, yet it still teaches you how to create state-of-the-art models. Your background can be in any language, although you might want to learn some Python before starting the course, since that is what we use. We introduce math concepts as needed, and we don’t recommend that you try to front-load studying math theory in advance.
If you don’t know how to code, I highly recommend learning, and Python is a great language to start with if you are interested in data science.
The data you need: far less than you think
Although many have claimed that you need Google-size data sets to do deep learning, this is false. The power of transfer learning (combined with techniques like data augmentation) make it possible for people to apply pre-trained models to much smaller datasets. As we’ve talked about elsewhere, at medical start-up Enlitic, Jeremy Howard led a team that used just 1,000 examples of lung CT scans with cancer to build an algorithm that was more accurate at diagnosing lung cancer than a panel of 4 expert radiologists. The C++ library Dlib has an example in which a face detector is accurately trained using only 4 images, containing just 18 faces!
A note about access
For the vast majority of people I talk with, the barriers to entry for deep learning are far lower than they expected and the costs are well within their budgets. However, I realize this is not the case universally. I’m periodically contacted by students that want to take our online course but can’t afford the costs of AWS. Unfortunately, I don’t have a solution. There are other barriers as well. Bruno Sánchez-A Nuño has written about the challenges of doing data science in places that don’t have reliable internet access, and fast.ai international fellow Tahsin Mayeesha describes hidden barriers to MOOC access in countries such as Bangladesh. I care about these issues of access, and it is disatisifying to not have solutions.
An all-too-common scenario: a seemingly impressive machine learning model is a complete failure when implemented in production. The fallout includes leaders who are now skeptical of machine learning and reluctant to try it again. How can this happen?
One of the most likely culprits for this disconnect between results in development vs results in production is a poorly chosen validation set (or even worse, no validation set at all). Depending on the nature of your data, choosing a validation set can be the most important step. Although sklearn offers a train_test_split method, this method takes a random subset of the data, which is a poor choice for many real-world problems.
The definitions of training, validation, and test sets can be fairly nuanced, and the terms are sometimes inconsistently used. In the deep learning community, “test-time inference” is often used to refer to evaluating on data in production, which is not the technical definition of a test set. As mentioned above, sklearn has a train_test_split method, but no train_validation_test_split. Kaggle only provides training and test sets, yet to do well, you will need to split their training set into your own validation and training sets. Also, it turns out that Kaggle’s test set is actually sub-divided into two sets. It’s no suprise that many beginners may be confused! I will address these subtleties below.
First, what is a “validation set”?
When creating a machine learning model, the ultimate goal is for it to be accurate on new data, not just the data you are using to build it. Consider the below example of 3 different models for a set of data:
The error for the pictured data points is lowest for the model on the far right (the blue curve passes through the red points almost perfectly), yet it’s not the best choice. Why is that? If you were to gather some new data points, they most likely would not be on that curve in the graph on the right, but would be closer to the curve in the middle graph.
The underlying idea is that:
the training set is used to train a given model
the validation set is used to choose between models (for instance, does a random forest or a neural net work better for your problem? do you want a random forest with 40 trees or 50 trees?)
the test set tells you how you’ve done. If you’ve tried out a lot of different models, you may get one that does well on your validation set just by chance, and having a test set helps make sure that is not the case.
A key property of the validation and test sets is that they must be representative of the new data you will see in the future. This may sound like an impossible order! By definition, you haven’t seen this data yet. But there are still a few things you know about it.
When is a random subset not good enough?
It’s instructive to look at a few examples. Although many of these examples come from Kaggle competitions, they are representative of problems you would see in the workplace.
If your data is a time series, choosing a random subset of the data will be both too easy (you can look at the data both before and after the dates your are trying to predict) and not representative of most business use cases (where you are using historical data to build a model for use in the future). If your data includes the date and you are building a model to use in the future, you will want to choose a continuous section with the latest dates as your validation set (for instance, the last two weeks or last month of the available data).
Suppose you want to split the time series data below into training and validation sets:
A random subset is a poor choice (too easy to fill in the gaps, and not indicative of what you’ll need in production):
Use the earlier data as your training set (and the later data for the validation set):
Kaggle currently has a competition to predict the sales in a chain of Ecuadorian grocery stores. Kaggle’s “training data” runs from Jan 1 2013 to Aug 15 2017 and the test data spans Aug 16 2017 to Aug 31 2017. A good approach would be to use Aug 1 to Aug 15 2017 as your validation set, and all the earlier data as your training set.
New people, new boats, new…
You also need to think about what ways the data you will be making predictions for in production may be qualitatively different from the data you have to train your model with.
In the Kaggle distracted driver competition, the independent data are pictures of drivers at the wheel of a car, and the dependent variable is a category such as texting, eating, or safely looking ahead. If you were the insurance company building a model from this data, note that you would be most interested in how the model performs on drivers you haven’t seen before (since you would likely have training data only for a small group of people). This is true of the Kaggle competition as well: the test data consists of people that weren’t used in the training set.
If you put one of the above images in your training set and one in the validation set, your model will seem to be performing better than it would on new people. Another perspective is that if you used all the people in training your model, your model may be overfitting to particularities of those specific people, and not just learning the states (texting, eating, etc).
A similar dynamic was at work in the Kaggle fisheries competition to identify the species of fish caught by fishing boats in order to reduce illegal fishing of endangered populations. The test set consisted of boats that didn’t appear in the training data. This means that you’d want your validation set to include boats that are not in the training set.
Sometimes it may not be clear how your test data will differ. For instance, for a problem using satellite imagery, you’d need to gather more information on whether the training set just contained certain geographic locations, or if it came from geographically scattered data.
The dangers of cross-validation
The reason that sklearn doesn’t have a train_validation_test split is that it is assumed you will often be using cross-validation, in which different subsets of the training set serve as the validation set. For example, for a 3-fold cross validation, the data is divided into 3 sets: A, B, and C. A model is first trained on A and B combined as the training set, and evaluated on the validation set C. Next, a model is trained on A and C combined as the training set, and evaluated on validation set B. And so on, with the model performance from the 3 folds being averaged in the end.
However, the problem with cross-validation is that it is rarely applicable to real world problems, for all the reasons describedin the above sections. Cross-validation only works in the same cases where you can randomly shuffle your data to choose a validation set.
Kaggle’s “training set” = your training + validation sets
One great thing about Kaggle competitions is that they force you to think about validation sets more rigorously (in order to do well). For those who are new to Kaggle, it is a platform that hosts machine learning competitions. Kaggle typically breaks the data into two sets you can download:
a training set, which includes the independent variables, as well as the dependent variable (what you are trying to predict). For the example of an Ecuadorian grocery store trying to predict sales, the independent variables include the store id, item id, and date; the dependent variable is the number sold. For the example of trying to determine whether a driver is engaging in dangerous behaviors behind the wheel, the independent variable could be a picture of the driver, and the dependent variable is a category (such as texting, eating, or safely looking forward).
a test set, which just has the independent variables. You will make predictions for the test set, which you can submit to Kaggle and get back a score of how well you did.
This is the basic idea needed to get started with machine learning, but to do well, there is a bit more complexity to understand. You will want to create your own training and validation sets (by splitting the Kaggle “training” data). You will just use your smaller training set (a subset of Kaggle’s training data) for building your model, and you can evaluate it on your validation set (also a subset of Kaggle’s training data) before you submit to Kaggle.
The most important reason for this is that Kaggle has split the test data into two sets: for the public and private leaderboards. The score you see on the public leaderboard is just for a subset of your predictions (and you don’t know which subset!). How your predictions fare on the private leaderboard won’t be revealed until the end of the competition. The reason this is important is that you could end up overfitting to the public leaderboard and you wouldn’t realize it until the very end when you did poorly on the private leaderboard. Using a good validation set can prevent this. You can check if your validation set is any good by seeing if your model has similar scores on it to compared with on the Kaggle test set.
Another reason it’s important to create your own validation set is that Kaggle limits you to two submissions per day, and you will likely want to experiment more than that. Thirdly, it can be instructive to see exactly what you’re getting wrong on the validation set, and Kaggle doesn’t tell you the right answers for the test set or even which data points you’re getting wrong, just your overall score.
Understanding these distinctions is not just useful for Kaggle. In any predictive machine learning project, you want your model to be able to perform well on new data.
What is the ethical responsibility of data scientists?
What we’re talking about is a cataclysmic change… What we’re talking about is a major foreign power with sophistication and ability to involve themselves in a presidential election and sow conflict and discontent all over this country… You bear this responsibility. You’ve created these platforms. And now they are being misused, Senator Feinstein said this week in a senate hearing. Who has created a cataclysmic change? Who bears this large responsibility? She was talking to executives at tech companies and referring to the work of data scientists.
As we data scientists sit behind computer screens coding, we may not give much thought to the people whose lives may be changed by our algorithms. However, we have a moral responsiblity to our world and to those whose lives will be impacted by our work. Technology is inherently about humans, and it is perilous to ignore human psychology, sociology, and history while creating tech. Even aside from our ethical responsibility, you could serve time in prison for the code you write, like the Volkswagon engineer who was sentenced to 3.5 years in prison for helping develop software to cheat on federal emissions tests. This is what his employer asked him to do, but following your boss’s orders doesn’t absolve you of responsibility and is not an excuse that will protect you in court.
As a data scientist, you may not have too much say in product decisions, but you can ask questions and raise issues. While it can be uncomfortable to stand up for what is right, you are in a fortunate position as part of only 0.3-0.5% of the global population who knows how to code. With this knowledge comes a responsibility to use it for good. There are many reasons why you may feel trapped in your job (needing a visa, supporting a family, being new to the industry); however, I have found that people in unethical or toxic work environments (my past self included) consistently underestimate their options. If you find yourself in an unethical environment, please at least attempt applying for other jobs. The demand for data scientists is high and if you are currently working as a data scientist, there are most likely other companies that would like to hire you.
One thing we should all be doing is thinking about how bad actors could misuse our technology. Here are a few key areas to consider:
How could trolls use your service to harass vulnerable people?
How could your work be used to spread harmful misinformation or propaganda?
What safeguards could be put in place to mitigate the above?
Data Science Impacts the World
The consequences of algorithms can be not only dangerous, but even deadly. Facebook is currently being used to spread dehumanizing misinformation about the Rohingya, an ethnic minority in Myanmar. As described above, over half a million Rohinyga have been driven from their homes due to systematic murder, rape, and burning. For many in Myanmar, Facebook is their only news source. As quoted in the New York Times, one local official of a village with numerous restrictions prohibiting Muslims (the Rohingya are Muslim while the majority of the country is Buddhist) admits that he has never met a Muslim, but says [they] are not welcome here because they are violent and they multiply like crazy with so many wives and children.I have to thank Facebook, because it is giving me the true information in Myanmar.
Abe Gong, CEO of Superconductive Health, discusses a criminal recidivism algorithm used in U.S. courtrooms that included data about whether a person’s parents separated and if their father had ever been arrested. To be clear, this means that people’s prisons sentences were longer or shorter depending on things their parents had done. Even if this increased the accuracy of the model, it is unethical to include this information, as it is completely beyond the control of the defendants. This is an example of why data scientists shouldn’t just unthinkingly optimize for a simple metric, but that we must also think about what type of society we want to live in.
Runaway Feedback Loops
Evan Estola, lead machine learning engineer at Meetup, discussed the example of men expressing more interest than women in tech meetups. Meetup’s algorithm could recommend fewer tech meetups to women, and as a result, fewer women would find out about and attend tech meetups, which could cause the algorithm to suggest even fewer tech meetups to women, and so on in a self-reinforcing feedback loop. Evan and his team made the ethical decision for their recommendation algorithm to not create such a feedback loop. It is encouraging to see a company not just unthinkingly optimize a metric, but to consider their impact.
While Meetup chose to avoid such an outcome, Facebook provides an example of allowing a runaway feedback loop to run wild. Facebook radicalizes users interested in one conspiracy theory by introducing them to more. As Renee DiResta, a researcher on proliferation of disinformation, writes, once people join a single conspiracy-minded [Facebook] group, they are algorithmically routed to a plethora of others. Join an anti-vaccine group, and your suggestions will include anti-GMO, chemtrail watch, flat Earther (yes, really), and ‘curing cancer naturally’ groups. Rather than pulling a user out of the rabbit hole, the recommendation engine pushes them further in.
Yet another example is a predictive policing algorithm that predicts more crime in certain neighborhoods, causing more police officers to be sent to those neighborhoods, which can result in more crime being recorded in those neighborhoods, and so on. Computer science research on Runaway Feedback Loops in Predictive Policing illustrates how this phenomenon arises and how it can be prevented.
Myths: “This is a neutral platform”, “How users use my tech isn’t my fault”, “Algorithms are impartial”
As someone outside the tech industry but who sees a lot of brand new tech, actor Kumail Nanjiani of the show Silicon Valley provides a helpful perspective. He recently tweeted that he and other cast members are often shown tech that scares them with its potential for misuse. Nanjiani writes, And we’ll bring up our concerns to them. We are realizing that ZERO consideration seems to be given to the ethical implications of tech. They don’t even have a pat rehearsed answer. They are shocked at being asked. Which means nobody is asking those questions. “We’re not making it for that reason but the way ppl choose to use it isn’t our fault. Safeguards will develop.” But tech is moving so fast. That there is no way humanity or laws can keep up… Only “Can we do this?” Never “should we do this?”
A common defense in response to calls for stronger ethics or accountability is for technologists such as Mark Zuckerberg to say that they are building neutral platforms. This defense doesn’t hold up, because any technology requires a number of decisions to be made. In the case of Facebook, decisions such as what to prioritize in the newsfeed, what metrics (such as ad revenue) to optimize for, what tools and filters to make available to advertisers vs users, and the firing of human editors have all influenced the product (as well as the political situation of many countries). Sociology professor Zeynep Tufecki argued in the New York Times that Facebook selling ads targeted to “Jew haters” was not a one-off failure, but rather an unsurprising outcome from how the platform is structured.
Others claim that they can not act to curb online harassment or hate speech as that would contradict the principle of free speech. Anil Dash, CEO of Fog Creek software, writes, “the net effect of online abuse is to silence members of [under-represented] communities. Allowing abuse hurts free speech. Communities that allow abusers to dominate conversation don’t just silence marginalized people, they also drive away any reasonable or thoughtful person who’s put off by that hostile environment.” All tech companies are making decisions about who to include their communities, whether it is through action or implicitly through inaction. Valerie Aurora debunks similar arguments in a post on the paradox of tolerance explaining how free speech can be reduced overall when certain groups are silenced and intimidated. Choosing not to take action about abuse and harassment is still a decision, and it’s a decision that will have a large influence on who uses your platform.
Some data scientists may see themselves as impartially analyzing data. However, as iRobot director of data science Angela Bassett said, “It’s not that data can be biased. Data is biased.” Know how your data was generated and what biases it may contain. We are encoding and even amplifying societal biases in the algorithms we create. In a recent interview with Wired, Kate Crawford, co-founder of the AI Now Institute and principal researcher at Microsoft, explains that data is not neutral, data can not be neutralized, and “data will always bear the marks of its history.” We need to understand that history and what it means for the systems we build.
These biased outcomes arise for a number of reasons, including biased data sets and lack of diversity in the teams building the products. Using a held-out test set and avoiding overfitting is not just good practice, but also an ethical imperative. Overfitting often means that the error rates are higher on types of data that are not well-represented in the training set, quite literally under-represented or minority data.
I’ve done extensive research on retaining women at your company and on bias in interviews, including practical tips to address both. Stripe Engineer Julia Evans thought she could do a better job at conducting phone interviews, so she created a rubric for evaluating candidates for herself, which was eventually adopted as a company-wide standard. She wrote an excellent post about Making Small Culture Changes that should be helpful regardless of what role you are in.
Systemic & Regulatory Response
This blog post is written with an audience of individual data scientists in mind, but systemic and regulatory responses are necessary as well. Renee DiResta draws an analogy between the advent of high frequency trading in the financial markets and the rise of bots and misinformation campaigns on social networks. She argues that just as regulations were needed for the financial markets to combat increasing fragility and bad actors, regulations are needed for social networks to combat increasing fragility and bad actors. Kate Crawford points out that there is a large gap between proposed ethical guidelines and what is happening in practice, because we don’t have accountability mechanisms in place.
The topic of ethics in data science is too huge and complicated to be thoroughly covered in a single blog post. I encourage you to do more reading on this topic and to discuss it with your co-workers and peers. Here are some resources to learn more: