I want to answer some questions that I’m commonly asked: What kind of computer do I need to do deep learning? Why does fast.ai recommend Nvidia GPUs? What deep learning library do you recommend for beginners? How do you put deep learning into production? I think these questions all fall under a general theme of What do you need (in terms of hardware, software, background, and data) to do deep learning? This post is geared towards those new to the field and curious about getting started.
The hardware you need
We are indebted to the gaming industry
The video game industry is larger (in terms of revenue) than the film and music industries combined. In the last 20 years, the video gaming industry drove forward huge advances in GPUs (graphical processing units), used to do the matrix math needed for rendering graphics. Fortunately, these are exactly the type of computations needed for deep learning. These advances in GPU technology are a key part of why neural networks are proving so much more powerful now than they did a few decades ago. Training a deep learning model without a GPU would be painfully slow in most cases.
Not all GPUs are the same
Most deep learning practitioners are not programming GPUs directly; we are using software libraries (such as PyTorch or TensorFlow) that handle this. However, to effectively use these libraries, you need access to the right type of GPU. In almost all cases, this means having access to a GPU from the company Nvidia.
CUDA and OpenCL are the two main ways for programming GPUs. CUDA is by far the most developed, has the most extensive ecosystem, and is the most robustly supported by deep learning libraries. CUDA is a proprietary language created by Nvidia, so it can’t be used by GPUs from other companies. When fast.ai recommends Nvidia GPUs, it is not out of any special affinity or loyalty to Nvidia on our part, but that this is by far the best option for deep learning.
Nvidia dominates the market for GPUs, with the next closest competitor being the company AMD. This summer, AMD announced the release of a platform called ROCm to provide more support for deep learning. The status of ROCm for major deep learning libraries such as PyTorch, TensorFlow, MxNet, and CNTK is still under development. While I would love to see an open source alternative succeed, I have to admit that I find the documentation for ROCm hard to understand. I just read the Overview, Getting Started, and Deep Learning pages of the ROCm website and still can’t explain what ROCm is in my own words, although I want to include it here for completeness. (I admittedly don’t have a background in hardware, but I think that data scientists like me should be part of the intended audience for this project.)
If you don’t have a GPU…
If your computer doesn’t have a GPU or has a non-Nvidia GPU, you have several great options:
Use Crestle, through your browser: Crestle is a service (developed by fast.ai student Anurag Goel) that gives you an already set up cloud service with all the popular scientific and deep learning frameworks already pre-installed and configured to run on a GPU in the cloud. It is easily accessed through your browser. New users get 10 hours and 1 GB of storage for free. After this, GPU usage is 59 cents per hour. I recommend this option to those who are new to AWS or new to using the console.
Set up an AWS cloud instance through your console: You can create an AWS instance (which remotely provides you with Nvidia GPUs) by following the steps in this fast.ai setup lesson. AWS charges 90 cents per hour for this. Although our set-up materials are about AWS (and you’ll find the most forum support for AWS), one fast.ai student created a guide for Setting up an Azure Virtual Machine for Deep learning. And I’m happy to share and add a link if anyone writes a blog post about doing this with Google Cloud Engine.
Build your own box. Here’s a lengthy thread from our fast.ai forums where people ask questions, share what components they are using, and post other useful links and tips. The cheapest new Nvidia GPUs are around $300, with some students finding even cheaper used ones on eBay or Craigslist, and others paying more for more powerful GPUs. A few of our students wrote blog posts documenting how they built their machines:
Deep learning is a relatively young field, and the libraries and tools are changing quickly. For instance, Theano, which we chose to use for part 1 of our course in 2016, was just retired. PyTorch, which we are using currently, was only released earlier this year (2017). As Jeremy wrote previously, you should assume that whatever specific libraries and software you learn today will be obsolete in a year or two. The most important thing is to understand the underlying concepts, and towards that end, we are creating our own library on top of Pytorch that we believe makes deep learning concepts clearer, as well as encoding best practices as defaults.
Python is by far the most commonly used language for deep learning. There are a number of deep learning libraries available, with almost every major tech company backing a different library, although employees at those companies often use a mix of tools. Deep learning libraries include TensorFlow (Google), PyTorch (Facebook), MxNet (University of Washington, adapted by Amazon), CNTK (Microsoft), DeepLearning4j (Skymind), Caffe2 (also Facebook), Nnabla (Sony), PaddlePaddle (Baidu), and Keras (a high-level API that runs on top of several other libraries in this list). All of these have Python options available.
Dynamic vs. Static Graph Computation
At fast.ai, we prioritize the speed at which programmers can experiment and iterate (through easier debugging and more intutive design) as more important than theoretical performance speed-ups. This is the reason we use PyTorch, a flexible deep learning library with dynamic computation.
One distinction amongst deep learning libraries is whether they use dynamic or static computations (some libraries, such as MxNet and now TensorFlow, allow for both). Dynamic computation mean that the program is executed in the order you wrote it. This typically makes debugging easier, and makes it more straightforward to translate ideas from your head into code. Static computation means that you build a structure for your neural network in advance, and then execute operations on it. Theoretically, this allows the compiler to do greater optimizations, although it also means there may be more of a disconnect between what you intended your program to be and what the compiler executes. It also means that bugs can seem more removed from the code that caused them (for instance, if there is an error in how you constructed your graph, you may not realize until you perform an operation on it later). Even though there are theoretical arguments that languages with static computation graphs are capable of better performance than languages with dynamic computation, we often find that is not the case for us in practice.
Google’s TensorFlow mostly uses a static computation graph, whereas Facebook’s PyTorch uses dynamic computation. (Note: TensorFlow announced a dynamic computation option, Eager Execution, just two weeks ago, although it is still quite early and most TensorFlow documentation and projects use the static option). In September, fast.ai announced that we had chosen PyTorch over TensorFlow to use in our course this year and to use for the development of our own library (a higher-level wrapper for PyTorch that encodes best practices). Briefly, here are a few of our reasons for choosing PyTorch (explained in much greater detail here):
easier to debug
dynamic computation is much better suited for natural language processing
traditional Object Oriented Programming style (which feels more natural to us)
TensorFlow’s use of unusual conventions like scope and sessions can be confusing and are more to learn
Google has put far more resources into marketing TensorFlow than anyone else, and I think this is one of the reasons that TensorFlow is so well known (for many people outside deep learning, TensorFlow is the only DL framework that they’ve heard of). As mentioned above, TensorFlow released a dynamic computation option a few weeks ago, which addresses some of the above issues. Many people have asked fast.ai if we are going to switch back to TensorFlow. The dynamic option is still quite new and far less developed, so we will happily continue with PyTorch for now. However, the TensorFlow team has been very receptive to our ideas, and we would love to see our fastai library ported to TensorFlow.
Note: The in-person version of our updated course, which uses PyTorch as well as our own fastai library, is happening currently. It will be released online for free after the course ends (estimated release: January).
What you need for production: not a GPU
Many people overcomplicate the idea of using deep learning in production and believe that they need much more complex systems than they actually do. You can use deep learning in production with a CPU and the webserver of your choice, and in fact, this is what we recommend for most use cases. Here are a few key points:
It is incredibly rare to need to train in production. Even if you want to update your model weights daily, you don’t need to train in production. Good news! This means that you are just doing inference (a forward pass through your model) in production, which is much quicker and easier than training.
You can use whatever webserver you like (e.g. Flask) and set up inference as a simple API call.
GPUs only provide a speed-up if you are effectively able to batch your data. Even if you are getting 32 requests per second, using a GPU would most likely slow you down, because you’d have to wait a second from when the 1st arrived to collect all 32, then perform the computation, and then return the results. We recommend using a CPU in production, and you can always add more CPUs (easier than using multiple GPUs) as needed.
For big companies, it may make sense to use GPUs in production for serving– however, it will be clear when your reach this size. Prematurely trying to scale before it’s needed will only add needless complexity and slow you down.
The background you need: 1 year of coding
One of the frustrations that inspired Jeremy and I to create Practical Deep Learning for Coders was (is) that most deep learning materials fall into one of two categories:
so shallow and high-level as to not give you the information or skills needed to actually use deep learning in the workplace or create state-of-the-art models. This is fine if you just want a high-level overview, but disappointing if you want to become a working practitioner.
highly theoretical and assume a graduate level math background. This is a prohibitive barrier for many folks, and even as someone who has a math PhD, I found that the theory wasn’t particularly useful in learning how to code practical solutions. It’s not surprising that many materials have this slant. Until quite recently, deep learning was almost entirely an academic discipline and largely driven by questions of what would publish in top academic journals.
Our free course Practical Deep Learning for Coders is unique in that the only pre-requisite is 1 year of programming experience, yet it still teaches you how to create state-of-the-art models. Your background can be in any language, although you might want to learn some Python before starting the course, since that is what we use. We introduce math concepts as needed, and we don’t recommend that you try to front-load studying math theory in advance.
If you don’t know how to code, I highly recommend learning, and Python is a great language to start with if you are interested in data science.
The data you need: far less than you think
Although many have claimed that you need Google-size data sets to do deep learning, this is false. The power of transfer learning (combined with techniques like data augmentation) make it possible for people to apply pre-trained models to much smaller datasets. As we’ve talked about elsewhere, at medical start-up Enlitic, Jeremy Howard led a team that used just 1,000 examples of lung CT scans with cancer to build an algorithm that was more accurate at diagnosing lung cancer than a panel of 4 expert radiologists. The C++ library Dlib has an example in which a face detector is accurately trained using only 4 images, containing just 18 faces!
A note about access
For the vast majority of people I talk with, the barriers to entry for deep learning are far lower than they expected and the costs are well within their budgets. However, I realize this is not the case universally. I’m periodically contacted by students that want to take our online course but can’t afford the costs of AWS. Unfortunately, I don’t have a solution. There are other barriers as well. Bruno Sánchez-A Nuño has written about the challenges of doing data science in places that don’t have reliable internet access, and fast.ai international fellow Tahsin Mayeesha describes hidden barriers to MOOC access in countries such as Bangladesh. I care about these issues of access, and it is disatisifying to not have solutions.
An all-too-common scenario: a seemingly impressive machine learning model is a complete failure when implemented in production. The fallout includes leaders who are now skeptical of machine learning and reluctant to try it again. How can this happen?
One of the most likely culprits for this disconnect between results in development vs results in production is a poorly chosen validation set (or even worse, no validation set at all). Depending on the nature of your data, choosing a validation set can be the most important step. Although sklearn offers a train_test_split method, this method takes a random subset of the data, which is a poor choice for many real-world problems.
The definitions of training, validation, and test sets can be fairly nuanced, and the terms are sometimes inconsistently used. In the deep learning community, “test-time inference” is often used to refer to evaluating on data in production, which is not the technical definition of a test set. As mentioned above, sklearn has a train_test_split method, but no train_validation_test_split. Kaggle only provides training and test sets, yet to do well, you will need to split their training set into your own validation and training sets. Also, it turns out that Kaggle’s test set is actually sub-divided into two sets. It’s no suprise that many beginners may be confused! I will address these subtleties below.
First, what is a “validation set”?
When creating a machine learning model, the ultimate goal is for it to be accurate on new data, not just the data you are using to build it. Consider the below example of 3 different models for a set of data:
The error for the pictured data points is lowest for the model on the far right (the blue curve passes through the red points almost perfectly), yet it’s not the best choice. Why is that? If you were to gather some new data points, they most likely would not be on that curve in the graph on the right, but would be closer to the curve in the middle graph.
The underlying idea is that:
the training set is used to train a given model
the validation set is used to choose between models (for instance, does a random forest or a neural net work better for your problem? do you want a random forest with 40 trees or 50 trees?)
the test set tells you how you’ve done. If you’ve tried out a lot of different models, you may get one that does well on your validation set just by chance, and having a test set helps make sure that is not the case.
A key property of the validation and test sets is that they must be representative of the new data you will see in the future. This may sound like an impossible order! By definition, you haven’t seen this data yet. But there are still a few things you know about it.
When is a random subset not good enough?
It’s instructive to look at a few examples. Although many of these examples come from Kaggle competitions, they are representative of problems you would see in the workplace.
If your data is a time series, choosing a random subset of the data will be both too easy (you can look at the data both before and after the dates your are trying to predict) and not representative of most business use cases (where you are using historical data to build a model for use in the future). If your data includes the date and you are building a model to use in the future, you will want to choose a continuous section with the latest dates as your validation set (for instance, the last two weeks or last month of the available data).
Suppose you want to split the time series data below into training and validation sets:
A random subset is a poor choice (too easy to fill in the gaps, and not indicative of what you’ll need in production):
Use the earlier data as your training set (and the later data for the validation set):
Kaggle currently has a competition to predict the sales in a chain of Ecuadorian grocery stores. Kaggle’s “training data” runs from Jan 1 2013 to Aug 15 2017 and the test data spans Aug 16 2017 to Aug 31 2017. A good approach would be to use Aug 1 to Aug 15 2017 as your validation set, and all the earlier data as your training set.
New people, new boats, new…
You also need to think about what ways the data you will be making predictions for in production may be qualitatively different from the data you have to train your model with.
In the Kaggle distracted driver competition, the independent data are pictures of drivers at the wheel of a car, and the dependent variable is a category such as texting, eating, or safely looking ahead. If you were the insurance company building a model from this data, note that you would be most interested in how the model performs on drivers you haven’t seen before (since you would likely have training data only for a small group of people). This is true of the Kaggle competition as well: the test data consists of people that weren’t used in the training set.
If you put one of the above images in your training set and one in the validation set, your model will seem to be performing better than it would on new people. Another perspective is that if you used all the people in training your model, your model may be overfitting to particularities of those specific people, and not just learning the states (texting, eating, etc).
A similar dynamic was at work in the Kaggle fisheries competition to identify the species of fish caught by fishing boats in order to reduce illegal fishing of endangered populations. The test set consisted of boats that didn’t appear in the training data. This means that you’d want your validation set to include boats that are not in the training set.
Sometimes it may not be clear how your test data will differ. For instance, for a problem using satellite imagery, you’d need to gather more information on whether the training set just contained certain geographic locations, or if it came from geographically scattered data.
The dangers of cross-validation
The reason that sklearn doesn’t have a train_validation_test split is that it is assumed you will often be using cross-validation, in which different subsets of the training set serve as the validation set. For example, for a 3-fold cross validation, the data is divided into 3 sets: A, B, and C. A model is first trained on A and B combined as the training set, and evaluated on the validation set C. Next, a model is trained on A and C combined as the training set, and evaluated on validation set B. And so on, with the model performance from the 3 folds being averaged in the end.
However, the problem with cross-validation is that it is rarely applicable to real world problems, for all the reasons describedin the above sections. Cross-validation only works in the same cases where you can randomly shuffle your data to choose a validation set.
Kaggle’s “training set” = your training + validation sets
One great thing about Kaggle competitions is that they force you to think about validation sets more rigorously (in order to do well). For those who are new to Kaggle, it is a platform that hosts machine learning competitions. Kaggle typically breaks the data into two sets you can download:
a training set, which includes the independent variables, as well as the dependent variable (what you are trying to predict). For the example of an Ecuadorian grocery store trying to predict sales, the independent variables include the store id, item id, and date; the dependent variable is the number sold. For the example of trying to determine whether a driver is engaging in dangerous behaviors behind the wheel, the independent variable could be a picture of the driver, and the dependent variable is a category (such as texting, eating, or safely looking forward).
a test set, which just has the independent variables. You will make predictions for the test set, which you can submit to Kaggle and get back a score of how well you did.
This is the basic idea needed to get started with machine learning, but to do well, there is a bit more complexity to understand. You will want to create your own training and validation sets (by splitting the Kaggle “training” data). You will just use your smaller training set (a subset of Kaggle’s training data) for building your model, and you can evaluate it on your validation set (also a subset of Kaggle’s training data) before you submit to Kaggle.
The most important reason for this is that Kaggle has split the test data into two sets: for the public and private leaderboards. The score you see on the public leaderboard is just for a subset of your predictions (and you don’t know which subset!). How your predictions fare on the private leaderboard won’t be revealed until the end of the competition. The reason this is important is that you could end up overfitting to the public leaderboard and you wouldn’t realize it until the very end when you did poorly on the private leaderboard. Using a good validation set can prevent this. You can check if your validation set is any good by seeing if your model has similar scores on it to compared with on the Kaggle test set.
Another reason it’s important to create your own validation set is that Kaggle limits you to two submissions per day, and you will likely want to experiment more than that. Thirdly, it can be instructive to see exactly what you’re getting wrong on the validation set, and Kaggle doesn’t tell you the right answers for the test set or even which data points you’re getting wrong, just your overall score.
Understanding these distinctions is not just useful for Kaggle. In any predictive machine learning project, you want your model to be able to perform well on new data.
What is the ethical responsibility of data scientists?
What we’re talking about is a cataclysmic change… What we’re talking about is a major foreign power with sophistication and ability to involve themselves in a presidential election and sow conflict and discontent all over this country… You bear this responsibility. You’ve created these platforms. And now they are being misused, Senator Feinstein said this week in a senate hearing. Who has created a cataclysmic change? Who bears this large responsibility? She was talking to executives at tech companies and referring to the work of data scientists.
As we data scientists sit behind computer screens coding, we may not give much thought to the people whose lives may be changed by our algorithms. However, we have a moral responsiblity to our world and to those whose lives will be impacted by our work. Technology is inherently about humans, and it is perilous to ignore human psychology, sociology, and history while creating tech. Even aside from our ethical responsibility, you could serve time in prison for the code you write, like the Volkswagon engineer who was sentenced to 3.5 years in prison for helping develop software to cheat on federal emissions tests. This is what his employer asked him to do, but following your boss’s orders doesn’t absolve you of responsibility and is not an excuse that will protect you in court.
As a data scientist, you may not have too much say in product decisions, but you can ask questions and raise issues. While it can be uncomfortable to stand up for what is right, you are in a fortunate position as part of only 0.3-0.5% of the global population who knows how to code. With this knowledge comes a responsibility to use it for good. There are many reasons why you may feel trapped in your job (needing a visa, supporting a family, being new to the industry); however, I have found that people in unethical or toxic work environments (my past self included) consistently underestimate their options. If you find yourself in an unethical environment, please at least attempt applying for other jobs. The demand for data scientists is high and if you are currently working as a data scientist, there are most likely other companies that would like to hire you.
One thing we should all be doing is thinking about how bad actors could misuse our technology. Here are a few key areas to consider:
How could trolls use your service to harass vulnerable people?
How could your work be used to spread harmful misinformation or propaganda?
What safeguards could be put in place to mitigate the above?
Data Science Impacts the World
The consequences of algorithms can be not only dangerous, but even deadly. Facebook is currently being used to spread dehumanizing misinformation about the Rohingya, an ethnic minority in Myanmar. As described above, over half a million Rohinyga have been driven from their homes due to systematic murder, rape, and burning. For many in Myanmar, Facebook is their only news source. As quoted in the New York Times, one local official of a village with numerous restrictions prohibiting Muslims (the Rohingya are Muslim while the majority of the country is Buddhist) admits that he has never met a Muslim, but says [they] are not welcome here because they are violent and they multiply like crazy with so many wives and children.I have to thank Facebook, because it is giving me the true information in Myanmar.
Abe Gong, CEO of Superconductive Health, discusses a criminal recidivism algorithm used in U.S. courtrooms that included data about whether a person’s parents separated and if their father had ever been arrested. To be clear, this means that people’s prisons sentences were longer or shorter depending on things their parents had done. Even if this increased the accuracy of the model, it is unethical to include this information, as it is completely beyond the control of the defendants. This is an example of why data scientists shouldn’t just unthinkingly optimize for a simple metric, but that we must also think about what type of society we want to live in.
Runaway Feedback Loops
Evan Estola, lead machine learning engineer at Meetup, discussed the example of men expressing more interest than women in tech meetups. Meetup’s algorithm could recommend fewer tech meetups to women, and as a result, fewer women would find out about and attend tech meetups, which could cause the algorithm to suggest even fewer tech meetups to women, and so on in a self-reinforcing feedback loop. Evan and his team made the ethical decision for their recommendation algorithm to not create such a feedback loop. It is encouraging to see a company not just unthinkingly optimize a metric, but to consider their impact.
While Meetup chose to avoid such an outcome, Facebook provides an example of allowing a runaway feedback loop to run wild. Facebook radicalizes users interested in one conspiracy theory by introducing them to more. As Renee DiResta, a researcher on proliferation of disinformation, writes, once people join a single conspiracy-minded [Facebook] group, they are algorithmically routed to a plethora of others. Join an anti-vaccine group, and your suggestions will include anti-GMO, chemtrail watch, flat Earther (yes, really), and ‘curing cancer naturally’ groups. Rather than pulling a user out of the rabbit hole, the recommendation engine pushes them further in.
Yet another example is a predictive policing algorithm that predicts more crime in certain neighborhoods, causing more police officers to be sent to those neighborhoods, which can result in more crime being recorded in those neighborhoods, and so on. Computer science research on Runaway Feedback Loops in Predictive Policing illustrates how this phenomenon arises and how it can be prevented.
Myths: “This is a neutral platform”, “How users use my tech isn’t my fault”, “Algorithms are impartial”
As someone outside the tech industry but who sees a lot of brand new tech, actor Kumail Nanjiani of the show Silicon Valley provides a helpful perspective. He recently tweeted that he and other cast members are often shown tech that scares them with its potential for misuse. Nanjiani writes, And we’ll bring up our concerns to them. We are realizing that ZERO consideration seems to be given to the ethical implications of tech. They don’t even have a pat rehearsed answer. They are shocked at being asked. Which means nobody is asking those questions. “We’re not making it for that reason but the way ppl choose to use it isn’t our fault. Safeguards will develop.” But tech is moving so fast. That there is no way humanity or laws can keep up… Only “Can we do this?” Never “should we do this?”
A common defense in response to calls for stronger ethics or accountability is for technologists such as Mark Zuckerberg to say that they are building neutral platforms. This defense doesn’t hold up, because any technology requires a number of decisions to be made. In the case of Facebook, decisions such as what to prioritize in the newsfeed, what metrics (such as ad revenue) to optimize for, what tools and filters to make available to advertisers vs users, and the firing of human editors have all influenced the product (as well as the political situation of many countries). Sociology professor Zeynep Tufecki argued in the New York Times that Facebook selling ads targeted to “Jew haters” was not a one-off failure, but rather an unsurprising outcome from how the platform is structured.
Others claim that they can not act to curb online harassment or hate speech as that would contradict the principle of free speech. Anil Dash, CEO of Fog Creek software, writes, “the net effect of online abuse is to silence members of [under-represented] communities. Allowing abuse hurts free speech. Communities that allow abusers to dominate conversation don’t just silence marginalized people, they also drive away any reasonable or thoughtful person who’s put off by that hostile environment.” All tech companies are making decisions about who to include their communities, whether it is through action or implicitly through inaction. Valerie Aurora debunks similar arguments in a post on the paradox of tolerance explaining how free speech can be reduced overall when certain groups are silenced and intimidated. Choosing not to take action about abuse and harassment is still a decision, and it’s a decision that will have a large influence on who uses your platform.
Some data scientists may see themselves as impartially analyzing data. However, as iRobot director of data science Angela Bassett said, “It’s not that data can be biased. Data is biased.” Know how your data was generated and what biases it may contain. We are encoding and even amplifying societal biases in the algorithms we create. In a recent interview with Wired, Kate Crawford, co-founder of the AI Now Institute and principal researcher at Microsoft, explains that data is not neutral, data can not be neutralized, and “data will always bear the marks of its history.” We need to understand that history and what it means for the systems we build.
These biased outcomes arise for a number of reasons, including biased data sets and lack of diversity in the teams building the products. Using a held-out test set and avoiding overfitting is not just good practice, but also an ethical imperative. Overfitting often means that the error rates are higher on types of data that are not well-represented in the training set, quite literally under-represented or minority data.
I’ve done extensive research on retaining women at your company and on bias in interviews, including practical tips to address both. Stripe Engineer Julia Evans thought she could do a better job at conducting phone interviews, so she created a rubric for evaluating candidates for herself, which was eventually adopted as a company-wide standard. She wrote an excellent post about Making Small Culture Changes that should be helpful regardless of what role you are in.
Systemic & Regulatory Response
This blog post is written with an audience of individual data scientists in mind, but systemic and regulatory responses are necessary as well. Renee DiResta draws an analogy between the advent of high frequency trading in the financial markets and the rise of bots and misinformation campaigns on social networks. She argues that just as regulations were needed for the financial markets to combat increasing fragility and bad actors, regulations are needed for social networks to combat increasing fragility and bad actors. Kate Crawford points out that there is a large gap between proposed ethical guidelines and what is happening in practice, because we don’t have accountability mechanisms in place.
The topic of ethics in data science is too huge and complicated to be thoroughly covered in a single blog post. I encourage you to do more reading on this topic and to discuss it with your co-workers and peers. Here are some resources to learn more:
There is a lot of misleading and even false information about AI out there, ranging from apallingly bad journalism to overhyped marketing materials to quotes from misinformed celebrities. Last month, it even got so bad that Snopes had to debunk a story about Facebook research that was inaccurately covered by a number of outlets.
AI is a complex topic moving at an overwhelming pace (even as someone working in the field, I find it impossible to keep up with everything that is happening). Beyond that, there are those who stand to profit off overhyping advances or drumming up fear.
I want to recommend several credible sources of accurate information. Most of the writing on this list is intended to be accessible to anyone—even if you aren’t a programmer or don’t work in tech:
Jack Clark’s email newsletter, Import AI, provides highlights and summaries of a selection of AI news and research from the previous week. You can check out previous issues (or sign up) here. Jack Clark is Strategy & Communications Director at OpenAI.
Mariya Yao’s writing on Topbots and on Forbes. Mariya is CTO and head of R&D for Topbots, a strategy and research firm for applied artificial intelligence and machine learning. Fun fact: Mariya worked on the LIDAR system for the 2nd place winner in the DARPA grand challenge for autonomous vehicles.
Zeynep Tufekci, a professor at UNC-Chapel Hill, is an expert on the interactions between technology and society. She shares a lot of important ideas on twitter, or read her New York Times op-eds here.
Kate Crawford is a professor at NYU, principal researcher at Microsoft, and co-founder of the AI Now Research institute, dedicated to studying the social impacts of AI. You can follow her on twitter here.
I also want to highlight a few great examples of AI researchers thoughtfully deconstructing the hype around some high-profile stories in the past few months, in an accessible way:
Denny Britz provides a balanced perspective on OpenAIs bot for Defense of the Ancients (a popular computer game).
Twitter is quite useful for keeping up on machine learning news and many people share surprisingly deep insights (that I often can’t find elsewhere). I was skeptical of Twitter before I started using it. The whole idea seemed weird: you can only write a 140 characters at a time? I already had Facebook and Linkedin, did I really need another social media account? It now occupies a useful and distinct niche for me. The hardest part is getting started; feel free to take a look at my twitter or Jeremy’s favorites to look for interesting accounts. Whenever I read an article I like or hear a talk I like, I always look up the author/speaker on twitter and see if I find their tweets interesting. If so, I follow them.
I will update this post based on constructive feedback as I receive it (with attribution)—although I’ve tried to largely stick to my area of specialty (data science) I’ve had to touch on various areas that I’m not an expert in, so please do let me know if you notice any issues. You can reach me on Twitter at @jeremyphoward.
When I first read about this study, I had a strong negative emotional response. The topic is of great personal interest to me—Rachel Thomas and I started fast.ai explicitly for the purpose of increasing diversity in the field of deep learning (including the deep neural networks used in this study), and we even personally pay for scholarships for diverse students, including LGBTQ students. In addition, we want to support the use of deep learning in a wider range of fields, because we believe that it can both positively and negatively impact many people’s lives, so we want to show how to use the technology appropriately and correctly.
Like many commentators, I had many basic concerns about this study. Should it have been done at all? Was the data collection an invasion of privacy? Were the right people involved in the work? Were the results communicated in a thoughtful and sensitive way? These are important issues, and can’t be answered by any single individual. Because deep learning is making it possible for computers to do things that weren’t possible before, we’re going to see more and more areas where these questions are going to arise. Therefore, we need to see more cross-disciplinary studies being done by more cross-disciplinary teams. In this case, the researchers are data scientists and psychologists, but the paper covers topics (and claims to reach conclusions) in fields from sociology to biology.
So, what does the paper actually show - can neural nets do what is claimed, or not? We will analyze this question as a data scientist - by looking at the data.
The key conclusions of both the paper (“Deep neural networks can detect sexual orientation from faces”) and the response (“AI can’t tell if you’re gay”) are not supported by the research shown. What is supported is a weaker claim: in some situations deep neural networks can recognize some photos of gay dating website users from some photos of heterosexual dating website users. We definitely can’t say “AI can’t tell if you’re gay”, and indeed to make this claim is irresponsible: the paper at least shows some sign that the opposite might well be true, and that such technology is readily available and easily used by any government or organization.
The senior researcher on this paper, Michael Kosinski, has successfully warned us of similar problems in the past: his paper Private traits and attributes are predictable from digital records of human behavior is one of the most cited of all time, and was at least partly responsible for getting Facebook to change their policy of having ‘likes’ be public by default. If the key result in this new study does turn out to be correct, then we should certainly be having a discussion about what policy implications it has. If you live in a country where homosexuality is punishable by death, then you need to be open to the possibility that you could be profiled for extra surveillance based on your social media pictures. If you are in a situation where you can’t be open about your sexual preference, you should be aware that a machine learning recommendation system could (perhaps even accidentally) target you with merchandise targeted to a gay demographic.
However, the paper comes to many other conclusions that are not directly related to this key question, are not clearly supported by the research, and are overstated and poorly communicated. In particular, the paper claims that the research supports the “widely accepted” prenatal hormone theory (PHT) that “same-gender sexual orientation stems from the under-exposure of male fetuses or overexposure of female fetuses to androgens that are responsible for sexual differentiation”. The support for this in the paper is far from rigorous, and should be considered inconclusive. Furthermore, the sociologist Greggor Mattson says that not only is the theory not widely accepted, but that “literally the first sentence of a decade-old review of the field is ‘Public perceptions of the effect of testosterone on ‘manly’ behavior are inaccurate’”.
How was the research was done?
A number of studies are presented in the paper, but the key one is ‘study 1a’. In this study, the researchers downloaded on average 5 images of 70,000 people from a dating website. None of the data collected in the study has been made available, although nearly any programmer could easily replicate this (and indeed many coders have created similar datasets in the past). Because the study’s focus is on detecting sexual orientation from faces, they cropped the photos to the area of the face. They also removed photos with multiple people, or where the face wasn’t clear, or wasn’t looking straight at the camera. The technical approach here is very standard and reliable, using widely used open source software called Face++.
They then removed any images that a group of non-expert workers considered either not adult, or not caucasian (using Amazon’s Mechanical Turk system). It wasn’t entirely clear why they did this; most likely they were assuming that handling more types of face would make it harder to train their model.
It’s important to be aware that steps like this used to ‘clean’ a dataset are necessary for nearly all data science projects, but they are rarely if ever perfect - and those imperfections are generally not important in understanding the accuracy of a study. What’s important for evaluation is to be confident that the final metrics reported are evaluated appropriately. More on this shortly…
They labeled each as gay or not based on each dating profile’s listed sexual preference.
The researchers then used a deep neural network (VGG-Face) to create features. Specifically, each image was turned into 4096 numbers, each of which had been trained by University of Oxford researchers to be as good as possible for recognizing humans from their faces. They compressed those 4096 numbers down 500 using a simple statistical technique called SVD, and they then used a simple regression model to map these 500 numbers to the label (gay or not).
They repeated the regression 10 times. Each time they used a different 90% subset of the data, and tested the model using the remaining 10% (this is known as cross-validation). The ten models were scored using a metric called AUC which is a standard approach for evaluating classification models like this one. The AUC for people marked as males in the dataset was 0.91.
How accurate is this model?
The researchers describe their model as “91% accurate”. This is based on the AUC score of 0.91. However, it is very unusual and quite misleading to use the word “accuracy” to describe AUC. The researchers have clarified that the actual accuracy of the model can be understood as follows: if you pick the 10% of people in the study with the highest scores on the model, about half will actually be gay based on the collected labels. If the actual percentage of gay males is 7%, then this shows that the model is a lot better than random. However it is probably not as accurate as most people would imagine if they heard something was “91% accurate”.
It is also important to note that based on this study (study 1a) we can only really say that the model can recognize gay dating profiles from one web-site of people that non-experts label as adult caucasian, not that it can recognize gay photos in general. It’s quite likely that this model will generalize to other similar populations, but we don’t know from this research how similar those populations would need to be, and how accurate it will be.
Have the researchers created a new technology here?
The approach used in this study is literally the very first technique that we teach in our introductory deep learning course. Our course does not require an advanced math background - only high school math is need. So the approach used here is literally something anyone can do with high school math, an hour of free online study, and a basic knowledge of programming.
A model trained in this way takes under 20 seconds to run on a commodity server that can be rented for $0.90/hour. So it does not require any special or expensive resources. The data can be easily downloaded from dating websites by anyone with basic coding skills.
The researchers say that their study shows a potential privacy problem. Since the technology they used is very accessible, then if you believe that the capabilities shown are of concern, then this claim seems reasonable.
It is probably reasonably to assume that many organizations have already completed similar projects, but without publishing them in the academic literature. This paper showing what can already be easily done—it is not creating a new technology. It is becoming increasingly common for marketers to use social media data to help push their products; in these cases the models simply look for correlations between product sales and and social media data that is available. In this case it would be very easy for a model in implicitly find a relationship between certain photos and products targeted to a gay market, without the developers even realizing it had made that connection. Indeed, we have seen somewhat similar issues before such as that described in the article How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did.
Did the model show that gay faces are physically different?
In study 1b, the researchers cover up different parts of each image, to see which parts when covered up cause the prediction to change. This is a common technique for understanding the relative importance of different parts of an input to a neural network.
The results this analysis are shown in this picture from the paper:
The red areas are relatively more important to the model than the blue areas. However, this analysis does not show how much more important it is, or why or in what way the red areas were more important.
In study 1c they try to create an “average face” for each of male and female and for each of gay and heterosexual. This part of the study has no rigorous analysis and relies entirely on an intuitive view of the images shown. From a data science point of view, there is no additional information that can be gained from this section.
The researchers claim that these studies show support for the prenatal hormone theory. However, no data is presented that shows how this theory is supported or what level of support is provided, nor investigates possible alternative theories for the observations.
Is the model more accurate than humans?
The researchers claim in the first sentence of the abstract that “faces contain much more information about sexual orientation than can be perceived and interpreted by the human brain”. They base this claim on Study 4, in which they ask humans to classify images from the same dataset as study 1a. However, this study completely fails to provide an adequate methodology to support the claim. Stanford researcher Andrej Karpathy (now at Tesla) showed a fairly rigorous approach to how human image classification can be compared to a neural net. The key piece is to give the human the same opportunity to study the training data as the computer received. In this case, that would mean letting each human judge study many examples of the faces and labels collected in the dataset, before being asked to classify faces themselves.
By failing to provide this “human training” step, the humans and computers had very different information with which to complete the task. Even if the methodology was better, there would still be many possible explanations other than the very strong and unsupported claim that they decided to open their paper with.
As a rule, academic claims should be made with care and rigor and communicated thoughtfully. Especially when it opens a paper. Especially especially when it is in such a sensitive area. Double especially especially when it covers an area outside the researchers’ area of specialty. This issue is thoughtfully discussed with relation to this paper in a Calling Bullshit case study.
Is the classifier effective for images other than dating pics from one website?
In short: we don’t know. Study 5 in the paper asserts that it is, but it does not provide strong support for this claim, and is set up in a oddly convoluted way. The method used for study 5 was to find facebook pics from some extra-super-gay facebook users: people who listed a same-sex partner, and liked at least two pages such as “Manhunt” and “I love being gay”. It then tried to see if they could train a classifier to separate these pics from heterosexual dating web-site users. The claimed accuracy of this classifier was 74%, although the exact meaning of this 74% figure is not listed. If it means an AUC of 0.74 (which is how the researchers referred to AUC earlier in the paper), this is not a strong result. It’s also comparing across datasets (facebook vs dating website), and using a very particular type of facebook profile to do their test.
The researchers state that they didn’t compare to heterosexual profile pics because they didn’t know how to find them.
Are their conclusions supported by their studies?
In the General Discussion section the researchers come to a number of conclusions. All of the conclusions as stated are stronger than what can be concluded from the research results shown. However, we can at least say (assuming that their data analysis was completed correctly, which we can’t confirm since we don’t have access to their data or code) that sexual preference of people in some photos can be identified much better than randomly in some situations.
They conclude that their model does not simply find differences in presentation between the two groups, but actually shows differences in underlying facial structure. This claim is based partly on an assertion that the VGG-Face model they use is trained to identify non-transient facial features. However, some simple data analysis readily shows that this assertion is incorrect. Victoria University researcher Tom White shared an analysis that showed this exact model, for instance, can recognize happy from neutral faces with a higher AUC (0.92) than the model shown in this paper (and can recognize happy from sad faces even better, with an AUC of 0.96).
Did the paper confuse correlation with causation?
Any time that a paper from social scientists comes to the attention of groups of programmers (such as when shared on coding forums), inevitably we hear the cry “correlation is not causation”. This happened for this paper too. What does this mean, and is it a problem here? Naturally, XKCD has us covered:
Correlation refers to an observation that two things happen at the same time. For instance, you may notice that on days when people buy more ice-cream, they also buy more sun-screen. Sometimes people incorrectly assume that such an observation implies causation—in this case, that eating ice-cream causes people to want sunscreen. When a correlation is observed between some event x (buying ice-cream) and event y (buying sunscreen), there are three main possibilities:
x causes y
y causes x
something else causes both x and y (possibly indirectly)
pure chance (we can measure the probability of this happening - in this study it is vanishingly low)
In this case, of course, a warm and sunny day causes both the desire for ice-cream, and the need for sunscreen.
Much of the social sciences deals with this issue. Researchers in these fields often have to try to reach conclusions from observational studies in the presence of many confounding factors. This is a complex and challenging task, and often results in imperfect results. For mathematicians and computer scientists, results from the social sciences can seem infuriatingly poorly founded. In math, if you want to claim that, for example, no three positive integers a, b, and c satisfy the equation an + bn = cn for any integer value of n greater than 2, then it doesn’t matter if you try millions of values of a, b, and c and show none have this relationship—you have to prove it for all possible integers. But in the social sciences this kind of result is generally not possible. So we must try to weigh the balance of evidence versus our a priori expectations regarding the results.
The Stanford paper tries to separate out correlation from causation by using various studies, as discussed above. And in the end, they don’t do a great job of it. But the simple claim that “correlation is not causation” is a sloppy response. Instead, alternative theories need to be provided, preferably with evidence: that is, can you make a claim that y causes x, or that something else causes both x and y, and show that your alternative theory is supported by the research shown in the paper?
In addition, we need to consider the simple question: does it actually matter? E.g. if it is possible for a government (or an over-zealous marketer) to classify a photo of a face by sexual orientation, mightn’t this be an important result regardless of whether the cause of the identified differences are grooming, facial expression, or facial structure?
Should we worry about privacy?
The paper concludes with a warning that governments are already using sophisticated technology to infer intimate traits of citizens, and that it is only through research like this that we can guess what kind of capabilities they have. They state:
Delaying or abandoning the publication of these findings could deprive individuals of the chance to take preventive measures and policymakers the ability to introduce legislation to protect people. Moreover, this work does not offer any advantage to those who may be developing or deploying classification algorithms, apart from emphasizing the ethical implications of their work. We used widely available off-the-shelf tools, publicly available data, and methods well known to computer vision practitioners. We did not create a privacy-invading tool, but rather showed that basic and widely used methods pose serious privacy threats.
These are genuine concerns and it is clearly a good thing for us all to understand the kinds of tools that could be used to reduce privacy. It is a shame that the overstated claims, weak cross-disciplinary research, and methodological problems clouded this important issue.