An all-too-common scenario: a seemingly impressive machine learning model is a complete failure when implemented in production. The fallout includes leaders who are now skeptical of machine learning and reluctant to try it again. How can this happen?
One of the most likely culprits for this disconnect between results in development vs results in production is a poorly chosen validation set (or even worse, no validation set at all). Depending on the nature of your data, choosing a validation set can be the most important step. Although sklearn offers a train_test_split method, this method takes a random subset of the data, which is a poor choice for many real-world problems.
The definitions of training, validation, and test sets can be fairly nuanced, and the terms are sometimes inconsistently used. In the deep learning community, “test-time inference” is often used to refer to evaluating on data in production, which is not the technical definition of a test set. As mentioned above, sklearn has a train_test_split method, but no train_validation_test_split. Kaggle only provides training and test sets, yet to do well, you will need to split their training set into your own validation and training sets. Also, it turns out that Kaggle’s test set is actually sub-divided into two sets. It’s no suprise that many beginners may be confused! I will address these subtleties below.
First, what is a “validation set”?
When creating a machine learning model, the ultimate goal is for it to be accurate on new data, not just the data you are using to build it. Consider the below example of 3 different models for a set of data:
The error for the pictured data points is lowest for the model on the far right (the blue curve passes through the red points almost perfectly), yet it’s not the best choice. Why is that? If you were to gather some new data points, they most likely would not be on that curve in the graph on the right, but would be closer to the curve in the middle graph.
The underlying idea is that:
the training set is used to train a given model
the validation set is used to choose between models (for instance, does a random forest or a neural net work better for your problem? do you want a random forest with 40 trees or 50 trees?)
the test set tells you how you’ve done. If you’ve tried out a lot of different models, you may get one that does well on your validation set just by chance, and having a test set helps make sure that is not the case.
A key property of the validation and test sets is that they must be representative of the new data you will see in the future. This may sound like an impossible order! By definition, you haven’t seen this data yet. But there are still a few things you know about it.
When is a random subset not good enough?
It’s instructive to look at a few examples. Although many of these examples come from Kaggle competitions, they are representative of problems you would see in the workplace.
If your data is a time series, choosing a random subset of the data will be both too easy (you can look at the data both before and after the dates your are trying to predict) and not representative of most business use cases (where you are using historical data to build a model for use in the future). If your data includes the date and you are building a model to use in the future, you will want to choose a continuous section with the latest dates as your validation set (for instance, the last two weeks or last month of the available data).
Suppose you want to split the time series data below into training and validation sets:
A random subset is a poor choice (too easy to fill in the gaps, and not indicative of what you’ll need in production):
Use the earlier data as your training set (and the later data for the validation set):
Kaggle currently has a competition to predict the sales in a chain of Ecuadorian grocery stores. Kaggle’s “training data” runs from Jan 1 2013 to Aug 15 2017 and the test data spans Aug 16 2017 to Aug 31 2017. A good approach would be to use Aug 1 to Aug 15 2017 as your validation set, and all the earlier data as your training set.
New people, new boats, new…
You also need to think about what ways the data you will be making predictions for in production may be qualitatively different from the data you have to train your model with.
In the Kaggle distracted driver competition, the independent data are pictures of drivers at the wheel of a car, and the dependent variable is a category such as texting, eating, or safely looking ahead. If you were the insurance company building a model from this data, note that you would be most interested in how the model performs on drivers you haven’t seen before (since you would likely have training data only for a small group of people). This is true of the Kaggle competition as well: the test data consists of people that weren’t used in the training set.
If you put one of the above images in your training set and one in the validation set, your model will seem to be performing better than it would on new people. Another perspective is that if you used all the people in training your model, your model may be overfitting to particularities of those specific people, and not just learning the states (texting, eating, etc).
A similar dynamic was at work in the Kaggle fisheries competition to identify the species of fish caught by fishing boats in order to reduce illegal fishing of endangered populations. The test set consisted of boats that didn’t appear in the training data. This means that you’d want your validation set to include boats that are not in the training set.
Sometimes it may not be clear how your test data will differ. For instance, for a problem using satellite imagery, you’d need to gather more information on whether the training set just contained certain geographic locations, or if it came from geographically scattered data.
The dangers of cross-validation
The reason that sklearn doesn’t have a train_validation_test split is that it is assumed you will often be using cross-validation, in which different subsets of the training set serve as the validation set. For example, for a 3-fold cross validation, the data is divided into 3 sets: A, B, and C. A model is first trained on A and B combined as the training set, and evaluated on the validation set C. Next, a model is trained on A and C combined as the training set, and evaluated on validation set B. And so on, with the model performance from the 3 folds being averaged in the end.
However, the problem with cross-validation is that it is rarely applicable to real world problems, for all the reasons describedin the above sections. Cross-validation only works in the same cases where you can randomly shuffle your data to choose a validation set.
Kaggle’s “training set” = your training + validation sets
One great thing about Kaggle competitions is that they force you to think about validation sets more rigorously (in order to do well). For those who are new to Kaggle, it is a platform that hosts machine learning competitions. Kaggle typically breaks the data into two sets you can download:
a training set, which includes the independent variables, as well as the dependent variable (what you are trying to predict). For the example of an Ecuadorian grocery store trying to predict sales, the independent variables include the store id, item id, and date; the dependent variable is the number sold. For the example of trying to determine whether a driver is engaging in dangerous behaviors behind the wheel, the independent variable could be a picture of the driver, and the dependent variable is a category (such as texting, eating, or safely looking forward).
a test set, which just has the independent variables. You will make predictions for the test set, which you can submit to Kaggle and get back a score of how well you did.
This is the basic idea needed to get started with machine learning, but to do well, there is a bit more complexity to understand. You will want to create your own training and validation sets (by splitting the Kaggle “training” data). You will just use your smaller training set (a subset of Kaggle’s training data) for building your model, and you can evaluate it on your validation set (also a subset of Kaggle’s training data) before you submit to Kaggle.
The most important reason for this is that Kaggle has split the test data into two sets: for the public and private leaderboards. The score you see on the public leaderboard is just for a subset of your predictions (and you don’t know which subset!). How your predictions fare on the private leaderboard won’t be revealed until the end of the competition. The reason this is important is that you could end up overfitting to the public leaderboard and you wouldn’t realize it until the very end when you did poorly on the private leaderboard. Using a good validation set can prevent this. You can check if your validation set is any good by seeing if your model has similar scores on it to compared with on the Kaggle test set.
Another reason it’s important to create your own validation set is that Kaggle limits you to two submissions per day, and you will likely want to experiment more than that. Thirdly, it can be instructive to see exactly what you’re getting wrong on the validation set, and Kaggle doesn’t tell you the right answers for the test set or even which data points you’re getting wrong, just your overall score.
Understanding these distinctions is not just useful for Kaggle. In any predictive machine learning project, you want your model to be able to perform well on new data.
What is the ethical responsibility of data scientists?
What we’re talking about is a cataclysmic change… What we’re talking about is a major foreign power with sophistication and ability to involve themselves in a presidential election and sow conflict and discontent all over this country… You bear this responsibility. You’ve created these platforms. And now they are being misused, Senator Feinstein said this week in a senate hearing. Who has created a cataclysmic change? Who bears this large responsibility? She was talking to executives at tech companies and referring to the work of data scientists.
As we data scientists sit behind computer screens coding, we may not give much thought to the people whose lives may be changed by our algorithms. However, we have a moral responsiblity to our world and to those whose lives will be impacted by our work. Technology is inherently about humans, and it is perilous to ignore human psychology, sociology, and history while creating tech. Even aside from our ethical responsibility, you could serve time in prison for the code you write, like the Volkswagon engineer who was sentenced to 3.5 years in prison for helping develop software to cheat on federal emissions tests. This is what his employer asked him to do, but following your boss’s orders doesn’t absolve you of responsibility and is not an excuse that will protect you in court.
As a data scientist, you may not have too much say in product decisions, but you can ask questions and raise issues. While it can be uncomfortable to stand up for what is right, you are in a fortunate position as part of only 0.3-0.5% of the global population who knows how to code. With this knowledge comes a responsibility to use it for good. There are many reasons why you may feel trapped in your job (needing a visa, supporting a family, being new to the industry); however, I have found that people in unethical or toxic work environments (my past self included) consistently underestimate their options. If you find yourself in an unethical environment, please at least attempt applying for other jobs. The demand for data scientists is high and if you are currently working as a data scientist, there are most likely other companies that would like to hire you.
One thing we should all be doing is thinking about how bad actors could misuse our technology. Here are a few key areas to consider:
How could trolls use your service to harass vulnerable people?
How could your work be used to spread harmful misinformation or propaganda?
What safeguards could be put in place to mitigate the above?
Data Science Impacts the World
The consequences of algorithms can be not only dangerous, but even deadly. Facebook is currently being used to spread dehumanizing misinformation about the Rohingya, an ethnic minority in Myanmar. As described above, over half a million Rohinyga have been driven from their homes due to systematic murder, rape, and burning. For many in Myanmar, Facebook is their only news source. As quoted in the New York Times, one local official of a village with numerous restrictions prohibiting Muslims (the Rohingya are Muslim while the majority of the country is Buddhist) admits that he has never met a Muslim, but says [they] are not welcome here because they are violent and they multiply like crazy with so many wives and children.I have to thank Facebook, because it is giving me the true information in Myanmar.
Abe Gong, CEO of Superconductive Health, discusses a criminal recidivism algorithm used in U.S. courtrooms that included data about whether a person’s parents separated and if their father had ever been arrested. To be clear, this means that people’s prisons sentences were longer or shorter depending on things their parents had done. Even if this increased the accuracy of the model, it is unethical to include this information, as it is completely beyond the control of the defendants. This is an example of why data scientists shouldn’t just unthinkingly optimize for a simple metric, but that we must also think about what type of society we want to live in.
Runaway Feedback Loops
Evan Estola, lead machine learning engineer at Meetup, discussed the example of men expressing more interest than women in tech meetups. Meetup’s algorithm could recommend fewer tech meetups to women, and as a result, fewer women would find out about and attend tech meetups, which could cause the algorithm to suggest even fewer tech meetups to women, and so on in a self-reinforcing feedback loop. Evan and his team made the ethical decision for their recommendation algorithm to not create such a feedback loop. It is encouraging to see a company not just unthinkingly optimize a metric, but to consider their impact.
While Meetup chose to avoid such an outcome, Facebook provides an example of allowing a runaway feedback loop to run wild. Facebook radicalizes users interested in one conspiracy theory by introducing them to more. As Renee DiResta, a researcher on proliferation of disinformation, writes, once people join a single conspiracy-minded [Facebook] group, they are algorithmically routed to a plethora of others. Join an anti-vaccine group, and your suggestions will include anti-GMO, chemtrail watch, flat Earther (yes, really), and ‘curing cancer naturally’ groups. Rather than pulling a user out of the rabbit hole, the recommendation engine pushes them further in.
Yet another example is a predictive policing algorithm that predicts more crime in certain neighborhoods, causing more police officers to be sent to those neighborhoods, which can result in more crime being recorded in those neighborhoods, and so on. Computer science research on Runaway Feedback Loops in Predictive Policing illustrates how this phenomenon arises and how it can be prevented.
Myths: “This is a neutral platform”, “How users use my tech isn’t my fault”, “Algorithms are impartial”
As someone outside the tech industry but who sees a lot of brand new tech, actor Kumail Nanjiani of the show Silicon Valley provides a helpful perspective. He recently tweeted that he and other cast members are often shown tech that scares them with its potential for misuse. Nanjiani writes, And we’ll bring up our concerns to them. We are realizing that ZERO consideration seems to be given to the ethical implications of tech. They don’t even have a pat rehearsed answer. They are shocked at being asked. Which means nobody is asking those questions. “We’re not making it for that reason but the way ppl choose to use it isn’t our fault. Safeguards will develop.” But tech is moving so fast. That there is no way humanity or laws can keep up… Only “Can we do this?” Never “should we do this?”
A common defense in response to calls for stronger ethics or accountability is for technologists such as Mark Zuckerberg to say that they are building neutral platforms. This defense doesn’t hold up, because any technology requires a number of decisions to be made. In the case of Facebook, decisions such as what to prioritize in the newsfeed, what metrics (such as ad revenue) to optimize for, what tools and filters to make available to advertisers vs users, and the firing of human editors have all influenced the product (as well as the political situation of many countries). Sociology professor Zeynep Tufecki argued in the New York Times that Facebook selling ads targeted to “Jew haters” was not a one-off failure, but rather an unsurprising outcome from how the platform is structured.
Others claim that they can not act to curb online harassment or hate speech as that would contradict the principle of free speech. Anil Dash, CEO of Fog Creek software, writes, “the net effect of online abuse is to silence members of [under-represented] communities. Allowing abuse hurts free speech. Communities that allow abusers to dominate conversation don’t just silence marginalized people, they also drive away any reasonable or thoughtful person who’s put off by that hostile environment.” All tech companies are making decisions about who to include their communities, whether it is through action or implicitly through inaction. Valerie Aurora debunks similar arguments in a post on the paradox of tolerance explaining how free speech can be reduced overall when certain groups are silenced and intimidated. Choosing not to take action about abuse and harassment is still a decision, and it’s a decision that will have a large influence on who uses your platform.
Some data scientists may see themselves as impartially analyzing data. However, as iRobot director of data science Angela Bassett said, “It’s not that data can be biased. Data is biased.” Know how your data was generated and what biases it may contain. We are encoding and even amplifying societal biases in the algorithms we create. In a recent interview with Wired, Kate Crawford, co-founder of the AI Now Institute and principal researcher at Microsoft, explains that data is not neutral, data can not be neutralized, and “data will always bear the marks of its history.” We need to understand that history and what it means for the systems we build.
These biased outcomes arise for a number of reasons, including biased data sets and lack of diversity in the teams building the products. Using a held-out test set and avoiding overfitting is not just good practice, but also an ethical imperative. Overfitting often means that the error rates are higher on types of data that are not well-represented in the training set, quite literally under-represented or minority data.
I’ve done extensive research on retaining women at your company and on bias in interviews, including practical tips to address both. Stripe Engineer Julia Evans thought she could do a better job at conducting phone interviews, so she created a rubric for evaluating candidates for herself, which was eventually adopted as a company-wide standard. She wrote an excellent post about Making Small Culture Changes that should be helpful regardless of what role you are in.
Systemic & Regulatory Response
This blog post is written with an audience of individual data scientists in mind, but systemic and regulatory responses are necessary as well. Renee DiResta draws an analogy between the advent of high frequency trading in the financial markets and the rise of bots and misinformation campaigns on social networks. She argues that just as regulations were needed for the financial markets to combat increasing fragility and bad actors, regulations are needed for social networks to combat increasing fragility and bad actors. Kate Crawford points out that there is a large gap between proposed ethical guidelines and what is happening in practice, because we don’t have accountability mechanisms in place.
The topic of ethics in data science is too huge and complicated to be thoroughly covered in a single blog post. I encourage you to do more reading on this topic and to discuss it with your co-workers and peers. Here are some resources to learn more:
My post on Facebook’s lackluster response to its role in the genocide in Myanmar (especially compared to its response to a potential financial penalty in Germany), what concrete regulations could help us, the continued push of Facebook lobbyists to gut the few privacy laws we have, and the role that Free Basics (aka Internet.org) has played in global hate speech and violence.
There is a lot of misleading and even false information about AI out there, ranging from apallingly bad journalism to overhyped marketing materials to quotes from misinformed celebrities. Last month, it even got so bad that Snopes had to debunk a story about Facebook research that was inaccurately covered by a number of outlets.
AI is a complex topic moving at an overwhelming pace (even as someone working in the field, I find it impossible to keep up with everything that is happening). Beyond that, there are those who stand to profit off overhyping advances or drumming up fear.
I want to recommend several credible sources of accurate information. Most of the writing on this list is intended to be accessible to anyone—even if you aren’t a programmer or don’t work in tech:
Jack Clark’s email newsletter, Import AI, provides highlights and summaries of a selection of AI news and research from the previous week. You can check out previous issues (or sign up) here. Jack Clark is Strategy & Communications Director at OpenAI.
Mariya Yao’s writing on Topbots and on Forbes. Mariya is CTO and head of R&D for Topbots, a strategy and research firm for applied artificial intelligence and machine learning. Fun fact: Mariya worked on the LIDAR system for the 2nd place winner in the DARPA grand challenge for autonomous vehicles.
Zeynep Tufekci, a professor at UNC-Chapel Hill, is an expert on the interactions between technology and society. She shares a lot of important ideas on twitter, or read her New York Times op-eds here.
Kate Crawford is a professor at NYU, principal researcher at Microsoft, and co-founder of the AI Now Research institute, dedicated to studying the social impacts of AI. You can follow her on twitter here.
I also want to highlight a few great examples of AI researchers thoughtfully deconstructing the hype around some high-profile stories in the past few months, in an accessible way:
Denny Britz provides a balanced perspective on OpenAIs bot for Defense of the Ancients (a popular computer game).
Twitter is quite useful for keeping up on machine learning news and many people share surprisingly deep insights (that I often can’t find elsewhere). I was skeptical of Twitter before I started using it. The whole idea seemed weird: you can only write a 140 characters at a time? I already had Facebook and Linkedin, did I really need another social media account? It now occupies a useful and distinct niche for me. The hardest part is getting started; feel free to take a look at my twitter or Jeremy’s favorites to look for interesting accounts. Whenever I read an article I like or hear a talk I like, I always look up the author/speaker on twitter and see if I find their tweets interesting. If so, I follow them.
I will update this post based on constructive feedback as I receive it (with attribution)—although I’ve tried to largely stick to my area of specialty (data science) I’ve had to touch on various areas that I’m not an expert in, so please do let me know if you notice any issues. You can reach me on Twitter at @jeremyphoward.
When I first read about this study, I had a strong negative emotional response. The topic is of great personal interest to me—Rachel Thomas and I started fast.ai explicitly for the purpose of increasing diversity in the field of deep learning (including the deep neural networks used in this study), and we even personally pay for scholarships for diverse students, including LGBTQ students. In addition, we want to support the use of deep learning in a wider range of fields, because we believe that it can both positively and negatively impact many people’s lives, so we want to show how to use the technology appropriately and correctly.
Like many commentators, I had many basic concerns about this study. Should it have been done at all? Was the data collection an invasion of privacy? Were the right people involved in the work? Were the results communicated in a thoughtful and sensitive way? These are important issues, and can’t be answered by any single individual. Because deep learning is making it possible for computers to do things that weren’t possible before, we’re going to see more and more areas where these questions are going to arise. Therefore, we need to see more cross-disciplinary studies being done by more cross-disciplinary teams. In this case, the researchers are data scientists and psychologists, but the paper covers topics (and claims to reach conclusions) in fields from sociology to biology.
So, what does the paper actually show - can neural nets do what is claimed, or not? We will analyze this question as a data scientist - by looking at the data.
The key conclusions of both the paper (“Deep neural networks can detect sexual orientation from faces”) and the response (“AI can’t tell if you’re gay”) are not supported by the research shown. What is supported is a weaker claim: in some situations deep neural networks can recognize some photos of gay dating website users from some photos of heterosexual dating website users. We definitely can’t say “AI can’t tell if you’re gay”, and indeed to make this claim is irresponsible: the paper at least shows some sign that the opposite might well be true, and that such technology is readily available and easily used by any government or organization.
The senior researcher on this paper, Michael Kosinski, has successfully warned us of similar problems in the past: his paper Private traits and attributes are predictable from digital records of human behavior is one of the most cited of all time, and was at least partly responsible for getting Facebook to change their policy of having ‘likes’ be public by default. If the key result in this new study does turn out to be correct, then we should certainly be having a discussion about what policy implications it has. If you live in a country where homosexuality is punishable by death, then you need to be open to the possibility that you could be profiled for extra surveillance based on your social media pictures. If you are in a situation where you can’t be open about your sexual preference, you should be aware that a machine learning recommendation system could (perhaps even accidentally) target you with merchandise targeted to a gay demographic.
However, the paper comes to many other conclusions that are not directly related to this key question, are not clearly supported by the research, and are overstated and poorly communicated. In particular, the paper claims that the research supports the “widely accepted” prenatal hormone theory (PHT) that “same-gender sexual orientation stems from the under-exposure of male fetuses or overexposure of female fetuses to androgens that are responsible for sexual differentiation”. The support for this in the paper is far from rigorous, and should be considered inconclusive. Furthermore, the sociologist Greggor Mattson says that not only is the theory not widely accepted, but that “literally the first sentence of a decade-old review of the field is ‘Public perceptions of the effect of testosterone on ‘manly’ behavior are inaccurate’”.
How was the research was done?
A number of studies are presented in the paper, but the key one is ‘study 1a’. In this study, the researchers downloaded on average 5 images of 70,000 people from a dating website. None of the data collected in the study has been made available, although nearly any programmer could easily replicate this (and indeed many coders have created similar datasets in the past). Because the study’s focus is on detecting sexual orientation from faces, they cropped the photos to the area of the face. They also removed photos with multiple people, or where the face wasn’t clear, or wasn’t looking straight at the camera. The technical approach here is very standard and reliable, using widely used open source software called Face++.
They then removed any images that a group of non-expert workers considered either not adult, or not caucasian (using Amazon’s Mechanical Turk system). It wasn’t entirely clear why they did this; most likely they were assuming that handling more types of face would make it harder to train their model.
It’s important to be aware that steps like this used to ‘clean’ a dataset are necessary for nearly all data science projects, but they are rarely if ever perfect - and those imperfections are generally not important in understanding the accuracy of a study. What’s important for evaluation is to be confident that the final metrics reported are evaluated appropriately. More on this shortly…
They labeled each as gay or not based on each dating profile’s listed sexual preference.
The researchers then used a deep neural network (VGG-Face) to create features. Specifically, each image was turned into 4096 numbers, each of which had been trained by University of Oxford researchers to be as good as possible for recognizing humans from their faces. They compressed those 4096 numbers down 500 using a simple statistical technique called SVD, and they then used a simple regression model to map these 500 numbers to the label (gay or not).
They repeated the regression 10 times. Each time they used a different 90% subset of the data, and tested the model using the remaining 10% (this is known as cross-validation). The ten models were scored using a metric called AUC which is a standard approach for evaluating classification models like this one. The AUC for people marked as males in the dataset was 0.91.
How accurate is this model?
The researchers describe their model as “91% accurate”. This is based on the AUC score of 0.91. However, it is very unusual and quite misleading to use the word “accuracy” to describe AUC. The researchers have clarified that the actual accuracy of the model can be understood as follows: if you pick the 10% of people in the study with the highest scores on the model, about half will actually be gay based on the collected labels. If the actual percentage of gay males is 7%, then this shows that the model is a lot better than random. However it is probably not as accurate as most people would imagine if they heard something was “91% accurate”.
It is also important to note that based on this study (study 1a) we can only really say that the model can recognize gay dating profiles from one web-site of people that non-experts label as adult caucasian, not that it can recognize gay photos in general. It’s quite likely that this model will generalize to other similar populations, but we don’t know from this research how similar those populations would need to be, and how accurate it will be.
Have the researchers created a new technology here?
The approach used in this study is literally the very first technique that we teach in our introductory deep learning course. Our course does not require an advanced math background - only high school math is need. So the approach used here is literally something anyone can do with high school math, an hour of free online study, and a basic knowledge of programming.
A model trained in this way takes under 20 seconds to run on a commodity server that can be rented for $0.90/hour. So it does not require any special or expensive resources. The data can be easily downloaded from dating websites by anyone with basic coding skills.
The researchers say that their study shows a potential privacy problem. Since the technology they used is very accessible, then if you believe that the capabilities shown are of concern, then this claim seems reasonable.
It is probably reasonably to assume that many organizations have already completed similar projects, but without publishing them in the academic literature. This paper showing what can already be easily done—it is not creating a new technology. It is becoming increasingly common for marketers to use social media data to help push their products; in these cases the models simply look for correlations between product sales and and social media data that is available. In this case it would be very easy for a model in implicitly find a relationship between certain photos and products targeted to a gay market, without the developers even realizing it had made that connection. Indeed, we have seen somewhat similar issues before such as that described in the article How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did.
Did the model show that gay faces are physically different?
In study 1b, the researchers cover up different parts of each image, to see which parts when covered up cause the prediction to change. This is a common technique for understanding the relative importance of different parts of an input to a neural network.
The results this analysis are shown in this picture from the paper:
The red areas are relatively more important to the model than the blue areas. However, this analysis does not show how much more important it is, or why or in what way the red areas were more important.
In study 1c they try to create an “average face” for each of male and female and for each of gay and heterosexual. This part of the study has no rigorous analysis and relies entirely on an intuitive view of the images shown. From a data science point of view, there is no additional information that can be gained from this section.
The researchers claim that these studies show support for the prenatal hormone theory. However, no data is presented that shows how this theory is supported or what level of support is provided, nor investigates possible alternative theories for the observations.
Is the model more accurate than humans?
The researchers claim in the first sentence of the abstract that “faces contain much more information about sexual orientation than can be perceived and interpreted by the human brain”. They base this claim on Study 4, in which they ask humans to classify images from the same dataset as study 1a. However, this study completely fails to provide an adequate methodology to support the claim. Stanford researcher Andrej Karpathy (now at Tesla) showed a fairly rigorous approach to how human image classification can be compared to a neural net. The key piece is to give the human the same opportunity to study the training data as the computer received. In this case, that would mean letting each human judge study many examples of the faces and labels collected in the dataset, before being asked to classify faces themselves.
By failing to provide this “human training” step, the humans and computers had very different information with which to complete the task. Even if the methodology was better, there would still be many possible explanations other than the very strong and unsupported claim that they decided to open their paper with.
As a rule, academic claims should be made with care and rigor and communicated thoughtfully. Especially when it opens a paper. Especially especially when it is in such a sensitive area. Double especially especially when it covers an area outside the researchers’ area of specialty. This issue is thoughtfully discussed with relation to this paper in a Calling Bullshit case study.
Is the classifier effective for images other than dating pics from one website?
In short: we don’t know. Study 5 in the paper asserts that it is, but it does not provide strong support for this claim, and is set up in a oddly convoluted way. The method used for study 5 was to find facebook pics from some extra-super-gay facebook users: people who listed a same-sex partner, and liked at least two pages such as “Manhunt” and “I love being gay”. It then tried to see if they could train a classifier to separate these pics from heterosexual dating web-site users. The claimed accuracy of this classifier was 74%, although the exact meaning of this 74% figure is not listed. If it means an AUC of 0.74 (which is how the researchers referred to AUC earlier in the paper), this is not a strong result. It’s also comparing across datasets (facebook vs dating website), and using a very particular type of facebook profile to do their test.
The researchers state that they didn’t compare to heterosexual profile pics because they didn’t know how to find them.
Are their conclusions supported by their studies?
In the General Discussion section the researchers come to a number of conclusions. All of the conclusions as stated are stronger than what can be concluded from the research results shown. However, we can at least say (assuming that their data analysis was completed correctly, which we can’t confirm since we don’t have access to their data or code) that sexual preference of people in some photos can be identified much better than randomly in some situations.
They conclude that their model does not simply find differences in presentation between the two groups, but actually shows differences in underlying facial structure. This claim is based partly on an assertion that the VGG-Face model they use is trained to identify non-transient facial features. However, some simple data analysis readily shows that this assertion is incorrect. Victoria University researcher Tom White shared an analysis that showed this exact model, for instance, can recognize happy from neutral faces with a higher AUC (0.92) than the model shown in this paper (and can recognize happy from sad faces even better, with an AUC of 0.96).
Did the paper confuse correlation with causation?
Any time that a paper from social scientists comes to the attention of groups of programmers (such as when shared on coding forums), inevitably we hear the cry “correlation is not causation”. This happened for this paper too. What does this mean, and is it a problem here? Naturally, XKCD has us covered:
Correlation refers to an observation that two things happen at the same time. For instance, you may notice that on days when people buy more ice-cream, they also buy more sun-screen. Sometimes people incorrectly assume that such an observation implies causation—in this case, that eating ice-cream causes people to want sunscreen. When a correlation is observed between some event x (buying ice-cream) and event y (buying sunscreen), there are three main possibilities:
x causes y
y causes x
something else causes both x and y (possibly indirectly)
pure chance (we can measure the probability of this happening - in this study it is vanishingly low)
In this case, of course, a warm and sunny day causes both the desire for ice-cream, and the need for sunscreen.
Much of the social sciences deals with this issue. Researchers in these fields often have to try to reach conclusions from observational studies in the presence of many confounding factors. This is a complex and challenging task, and often results in imperfect results. For mathematicians and computer scientists, results from the social sciences can seem infuriatingly poorly founded. In math, if you want to claim that, for example, no three positive integers a, b, and c satisfy the equation an + bn = cn for any integer value of n greater than 2, then it doesn’t matter if you try millions of values of a, b, and c and show none have this relationship—you have to prove it for all possible integers. But in the social sciences this kind of result is generally not possible. So we must try to weigh the balance of evidence versus our a priori expectations regarding the results.
The Stanford paper tries to separate out correlation from causation by using various studies, as discussed above. And in the end, they don’t do a great job of it. But the simple claim that “correlation is not causation” is a sloppy response. Instead, alternative theories need to be provided, preferably with evidence: that is, can you make a claim that y causes x, or that something else causes both x and y, and show that your alternative theory is supported by the research shown in the paper?
In addition, we need to consider the simple question: does it actually matter? E.g. if it is possible for a government (or an over-zealous marketer) to classify a photo of a face by sexual orientation, mightn’t this be an important result regardless of whether the cause of the identified differences are grooming, facial expression, or facial structure?
Should we worry about privacy?
The paper concludes with a warning that governments are already using sophisticated technology to infer intimate traits of citizens, and that it is only through research like this that we can guess what kind of capabilities they have. They state:
Delaying or abandoning the publication of these findings could deprive individuals of the chance to take preventive measures and policymakers the ability to introduce legislation to protect people. Moreover, this work does not offer any advantage to those who may be developing or deploying classification algorithms, apart from emphasizing the ethical implications of their work. We used widely available off-the-shelf tools, publicly available data, and methods well known to computer vision practitioners. We did not create a privacy-invading tool, but rather showed that basic and widely used methods pose serious privacy threats.
These are genuine concerns and it is clearly a good thing for us all to understand the kinds of tools that could be used to reduce privacy. It is a shame that the overstated claims, weak cross-disciplinary research, and methodological problems clouded this important issue.
The next fast.ai courses will be based nearly entirely on a new framework we have developed, built on Pytorch. Pytorch is a different kind of deep learning library (dynamic, rather than static), which has been adopted by many (if not most) of the researchers that we most respect, and in a recent Kaggle competition was used by nearly all of the top 10 finishers.
We have spent around a thousand hours this year working with Pytorch to get to this point, and we are very excited about what it is allowing us to do. We will be writing a number of articles in the coming weeks talking about each aspect of this. First, we will start with a quick summary of the background to, and implications of, this decision. Perhaps the best summary, however, is this snippet from the start of our first lesson:
fast.ai’s teaching goal
Our goal at fast.ai is for there to be nothing to teach. We believe that the fact that we currently require high school math, one year of coding experience, and seven weeks of study to become a world-class deep learning practitioner, is not an acceptable state of affairs (even although this is less prerequisites for any other course of a similar level). Everybody should be able to use deep learning to solve their problems with no more education than it takes to use a smart phone. Therefore, each year our main research goal is to be able to teach a wider range of deep learning applications, that run faster, and are more accurate, to people with less prerequisites.
We want our students to be able to solve their most challenging and important problems, to transform their industries and organisations, which we believe is the potential of deep learning. We are not just trying to teach people how to get existing jobs in the field — but to go far beyond that.
Therefore, since we first ran our deep learning course, we have been constantly curating best practices, and benchmarking and developing many techniques, trialling them against Kaggle leaderboards and academic state-of-the-art results.
Why we tried Pytorch
As we developed our second course, Cutting-Edge Deep Learning for Coders, we started to hit the limits of the libraries we had chosen: Keras and Tensorflow. For example, perhaps the most important technique in natural language processing today is the use of attentional models. We discovered that there was no effective implementation of attentional models for Keras at the time, and the Tensorflow implementations were not documented, rapidly changing, and unnecessarily complex. We ended up writing our own in Keras, which turned out to take a long time, and be very hard to debug. We then turned our attention to implementing dynamic teacher forcing, for which we could find no implementation in either Keras or Tensorflow, but is a critical technique for accurate neural translation models. Again, we tried to write our own, but this time we just weren’t able to make anything work.
At that point the first pre-release of Pytorch had just been released. The promise of Pytorch was that it was built as a dynamic, rather than static computation graph, framework (more on this in a later post). Dynamic frameworks, it was claimed, would allow us to write regular Python code, and use regular python debugging, to develop our neural network logic. The claims, it turned out, were totally accurate. We had implemented attentional models and dynamic teacher forcing from scratch in Pytorch within a few hours of first using it.
Some pytorch benefits for us and our students
The focus of our second course is to allow students to be able to read and implement recent research papers. This is important because the range of deep learning applications studied so far has been extremely limited, in a few areas that the academic community happens to be interested in. Therefore, solving many real-world problems with deep learning requires an understanding of the underlying techniques in depth, and the ability to implement customised versions of them appropriate for your particular problem, and data. Because Pytorch allowed us, and our students, to use all of the flexibility and capability of regular python code to build and train neural networks, we were able to tackle a much wider range of problems.
An additional benefit of Pytorch is that it allowed us to give our students a much more in-depth understanding of what was going on in each algorithm that we covered. With a static computation graph library like Tensorflow, once you have declaratively expressed your computation, you send it off to the GPU where it gets handled like a black box. But with a dynamic approach, you can fully dive into every level of the computation, and see exactly what is going on. We believe that the best way to learn deep learning is through coding and experiments, so the dynamic approach is exactly what we need for our students.
Much to our surprise, we also found that many models trained quite a lot faster on pytorch than they had on Tensorflow. This was quite against the prevailing wisdom, that said that static computation graphs should allow for more optimization to be done, which should have resulted in higher performance in Tensorflow. In practice, we’re seeing some models are a bit faster, some a bit slower, and things change in this respect every month. The key issues seem to be that:
Improved developer productivity and debugging experience in Pytorch can lead to more rapid development iterations, and therefore better implementations
Smaller, more focussed development community in Pytorch looks for “big wins” rather than investing in micro-optimization of every function.
Why we built a new framework on top of Pytorch
Unfortunately, Pytorch was a long way from being a good option for part one of the course, which is designed to be accessible to people with no machine learning background. It did not have anything like the clear simple API of Keras for training models. Every project required dozens of lines of code just to implement the basics of training a neural network. Unlike Keras, where the defaults are thoughtfully chosen to be as useful as possible, Pytorch required everything to be specified in detail. However, we also realised that Keras could be even better. We noticed that we kept on making the same mistakes in Keras, such as failing to shuffle our data when we needed to, or vice versa. Also, many recent best practices were not being incorporated into Keras, particularly in the rapidly developing field of natural language processing. We wondered if we could build something that could be even better than Keras for rapidly training world-class deep learning models.
After a lot of research and development it turned out that the answer was yes, we could (in our biased opinion). We built models that are faster, more accurate, and more complex than those using Keras, yet were written with much less code. We’ve implemented recent papers that allow much more reliable training of more accurate models, across a number of fields.
The key was to create an OO class which encapsulated all of the important data choices (such as preprocessing, augmentation, test, training, and validation sets, multiclass versus single class classification versus regression, et cetera) along with the choice of model architecture. Once we did that, we were able to largely automatically figure out the best architecture, preprocessing, and training parameters for that model, for that data. Suddenly, we were dramatically more productive, and made far less errors, because everything that could be automated, was automated. But we also provided the ability to customise every stage, so we could easily experiment with different approaches.
With the increased productivity this enabled, we were able to try far more techniques, and in the process we discovered a number of current standard practices that are actually extremely poor approaches. For example, we found that the combination of batch normalisation (which nearly all modern CNN architectures use) and model pretraining and fine-tuning (which you should use in every project if possible) can result in a 500% decrease in accuracy using standard training approaches. (We will be discussing this issue in-depth in a future post.) The results of this research are being incorporated directly into our framework.
There will be a limited release for our in person students at USF first, at the end of October, and a public release towards the end of the year. (By which time we’ll need to pick a name! Suggestions welcome…) (If you want to join the in-person course, there’s still room in the International Fellowship program.)
What should you be learning?
If it feels like new deep learning libraries are appearing at a rapid pace nowadays, then you need to be prepared for a much faster rate of change in the coming months and years. As more people enter the field, they will bring more skills and ideas, and try more things. You should assume that whatever specific libraries and software you learn today will be obsolete in a year or two. Just think about the number of changes of libraries and technology stacks that occur all the time in the world of web programming — and yet this is a much more mature and slow-growing area than deep learning. So we strongly believe that the focus in learning needs to be on understanding the underlying techniques and how to apply them in practice, and how to quickly build expertise in new tools and techniques as they are released.
By the end of the course, you’ll understand nearly all of the code that’s inside the framework, because each lesson we’ll be digging a level deeper to understand exactly what’s going on as we build and train our models. This means that you’ll have learnt the most important best practices used in modern deep learning—not just how to use them, but how they really work and are implemented. If you want to use those approaches in another framework, you’ll have the knowledge you need to develop it if needed.
To help students learn new frameworks as they need them, we will be spending one lesson learning to use Tensorflow, MXNet, CNTK, and Keras. We will work with our students to port our framework to these other libraries, which will make for some great class projects.
We will also spend some time looking at how to productionize deep learning models. Unless you are working at Google-scale, your best approach will probably be to create a simple REST interface on top of your Pytorch model, running inference on the CPU. If you need to scale up to very high volume, you can export your model (as long as it does not use certain kinds of customisations) to Caffe2 or CNTK. If you need computation to be done on a mobile device, you can either export as mentioned above, or use an on-device library.
How we feel about Keras
We still really like Keras. It’s a great library and is far better for fairly simple models than anything that came before. It’s very easy to move between Keras and our new framework, at least for the subset of tasks and architectures that Keras supports. Keras supports lots of backend libraries which means you can run Keras code in many places.
It has a unique (to our knowledge) approach to defining architectures where authors of custom layers are required to create a build() method which tells Keras what shape output it creates for a given input. This allows users to more easily create simple architectures because they almost never have to specify the number of input channels for a layer. For architectures like Densenet which concatenate layers it can make the code quite a bit simpler.
On the other hand, it tends to make it harder to customize models, especially during training. More importantly, the static computation graph on the backend, along with Keras’ need for an extra compile() phase, means that it’s hard to customize a model’s behaviour once it’s built.
What’s next for fast.ai and Pytorch
We expect to see our framework and how we teach Pytorch develop a lot as we teach the course and get feedback and ideas from our students. In past courses students have developed a lot of interesting projects, many of which have helped other students—we expect that to continue. Given the accelerating progress in deep learning, it’s quite possible that by this time next year, there will be very different hardware or software options that will make todays’ technology quite obsolete. Although based on the quick adoption of new technologies we’ve seen from the Pytorch developers, we suspect that they might stay ahead of the curve for a while at least…
In our next post, we’ll be talking more about some of the standout features of Pytorch, and dynamic libraries more generally.