Despite what you may have heard, you can use deep learning for the type of data you might keep in a SQL database, a Pandas DataFrame, or an Excel spreadsheet (including time-series data). I will refer to this as tabular data, although it can also be known as relational data, structured data, or other terms (see my twitter poll and comments for more discussion).
Tabular data is the most commonly used type of data in industry, but deep learning on tabular data receives far less attention than deep learning for computer vision and natural language processing. This post covers some key concepts from applying neural networks to tabular data, in particular the idea of creating embeddings for categorical variables, and highlights 2 relevant modules of the fastai library:
fastai.structured: this module works with Pandas DataFrames, is not dependent on PyTorch, and can be used separately from the rest of the fastai library to process and work with tabular data.
fastai.column_data: this module also works with Pandas DataFrames, and provides methods to convert DataFrames (with both continuous and categorical variables) into ModelData objects that can easily be used when training neural networks. It also includes an implementation for creating embeddings of categorical variables, a powerful technique I will explain below.
A key technique to making the most of deep learning for tabular data is to use embeddings for your categorical variables. This approach allows for relationships between categories to be captured. Perhaps Saturday and Sunday have similar behavior, and maybe Friday behaves like an average of a weekend and a weekday. Similarly, for zip codes, there may be patterns for zip codes that are geographically near each other, and for zip codes that are of similar socio-economic status.
Taking Inspiration from Word Embeddings
A way to capture these multi-dimensional relationships between categories is to use embeddings. This is the same idea as is used with word embeddings, such as Word2Vec. For instance, a 3-dimensional version of a word embedding might look like:
[0.9, 1.0, 0.0]
[1.0, 0.2, 0.0]
[0.0, 1.0, 0.9]
[0.0, 0.2, 1.0]
Notice that the first dimension is capturing something related to being a dog, and the second dimension captures youthfulness. This example was made up by hand, but in practice you would use machine learning to find the best representations (while semantic values such as dogginess and youth would be captured, they might not line up with a single dimension so cleanly). You can check out my workshop on word embeddings for more details about how word embeddings work.
Applying Embeddings for Categorical Variables
Similarly, when working with categorical variables, we will represent each category by a vector of floating point numbers (the values of this representation are learned as the network is trained).
For instance, a 4-dimensional version of an embedding for day of week could look like:
[.8, .2, .1, .1]
[.1, .2, .9, .9]
[.2, .1, .9, .8]
Here, Monday and Tuesday are fairly similar, yet they are both quite different from Sunday. Again, this is a toy example. In practice, our neural network would learn the best representations for each category while it is training, and each dimension (or direction, which doesn’t necessarily line up with ordinal dimensions) could have multiple meanings. Rich relationships can be captured in these distributed representations.
Reusing Pretrained Categorical Embeddings
Embeddings capture richer relationships and complexities than the raw categories. Once you have learned embeddings for a category which you commonly use in your business (e.g. product, store id, or zip code), you can use these pre-trained embeddings for other models. For instance, Pinterest has created 128-dimensional embeddings for its pins in a library called Pin2Vec, and Instacart has embeddings for its grocery items, stores, and customers.
The fastai library contains an implementation for categorical variables, which work with Pytorch’s nn.Embedding module, so this is not something you need to code from hand each time you want to use it.
Treating some Continuous Variables as Categorical
We generally recommend treating month, year, day of week, and some other variables as categorical, even though they could be treated as continuous. Often for variables with a relatively small number of categories, this results in better performance. This is a modeling decision that the data scientist makes. Generally, we want to keep continuous variables represented by floating point numbers as continuous.
Although we can choose to treat continuous variables as categorical, the reverse is not true: any variables that are categorical must be treated as categorical.
Time Series Data
The approach of using neural networks together with categorical embeddings can be applied to time series data as well. In fact, this was the model used by students of Yoshua Bengio to win 1st place in the Kaggle Taxi competition(paper here), using a trajectory of GPS points and timestamps to predict the length of a taxi ride. It was also used by the 3rd place winners in the Kaggle Rossmann Competition, which involved using time series data from a chain of stores to predict future sales. The 1st and 2nd place winners of this competition used complicated ensembles that relied on specialist knowledge, while the 3rd place entry was a single model with no domain-specific feature engineering.
In our Lesson 3 jupyter notebook we walk through a solution for the Kaggle Rossmann Competition. This data set (like many data sets) includes both categorical data (such as the state the store is located in, or being one of 3 different store types) and continuous data (such as the distance to the nearest competitor or the temperature of the local weather). The fastai library lets you enter both categorical and continuous variables as input to a neural network.
When applying machine learning to time-series data, you nearly always want to choose a validation set that is a continuous selection with the latest available dates that you have data for. As I wrote in a previous post, “If your data is a time series, choosing a random subset of the data will be both too easy (you can look at the data both before and after the dates your are trying to predict) and not representative of most business use cases (where you are using historical data to build a model for use in the future).”
One key to successfully using deep learning with time series data is to split the date into multiple categorical variables (year, month, week, day of week, day of month, and Booleans for whether it’s the start/end of a month/quarter/year). The fastai library has implemented a method to handle this for you, as described below.
Modules to Know in the Fastai Library
We will be releasing more documentation for the fastai library in coming months, but it is already available on pip and on github, and it is used in the Practical Deep Learning for Coders course. The fastai library is built on top of Pytorch and encodes best practices and helpful high-level abstractions for using neural networks. The fastai library achieves state-of-the-art results and was recently used to win the Stanford DAWNBench competition (fastest CIFAR10 training).
fastai.column_data.ColumnarModelData takes a Pandas DataFrame as input and creates a type of ModelData object (an object which contains data loaders for the training, validation, and test sets, and which is the fundamental way of keeping track of your data while training models).
The fastai.structured module of the fastai library is built on top of Pandas, and includes methods to transform DataFrames in a number of ways, improving the performance of machine learning models by pre-processing the data appropriately and creating the right types of variables.
For instance, fastai.structured.add_datepart converts dates (e.g. 2000-03-11) into a number of variables (year, month, week, day of week, day of month, and booleans for whether it’s the start/end of a month/quarter/year.)
Other useful methods in the module allow you to:
Fill in missing values with the median whilst adding a boolean indicator variable (fix_missing)
Change any columns of strings in a Pandas DataFrame to a column of categorical values (train_cats)
While many news outlets havecoveredwhatZuckerbergwore for his testimony before congress last week, I wish that several more substantial issues were getting greater coverage, such as the lackluster response of Facebook to its role in the genocide in Myanmar, what concrete regulations could help us, the continued push of Facebook lobbyists to gut the few privacy laws we have, and the role that Free Basics (aka Internet.org– remember that?) has played in global hate speech and violence. So let’s discuss them now.
Genocide in Myanmar vs. a Monetary Fine in Germany
The impact of Facebook is much larger than just the USA and UK: there is a genocide happening in Myanmar, with hundreds of thousands of people from an ethnic minority being systematically driven from their homes, raped, and/or killed, and with entire villages being burnt and bulldozed to the ground. Facebook is the primary source of news in Myanmar (in part due to the Facebook program “Free Basics”, explained more below), and it has been used for years to spread dehumanizing propaganda about the ethnic minority (as well as to censor news reports about violence against the minority). Local activists have been asking Facebook to address this issue since 2013 and news outlets have been covering Facebook’s role in the genocide since 2014, yet Facebook has been beyond sluggish to respond.
(I previously wrote about Facebook’s role in Myanmar here.)
Even now, Zuckerberg promised during the congressional hearings to hire “dozens” to address the genocide in Myanmar (in 2018, years after the genocide had begun), which stands in stark contrast to Facebook quickly hiring 1,200 people in Germany to try to avoid expensive penalties under a new German law against hate speech. Clearly, Facebook is more reactive to the threat of a financial penalty than to the systematic destruction of an ethnic minority.
Can AI solve hate speech?
Facebook’s actions in Germany of hiring 1,200 human content moderators also stand in contrast to Zuckerberg’s vague assurance that Facebook will use AI to address hate speech. It’s valuable to look at how Facebook actually behaves when facing a potential financial penalty, and not what sort of vague promises are made to Congress. As Cornell Law Tech professor James Grimmelmann said, “[AI] won’t solve Facebook’s problems, but it will solve Zuckerberg’s: getting someone else to take responsibility.”
As two concrete examples of just how far the tech giants are from effectively using AI on these problems, let’s consider Google. In 2017, Google released a tool called Perspective to automatically detect online harassment and abusive speech, with high-profile launch partners including The New York Times, The Economist, and Wikipedia. Librarian Jessamyn West found a number of sexist, racist, and ableist biases (confirmed and covered in more detail here). For instance, the statement I am a gay Black woman got an 87 percent toxicity score (out of 100), while I am a man was one of the least toxic phrases tested (just 20 percent). Also in 2017, Google boasted that they’d implemented cutting-edge machine learning to identify and remove YouTube videos promoting violent extremism or terrorism. The removed videos included those of activist groups documenting Syrian war crimes, evidence from the Chelsea Manning court case, and documentation of the destruction of ancient artifacts by Islamic State (many of these videos were later put back and Google conceded it had made mistakes). These examples demonstrate a few key points:
Using AI to address harassment and abusive speech is incredibly hard.
You can’t just throw AI at a problem without deeply understanding the problem.
Although Facebook is in the spotlight right now, Google has the same business model (targeted ads) and same monopoly status, and Google is causing huge societal harms as well. For instance, YouTube has been well-documented to recommend white supremacist videos and conspiracy theories from a wide-range of starting points, and it is playing a role in radicalizing users into dangerous viewpoints.
You can’t just throw AI at a problem
Expertise in history, sociology, and psychology is necessary. As I said in my talk at the MIT Tech Review Conference, domain expertise is incredibly important. You can’t just have techies throwing AI at a problem without really understanding it, or the domain. For instance, radiologists who have also become deep learning practitioners have been able to catch some errors in deep learning work on radiology that deep learning experts missed. To this end, I think it is crucial that Facebook work with sociologists, historians, psychologists, and experts on topics like authoritarian governments, propaganda, etc. to better understand the problems they’ve created and plausible solutions.
Data privacy is a public good
Hopefully the above examples have made clear that data privacy is not just an individual choice, but it has profound societal impacts. As UNC Professor and techno-sociology expert Zeynep Tufekci wrote, Data privacy is not like a consumer good, where you click ‘I accept’ and all is well. Data privacy is more like air quality or safe drinking water, a public good that cannot be effectively regulated by trusting in the wisdom of millions of individual choices. A more collective response is needed.
Data collection should only be through a clear, concise, & transparent opt-in process.
People should have access to all data collected on them.
Data collection should be limited to specifically enumerated purposes, for a designated period of time.
Aggregate use of data should be regulated.
Another area of potential regulation is to address the monopoly status of tech giants like Facebook and force interoperability. Professor Jonathan Zittrain of Harvard has suggested, The key is to have the companies actually be the ‘platforms’ they claim they are. Facebook should allow anyone to write an algorithm to populate someone’s feed.
Professor Tufekci also proposed, If anything, we should all be thinking of ways to reintroduce competition into the digital economy. Imagine, for example, requiring that any personal data you consent to share be offered back to you in an “interoperable” format, so that you could choose to work with companies you thought would provide you better service, rather than being locked in to working with one of only a few.
The power of tech lobbies
While regulation is necessary for items we consider public goods, in the USA we face a significant obstacle due to the amount the tech industry spends on lobbying and the outsize impact that corporate lobbies have on our policies. As Alvaro Bedoya, Executive Director at the Georgetown Law Center on Privacy and Technology and a former senate staffer who worked on privacy laws points out, the last major consumer privacy law was passed in 2009, before Google, Facebook, and Amazon had really ramped up their lobbying spending. I’m sitting here watching Mark Zuckerberg say he’s sorry and that Facebook will do better on privacy, yet literally as he testifies lobbyists paid by Facebook in Illinois and California are working to stop or gut privacy laws,said Bedoya.
In the USA, we also face the issue of whether our laws will actually be enforced when tech giants are found to violate them. For instance, in 2016 Propublica discovered that Facebook allowed the placement of housing ads that would not be shown to African-American, Hispanic, or Asian-American people, which is a violation of the Fair Housing Act of 1968. “This is horrifying. This is massively illegal. This is about as blatant a violation of the federal Fair Housing Act as one can find,” said prominent civil rights lawyer John Relman. Facebook apologized (although it did not face any penalty); yet over a year later in 2017, Facebook was still found to let people place housing ads that would not be shown to certain groups of people, such as African Americans, people interested in wheelchair ramps, Jews, and Spanish speakers. Again, please contrast this with how swiftly and strongly Facebook responded in Germany to the threat of a penalty. As a further example, Facebook has been found to be violating the Age Discrimination in Employment Act of 1967 by allowing companies, including Verizon, Amazon, Goldman Sachs, Target, and Facebook itself, to place job advertisements targeted solely at young people.
Free Basics, aka Internet.org
Zooming back out to Facebook’s global impact, it is important to understand the role that Facebook’s program Free Basics has played in the genocide in Myanmar, the election of a violent strongman in the Philipines who is notorious for his use of extrajudicial killings, and elsewhere. Free Basics (formerly called Internet.org) was initially touted as a charitable effort in which Facebook provides free access to Facebook and a small number of other sites, but not to the internet as a whole, in countries including Myanmar, Somalia, the Philippines, Nigeria, Iraq, Pakistan, and others. For users in these countries, there is no charge for using the Facebook app on a smartphone, while the data rates for other apps is often prohibitively expensive. Despite it’s original name and description of benevolent aims, Internet.org was not a non-profit, but rather a business development group within Facebook aimed at increasing users and revenue.
Free Basics has led to large numbers of Facebook users who say they don’t use the internet and never follow links outside of facebook (56% of Facebook users in Indonesia and 61% in Nigeria). In some countries, people use internet and Facebook interchangeably, and Facebook is the primary way that people access news. Free Basics violates net neutrality and was banned in India in 2016 for this reason. While Facebook may have had good intentions with this program, it should have been accompanied with a huge sense of responsibility, and it needs to be re-evaluated in light of its role in inciting violence. Facebook needs to put significant resources into analyzing the impact Free Basics is having in the countries where it is offered, and to work closely with local experts from those countries.
I want to thank Omoju Miller, a machine learning engineer at Github who has her PhD from Berkeley, for recently reminding me about Free Basics and asking why more journalists aren’t writing about the role it’s played in global events.
Rethinking the Web
In French President Emmanuel Macron’s recent interview about AI said: In the US, [AI] is entirely driven by the private sector, large corporations, and some startups dealing with them. All the choices they make are private choices that deal with collective values. That’s exactly the problem you have with Facebook and Cambridge Analytica or autonomous driving… The key driver should not only be technological progress, but human progress. This is a huge issue. I do believe that Europe is a place where we are able to assert collective preferences and articulate them with universal values. Perhaps France will provide a template of laws that support technical progress while protecting human values.
To end on a note of hope, Tim Berners-Lee, inventor of the web, wrote an op-ed saying, two myths currently limit our collective imagination: the myth that advertising is the only possible business model for online companies, and the myth that it’s too late to change the way platforms operate. On both points, we need to be a little more creative. While the problems facing the web are complex and large, I think we should see them as bugs: problems with existing code and software systems that have been created by people – and can be fixed by people. Create a new set of incentives and changes in the code will follow.
It’s easy to jump to simplistic conclusions: many commenters assume that Facebook must be supported at all times on the assumption that technology providers have the power to help society (or by disregarding the role that regulation plays in maintaining healthy markets and a health society); and many assume that Facebook and all involved must be evil and everyone needs to stop using their services right away. The reality is more complex. Technology can be greatly beneficial for society, and it can also have a disastrous impact. The complex problems we are facing require creative and nuanced solutions.
I recently was a guest speaker at the Stanford AI Salon on the topic of accessiblity in AI, which included a free-ranging discussion among assembled members of the Stanford AI Lab. There were a number of interesting questions and topics, so I thought I would share a few of my answers here.
Q: What 3 things would you most like the general public to know about AI?
AI is easier to use than the hype would lead you to believe. In my recent talk at the MIT Technology Review conference, I debunked several common myths that you must have a PhD, a giant data set, or expensive computational power to use AI.
Most AI researchers are not working on getting computers to achieve human consciousness.Artificial intelligence is an incredibly broad field, and a somewhat misleading name for a lot of the stuff included in the field (although since this is the terminology everyone uses, I go along with it). Last week I was 20 minutes into a conversation with an NPR reporter before I realized that he thought we were talking about computer systems achieving human consciousness, and I thought we were talking about the algorithm Facebook uses to decide which advertisements to show you. Things like this happen all too often.
95% of the time when people within AI talk about AI, they are referring to algorithms that do a specific task (e.g. that sort photos, translate language, or win Atari games) and 95% of the time when people outside of AI hear something about AI they think of humanoid robots achieving super-intelligence (numbers made up based on my experience). I believe that this leads to a lot of unnecessary confusion and fear.
There are several problems that are far more urgent than the threat of evil super-intelligence, such as how we are encoding racial and gender biases into algorithms (that are increasingly used to make hiring, firing, healthcare benefits, criminal justice, and other life-impacting decisions) or increasing inequality (and the role that algorithms play in perpetuating & accelerating this).
Q: What can AI researchers do to improve accessibility?
A: My wish-list for AI researchers:
Write a blog post to accompany your paper (some of the advice we give fast.ai students about reading academic papers is to first search if someone has written a blog post version)
Share your code.
Try running your code on a single GPU (which is what most people have access to). Jeremy and I sometimes come across code that was clearly never run on just one GPU.
Use concrete examples in your paper. I was recently reading a paper on fairness that was all math equations, and even as someone with a math PhD, I was having trouble mapping what this meant into the real world.
Q: I recently taught a course on deep learning and had all the students do their own projects. It was so hard. The students couldn’t get their models to train, and we were like “that’s deep learning”. How are you able to teach this with fast.ai?
A: Yes, deep learning models can be notoriously finicky to train. We’ve had a lot of successful student projects come out of fast.ai, and I believe some of the factors for this are:
The fast.ai course is built around a number of practical, hands-on case studies (spanning computer vision, natural language processing, recommendation systems, and time series problems), and I think this gives students a good starting point for many of their projects.
We’ve developed the fastai library with the primary goal of having it be easy for students to use and to apply to new problems. This includes innovations like a learning rate finder, setting good defaults, and encoding best practices (such as cyclical learning rates).
fast.ai is not an education company; we are a research lab, and our research is primarily on how to make deep learning easier to use (which closely aligns with the goals of making deep learning easier to learn).
All this said, I do want to acknowledge that deep learning models can still be frustrating and finicky to train! I think everyone working in the field does routinely experience these frustrations (and I look forward to this improving as the field matures and advances).
Q: What do you mean by accessibility in AI? And why is it important?
A: By accessibility, I mean that people from all sorts of backgrounds: education, location where they live, domain of expertise, race, gender, age, and more should all be able to apply AI to problems that they care about.
I think this is important for two key reasons:
On the positive side, people from different backgrounds will know and care about problems that nobody else knows about. For instance, we had a student this fall who is a Canadian dairy farmer using deep learning to improve the health of his goats’ udders. This is an application that I never would have thought about.
People from different backgrounds are necessary to try to catch biases and negative impacts of AI. I think people from under-represented backgrounds are most likely to identify the ways that AI may be harmful, so their inclusion is vital.
I also want people from a variety of backgrounds to know enough about AI to be able to identify when people are selling snake oil (or just over-promising on what they could reasonably deliver).
Currently, for our Practical Deep Learning for Coders course, you need 1 year of coding experience to be able to learn deep learning, although we are working to lower that barrier to entry. I have not checked out the newest versions of deep learning SAAS APIs (it’s on my to-do list) that purportedly let people use deep learning without knowing how to code, but the last time I checked, these APIs suffered from several short-comings:
not truly state-of-the-art results (for this, you needed to write your own code), which is what most people want
only worked on fairly limited problems, or
in order to effectively use the API, you needed to know as much as what it takes to write your own code (in which case, people prefer to write their own, since it gives them more control and customization).
While I support the long-term vision of a machine learning API that anyone can use, I’m skeptical that the technology is there to truly provide something robust and performant enough to eliminate code yet.
Q: Do we really need everyone to understand AI? If the goal is to make an interface as user-friendly as a car, should we just be focusing on that? Cars have a nice UI, and we can all drive cars without understanding how the engine works. Does it really matter who developed cars initially?
A: While I agree with the long-term goal of making deep learning as easy to use as a car, I think we are a long way from that, and it very much matters who is involved in the meantime. It is dangerous to have a homogeneous group developing technology that impacts us all.
For instance, it wasn’t until 2011 that car manufactures were required to use crash test dummies with prototypical female anatomy, in addition to the “standard” male test dummies (a fact I learned from Datasheets for Datasets by Timnit Gebru, et al., which includes fascinating case studies of how standardization came to the electronics, automobile, and pharmaceutical industries). As described in the paper, “a safety study of automobiles manufactured between 1998 and 2008 concluded that women wearing seat belts were 47% more likely to be seriously injured than males in similar accidents”.
Extending the car analogy further, there are many things that are sub-optimal about our system of cars and how they developed: the USA under-investment in public transportation, how free parking led to sprawl and congestion, and the negative impacts of going with fossil fuels over electric power. Could these things have been different if those developing cars had come from a wider variety of backgrounds and fields?
And finally, returning to the question of AI usability and access, it matters both:
Who helps create the abstractions and usability interfaces
Who is able to use AI in the meantime.
Q: Is it really possible for people to learn the math they need on their own? Isn’t it easier to learn things through an in-person class– so as a grad student at Stanford, should I take advantage of the math courses that I have access to now?
A: At fast.ai, we highly recommend that people learn math on an as-needed basis (and here and here). Feeling like you need to learn all the math you might ever need before getting to work on the topic you’re excited about is a recipe for discouragement, and most people struggle to maintain motivation. If your goal is to apply deep learning to practical problems, much of that math turns out to be unnecessary
As for learning better through in-person courses, this is something that I think about. Even though we put all our course materials online, for free, we’ve had several students travel from other countries (including Australia, Argentina, England, and India) to attend our in-person course. If you are taking a course online or remotely, I definitely recommend trying to organize an in-person study group around it with others in your area. And no matter what, seek out community online.
I have a PhD in math, so I took a lot of math in graduate school. And sadly, I have forgotten much of the stuff that I didn’t use (most of my graduate courses were taken over a decade ago). It wasn’t a good use of time, given what I’ve ended up doing now. So I don’t know how useful it is to take a course that you likely won’t need (unless you are purely taking it for the enjoyment, or you are planning to stay in academia).
And finally, in talking with students, I think that people’s anxiety and negative emotions around math are a much bigger problem than their failure to understand concepts. There is a lot broken with how math is taught and the culture around it in the USA, and so it’s understandable that many people feel this way.
This is just a subset of the many topics we discussed at the AI Salon. All questions are paraphrased as I remember them and are not exact quotes. I should also note that my co-presenter, Stanford post-doc Mark Whiting, made a number of interesting points around HCI and symbolic systems. I enjoyed the event and want to thank the organizers for having me!
We keep hearing that data scientist is the hottest job of the 21st century and that there is a cross-industry shortage of employees with enough data skills, yet the idea of studying data science in school is still very new. What should colleges and universities teach about the topic? Is it just an adaption of existing math, statistics, or computer science courses? Would these classes be of interest to any non-math majors? Is there enough material to create a minor?
A math professor at a small university (who is also an old friend from grad school) recently asked me these questions, and I thought I would tackle them here for my latest advice column.
What is data science?
Data science refers to a wide collection of skills needed to ask and answer questions using data. The role of data scientist is actually used to refer to several separate roles: business analyst, data analyst, machine learning engineer, data pipeline engineer, modeling researcher, etc.
Data scientists need to know how to:
load, clean, and inspect data
plot and do exploratory analysis
formulate questions and test hypotheses
communicate their results to people who are not data scientists.
In addition, some data scientists do more specialized tasks, such as building machine learning models.
Of the schools that are starting to create data science programs, many use a mix of existing mathematics, statistics, and computer science courses. However, I didn’t learn the most useful data science skills in these fields when I was a student. (I was a math major and CS minor, and I earned my PhD in a field related to probability. My goal had been to become a theoretical math professor, and I did not start learning any practical skills until the end of my PhD.) Looking through the selection of math courses offered at my friend’s university, none jumps out as particularly useful for data science.
Although data science is related to math, computer science, and statistics, I definitely recommend designing new courses (or at least new units of material), and not trying to shoehorn existing courses into this role.
Most Useful Things to Learn
Python or R. I’m inclined towards Python since it is a very general language with libraries for many different purposes (e.g. if a student decided to become a software engineer instead, Python would be much more useful than R). However, R is nice too and is widely used in the academic statistics community. When learning Python for data science, you should learn at least:
Pandas: A library for working with tables of data.
Numpy: Used for nearly all data crunching in Python
SQL is a language used for interacting with tabular data (data that appears in tables with rows and columns), and particularly relational data (data in multiple related tables, such as customers and orders). It is widely used, and because it is highly specialized, it is quicker to learn then most programming languages. SQL is a highly employable skill. Needed skills are how to write queries and joins, what keys are, and how to design database schemas. SQL should be learned regardless of whether you choose R or Python.
Jupyter Notebooks provide an interactive environment that can include code, data, plots, text, and LaTex equations. They are a great tool for both teaching and for doing data science in the workplace. Many textbooks are now being released as Jupyter Notebooks, such as the ones in this interesting gallery. I typically run Python within Jupyter notebooks.
Exploratory data analysis includes loading and inspecting data, creating plots, checking what type different variables are, and dealing with missing values.
Machine learning is about using data to make predictions (whether that is predicting sales, identifying cancer on a CT scan, or Google Maps identifying house numbers from photographs). The most vital concept is the idea of having a held-out test set. A great algorithm to start with is ensembles of decision trees.
Ethics should be included as an integral part of all data science courses, and not as a separate course. Cases studies are particularly useful and I touch on several in this post, as well as linking to a number of course syllabi and other resources.
Working on a project from start to finish: designing a problem, running experiments, and writing them up. One resource is Jeremy’s article on designing great data products. Thinking about data quality and verification is part of this process. Microsoft’s racist chatbot Tay, which had to be discontinued less than a day after it was released when it began spouting Nazi rhetoric, provides a case study of not giving enough thought to the input data. Working on a project could also include productionizing it by building a simple web app (such as Python’s Flask).
My friend’s question used the term big data, but I chose to interpret this as being a question about data science. The marketing blitz around big data has been harmful, in that it misleadingly suggests that it is the size of your data set that matters. In many cases, folks with big data solutions are left searching for a problem to apply their technology to.
In most of data science (including artificial intelligence) far less data is needed than many people realize. One of our students created a model to distinguish pictures of cricket from pictures of baseball using just 30 training images! Even when you have a large data set, I recommend working on a smaller subset (until you are almost finished), since that will allow you to iterate much more quickly as you experiment. Also, what was considered “big data” a few years ago is now considered normal, and this trend is continuing all the time as technology advances.
Not just for math majors
A data science minor would be valuable across a range of disciplines: pre-med, economics, sociology, business, biology, and more. People are using data analysis to study everything from art curation to Japanese calligraphy.
Foundations of data science is the fastest growing course ever at UC Berkeley with 1,500 students from 60 different majors taking it in fall 2017. I can see a future in which college students from all majors take at least 1 or 2 data science courses (or in which it becomes mandatory, just like basic reading and writing literacy). Data literacy will continue to increase in importance both in the workplace and in society at large. I am excited to hear that more universities are beginning to add data science to their curriculums!
This post is part of my ask-a-data-scientist advice column. Here are some of the previous posts:
Last year we announced that we were developing a new deep learning course based on Pytorch (and a new library we have built, called fastai), with the goal of allowing more students to be able to achieve world-class results with deep learning. Today, we are making this course, Practical Deep Learning for Coders 2018, generally available for the first time, following the completion of a preview version of the course by 600 students through our diversity fellowship, international fellowship, and Data Institute in-person programs. The only prerequisites are a year of coding experience, and high school math (math required for understanding the material is introduced as required during the course).
The course includes around 15 hours of lessons and a number of interactive notebooks, and is now available for free (with no ads) at course.fast.ai. About 80% of the material is new this year, including:
All models train much faster than last year’s equivalents, are much more accurate, and require fewer lines of code
Shows how to surpass all previous academic benchmarks in text classification, and how to match the state of the art in collaborative filtering, and time series and structured data analysis
Leverages the dynamic compilation features of Pytorch to provide deeper understanding of the internals of designing and training models
Covers recent network architectures such as Resnet and ResNeXt, including building a Resnet with batch normalization from scratch.
Combining research and education
fast.ai is first and foremost a research lab. Our research focuses on how to make practically useful deep learning more widely accessible. Often we’ve found that the current state of the art (SoTA) approaches aren’t good enough to be used in practice, so we have to figure out how to improve them. This means that our course is unusual in that although it’s designed to be accessible with minimal prerequisites (just high school math and a year of coding experience) we show how to match or better SoTA approaches in computer vision, natural language processing (NLP), time series and structured data analysis, and collaborative filtering. Therefore, we find our students have wildly varying backgrounds, including 14 year old high school students and tenured professors of statistics and successful Silicon Valley startup CEOs (and dairy farmers, accountants, buddhist monks, and product managers).
We want our students to be able to solve their most challenging and important problems, to transform their industries and organizations, which we believe is the potential of deep learning. We are not just trying to teach people how to get existing jobs in the field — but to go far beyond that. Therefore, since we first ran our deep learning course, we have been constantly curating best practices, and benchmarking and developing many techniques, trialing them against Kaggle leaderboards and academic state-of-the-art results.
The 2018 course shows how to (all with no more than a single GPU and a few seconds to a few hours of computation):
Build world-class image classifiers with as little as 3 lines of code (in a Kaggle competition running during the preview class, 17 of the top 20 ranked competitors were fast.ai students!)
Greatly surpass the academic SoTA in NLP sentiment analysis
Build a movie recommendation system that is competitive with the best highly specialized models
Replicate top Kaggle solutions in structured data analysis problems
Build the 2015 imagenet winning resnet architecture and batch normalization layer from scratch
A unique approach
Our earlier 2017 course was very successful, with many students developing deep learning skills that let them do things like:
Karthik Kannan, founder of letsenvision.com, who told us “Today I’ve picked up steam enough to confidently work on my own CV startup and the seed for it was sowed by fast.ai with Pt.1 and Pt.2”
Matthew Kleinsmith and Brendon Fortuner, who in 24 hours built a system to add filters to the background and foreground of videos, giving them victory in the 2017 Deep Learning Hackathon.
The new course is 80% new material, reflecting the major developments that have occurred in the last 12 months in the field. However, the key educational approach is unchanged: a top-down, code-first approach to teaching deep learning that is unique. We teach “the whole game”–starting off by showing how to use a complete, working, very usable, state of the art deep learning network to solve real world problems, by using simple, expressive tools. And then gradually digging deeper and deeper into understanding how those tools are made, and how the tools that make those tools are made, and so on… We always teach through examples: ensuring that there is a context and a purpose that you can understand intuitively, rather than starting with algebraic symbol manipulation. For more information, have a look at Providing a Good Education in Deep Learning, one of the first things we wrote after fast.ai was launched.
A good example of this approach is shown in the article Fun with small image data-sets, where a fast.ai student shows how to use the teachings from lesson 1 to build near perfect classifiers with <20 training images.
A special community
Perhaps the most valuable resource of all are the flourishing forums where thousands of students and alumni are discussing the lessons, their projects, the latest research papers, and more. This community has built many valuable resources, such as the helpful notes and guides provided by Reshama Shaikh, including:
Some forum threads have become important sources of information, such as Making your own server which now has over 500 posts full of valuable information, and Implementing Mask R-CNN which has been viewed by thousands of people that are interested in this cutting edge segmentation approach.
A new deep learning library
The new course is built on top of Pytorch, and uses a new library (called fastai) that we developed to make Pytorch more powerful and easy to use. I won’t go into too much detail about this now, since we discussed the motivation in detail in Introducing Pytorch for fast.ai and will be providing much more information about the implementation and API shortly. But to give a sense of what it means for students, here’s some examples of fastai in action:
fastai is the first library to implement the Learning Rate Finder (Smith 2015) which solves the challenging problem of selecting a good learning rate for your model
Having found a good learning rate, you can train your model much faster and more reliably by utilizing Stochastic Gradient Descent with Restarts (SGDR) - again fastai is the first library to provide this feature
Once you’ve trained your model, you’ll want to view images that the model is getting wrong - fastai provides this feature in a single line of code
The same API can be used to train any kind of model with only minor changes - here are examples of training an image classifier, and a recommendation system:
The course also shows how to use Keras with Tensorflow, although it takes a lot more code and compute time to get a much lower accuracy in this case. The basic steps are very similar however, so students can rapidly switch between libraries by using the conceptual understanding developed in the lessons.
Taking your learning further
Part 2 of the 2018 course will be run in San Francisco for 7 weeks from March 19. If you’re interested in attending, please see the details on the in-person course web site.