We keep hearing that data scientist is the hottest job of the 21st century and that there is a cross-industry shortage of employees with enough data skills, yet the idea of studying data science in school is still very new. What should colleges and universities teach about the topic? Is it just an adaption of existing math, statistics, or computer science courses? Would these classes be of interest to any non-math majors? Is there enough material to create a minor?
A math professor at a small university (who is also an old friend from grad school) recently asked me these questions, and I thought I would tackle them here for my latest advice column.
What is data science?
Data science refers to a wide collection of skills needed to ask and answer questions using data. The role of data scientist is actually used to refer to several separate roles: business analyst, data analyst, machine learning engineer, data pipeline engineer, modeling researcher, etc.
Data scientists need to know how to:
load, clean, and inspect data
plot and do exploratory analysis
formulate questions and test hypotheses
communicate their results to people who are not data scientists.
In addition, some data scientists do more specialized tasks, such as building machine learning models.
Of the schools that are starting to create data science programs, many use a mix of existing mathematics, statistics, and computer science courses. However, I didn’t learn the most useful data science skills in these fields when I was a student. (I was a math major and CS minor, and I earned my PhD in a field related to probability. My goal had been to become a theoretical math professor, and I did not start learning any practical skills until the end of my PhD.) Looking through the selection of math courses offered at my friend’s university, none jumps out as particularly useful for data science.
Although data science is related to math, computer science, and statistics, I definitely recommend designing new courses (or at least new units of material), and not trying to shoehorn existing courses into this role.
Most Useful Things to Learn
Python or R. I’m inclined towards Python since it is a very general language with libraries for many different purposes (e.g. if a student decided to become a software engineer instead, Python would be much more useful than R). However, R is nice too and is widely used in the academic statistics community. When learning Python for data science, you should learn at least:
Pandas: A library for working with tables of data.
Numpy: Used for nearly all data crunching in Python
SQL is a language used for interacting with tabular data (data that appears in tables with rows and columns), and particularly relational data (data in multiple related tables, such as customers and orders). It is widely used, and because it is highly specialized, it is quicker to learn then most programming languages. SQL is a highly employable skill. Needed skills are how to write queries and joins, what keys are, and how to design database schemas. SQL should be learned regardless of whether you choose R or Python.
Jupyter Notebooks provide an interactive environment that can include code, data, plots, text, and LaTex equations. They are a great tool for both teaching and for doing data science in the workplace. Many textbooks are now being released as Jupyter Notebooks, such as the ones in this interesting gallery. I typically run Python within Jupyter notebooks.
Exploratory data analysis includes loading and inspecting data, creating plots, checking what type different variables are, and dealing with missing values.
Machine learning is about using data to make predictions (whether that is predicting sales, identifying cancer on a CT scan, or Google Maps identifying house numbers from photographs). The most vital concept is the idea of having a held-out test set. A great algorithm to start with is ensembles of decision trees.
Ethics should be included as an integral part of all data science courses, and not as a separate course. Cases studies are particularly useful and I touch on several in this post, as well as linking to a number of course syllabi and other resources.
Working on a project from start to finish: designing a problem, running experiments, and writing them up. One resource is Jeremy’s article on designing great data products. Thinking about data quality and verification is part of this process. Microsoft’s racist chatbot Tay, which had to be discontinued less than a day after it was released when it began spouting Nazi rhetoric, provides a case study of not giving enough thought to the input data. Working on a project could also include productionizing it by building a simple web app (such as Python’s Flask).
My friend’s question used the term big data, but I chose to interpret this as being a question about data science. The marketing blitz around big data has been harmful, in that it misleadingly suggests that it is the size of your data set that matters. In many cases, folks with big data solutions are left searching for a problem to apply their technology to.
In most of data science (including artificial intelligence) far less data is needed than many people realize. One of our students created a model to distinguish pictures of cricket from pictures of baseball using just 30 training images! Even when you have a large data set, I recommend working on a smaller subset (until you are almost finished), since that will allow you to iterate much more quickly as you experiment. Also, what was considered “big data” a few years ago is now considered normal, and this trend is continuing all the time as technology advances.
Not just for math majors
A data science minor would be valuable across a range of disciplines: pre-med, economics, sociology, business, biology, and more. People are using data analysis to study everything from art curation to Japanese calligraphy.
Foundations of data science is the fastest growing course ever at UC Berkeley with 1,500 students from 60 different majors taking it in fall 2017. I can see a future in which college students from all majors take at least 1 or 2 data science courses (or in which it becomes mandatory, just like basic reading and writing literacy). Data literacy will continue to increase in importance both in the workplace and in society at large. I am excited to hear that more universities are beginning to add data science to their curriculums!
This post is part of my ask-a-data-scientist advice column. Here are some of the previous posts:
Last year we announced that we were developing a new deep learning course based on Pytorch (and a new library we have built, called fastai), with the goal of allowing more students to be able to achieve world-class results with deep learning. Today, we are making this course, Practical Deep Learning for Coders 2018, generally available for the first time, following the completion of a preview version of the course by 600 students through our diversity fellowship, international fellowship, and Data Institute in-person programs. The only prerequisites are a year of coding experience, and high school math (math required for understanding the material is introduced as required during the course).
The course includes around 15 hours of lessons and a number of interactive notebooks, and is now available for free (with no ads) at course.fast.ai. About 80% of the material is new this year, including:
All models train much faster than last year’s equivalents, are much more accurate, and require fewer lines of code
Shows how to surpass all previous academic benchmarks in text classification, and how to match the state of the art in collaborative filtering, and time series and structured data analysis
Leverages the dynamic compilation features of Pytorch to provide deeper understanding of the internals of designing and training models
Covers recent network architectures such as Resnet and ResNeXt, including building a Resnet with batch normalization from scratch.
Combining research and education
fast.ai is first and foremost a research lab. Our research focuses on how to make practically useful deep learning more widely accessible. Often we’ve found that the current state of the art (SoTA) approaches aren’t good enough to be used in practice, so we have to figure out how to improve them. This means that our course is unusual in that although it’s designed to be accessible with minimal prerequisites (just high school math and a year of coding experience) we show how to match or better SoTA approaches in computer vision, natural language processing (NLP), time series and structured data analysis, and collaborative filtering. Therefore, we find our students have wildly varying backgrounds, including 14 year old high school students and tenured professors of statistics and successful Silicon Valley startup CEOs (and dairy farmers, accountants, buddhist monks, and product managers).
We want our students to be able to solve their most challenging and important problems, to transform their industries and organizations, which we believe is the potential of deep learning. We are not just trying to teach people how to get existing jobs in the field — but to go far beyond that. Therefore, since we first ran our deep learning course, we have been constantly curating best practices, and benchmarking and developing many techniques, trialing them against Kaggle leaderboards and academic state-of-the-art results.
The 2018 course shows how to (all with no more than a single GPU and a few seconds to a few hours of computation):
Build world-class image classifiers with as little as 3 lines of code (in a Kaggle competition running during the preview class, 17 of the top 20 ranked competitors were fast.ai students!)
Greatly surpass the academic SoTA in NLP sentiment analysis
Build a movie recommendation system that is competitive with the best highly specialized models
Replicate top Kaggle solutions in structured data analysis problems
Build the 2015 imagenet winning resnet architecture and batch normalization layer from scratch
A unique approach
Our earlier 2017 course was very successful, with many students developing deep learning skills that let them do things like:
Karthik Kannan, founder of letsenvision.com, who told us “Today I’ve picked up steam enough to confidently work on my own CV startup and the seed for it was sowed by fast.ai with Pt.1 and Pt.2”
Matthew Kleinsmith and Brendon Fortuner, who in 24 hours built a system to add filters to the background and foreground of videos, giving them victory in the 2017 Deep Learning Hackathon.
The new course is 80% new material, reflecting the major developments that have occurred in the last 12 months in the field. However, the key educational approach is unchanged: a top-down, code-first approach to teaching deep learning that is unique. We teach “the whole game”–starting off by showing how to use a complete, working, very usable, state of the art deep learning network to solve real world problems, by using simple, expressive tools. And then gradually digging deeper and deeper into understanding how those tools are made, and how the tools that make those tools are made, and so on… We always teach through examples: ensuring that there is a context and a purpose that you can understand intuitively, rather than starting with algebraic symbol manipulation. For more information, have a look at Providing a Good Education in Deep Learning, one of the first things we wrote after fast.ai was launched.
A good example of this approach is shown in the article Fun with small image data-sets, where a fast.ai student shows how to use the teachings from lesson 1 to build near perfect classifiers with <20 training images.
A special community
Perhaps the most valuable resource of all are the flourishing forums where thousands of students and alumni are discussing the lessons, their projects, the latest research papers, and more. This community has built many valuable resources, such as the helpful notes and guides provided by Reshama Shaikh, including:
Some forum threads have become important sources of information, such as Making your own server which now has over 500 posts full of valuable information, and Implementing Mask R-CNN which has been viewed by thousands of people that are interested in this cutting edge segmentation approach.
A new deep learning library
The new course is built on top of Pytorch, and uses a new library (called fastai) that we developed to make Pytorch more powerful and easy to use. I won’t go into too much detail about this now, since we discussed the motivation in detail in Introducing Pytorch for fast.ai and will be providing much more information about the implementation and API shortly. But to give a sense of what it means for students, here’s some examples of fastai in action:
fastai is the first library to implement the Learning Rate Finder (Smith 2015) which solves the challenging problem of selecting a good learning rate for your model
Having found a good learning rate, you can train your model much faster and more reliably by utilizing Stochastic Gradient Descent with Restarts (SGDR) - again fastai is the first library to provide this feature
Once you’ve trained your model, you’ll want to view images that the model is getting wrong - fastai provides this feature in a single line of code
The same API can be used to train any kind of model with only minor changes - here are examples of training an image classifier, and a recommendation system:
The course also shows how to use Keras with Tensorflow, although it takes a lot more code and compute time to get a much lower accuracy in this case. The basic steps are very similar however, so students can rapidly switch between libraries by using the conceptual understanding developed in the lessons.
Taking your learning further
Part 2 of the 2018 course will be run in San Francisco for 7 weeks from March 19. If you’re interested in attending, please see the details on the in-person course web site.
From the time of our very first deep learning course at the USF Data Institute (which was recorded and formed the basis of our MOOC), we have allowed selected students that could not participate in person to attend via video and text chat through our International Fellowship. This International Fellowship, along with our Diversity Fellowship, has been an important part of our mission:
We want to get deep learning into the hands of as many people as possible, from as many diverse backgrounds as possible. People with different backgrounds have different problems they’re interested in solving. We have seen and experienced some of the obstacles facing outsiders: inequality, discrimination, and lack of access. We’ve also observed that the field of artificial intelligence is missing out because of its lack of diversity.
In fact, many of our strongest students and most effective projects have come from the International Fellowship. By opening up the opportunity to learn deep learning in a collaborative environment, students have been able to apply this powerful technology to local problems in their area. For instance, past International Fellows have worked to:
This year, we’re presenting an entirely new version of part 2 of our deep learning course, and today we’re launching the International Fellowship for it. The program allows those who can not get to San Francisco to attend virtual classes for free during the same time period as the in-person class and provides access to all the same online resources. (Note that International Fellowships do not provide an official completion certificate through USF). International fellows can come from anywhere on the planet (including from the USA) other than San Francisco, but need to be able to attend each class via Youtube Live at 6.30pm-9pm Pacific Time each Monday for 7 weeks from March 19, 2018 onwards. For many people that means getting up in the middle of the night—but our past students tell us it’s worth it!
The prerequisites are:
Familiarity with Python (including numpy), git, and bash
Available on Monday evenings (Pacific Standard Time) to attend via Youtube Live, from March 19 to April 30
Able to commit 10 hours a week of study to the course.
You can fulfill the requirement to be familiar with deep learning, the fastai library, and PyTorch by doing any 1 of the following:
You took the updated, in-person deep learning part 1 course during fall 2017
You are comfortable solving problems using python and numpy to solve problems, have watched the first 2 videos of the online course before you apply, and a commitment to work through all 7 lessons before the start of the course. We estimate that each lesson takes approximately 10 hours of study (so you would need to study for the 7 weeks prior to the course starting on March 19, for 10 hours each week).
You have previously taken the older version of the course (released last year) AND watch the first 4 lessons of the new course to get familiar with the fastai library and PyTorch.
Title your email “International Fellowship Application”
Include your resume
Write 1 paragraph describing one or more problems you’d like to apply deep learning to
Describe how you fulfill the deep learning prerequisite (e.g. have already taken deep learning part 1, or have started part 1 and plan to finish it before part 2 starts)
State where you are located
Confirm that you that you can commit 8 hours a week to working on the course and that you are able to participate in each class via Youtube Live at 6.30pm-9pm Pacific Time each Monday for 7 weeks from March 19, 2018 onwards.
The deadline to apply is February 28, 2018. You will not be notified if you have been selected until AFTER the deadline.
Frequently Asked Questions
Will the updated course be released online? Yes, the course will be recorded and released after the in-person version ends.
Can I apply again if I was an international fellow last year? Yes, you are welcome to apply again.
Do I get a certificate of completion for the international fellowship? No, the USF Data Institute only awards certificates to graduates of the in-person course.
Dawit Haile fought against the odds when he decided to study computer science in Eritrea, East Africa, despite having no internet connectivity. His perseverance paid off, first landing a job with the Eritrean government department of education, and later as an engineer in Lithuania. Today, Dawit is a data scientist in the San Francisco Bay Area, and he credits this new job to the knowledge and experience he gained from fast.ai. On the side, he’s building an algorithm to translate between English and his native language of Tigrinya.
We had over 70 incredibly qualified applicants for the diversity scholarships, including senior software engineers, several start-up founders, a researcher who had published in Nature, and many who are active in teaching, volunteer, and community organizations. It was a delight to be able to offer as many scholarships as we could with the support of our sponsors. Here are the stories of some of our fellows.
Adriana Fuentes is co-founder and technical lead at a stealth startup and president of the Society of Hispanic Professional Engineers at Silicon Valley (SHPE). She is applying knowledge gained from fast.ai to building a small autonomous vehicle which she will use to engage low socioeconomic students with the field of AI, as part of her volunteer work with SHPE. Previously, she built large scale distributed systems and databases at Hewlett Packard and was an engineer for hybrid vehicles, navigation systems, and infotainment at Ford Motor Company.
Sarada Lee was an accountant with no programming experience when she first encountered machine learning at a hackathon in 2016 and came away fascinated. She taught herself to code and founded the Perth Machine Learning Group in Perth, Australia. What began as a small group of friends meeting in Sarada’s living room grew to a community of 280 members within a year. The group worked through the online fast.ai course, won hackathons, attracted corporate sponsors, and hosts a number of speakers. Members have used image classification techniques on a utility project to potentially save millions of dollars. Sarada is now working on a new algorithm to read and understand large corpuses of documents, as well as developing new initiatives to help increase diversity in AI.
Tiffany Liu, a bioinformatics scientist researching brain tumor treatment, told us that the course provided hands-on help in her work building a multi-task neural network that simultaneously predicts both the tumor region and its associated clinical information.
We’re in awe of Dawit, Adriana, Tiffany, Sarada, Nahid, and all our diversity fellows, and we are so grateful to our sponsors for making this possible. While many are bemoaning a supposed “talent shortage” in AI, it is encouraging to see these companies and individuals take concrete action.
When I first moved to San Francisco in 2012, I was thrilled by how many startups there are here; the culture seemed so creative! Then I realized that most of the startups were indistinguishable from one another: nearly everyone was following the same destructive trends which are bad for employees and bad for products.
If you are working on a startup, I want you to know that there are options in how to to do things. After working at several startups and watching friends found start-ups, I took the leap and started fast.ai, together with Jeremy Howard. We are unusual in many ways: we have no interest in growing our tiny team; we are allergic to traditional venture capital; and we don’t plan to hire any deep learning PhDs. Yet we are still having a big impact!
If you are going to avoid making the same mistakes that so many entrepreneurs have made, the first step is to be able to recognize them. I’ve identified 5 dominant narratives in Bay Area Tech start-ups that not only harm employees, but lead to weaker companies and worse products. This post offers a high-level overview, and I’ll dig into the trends in greater detail in future posts (adding links as I do so):
Venture Capital often pushes what could’ve been a successful small business to over-expand and ultimately fail; prevents companies from focusing their priorities; distracts from finding a monetization plan; causes conflict due to the misalignment of incentives between VCs and founders; and is full of far too many unethical bullies and thugs.
Hypergrowth is nearly impossible to manage and leads to communication failures, redundant work, burnout, and high employee attrition.
Trying to be “like a family” severely limits your pool of potential employees, leaves you unprepared for conflict or HR incidents, and sets employees up to feel betrayed.
Attempting to productionize a PhD thesis is rarely a good business plan. The priorities and values of academia and business are drastically different.
Hiring a bunch of academic researchers will not improve your product and harms your company by diverting so many resources (unless your goal is an aquihire).
I recognize that there are many startups following these trends that have high-valuations on paper. However, that does not mean that these companies will succeed in the long-term (we’ve already seen many highly valued, high profile startups fail in recent years).
Negative trend 1: Venture Capital
Imagine you were to create a business where you could profitably support yourself and 10 employees selling a product your customers liked, and after running it for 10 years you sold it for $10 million, of which half ended up in your pocket and half with your employees. Most VCs would consider that an abject failure. They are looking for at least 100x returns, because all of their profits come from the one or two best performers in their portfolio.
Therefore, VCs often push companies to grow too quickly, before they’ve nailed down product-market fit and monetization. Growing at a slow, sustainable rate helps keep your priorities in order. Funding yourself (through part-time consulting, saving up money in advance, and/or getting a simple product to market quickly) will force you to stay smaller and grow more slowly than VC funded businesses, but this is good. Staying small keeps you focused on a small number of high-impact features.
I have seen a lot of deeply unethical, bullying, and downright illegal behavior by venture capitalists against close friends of mine. This is not just a few bad actors: the behavior is wide-spread, including by many well-known and ultra-wealthy investors (although founders often don’t speak out about it because of fear of professional repercussions).
Negative trend 2: Hypergrowth
Hypergrowth typically involves: chaos, inefficiency, and severe burn-out (none of which is good for your business) I’ve worked at several companies that have doubled in size in just a year. It was always painful and chaotic. Communication broke down. There was duplicate and redundant work. Company politics became increasingly destructive. Burnout was endemic and many people quit. In all cases, the quality of the product suffered.
Management is hard, and management of hypergrowth is an order of magnitude harder. So many start-ups work their employees into the ground for the sake of short-term growth. Burnout is a very real and expensive problem in the tech industry, and hypergrowth routinely leads to burnout.
Negative trend 3: “Our startup is like a family”
Many startups claim that they’re creating a family-like culture amongst their employees: they don’t just work together, they go out after work, share the same hobbies, and are best friends. Doing this severely limits your pool of potential employees. Employees with health problems, long commutes, families, outside hobbies, outside friendships, or from under-represented groups may all struggle to thrive in such a culture.
Secondly, you are making a promise you can’t keep, which sets people up for feeling betrayed. You’re not actually a family; you are a company. You will need to make hard decisions for the sake of the business. You can’t actually offer people anything remotely close to lifelong loyalty or security, and it’s dishonest to implicitly do so.
Negative trend 4 (AI specific): Productionizing your PhD thesis
The best approach to starting a start-up is to address a problem that people in the business world have. Your PhD thesis is not doing this, and it is highly unlikely that it will give you a competitive edge. You and your adviser picked your thesis topic because it’s an interesting technical problem with good opportunities to publish, not because it has a large opportunity for impact in an underserved market with few barriers to entry.
In the business world, products are not evaluated on underlying theoretical novelty, but on implementation, ease-of-use, effectiveness, and how they relate to revenues.
Negative trend 5 (AI specific): Hiring a bunch of PhDs
You almost certainly do not need a bunch of PhDs. There are so many things that go into a successful product beyond the algorithm: the product-market fit, software engineering that productionizes and deploys it, the act of selling it, supporting your users, etc. And even for highly technical aspects like deep learning, fast.ai has shown that people with 1-year of coding experience can become world-class deep learning practitioners; you don’t need to hire Stanford PhDs. By diverting valuable resources into academic research at your startup, you are hurting the product.
Fast.ai is solving a problem that I experienced first-hand: how hard it can be to break into deep learning and gain practical AI knowledge if you don’t have the “right” background and didn’t train with the academic stars of the field. I have seen and experienced some of the obstacles facing outsiders: inequality, discrimination, and lack of access.
I grew up in Texas (not in a major city) and attended a poor, predominantly Black public high school that was later ranked in the bottom 2% of Texas schools. We had far fewer resources and opportunities compared to the wealthier, predominantly White schools around us. In graduate school, the sexism and harassment I experienced led me to abandon my dreams of becoming a math professor, although I then experienced similar problems in the the tech industry. When I first became interested in deep learning in 2013, I found that experts weren’t writing down the practical methods they used to actually get their research to work, instead just publishing the theory. I believe deep learning will have a huge impact across all industries, and I want the creators of this technology to be a more diverse and less exclusive group.
With fast.ai, I’m finally able to do work completely in line with my values, on a tiny team characterized by trust and respect. Having a small team forces us to prioritize ruthlessly, and to focus only on what we value most or think will be highest impact. Something that has surprised me with fast.ai is how much I’ve been able to invest in my own career and own skills, in ways that I never could in previous jobs. Jeremy and I are committed to fast.ai for the long term, so neither of us has any interest in burning out. We believe you can have an impact with your work, without destroying your health and relationships.
I’d love to see more small companies building useful products in a healthy and sustainable way.