Adding Data Science to your College Curriculum27 Feb 2018 Rachel Thomas [
We keep hearing that data scientist is the hottest job of the 21st century and that there is a cross-industry shortage of employees with enough data skills, yet the idea of studying data science in school is still very new. What should colleges and universities teach about the topic? Is it just an adaption of existing math, statistics, or computer science courses? Would these classes be of interest to any non-math majors? Is there enough material to create a minor?
A math professor at a small university (who is also an old friend from grad school) recently asked me these questions, and I thought I would tackle them here for my latest advice column.
What is data science?
Data science refers to a wide collection of skills needed to ask and answer questions using data. The role of data scientist is actually used to refer to several separate roles: business analyst, data analyst, machine learning engineer, data pipeline engineer, modeling researcher, etc.
Data scientists need to know how to:
- load, clean, and inspect data
- plot and do exploratory analysis
- formulate questions and test hypotheses
- write code
- communicate their results to people who are not data scientists.
In addition, some data scientists do more specialized tasks, such as building machine learning models.
Of the schools that are starting to create data science programs, many use a mix of existing mathematics, statistics, and computer science courses. However, I didn’t learn the most useful data science skills in these fields when I was a student. (I was a math major and CS minor, and I earned my PhD in a field related to probability. My goal had been to become a theoretical math professor, and I did not start learning any practical skills until the end of my PhD.) Looking through the selection of math courses offered at my friend’s university, none jumps out as particularly useful for data science.
Although data science is related to math, computer science, and statistics, I definitely recommend designing new courses (or at least new units of material), and not trying to shoehorn existing courses into this role.
Most Useful Things to Learn
Python or R. I’m inclined towards Python since it is a very general language with libraries for many different purposes (e.g. if a student decided to become a software engineer instead, Python would be much more useful than R). However, R is nice too and is widely used in the academic statistics community. When learning Python for data science, you should learn at least:
- Pandas: A library for working with tables of data.
- Matplotlib: A library for plotting data.
- Numpy: Used for nearly all data crunching in Python
SQL is a language used for interacting with tabular data (data that appears in tables with rows and columns), and particularly relational data (data in multiple related tables, such as customers and orders). It is widely used, and because it is highly specialized, it is quicker to learn then most programming languages. SQL is a highly employable skill. Needed skills are how to write queries and joins, what keys are, and how to design database schemas. SQL should be learned regardless of whether you choose R or Python.
Jupyter Notebooks provide an interactive environment that can include code, data, plots, text, and LaTex equations. They are a great tool for both teaching and for doing data science in the workplace. Many textbooks are now being released as Jupyter Notebooks, such as the ones in this interesting gallery. I typically run Python within Jupyter notebooks.
Exploratory data analysis includes loading and inspecting data, creating plots, checking what type different variables are, and dealing with missing values.
Machine learning is about using data to make predictions (whether that is predicting sales, identifying cancer on a CT scan, or Google Maps identifying house numbers from photographs). The most vital concept is the idea of having a held-out test set. A great algorithm to start with is ensembles of decision trees.
Ethics should be included as an integral part of all data science courses, and not as a separate course. Cases studies are particularly useful and I touch on several in this post, as well as linking to a number of course syllabi and other resources.
Working on a project from start to finish: designing a problem, running experiments, and writing them up. One resource is Jeremy’s article on designing great data products. Thinking about data quality and verification is part of this process. Microsoft’s racist chatbot Tay, which had to be discontinued less than a day after it was released when it began spouting Nazi rhetoric, provides a case study of not giving enough thought to the input data. Working on a project could also include productionizing it by building a simple web app (such as Python’s Flask).
Curriculum to check out:
- UC Berkeley’s Foundations of Data Science free online textbook
- Python for Data Analysis by Wes McKinney
- Python Data Science Handbook by Jake Van der Plaas
I asked on twitter what people’s favorite introductory data science resources are, and I was overwhelmed (in a good way!) by the responses. There are too many to list, but feel free to check them out:
Question: what are your favorite "intro to data science" courses/blogs/websites?— Rachel Thomas (@math_rachel) February 25, 2018
What about “big data”?
My friend’s question used the term big data, but I chose to interpret this as being a question about data science. The marketing blitz around big data has been harmful, in that it misleadingly suggests that it is the size of your data set that matters. In many cases, folks with big data solutions are left searching for a problem to apply their technology to.
In most of data science (including artificial intelligence) far less data is needed than many people realize. One of our students created a model to distinguish pictures of cricket from pictures of baseball using just 30 training images! Even when you have a large data set, I recommend working on a smaller subset (until you are almost finished), since that will allow you to iterate much more quickly as you experiment. Also, what was considered “big data” a few years ago is now considered normal, and this trend is continuing all the time as technology advances.
Not just for math majors
A data science minor would be valuable across a range of disciplines: pre-med, economics, sociology, business, biology, and more. People are using data analysis to study everything from art curation to Japanese calligraphy.
Foundations of data science is the fastest growing course ever at UC Berkeley with 1,500 students from 60 different majors taking it in fall 2017. I can see a future in which college students from all majors take at least 1 or 2 data science courses (or in which it becomes mandatory, just like basic reading and writing literacy). Data literacy will continue to increase in importance both in the workplace and in society at large. I am excited to hear that more universities are beginning to add data science to their curriculums!
This post is part of my ask-a-data-scientist advice column. Here are some of the previous posts:
- How to change careers and become a data scientist
- Advice to parents to encourage your child in STEM
- How to structure your data science and engineering teams
- Does Machine Learning as a Service (MLaaS) work? Do you need a PhD?
- How to make peace with personal branding
- Advice to medical experts interested in AI