Advice to Medical Experts Interested in AI


Rachel Thomas


August 21, 2017

This week’s Ask-A-Data-Scientist column is from a medical doctor. Email your data science advice questions to mailto:[email protected]. Previous posts include:

Q: I’m a medical doctor (MD-PhD). I do a mix of clinical work and basic science research. My research primarily involves small scale animal studies for hypothesis testing, although others in my lab do some statistical clinical studies, such as paired cohort analysis. I’m interested in AI and wondering if and how it can be applied to my field?

A: AI is being applied to several fields of medicine, including:

Jeremy Howard Jeremy Howard, photographed by Jason Henry for the New York Times

Does this mean I need “big data”? No.

Currently, when news articles talk about “AI”, they are often referring to deep learning, one particular family of algorithms.

Although the above examples involve relatively large datasets, deep learning is being effectively applied to smaller and smaller datasets all the time. Here are a few examples that I listed in a previous blog post: Francois Chollet, creator of the popular deep learning library Keras and now at Google Brain, has an excellent tutorial entitled Building powerful image classification models using very little data in which he trains an image classifier on only 2,000 training examples. At Enlitic, Jeremy Howard led a team that used just 1,000 examples of lung CT scans with cancer to build an algorithm that was more accurate at diagnosing lung cancer than a panel of 4 expert radiologists. The C++ library Dlib has an example in which a face detector is accurately trained using only 4 images, containing just 18 faces!

Face Recognition with Dlib student Ben Bowles wrote a post on how data platform Quid uses deep learning with small data to improve the quality of one of their data sets.

Transfer learning is a powerful technique in which models trained on larger data sets (by teams with more computational resources) can be fine-tuned to fit other problems, with less data and less computation. For instance, models originally trained on ImageNet (images in a set of 1,000 categories) are a good starting point for other computer vision problems (such as analyzing a CT scan for lung cancer). Transfer learning is a major focus of our Practical Deep Learning for Coders course.

How does machine learning compare to hypothesis testing or paired cohort analysis?

Machine learning shines at handling messy data. Techniques such as controlled studies or paired cohort analysis rely on carefully controlling for different variables in your experiment set-up (or in finding the pairings), whereas machine learning is an excellent choice when this isn’t possible.

Random Forests

Deep learning is just one kind of machine learning. Another machine learning algorithm is the random forest, which is great for observational studies.

The original random forest paper successfully tested the algorithm on a number of small data sets, including 569 images of a finite aspirate of a breast cancer mass, data from 345 men with liver disorders, and 336 ecoli samples.

Getting started

If you are at a university, seek out collaborators or interns. Note, if you have a student who knows how to code, they can learn deep learning. If you are looking for a collaborator, you do not need to find a deep learning expert. All you need is someone with a year or two of coding experience who is interested in your project and wants to learn deep learning. It’s even better if they are familiar with your research (for instance, perhaps a student who may already be working in your lab and knows how to code).

I recommend that you learn to code. Even if it’s not your area of focus and you will be collaborating with programmers, knowing some code will help you better understand what’s possible and have a better sense of what the programmers you’re collaborating with are doing.

For doctors, I think the best way to start coding is by learning R: R has the easiest to use implementation of random forests (which is a great general purpose machine learning algorithm) and R is commonly used by statisticians, so you will most likely meet biostatisticians who use it. Rstudio is a relatively user-friendly, free environment in which to use R (although it still requires writing code). This free coursera class, taught by biostatisticians from Johns Hopkins University, is one good way to get started. Folks that know me may be surprised by this recommendation: in general, I recommend that people interested in becoming data scientists learn Python; I recommend that teenagers or anyone who likes art or games learn JavaScript; and now I’m recommending that doctors learn R. You will need to learn Python if you start learning deep learning, but random forests in R are a great place to get started with machine learning (and random forests still produce top quality results in many areas– they are not just for beginners!)