Our online courses (all are free and have no ads):

Our software: fastai v1 for PyTorch

Take our course in person, March-April 2019 in SF: Register here

fast.ai in the news:


fast.ai Embracing Swift for Deep Learning

Note from Jeremy: If you want to join the next deep learning course at the University of San Francisco, discussed below, please apply as soon as possible because it’s under 2 weeks away! You can apply here. At least a year of coding experience, and deep learning experience equivalent to completing Practical Deep Learning for Coders is required.

Today at the TensorFlow Dev Summit we announced that two lessons in our next course will cover Swift for TensorFlow. These lessons will be co-taught with the inventor of Swift, Chris Lattner; together, we’ll show in the class how to take the first steps towards implementing an equivalent of the fastai library in Swift for TensorFlow. We’ll be showing how to get started programming in Swift, and explain how to use and extend Swift for TensorFlow.

Last month I showed that Swift can be used for high performance numeric computing (that post also has some background on what Swift is, and why it’s a great language, so take a look at that if you haven’t read it before). In my research on this topic, I even discovered that Swift can match the performance of hand-tuned assembly code from numerical library vendors. But I warned that: “Using Swift for numeric programming, such as training machine learning models, is not an area that many people are working on. There’s very little information around on the topic”.

So, why are we embracing Swift at this time? Because Swift for TensorFlow is the first serious effort I’ve seen to incorporate differentiable programming deep in to the heart of a widely used language that is designed from the ground up for performance.

Our plans for Swift at fast.ai

The combination of Python, PyTorch, and fastai is working really well for us, and for our community. We have many ongoing projects using fastai for PyTorch, including a forthcoming new book, many new software features, and the majority of the content in the upcoming courses. This stack will remain the main focus of our teaching and development.

It is very early days for Swift for TensorFlow. We definitely don’t recommend anyone tries to switch all their deep learning projects over to Swift just yet! Right now, most things don’t work. Most plans haven’t even been started. For many, this is a good reason to skip the project entirely.

But for me, it’s a reason to jump in! I love getting involved in the earliest days of projects that I’m confident will be successful, and helping our community to get involved too. Indeed, that’s what we did with PyTorch, including it in our course within a few weeks of its first pre-release version. People who are involved early in a project like this can have a big influence on its development, and soon enough they find themselves the “insiders” in something that’s getting big and popular!

I’ve been looking for a truly great numerical programming language for over 20 years now, so for me the possibility that Swift could be that language is hugely exciting. There are many project opportunities for students to pick something that’s not yet implemented in Swift for TensorFlow, and submit a PR implementing and testing that functionality.

Python: What’s missing

In the last three years, we’ve switched between many different deep learning libraries in our courses: Theano, TensorFlow, Keras, PyTorch, and of course our own fastai library. But they’ve all had one thing in common: they are Python libraries. This is because Python is today the language that’s used in nearly all research, teaching, and commercial applications of deep learning. To be a deep learning practitioner and use a language other than Python means giving up a vast ecosystem of interconnected libraries, or else using Python’s libraries through clunky inter-language communication mechanisms.

But Python is not designed to be fast, and it is not designed to be safe. Instead, it is designed to be easy, and flexible. To work around the performance problems of using “pure Python” code, we instead have to use libraries written in other languages (generally C and C++), like numpy, PyTorch, and TensorFlow, which provide Python wrappers. To work around the problem of a lack of type safety, recent versions of Python have added type annotations that optionally allow the programmer to specify the types used in a program. However, Python’s type system is not capable of expressing many types and type relationships, does not do any automated typing, and can not reliably check all types at compile time. Therefore, using types in Python requires a lot of extra code, but falls far short of the level of type safety that other languages can provide.

The C/C++ libraries that are at the heart of nearly all Python numeric programming are also a problem for both researchers, and for educators. Researchers can not easily modify the underlying code, or inspect it, since it requires a whole different toolbox—and in the case of libraries like MKL and cudnn the underlying code is optimized machine language. Educators cannot easily show students what’s really going on in a piece of code, because the normal Python-based debugging and inspection approaches can not handle libraries in other languages. Developers struggle to profile and optimize code where it crosses language boundaries, and Python itself can not properly optimize code that crosses language or library boundaries.

For instance, we’ve been doing lots of research in to different types of recurrent neural network architectures and normalization layers. In both cases, we haven’t been able to get the same level of performance that we see in pure CUDA C implementations, even when using PyTorch’s fantastic new JIT compiler.

At the PyTorch Dev Summit last year I participated in a panel with Soumith Chintala, Yangqing Jia, Noah Goodman, and Chris Lattner. In the panel discussion, I said that: “I love everything about PyTorch, except Python.” I even asked Soumith “Do you think we might see a ‘SwifTorch’ one day?” At the time, I didn’t know that we might be working with Swift ourselves so soon!

So what now?

In the end, anything written in Python has to deal with one or more of the following:

  • Being run as pure Python code, which means it’s slow
  • Being a wrapper around some C library, which means it’s hard to extend, can’t be optimized across library boundaries, and hard to profile and debug
  • Being converted in to some different language (such as PyTorch using TorchScript, or TensorFlow using XLA), which means you’re not actually writing in the final target language, and have to deal with the mismatch between the language you think you’re writing, and the actual language that’s really being used (with at least the same debugging and profiling challenges of using a C library).

On the other hand, Swift is very closely linked with its underlying compiler infrastructure, LLVM. In fact, Chris Lattner has described it before as “syntactic sugar for LLVM”. This means that code written in Swift can take full advantage of all of the performance optimization infrastructure provided by LLVM. Furthermore, Chris Lattner and Jacques Pienaar recently launched the MLIR compiler infrastructure project, which has the potential to significantly improve the capabilities of Swift for TensorFlow.

Our hope is that we’ll be able to use Swift to write every layer of the deep learning stack, from the highest level network abstractions all the way down to the lowest level RNN cell implementation. There would be many benefits to doing this:

  • For education, nothing is mysterious any more; you can see exactly what’s going on in every bit of code you use
  • For research, nothing is out of bounds; whatever you can conceive of, you can implement, and have it run at full speed
  • For development, the language helps you; your editor will deeply understand your code, doing intelligent completions and warning you about problems like tensor mismatches, your profiler will show you all the steps going on so you can find and fix performance problems, and your debugger will let you step all the way to the bottom of your call stack
  • For deployment, you can deploy the same exact code that you developed on using your laptop. No need to convert it to some arcane format that only your deep learning server understands!

In conclusion

For education, our focus has always been on explaining the concepts of deep learning, and the practicalities of actually using this tool. We’ve found that our students can very easily (within a couple of days) switch to being productive in a different library, as long as they understand the foundations well, and have practiced applying them to solve real problems.

Our Python fastai library will remain the focus of our development and teaching. We will, however, be doing lots of research using Swift for TensorFlow, and if it reaches the potential we think it has, expect to see it appearing more and more in future courses! We will be working to make practical, world-class, deep learning in Swift as accessible as possible—and that probably means bringing our fastai library (or something even better!) to Swift too. It’s too early to say exactly what that will look like; if you want to be part of making this happen, be sure to join the upcoming class, either in person at the University of San Francisco, or in the next part 2 MOOC (coming out June 2019).

A Conversation about Tech Ethics with the New York Times Chief Data Scientist

Note from Rachel: Although I’m excited about the positive potential of tech, I’m also scared about the ways that tech is having a negative impact on society, and I’m interested in how we can push tech companies to do better. I was recently in a discussion during which New York Times chief data scientist Chris Wiggins shared a helpful framework for thinking about the different forces we can use to influence tech companies towards responsibility and ethics. I interviewed Chris on the topic and have summarized that interview here.

In addition to having been Chief Data Scientist at the New York Times since January 2014, Chris Wiggins is professor of applied mathematics at Columbia University, a founding member of Columbia’s Data Science Institute, and co-founder of HackNY. He co-teaches a course at Columbia on the history and ethics of data.

Ways to Influence Tech Companies to be More Responsible and More Ethical

Chris has developed a framework showing the different forces acting on and within tech companies:

  1. External Forces
    1. Government Power
      1. Regulation
      2. Litigation
      3. Fear of regulation and litigation
    2. People Power
      1. Consumer Boycott
      2. Data Boycott
      3. Talent Boycott
    3. Power of Other Companies
      1. Responsibility as a value-add
      2. Direct interactions, such as de-platforming
      3. The Press
  2. Internal Forces
    1. How we define ethics
      One key example: the Belmont Principles
      1. Respect for persons
      2. Beneficence
      3. Justice
    2. How we design for ethics
      1. Starts with leadership
      2. Includes the importance of monitoring user experience

The two big categories are internal forces and external forces. Chris shared that at the New York Times, he’s seen the internal process through the work of a data governance committee on responsible data stewardship. Preparing for GDPR helped focus those conversations, as well as wanting to be proactive in preparing for other data regulations that Chris and his team expect are coming in the future. The New York Times has standardized processes, including for data deletion and for protecting personally identifiable information (PII). For instance, any time you are storing aggregate information, e.g. about page views, you don’t need to keep the PII of the individuals who viewed the page.

How to Impact Tech Companies: External Forces

Chris cites Doing Capitalism in the Innovation Economy by Bill Janeway as having been influential on his thinking about external forces impacting companies. Janeway writes of an unstable game amongst three players: government, people, and companies. Government power can take the form of regulations and litigation, or even just the fear of regulation and litigation.

A second external force is people power. The most well-known example of exercising people power is a consumer boycott– not giving companies money when we disagree with their practices. There is also a data boycott, not giving companies access to our data, and a talent boycott, refusing to work for them. In the past year, we have seen engineers from Google (protesting Project Maven military contract and a censored Chinese search engine), Microsoft (protesting Hololens military contract), and Amazon (protesting facial recognition technology sold to police) start to exercise this power to ask for change. However, most engineers and data scientists still do not realize the collective power that they have. Engineers and data scientists are in high demand, and they should be leveraging this to push companies to be more ethical.

A third type of external power is the power of other companies. For instance, companies can make responsibility part of the value-add that differentiates them from competitors. The search engine DuckDuckGo (motto: “The search engine that doesn’t track you”) has always made privacy a core part of their appeal. Apple has become known for championing user privacy in recent years. Consumer protection was a popular idea in the 70s and 80s, yet has somewhat fallen out of favor. More companies could make consumer protection and responsibility part of their products and what differentiates them from competitors.

Companies can also exert power in direct ways on one another, for instance, when Apple de-platformed Google and Facebook by revoking their developer certificates after they violated Apple’s privacy policies. And finally, the press counts as a company that influences other companies. There are many interconnections here: the press influences people as citizens, voters, and consumers, which then impacts the government and the companies directly.

Internal Forces: Defining Ethics vs Designing for Ethics

Chris says there is an important to distinguish between how we define ethics vs. how we design for ethics. Conversations quickly get muddled up when people are jumping between these two.

Defining ethics involves identifying your principles. There is a granularity to ethics: we need principles to be granular enough to be meaningful, but not so granular that they are context-dependent rules which change all the time. For example, “don’t be evil” is too broad to be meaningful.

Ethical principles are distinct from their implementation, and defining principles means being willing to commit to do the work to define more specific rules that follow from these principles, or to redefine them in a way more consistent with your principles as technology or context changes. For instance, many ethics principles were laid out in the U.S. Bill of Rights, and we’ve spent the next few centuries working out what that means in practice.

Designing for Ethics

In terms of designing for ethics, Chris notes that this needs to start from the top. Leaders set company goals, which are then translated into objectives; those objectives are translated into KPIs; and those KPIs are used in operations. Feedback from operations and KPIs should be used to continually reflect on whether the ethical principles are being defended or challenged, or to revisit ways that the system is falling short. One aspect of operations that most major tech companies have neglected is monitoring user experience, particularly deleterious user experiences. When companies use contractors for content moderation (as opposed to full-time employees), it says a lot about the low priority they place on negative user experiences. While content moderation addresses one component of negative user experiences, there are also many others.

Chris wrote about this topic in a blog post, Ethical Principles, OKRs, and KPIs: what YouTube and Facebook could learn from Tukey, saying that “Part of this monitoring will not be quantitative. Particularly since we can not know in advance every phenomenon users will experience, we can not know in advance what metrics will quantify these phenomena. To that end, data scientists and machine learning engineers must partner with or learn the skills of user experience research, giving users a voice.

Learning from the Belmont Report and IRBs

The history of ethics is not discussed enough and many people aren’t very familiar with it. We can learn a lot from other fields, such as human subject research. In the wake of the horrifying and unethical Tuskegee Syphilis Study, the National Research Act of 1974 was passed and researchers spent much time identifying and formalizing ethical principles. These are captured in the Belmont Principles for ethical research on human subjects. This is the ethical framework that informed the later creation of institutional review boards (IRBs), which are used to review and approve research involving human subjects. Studying how ethics have been operationalized via the Belmont principles can be very informative, as they have for almost 40 years been stress-tested via real-world implementations, and there is copious literature about their utility and limitations.

The core tenets of the Belmont Principles can be summarized:

  1. Respect for Persons:
    1. informed consent;
    2. respect for individuals’ autonomy;
    3. respect individuals impacted;
    4. protection for individuals with diminished autonomy or decision making capability.
  2. Beneficence:
    1. Do not harm;
    2. assess risk.
  3. Justice:
    1. equal consideration;
    2. fair distribution of benefits of research;
    3. fair selection of subjects;
    4. equitable allocation of burdens.

Note that the principle of beneficence can be used to make arguments in which the ends justify the means, and the principle of respect for persons can be used to make arguments in which the means justify the ends, so there is a lot captured here.

This topic came up in the national news after Kramer, et. al, the 2014 paper in which Facebook researchers manipulated users’ moods, which received a lot of criticism and concern. There was a follow-up paper by two other Facebook researchers, Evolving the IRB: Building Robust Review for Industry Research, which suggests that a form of IRB Design has now been implemented at Facebook. I have learned a lot from studying work of ethicists on research with human subjects, particularly the Belmont Principles.

For those interested in learning more about this topic, Chris recommends Chapter 6 of Matthew Salganik’s book, Bit by Bit: Social Research in the Digital Age, which Chris uses in the course on the history and ethics of data that he teaches at Columbia. Salganik is a Princeton professor who does computational social science research, and is also professor in residence at the New York Times.

Chris also says he has learned a lot from legal theorists. Most engineers may not have thought much about legal theorists, but they have a long history of addressing the balance between standards, principles, and rules.

High Impact Areas

Ethics is not an API call. It needs to happen at a high-level. The greatest impact will happen when leaders to take it seriously. The level of the person in the org chart who speaks about ethics is very telling of how much the company values ethics (because for most companies, it is not the CEO, although it should be).

As stated above, engineers don’t understand their own power, and they need to start using that power more. Chris recalls listening to a group of data scientists saying that they wished their company had some ethical policy that another company had. But they can make it happen! They just need to decide to use their collective power.

Reading Recommendations from and by Chris

Other fast.ai posts on tech ethics

Dairy farming, solar panels, and diagnosing Parkinson's disease: what can you do with deep learning?

Many people incorrectly assume that AI is only for an elite few– a handful of Silicon Valley computer science prodigies with monthly budgets larger than most people’s lifetime earnings, turning out abstruse academic papers. This couldn’t be more wrong. Deep learning (a powerful type of AI) can, and is, being used by people with varied backgrounds all over the world. A small taste of that variety can be found in the stories shared here: a Canadian dairy farmer trying to identify udder infections in his goats, a Kenyan microbiologist seeking more efficiency in the lab, a former accountant expanding use of solar power in Australia, a 73-year old embarking on a second career, a son of refugees who works in cybersecurity, and a researcher using genomics to improve cancer treatment. Hopefully this may inspire you to apply deep learning to a problem of your own!

Top row: Alena Harley, Benson Nyabuti Mainye, and Harold Nguyen. Bottom row: Dennis Graham, Sarada Lee, and Cory Spencer
Top row: Alena Harley, Benson Nyabuti Mainye, and Harold Nguyen. Bottom row: Dennis Graham, Sarada Lee, and Cory Spencer

Building Tools for Microbiologists in Kenya

Benson Nyabuti Mainye trained as a microbiologist in his home country of Kenya. He noticed that lab scientists can spend up to 5 hours studying a slide through a microscope to try to identify what cell types were in it, and he wanted a faster alternative. Benson created an immune cell classifier to distinguish various immune cells (eosinophils, basophils, monocytes, and lymphocytes) within an image of a blood smear. This fall, he traveled to San Francisco to attend part of the fast.ai course in person at the USF Data Institute (a new session starts next month), and where another fast.ai classmate, Charlie Harrington, helped him deploy the immune cell classifier. Since malaria is one of the top 10 causes of death in Kenya, Benson is currently working with fellow Kenyan and fast.ai alum Gerald Muriuki on a classifier to distinguish different types of mosquitoes to isolate particular types that carry the Plasmodium species (the parasite which causes malaria).

Dairy Goat Farming

Cory Spencer is a dairy goat farmer on bucolic Vancouver Island, and together with his wife owns The Happy Goat Cheese Company. When one of his goats came down with mastitis (an udder infection), Cory was unable to detect it until after the goat had suffered permanent damage. Estimates suggest that mastitis costs the dairy industry billions of dollars each year. By combining a special camera that detects heat (temperatures are higher near an infection) together with deep learning, Cory developed a tool to identify infections far earlier (at a subclinical level) and for one-tenth the cost of existing methods. Next up: Cory is currently building a 3D model to track specific parts of udders in real time, towards the goal of creating an automatic goat milking robot, since as Cory says, “The cow guys already have the fancy robotic tech, but the goat folk are always neglected.”

Cory Spencer's goats
Cory Spencer's goats

State-of-the-art Results in Cancer Genomics

Alena Harley is working to use genetic information to improve cancer treatment, in her role as head of machine learning at Human Longevity Institute. While taking the fast.ai course, she achieved state-of-the-art results for identifying the source of origin of metastasized cancer, which is relevant for treatment. She is currently working on accurately identifying somatic variants (genetic mutations that can contribute to cancer), automating what was previously a slow manual process.

One of Alena Harley's posts about her work on cancer metastasis
One of Alena Harley's posts about her work on cancer metastasis

From Accountant to Deep Learning Practitioner working on Solar Energy

Sarada Lee was a former accountant looking to transition careers when she began a machine learning meetup in her living room in Perth, Australia, as a way to study the topic. That informal group in Sarada’s living room has now grown into the Perth Machine Learning Meetup, which has over 1,400 members and hosts 6 events per month. Sarada traveled to San Francisco to take the Practical Deep Learning for Coders and Cutting Edge Deep Learning for Coders courses in person at the USF Data Institute, and shared what she learned when she returned back to Perth. Sarada recently won a 5-week long hackathon on the topics of solar panel identification and installation size prediction from aerial images, using U-nets. As a result, she and her team have been pre-qualified to supply data science services to a major utility company, which is working on solar panel adoption for an area the size of UK with over 1.5 million users. Other applications they are working on include electricity network capacity planning, predicting reverse energy flow and safety implications, and monitoring the rapid adoption of solar.

Part of the Perth Machine Learning team with their BitLit Booth at Fringe World (Sarada is 2nd from the left)
Part of the Perth Machine Learning team with their BitLit Booth at Fringe World (Sarada is 2nd from the left)

Sarada and the Perth Machine Learning Meetup are continuing their deep learning outreach efforts. Last month, a team led by Lauren Amos created an interactive creative display at the Fringe World Festival to make deep learning more accessible to the general public. This was a comprehensive team effort, and the display included:

  • artistic panels design based on style transfer
  • GRU/RNN generated poems
  • Implemented BERT to generate poems or short books
  • Applied speech-to-text and text-to-speech APIs to interact with a poetry-generating robot

Festival attendees were able to enjoy the elegant calligraphy of machine generated poems, read chapters of machine-generated books, and even request a robot to generate poems given a short seed sentence. Over 4,000 poems were generated during the course of the 2-week festival!

Cutting-edge Medical Research at Age 73

At age 73, Dennis Graham is using deep learning to diagnose Parkinson’s disease from Magneto-Encepholo-Graphy (MEG), as part of a UCH-Anschutz Neurology Research center project. Dennis is painfully familiar with Parkinson’s, as his wife has been afflicted with it for the last 25 years. MEG has the advantages of being inexpensive, readily available, and non-intrusive, but previous techniques had not been analytically accurate when evaluating MEG data. For two years, the team struggled, unable to obtain acceptable results using traditional techniques, until Dennis switched to deep learning, applying techniques and code he learned in the fast.ai course. It turns out that the traditional pre-processing was removing essential data that a neural network classifier could effectively and easily use. With deep learning, Dennis is now achieving much higher accuracy on this problem. Despite his successes, it hasn’t all been easy, and Dennis has had to overcome the ageism of the tech industry as he embarked on his second career.

A First-Generation College Student Working in Cybersecurity

Harold Nguyen’s parents arrived in the United States as refugees during the Vietnam War. Harold is a first generation Vietnamese American and the first in his family to attend college. He loved college so much that he went on to obtain a PhD in Particle Physics and now works in cybersecurity. Harold is using deep learning to protect brands from bad actors on social media as part of his work in digital risk for Proofpoint. Based on work he did with fast.ai, he created a model with high accuracy that was deployed to production at his company last month. Earlier during the course, Harold created an audio model to distinguish between the voices of Ben Affleck, Elon Musk, and Joe Rogan.

What problem will you tackle with deep learning?

Are you facing a problem in your field that could be addressed by deep learning? You don’t have to be a math prodigy or have gone to the most prestigious school to become a deep learning practitioner. The only pre-requisite for the fast.ai course (available in-person or online) is one year of coding, yet it teaches you the hands-on practical techniques needed to achieve state-of-the-art results.

I am so proud of what fast.ai students and alums are achieving. As I shared in my TEDx talk, I consider myself an unlikely AI researcher, and my goal is to help as many unlikely people as possible find their way into the field.

Onstage during my talk 'AI needs all of us' at San Francisco TEDx
Onstage during my talk 'AI needs all of us' at San Francisco TEDx

Further reading

You may be interested to read about some other fantastic projects from fast.ai students and alumni in these posts:

fastec2 script: Running and monitoring long-running tasks

This is part 2 of a series on fastec2. For an introduction to fastec2, see part 1.

Spot instances are particularly good for long-running tasks, since you can save a lot of money, and you can use more expensive instance types just for the period you’re actually doing heavy computation. fastec2 has some features to make this use case much more convenient. Let’s see an example. Here’s what we’ll be doing:

  1. Use an inexpensive on-demand monitoring instance for collecting results (and optionally for launching the task). We’ll call this od1 in this guide (but you can call it anything you like)
  2. Create a script to do the work required, and put any configuration files it needs in a specific folder. The script will need to be written to save results to a specific folder so they’ll be saved
  3. Test the script works OK in a fresh instance
  4. Run the script under fastec2, which will cause it to be launched inside a tmux session on a new instance, with the required files copied over, and any results copied back to od1 as they’re created
  5. While the script is running, check its progress either by connecting to the tmux session it’s running in, or looking at the results being copied back to od1 as it runs
  6. When done, the instance will be terminated automatically, and we’ll review the results on od1.

Let’s look at the details of how this works, and how to use it. Later in this post, we’ll also see how to use fastec2’s volumes and snapshots functionality to make it easier to connect to large datasets.

Setting up your monitoring instance and script

First, create a script that completes the task you need. When running under fastec2, the script will be launched inside a directory called ~/fastec2, and this directory will also contain any extra files (that aren’t already in your AMI) needed for the script, and will be monitored for changes which are copied back to your on-demand instance (od1, in this guide). Here’s a example (we’ll call it myscript.sh) we can use for testing:

#!/usr/bin/env bash
echo starting >> $FE2_DIR/myscript.log
sleep 60
echo done >> $FE2_DIR/myscript.log

When running, the environment variable FE2_DIR will be set to the directory your script and files are in. Remember to give your script executable permissions:

$ chmod u+x myscript.sh

When testing it on a fresh instance, just set FE2_DIR and create that directory, then see if your script runs OK (it’s a good idea to have some parameter to your script that causes it to run a quick version for testing).

$ export FE2_DIR=~/fastec2/spot2
$ mkdir -p $FE2_DIR
$ ./myscript.sh

Running the script with fastec2

You need some computer running that can be used to collect the results of the long running script. You won’t want to use a spot instance for this, since it can be shut down at any time, causing you to lose your work. But it can be a cheap instance type; if you’ve had your AWS account for less than 1 year then you can use a t2.micro instance for free. Otherwise a t3.micro is a good choice—it should cost you around US$7/month (plus storage costs) if you leave it running.

To run your script under fastec2, you need to provide the following information:

  1. The name of the instance to use (first create it with launch)
  2. The name of your script
  3. Additional arguments ([--myip MYIP] [--user USER] [--keyfile KEYFILE]) to connect to the monitoring instance to copy results to. If no host is provided, it uses the IP of the computer where fe2 is running.

E.g. this command will run myscript.sh on spot2 and copy results back to 18.188.16.203:

$ fe2 launch spot2 base 80 m5.large --spot
$ fe2 script myscript.sh spot2 18.188.162.203

Here’s what happens after you run the fe2 script line above:

  1. A directory called ~/fastec2/spot2 is created on the monitoring instance if it doesn’t already exist (it is always a subdirectory of ~/fastec2 and is given the same name as the instance you’re connecting to, which in this case is spot2)
  2. Your script is copied to this directory
  3. This directory is copied to the target instance (in this case, spot2)
  4. A file called ~/fastec2/current is created on the target instance, containing the name of this task (“spot2 in this case”)
  5. lsyncd is run in the background on the target instance, which will continually copy any new/changed files from ~/fastec2/spot2 on the target instance, to the monitoring instance
  6. ~/fastec2/spot2/myscript.sh is run inside the tmux session

If you want the instance to terminate after the script completes, remember to include systemctl poweroff (for Ubuntu) or similar at the end of your script.

Creating a data volume

One issue with the above process is that if you have a bunch of different large datasets to work with, you either need to copy all of them to each AMI you want to use (which is expensive, and means recreating that AMI every time you add a dataset), or creating a new AMI for each dataset (which means as you change your configuration or add applications, that you have to change all your AMIs).

An easier approach is to put your datasets on to a separate volume (that is, an AWS disk). fastec2 makes it easy to create a volume (formatted with ext4, which is the most common type of filesystem on Linux). To do so, it’s easiest to use the fastec2 REPL (see the last section of part 1 of this series for an introduction to the REPL), since we need an ssh object which can connect to an instance to mount and format our new volume. For instance, to create a volume using instance od1 (assuming it’s already running):

$ fe2 i
IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: inst = e.get_instance('od1')

In [2]: ssh = e.ssh(inst)

In [3]: vol = e.create_volume(ssh, 20)

In [4]: vol
Out[4]: od1 (vol-0bf4a7b9a02d6f942 in-use): 20GB

In [5]: print(ssh.run('ls -l /mnt/fe2_disk'))
total 20
-rw-rw-r-- 1 ubuntu ubuntu     2 Feb 20 14:36 chk
drwx------ 2 ubuntu root   16384 Feb 20 14:36 lost+found

As you see, the new disk has been mounted on the requested instance under the directory /mnt/fe2_disk, and the new volume has been given the same name (od1) as the instance it was created with. You can now connect to your instance and copy your datasets to this directory, and when you’re done, unmount the volume (sudo umount /mnt/fe2_disk in your ssh session), and then you can detach the volume with fastec2. If you do’nt have your previous REPL session open any more, you’ll need to get your volume object first, then you can detach it.

In [1]: vol = e.get_volume('od1')

In [2]: vol
Out[2]: od1 (vol-0bf4a7b9a02d6f942 in-use): 20GB

In [3]: e.detach_volume(vol)

In [4]: vol
Out[4]: od1 (vol-0bf4a7b9a02d6f942 available): 20GB

In the future, you can re-mount your volume through the repl:

In [5]: e.mount_volume(ssh, vol)

Using snapshots

A significant downside of volumes is that you can only attach a volume to one instance at a time. That means you can’t use volumes to launch lots of tasks all connected to the same dataset. Instead, for this purpose you should create a snapshot. A snapshot is a template for a volume; any volumes created from this snapshot will have the same data that the original volume did. Note however that snapshots are not updated with any additional information added to volumes—the data originally included in the snapshot remains without any changes.

To create a snapshot from a volume (assuming you already have a volume object vol, as above, and you’ve detached it from the instance):

In [7]: snap = e.create_snapshot(vol, name="snap1")

You can now create a volume using this snapshot, which attaches to your instance automatically:

In [8]: vol = e.create_volume(ssh, name="vol1", snapshot="snap1")

Summary

Now we’ve got all the pieces of the puzzle. In a future post we’ll discuss best practices for running tasks using fastec2 using all these pieces—but here’s the quick summary of the process:

  1. Launch an instance and set it up with the software and configuration you’ll need
  2. Create a volume for your datasets if required, and make a snapshot from it
  3. Stop that instance, and create an AMI from it (optionally you can terminate the instance after that is done)
  4. Launch a monitoring instance using an inexpensive instance type
  5. Launch a spot instance for your long-running task
  6. Create a volume from your snapshot, attached to your spot instance
  7. Run your long running task on that instance, passing the IP of your monitoring instance
  8. Ensure that your long running task shuts down the instance when done, to avoid paying for the instance after complete. (You may also want to delete the volume created from the snapshot at that time.)

To run additional tasks, you only need to repeat the last 4 steps. You can automate that process using the API calls shown in this guide.

fastec2: AWS computer management for regular folks

This is part 1 of a series on fastec2. To learn how to run and monitor long-running tasks with fastec2 check out part 2.

AWS EC2 is a wonderful system; it allows anyone to rent a computer for a few cents an hour, including a fast network connection and plenty of disk space. I’m particularly grateful to AWS, because thanks to their Activate program we’ve got lots of compute credits to use for our research and development at fast.ai.

But if you’ve spent any time working with AWS EC2, then for setting it up you’ve probably found yourself stuck between the slow and complex AWS Console GUI, and the verbose and clunky command line interface (CLI). There are various tools available to streamline AWS management, but they tend towards the power user end of the spectrum, written for people that are deploying dozens of computers in complex architectures.

Where’s the tool for regular folks? Folks who just want to launch a computer or two for getting some work done, and shutting it down when it’s finished? Folks who aren’t really that keen to learn a whole bunch of AWS-specific jargon about VPCs and Security Groups and IAM Roles and oh god please just make it go away…

The delights of the AWS Console
The delights of the AWS Console

Contents

  1. Overview
  2. Installation and configuration
  3. Creating your initial on-demand instance
  4. Creating your Amazon Machine Instance (AMI)
  5. Launching and connecting to your instance
  6. Launching a spot instance
  7. Using the interactive REPL and ssh API

Since I’m an extremely regular folk myself, I figured I better go write that tool. So here it is: fastec2. Is it for you? Here’s a summary of what it is designed to make easy (‘instance’ here simply means ‘AWS computer’):

  • Launch a new on-demand or spot instance
  • See what instances are running
  • Start an instance
  • Connect to a named instance using ssh
  • Run a long-running script in a spot instance and monitor and save results
  • Create and use volumes and snapshots, including automatic formatting/mounting
  • Change the type of an instance (e.g. add or remove a GPU)
  • See pricing for on-demand and spot instance types
  • Access through either a standard command line or through a Jupyter Notebook API
  • Tab completion
  • IPython command line interactive REPL available for further exploration

I expect that this will be most useful to people who are doing data analysis, data collection, and machine learning model training. Note that fastec2 is not designed to make it easy to manage huge fleets of servers on set up complex network architectures, or to help with deployment of applications. If you’re wanting to do that, you might want to check out Terraform or CloudFormation.

To see how it works, let’s do a complete walkthru of creating a new Amazon Machine Image (AMI), then lauching an AMI from this instance, and connecting to it. We’ll also see how to launch a spot instance, running a long-running script on it, and collect the results of the script. I’m assuming you already have an AWS account, and know the basics of connecting to instances with ssh. (If you’re not sure about this bit, first you should follow this tutorial on DataCamp.) Note that much of the coolest functionality in fastec2 is being provided by the wonderful Fire, Paramiko, and boto3 libraries—so a big thanks to all the wonderful people that made these available!

Overview

The main use case that we’re looking to support with fastec2 is as follows: you want to interactively start and stop machines of various types, each time getting the same programs, data, and configuration automatically. Sometimes you’ll create an on-demand image and start and stop it as required. You may also want to change the instance type occassionally, such as adding a GPU, or increasing the RAM. (This can be done instantly with a single command!) Sometimes you’ll fire up a spot instance in order to run a script and save the results (such as for training a machine learning model, or completing a web scraping task).

The key to having this work well is to set up an AMI which is set up just as you need it. You may think of an AMI as being something that only sysadmin geniuses at Amazon build for you, but as you’ll see it’s actually pretty quick and easy. By making it easy to create and use AMIs, you can then easily create the machines you need, when you need them.

Everything in fastec2 can also be done through the AWS Console, and through the official AWS CLI. Furthermore, there’s lots of things that fastec2 can’t do—it’s not meant to be complete, it’s meant to be convenient for the most commonly used functionality. But hopefully you’ll discover that for what it provides, it makes it easier and faster than anything else out there…

Installation and configuration

You’ll need python 3.6 or later - we highly recommend installing Anaconda if you’re not already using python 3.6. It lets you have as many different python versions as you want, and different environments, and switch between them as needed. To install fastec2:

pip install git+https://github.com/fastai/fastec2.git

You can also save some time by installing tab-completion for your shell. See the readme for setup steps for this. Once installed, hit Tab at any point to complete a command, or hit Tab again to see possible alternatives.

fastec2 uses a python interface to the AWS CLI to do its work, so you’ll need to configure this. The CLI uses region codes, instead of the region names you see in that console. To find out the region code for the region you wish to use, fastec2 can help. To run the fastec2 application type fe2, along with a command name and any required arguments. The command region will show the first code that matches the (case-sensitive) substring you provide, eg (note that I’m using ‘$’ to indicate the lines you type, and other lines are the responses):

$ fe2 region Ohio
us-east-2

Now that you have your region code, you can configure AWS CLI:

$ aws configure
AWS Access Key ID: XXX
AWS Secret Access Key: XXX
Default region name: us-east-2

For information on setting this up, including getting your access keys for AWS, see Configuring the AWS CLI.

Creating your initial on-demand instance

Life is much easier when you can rapidly create new instances which are all set up just how you like them, with the right software installed, data files downloaded, and configuration set up. You can do this by creating an AMI, which is simply a “frozen” version of a computer that you’ve set up, and can then recreate as many times as you like, nearly instantly.

Therefore, we will first set up an EC2 instance with whatever we’re going to need (we’ll call this your base instance). (You might already have an instance set up, in which case you can skip this step).

One thing that will make things a bit easier is if you ensure you have a key pair on AWS called “default”. If you don’t, go ahead and upload or create one with that name now. Although fastec2 will happily use other named keys if you wish, you’ll need to specify the key name every time if you don’t use “default”. You don’t need to make your base instance disk very big, since you can always use a larger size later when you launch new instances using your AMI. Generally 60GB is a reasonable size to choose.

To create our base image, we’ll need to start with some existing AMI that contains a Linux distribution. If you already have some preferred AMI that you use, feel free to use it; otherwise, we suggest using the latest stable Ubuntu image. To get the AMI id for the latest Ubuntu, type:

$ fe2 get-ami - id
ami-0c55b159cbfafe1f0

This shows a powerful feature of fastec2: all commands that start with “get-” return an AWS object, on which you can call any method or property (each of these commands also has a version without the get- prefix, which prints a brief summary of the object instead of returning it). Type your method or property name after a hyphen, as shown above. In this case, we’re getting the ‘id’ property of the AMI object returned by get-ami (which defaults to the latest stable Ubuntu image; see below for examples of other AMIs). To see the list of properties and methods, simply call the command without a property or method added:

$ fe2 get-ami -

Usage:           fe2 get-ami
                 fe2 get-ami architecture
                 fe2 get-ami block-device-mappings
                 fe2 get-ami create-tags
                 fe2 get-ami creation-date
                 ...

Now you can launch your instance—this creates a new “on-demand” Linux instance, and when complete (it’ll take a couple of minutes) it will print out the name, id, status, and IP address. The command will wait until ssh is accessible on your new instance before it returns:

$ fe2 launch base ami-0c55b159cbfafe1f0 50 m5.xlarge
base (i-00c7f2f81a841b525 running): 18.216.25.57

The fe2 launch command takes a minimum of 4 parameters: the name of the instance to create, the ami to use (either id or name—here we’re using the AMI id we retrieved earlier), the size of the disk to create (in GB), and the instance type. You can learn about the different instance types available from this AWS page. To see the pricing of different instances, you can use this command (replace m5 with whichever instance series you’re interested in; note that currently only US prices are displayed, and they may not be accurate or up to date—use the AWS web site for full price lists):

$ fe2 price-demand m5
["m5.large", 0.096]
["m5.metal", 4.608]
["m5.xlarge", 0.192]
["m5.2xlarge", 0.384]
["m5.4xlarge", 0.768]
["m5.12xlarge", 2.304]
["m5.24xlarge", 4.608]

With our instance running, we can now connect to it with ssh:

$ fe2 connect base
Welcome to Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-1032-aws x86_64)

Last login: Fri Feb 15 22:10:28 2019 from 4.78.240.2

ubuntu@ip-172-31-13-138:~$ |

Now you can configure your base instance as required, so go ahead and apt install any software you want, copy over data files you’ll need, and so forth. In order to use some features of fastec2 (discussed below) you’ll need tmux and lsyncd installed in your AMI, so go ahead and install then now (sudo apt install -y tmux lsyncd). Also, if you’ll be using the long-running script functionality in fastec2 you’ll need a private key in your ~/.ssh directory which has permission to connect to another instance to save results of the script. So copy your regular private key over (if it’s not too sensitive) or create a new one (type: ssh-keygen) and grab the ~/.ssh/id_dsa.pub file it creates.

Check: make sure you’ve done the following in your instance before you make it into an AMI: installed lsyncd and tmux; copied over your private key.

If you want to connect to jupyter notebook, or any other service on your instance, you can use ssh tunneling. To create ssh tunnels, add an extra argument to the above fe2 connect command, passing in either a single int (one port) or an array (multiple ports), e.g.:

# Tunnel to just jupyter notebook (running on port 8888)
fe2 connect od1 8888
# Two tunnels: jupyter notebook, and a server running on port 8008
fe2 connect od1 [8888,8008]

This doesn’t do any fancy fowarding between different machines on the networks - it’s just a direct connection from the computer you run fe2 connect on, to your computer you’re ssh’ing to. So generally you’ll run this on your own PC, and then access (for Jupyter) http://localhost:8888 in your browser.

Creating your Amazon Machine Instance (AMI)

Once you’ve configured your base instance, you can create your own AMI:

$ fe2 freeze base
ami-01b7ceef9767a163a

Here ‘freeze’ is the command, and ‘base’ is the argument. Replace myname with the name of your base instance that you wish to “freeze” into an AMI. Note that your instance will be rebooted during this process, so ensure that you’ve saved any open documents and it’s OK to shut down. It might take 15 mins or so for the process to complete (for very large disks of hundreds of GB it could take hours). To check on progress, either look in the AMIs section of the AWS console, or type this command (it will display ‘pending’ whilst it is still creating the image):

$ fe2 get-ami base - state
pending

(As you’ll see, this is using the method-calling functionality of fastec2 that we saw earlier.)

Launching and connecting to your instance

Now you’ve gotten your AMI, you can launch a new instance using that template. It only take a couple of minutes for your new instance to be created, as follows:

$ fe2 launch inst1 base 80 m5.large
inst1 (i-0f5a3b544274c645f running): 18.191.111.211

We’re calling our new instance ‘inst1’, and using the ‘base’ AMI we created earlier. As you can see, the disk size and instance type need not be the same as you used when creating the AMI (although the disk size can’t be smaller than the size you created with). You can see all the options available for the launch command; we’ll see how to use the iops and spot parameters in the next section:

$ fe2 launch -- --help

Usage: fe2 launch NAME AMI DISKSIZE INSTANCETYPE [KEYNAME] [SECGROUPNAME] [IOPS] [SPOT]
       fe2 launch --name NAME --ami AMI --disksize DISKSIZE --instancetype INSTANCETYPE
         [--keyname KEYNAME] [--secgroupname SECGROUPNAME] [--iops IOPS] [--spot SPOT]

Congratulations, you’ve launched your first instance from your own AMI! You can repeat the previous fe2 launch command, just passing in a different name, to create more instances, and ssh to each with fe2 connect <name>. To shutdown an instance, enter in the terminal of your instance:

sudo shutdown -h now

…or alternatively enter in the terminal of your own computer (change inst1 to the name of your instance):

fe2 stop inst1

If you replace stop with terminate in the above command it will terminate your instance (i.e. it will destroy it, and by default will remove all of your data on the instance; when terminating the instance, fastec2 will also remove its name tag, so it’s immediately available to reuse). If you want to have fastec2 wait until the instance is stopped, use this command (otherwise it will happen automatically in the background):

$ fe2 get-instance inst1 - wait-until-stopped

Here’s a really handy feature: after you’ve stopped your instance, you can change it to a different type! This means that you can do your initial prototyping on a cheap instance type, and then run your big analysis on a super-fast machine when you’re ready.

$ fe2 change-type inst1 p3.8xlarge

Then you can re-start your instance and connect to it as before:

$ fe2 start inst1
inst1 (i-0f5a3b544274c645f running): 52.14.245.85

$ fe2 connect inst1
Welcome to Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-1032-aws x86_64)

With all this playing around with instances you may get lost as to what you’ve created and what’s actually running! To find out, just use the instances command:

$ fe2 instances
spot1 (i-0b39947b710d05337 running): 3.17.155.171
inst1 (i-0f5a3b544274c645f stopped): No public IP
base (i-00c7f2f81a841b525 running): 18.216.25.57
od1 (i-0a1b47f88993b2bba stopped): No public IP

The instances with “No public IP” will automatically get a public IP when you start them. Generally you won’t need to worry about what the IP is, since you can fe2 connect using just the name; however you can always grab the IP through fastec2 if needed:

$ fe2 get-instance base - public-ip-address
18.216.25.57

Launching a spot instance

Spot instances can be 70% (or more) cheaper than on-demand instances. However, they may be shut down at any time, may not always be available, and all data on their root volume is deleted when they are shut down (in fact, they can only be terminated; they can’t be shut down and restarted later). Spot instance prices vary over time, by instance type, and by region. To see the last 3 days’ pricing for instances in a group (in this case, for p3 types), enter:

$ fe2 price-hist p3
Timestamp      2019-02-13  2019-02-14  2019-02-15
InstanceType
p3.2xlarge         1.1166      1.1384      1.1547
p3.8xlarge         3.9462      3.8884      3.8699
p3.16xlarge        7.3440      7.4300      8.0867
p3dn.24xlarge         NaN         NaN         NaN

Let’s compare to on-demand pricing:

$ fe2 price-demand p3
["p3.2xlarge", 3.06]
["p3.8xlarge", 12.24]
["p3.16xlarge", 24.48]
["p3dn.24xlarge", 31.212]

That’s looking pretty good! To get more detailed price graphs, check out the spot pricing tool on the AWS console, or else try using the fastec2 jupyter notebook API. This API is identical to the fe2 command, except that you create an instance of the EC2 class (optionally passing a region to the constructor), and call methods on that class. (If you haven’t used Jupyter Notebook before, then you should definitely check it out, because it’s amazingly great! Here’s a helpful tutorial from the kind folks at DataQuest to get you started.) The price-demand method has an extra feature when used in a notebook that prints the last few weeks prices in a graph for you (note that hyphens must be replaced with underscores in the notebook API).

Example of spot pricing in the notebook API
Example of spot pricing in the notebook API

To launch a spot instance, just add --spot to your launch command:

$ fe2 launch spot1 base 80 m5.large --spot
spot1 (i-0b39947b710d05337 running): 3.17.155.171

Note that this is only requesting a spot instance. It’s possible that no capacity will be available for your request. In that case, after a few minutes you’ll see an error from fastec2 telling you that the request failed. We can see that the above request was successful, because it’s printed out a message showing the new instance is “running”.

Remember: if you stop this spot instance it will be terminated and all data will be lost! And AWS can decide to shut it down at any time.

Using the interactive REPL and ssh API

How do you know what methods and properties are available? And how can you access them more conveniently? The answer is: use the interactive REPL! A picture tells a thousand words:…

The fastec2 REPL
The fastec2 REPL

If you add -- -i to the end of a command which returns an object (which is currently the instance, get-ami, and ssh commands) then you’ll be popped in to an IPython session with that object available in the special name result. So just type result. and hit Tab to see all the methods and properties available. This is a full python interpreter, so you can use the full power of python to interact with this object. When you’re done, hit Ctrl-d twice to exit.

One interesting use of this is to experiment with the ssh command, which provides an API to issue commands to the remote instance via ssh. The object returned by this command is a standard Paramiko SSHClient, with a couple of extra goodies. One of those goodies is send(cmd), which sends ‘cmd’ to a tmux session (that’s automatically started) on the instance. This is mainly designed for you to use from scripts, but you can experiment with it via the REPL, as shown below.

Communicating with remote tmux session via the REPL
Communicating with remote tmux session via the REPL

If you just want to explore the fastec2 API interactively, the easiest way is by launching the REPL using fe2 i (you can optionally append a region id or part of a region name). A fastec2.EC2 object called e will be automatically created for you. Type e. and hit Tab to see a list of options. IPython is started in smart autocall mode, which means that you often don’t even need to type parentheses to run methods. For instance:

$ fe2 i Ohio
IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: e.instances
inst1 (i-0f5a3b544274c645f m5.large running): 18.222.175.103
base (i-00c7f2f81a841b525 m5.xlarge stopped): No public IP
od1 (i-0a1b47f88993b2bba t3.micro running): 18.188.162.203

In [2]: i=e.get_instance('od1')

In [3]: i.block_device_mappings
Out[3]:
[{'DeviceName': '/dev/sda1',
  'Ebs': {'AttachTime': datetime.datetime(2019, 2, 14, 9, 30, 16),
   'DeleteOnTermination': True,
   'Status': 'attached',
   'VolumeId': 'vol-0d1b1a47539d5bcaf'}}]

fastec2 provides many convenient methods for managing AWS EC2, and also adds functionalty to make SSH and SFTP easier to use. We’ll look at these features of the fastec2 API in more detail in a future article.

If you want to learn how to run and monitor long-running tasks with fastec2 check out part 2 of this series, where we’ll also see how fastec2 helps to create and use volumes and snapshots, including automatic formatting/mounting.