Our online courses (all are free and have no ads):

Our software: fastai v1 for PyTorch

fast.ai in the news:


A Conversation about Tech Ethics with the New York Times Chief Data Scientist

Note from Rachel: Although I’m excited about the positive potential of tech, I’m also scared about the ways that tech is having a negative impact on society, and I’m interested in how we can push tech companies to do better. I was recently in a discussion during which New York Times chief data scientist Chris Wiggins shared a helpful framework for thinking about the different forces we can use to influence tech companies towards responsibility and ethics. I interviewed Chris on the topic and have summarized that interview here.

In addition to having been Chief Data Scientist at the New York Times since January 2014, Chris Wiggins is professor of applied mathematics at Columbia University, a founding member of Columbia’s Data Science Institute, and co-founder of HackNY. He co-teaches a course at Columbia on the history and ethics of data.

Ways to Influence Tech Companies to be More Responsible and More Ethical

Chris has developed a framework showing the different forces acting on and within tech companies:

  1. External Forces
    1. Government Power
      1. Regulation
      2. Litigation
      3. Fear of regulation and litigation
    2. People Power
      1. Consumer Boycott
      2. Data Boycott
      3. Talent Boycott
    3. Power of Other Companies
      1. Responsibility as a value-add
      2. Direct interactions, such as de-platforming
      3. The Press
  2. Internal Forces
    1. How we define ethics
      One key example: the Belmont Principles
      1. Respect for persons
      2. Beneficence
      3. Justice
    2. How we design for ethics
      1. Starts with leadership
      2. Includes the importance of monitoring user experience

The two big categories are internal forces and external forces. Chris shared that at the New York Times, he’s seen the internal process through the work of a data governance committee on responsible data stewardship. Preparing for GDPR helped focus those conversations, as well as wanting to be proactive in preparing for other data regulations that Chris and his team expect are coming in the future. The New York Times has standardized processes, including for data deletion and for protecting personally identifiable information (PII). For instance, any time you are storing aggregate information, e.g. about page views, you don’t need to keep the PII of the individuals who viewed the page.

How to Impact Tech Companies: External Forces

Chris cites Doing Capitalism in the Innovation Economy by Bill Janeway as having been influential on his thinking about external forces impacting companies. Janeway writes of an unstable game amongst three players: government, people, and companies. Government power can take the form of regulations and litigation, or even just the fear of regulation and litigation.

A second external force is people power. The most well-known example of exercising people power is a consumer boycott– not giving companies money when we disagree with their practices. There is also a data boycott, not giving companies access to our data, and a talent boycott, refusing to work for them. In the past year, we have seen engineers from Google (protesting Project Maven military contract and a censored Chinese search engine), Microsoft (protesting Hololens military contract), and Amazon (protesting facial recognition technology sold to police) start to exercise this power to ask for change. However, most engineers and data scientists still do not realize the collective power that they have. Engineers and data scientists are in high demand, and they should be leveraging this to push companies to be more ethical.

A third type of external power is the power of other companies. For instance, companies can make responsibility part of the value-add that differentiates them from competitors. The search engine DuckDuckGo (motto: “The search engine that doesn’t track you”) has always made privacy a core part of their appeal. Apple has become known for championing user privacy in recent years. Consumer protection was a popular idea in the 70s and 80s, yet has somewhat fallen out of favor. More companies could make consumer protection and responsibility part of their products and what differentiates them from competitors.

Companies can also exert power in direct ways on one another, for instance, when Apple de-platformed Google and Facebook by revoking their developer certificates after they violated Apple’s privacy policies. And finally, the press counts as a company that influences other companies. There are many interconnections here: the press influences people as citizens, voters, and consumers, which then impacts the government and the companies directly.

Internal Forces: Defining Ethics vs Designing for Ethics

Chris says there is an important to distinguish between how we define ethics vs. how we design for ethics. Conversations quickly get muddled up when people are jumping between these two.

Defining ethics involves identifying your principles. There is a granularity to ethics: we need principles to be granular enough to be meaningful, but not so granular that they are context-dependent rules which change all the time. For example, “don’t be evil” is too broad to be meaningful.

Ethical principles are distinct from their implementation, and defining principles means being willing to commit to do the work to define more specific rules that follow from these principles, or to redefine them in a way more consistent with your principles as technology or context changes. For instance, many ethics principles were laid out in the U.S. Bill of Rights, and we’ve spent the next few centuries working out what that means in practice.

Designing for Ethics

In terms of designing for ethics, Chris notes that this needs to start from the top. Leaders set company goals, which are then translated into objectives; those objectives are translated into KPIs; and those KPIs are used in operations. Feedback from operations and KPIs should be used to continually reflect on whether the ethical principles are being defended or challenged, or to revisit ways that the system is falling short. One aspect of operations that most major tech companies have neglected is monitoring user experience, particularly deleterious user experiences. When companies use contractors for content moderation (as opposed to full-time employees), it says a lot about the low priority they place on negative user experiences. While content moderation addresses one component of negative user experiences, there are also many others.

Chris wrote about this topic in a blog post, Ethical Principles, OKRs, and KPIs: what YouTube and Facebook could learn from Tukey, saying that “Part of this monitoring will not be quantitative. Particularly since we can not know in advance every phenomenon users will experience, we can not know in advance what metrics will quantify these phenomena. To that end, data scientists and machine learning engineers must partner with or learn the skills of user experience research, giving users a voice.

Learning from the Belmont Report and IRBs

The history of ethics is not discussed enough and many people aren’t very familiar with it. We can learn a lot from other fields, such as human subject research. In the wake of the horrifying and unethical Tuskegee Syphilis Study, the National Research Act of 1974 was passed and researchers spent much time identifying and formalizing ethical principles. These are captured in the Belmont Principles for ethical research on human subjects. This is the ethical framework that informed the later creation of institutional review boards (IRBs), which are used to review and approve research involving human subjects. Studying how ethics have been operationalized via the Belmont principles can be very informative, as they have for almost 40 years been stress-tested via real-world implementations, and there is copious literature about their utility and limitations.

The core tenets of the Belmont Principles can be summarized:

  1. Respect for Persons:
    1. informed consent;
    2. respect for individuals’ autonomy;
    3. respect individuals impacted;
    4. protection for individuals with diminished autonomy or decision making capability.
  2. Beneficence:
    1. Do not harm;
    2. assess risk.
  3. Justice:
    1. equal consideration;
    2. fair distribution of benefits of research;
    3. fair selection of subjects;
    4. equitable allocation of burdens.

Note that the principle of beneficence can be used to make arguments in which the ends justify the means, and the principle of respect for persons can be used to make arguments in which the means justify the ends, so there is a lot captured here.

This topic came up in the national news after Kramer, et. al, the 2014 paper in which Facebook researchers manipulated users’ moods, which received a lot of criticism and concern. There was a follow-up paper by two other Facebook researchers, Evolving the IRB: Building Robust Review for Industry Research, which suggests that a form of IRB Design has now been implemented at Facebook. I have learned a lot from studying work of ethicists on research with human subjects, particularly the Belmont Principles.

For those interested in learning more about this topic, Chris recommends Chapter 6 of Matthew Salganik’s book, Bit by Bit: Social Research in the Digital Age, which Chris uses in the course on the history and ethics of data that he teaches at Columbia. Salganik is a Princeton professor who does computational social science research, and is also professor in residence at the New York Times.

Chris also says he has learned a lot from legal theorists. Most engineers may not have thought much about legal theorists, but they have a long history of addressing the balance between standards, principles, and rules.

High Impact Areas

Ethics is not an API call. It needs to happen at a high-level. The greatest impact will happen when leaders to take it seriously. The level of the person in the org chart who speaks about ethics is very telling of how much the company values ethics (because for most companies, it is not the CEO, although it should be).

As stated above, engineers don’t understand their own power, and they need to start using that power more. Chris recalls listening to a group of data scientists saying that they wished their company had some ethical policy that another company had. But they can make it happen! They just need to decide to use their collective power.

Reading Recommendations from and by Chris

Other fast.ai posts on tech ethics

Dairy farming, solar panels, and diagnosing Parkinson's disease: what can you do with deep learning?

Many people incorrectly assume that AI is only for an elite few– a handful of Silicon Valley computer science prodigies with monthly budgets larger than most people’s lifetime earnings, turning out abstruse academic papers. This couldn’t be more wrong. Deep learning (a powerful type of AI) can, and is, being used by people with varied backgrounds all over the world. A small taste of that variety can be found in the stories shared here: a Canadian dairy farmer trying to identify udder infections in his goats, a Kenyan microbiologist seeking more efficiency in the lab, a former accountant expanding use of solar power in Australia, a 73-year old embarking on a second career, a son of refugees who works in cybersecurity, and a researcher using genomics to improve cancer treatment. Hopefully this may inspire you to apply deep learning to a problem of your own!

Top row: Alena Harley, Benson Nyabuti Mainye, and Harold Nguyen. Bottom row: Dennis Graham, Sarada Lee, and Cory Spencer
Top row: Alena Harley, Benson Nyabuti Mainye, and Harold Nguyen. Bottom row: Dennis Graham, Sarada Lee, and Cory Spencer

Building Tools for Microbiologists in Kenya

Benson Nyabuti Mainye trained as a microbiologist in his home country of Kenya. He noticed that lab scientists can spend up to 5 hours studying a slide through a microscope to try to identify what cell types were in it, and he wanted a faster alternative. Benson created an immune cell classifier to distinguish various immune cells (eosinophils, basophils, monocytes, and lymphocytes) within an image of a blood smear. This fall, he traveled to San Francisco to attend part of the fast.ai course in person at the USF Data Institute (a new session starts next month), and where another fast.ai classmate, Charlie Harrington, helped him deploy the immune cell classifier. Since malaria is one of the top 10 causes of death in Kenya, Benson is currently working with fellow Kenyan and fast.ai alum Gerald Muriuki on a classifier to distinguish different types of mosquitoes to isolate particular types that carry the Plasmodium species (the parasite which causes malaria).

Dairy Goat Farming

Cory Spencer is a dairy goat farmer on bucolic Vancouver Island, and together with his wife owns The Happy Goat Cheese Company. When one of his goats came down with mastitis (an udder infection), Cory was unable to detect it until after the goat had suffered permanent damage. Estimates suggest that mastitis costs the dairy industry billions of dollars each year. By combining a special camera that detects heat (temperatures are higher near an infection) together with deep learning, Cory developed a tool to identify infections far earlier (at a subclinical level) and for one-tenth the cost of existing methods. Next up: Cory is currently building a 3D model to track specific parts of udders in real time, towards the goal of creating an automatic goat milking robot, since as Cory says, “The cow guys already have the fancy robotic tech, but the goat folk are always neglected.”

Cory Spencer's goats
Cory Spencer's goats

State-of-the-art Results in Cancer Genomics

Alena Harley is working to use genetic information to improve cancer treatment, in her role as head of machine learning at Human Longevity Institute. While taking the fast.ai course, she achieved state-of-the-art results for identifying the source of origin of metastasized cancer, which is relevant for treatment. She is currently working on accurately identifying somatic variants (genetic mutations that can contribute to cancer), automating what was previously a slow manual process.

One of Alena Harley's posts about her work on cancer metastasis
One of Alena Harley's posts about her work on cancer metastasis

From Accountant to Deep Learning Practitioner working on Solar Energy

Sarada Lee was a former accountant looking to transition careers when she began a machine learning meetup in her living room in Perth, Australia, as a way to study the topic. That informal group in Sarada’s living room has now grown into the Perth Machine Learning Meetup, which has over 1,400 members and hosts 6 events per month. Sarada traveled to San Francisco to take the Practical Deep Learning for Coders and Cutting Edge Deep Learning for Coders courses in person at the USF Data Institute, and shared what she learned when she returned back to Perth. Sarada recently won a 5-week long hackathon on the topics of solar panel identification and installation size prediction from aerial images, using U-nets. As a result, she and her team have been pre-qualified to supply data science services to a major utility company, which is working on solar panel adoption for an area the size of UK with over 1.5 million users. Other applications they are working on include electricity network capacity planning, predicting reverse energy flow and safety implications, and monitoring the rapid adoption of solar.

Part of the Perth Machine Learning team with their BitLit Booth at Fringe World (Sarada is 2nd from the left)
Part of the Perth Machine Learning team with their BitLit Booth at Fringe World (Sarada is 2nd from the left)

Sarada and the Perth Machine Learning Meetup are continuing their deep learning outreach efforts. Last month, a team led by Lauren Amos created an interactive creative display at the Fringe World Festival to make deep learning more accessible to the general public. This was a comprehensive team effort, and the display included:

  • artistic panels design based on style transfer
  • GRU/RNN generated poems
  • Implemented BERT to generate poems or short books
  • Applied speech-to-text and text-to-speech APIs to interact with a poetry-generating robot

Festival attendees were able to enjoy the elegant calligraphy of machine generated poems, read chapters of machine-generated books, and even request a robot to generate poems given a short seed sentence. Over 4,000 poems were generated during the course of the 2-week festival!

Cutting-edge Medical Research at Age 73

At age 73, Dennis Graham is using deep learning to diagnose Parkinson’s disease from Magneto-Encepholo-Graphy (MEG), as part of a UCH-Anschutz Neurology Research center project. Dennis is painfully familiar with Parkinson’s, as his wife has been afflicted with it for the last 25 years. MEG has the advantages of being inexpensive, readily available, and non-intrusive, but previous techniques had not been analytically accurate when evaluating MEG data. For two years, the team struggled, unable to obtain acceptable results using traditional techniques, until Dennis switched to deep learning, applying techniques and code he learned in the fast.ai course. It turns out that the traditional pre-processing was removing essential data that a neural network classifier could effectively and easily use. With deep learning, Dennis is now achieving much higher accuracy on this problem. Despite his successes, it hasn’t all been easy, and Dennis has had to overcome the ageism of the tech industry as he embarked on his second career.

A First-Generation College Student Working in Cybersecurity

Harold Nguyen’s parents arrived in the United States as refugees during the Vietnam War. Harold is a first generation Vietnamese American and the first in his family to attend college. He loved college so much that he went on to obtain a PhD in Particle Physics and now works in cybersecurity. Harold is using deep learning to protect brands from bad actors on social media as part of his work in digital risk for Proofpoint. Based on work he did with fast.ai, he created a model with high accuracy that was deployed to production at his company last month. Earlier during the course, Harold created an audio model to distinguish between the voices of Ben Affleck, Elon Musk, and Joe Rogan.

What problem will you tackle with deep learning?

Are you facing a problem in your field that could be addressed by deep learning? You don’t have to be a math prodigy or have gone to the most prestigious school to become a deep learning practitioner. The only pre-requisite for the fast.ai course (available in-person or online) is one year of coding, yet it teaches you the hands-on practical techniques needed to achieve state-of-the-art results.

I am so proud of what fast.ai students and alums are achieving. As I shared in my TEDx talk, I consider myself an unlikely AI researcher, and my goal is to help as many unlikely people as possible find their way into the field.

Onstage during my talk 'AI needs all of us' at San Francisco TEDx
Onstage during my talk 'AI needs all of us' at San Francisco TEDx

Further reading

You may be interested to read about some other fantastic projects from fast.ai students and alumni in these posts:

fastec2 script: Running and monitoring long-running tasks

This is part 2 of a series on fastec2. For an introduction to fastec2, see part 1.

Spot instances are particularly good for long-running tasks, since you can save a lot of money, and you can use more expensive instance types just for the period you’re actually doing heavy computation. fastec2 has some features to make this use case much more convenient. Let’s see an example. Here’s what we’ll be doing:

  1. Use an inexpensive on-demand monitoring instance for collecting results (and optionally for launching the task). We’ll call this od1 in this guide (but you can call it anything you like)
  2. Create a script to do the work required, and put any configuration files it needs in a specific folder. The script will need to be written to save results to a specific folder so they’ll be saved
  3. Test the script works OK in a fresh instance
  4. Run the script under fastec2, which will cause it to be launched inside a tmux session on a new instance, with the required files copied over, and any results copied back to od1 as they’re created
  5. While the script is running, check its progress either by connecting to the tmux session it’s running in, or looking at the results being copied back to od1 as it runs
  6. When done, the instance will be terminated automatically, and we’ll review the results on od1.

Let’s look at the details of how this works, and how to use it. Later in this post, we’ll also see how to use fastec2’s volumes and snapshots functionality to make it easier to connect to large datasets.

Setting up your monitoring instance and script

First, create a script that completes the task you need. When running under fastec2, the script will be launched inside a directory called ~/fastec2, and this directory will also contain any extra files (that aren’t already in your AMI) needed for the script, and will be monitored for changes which are copied back to your on-demand instance (od1, in this guide). Here’s a example (we’ll call it myscript.sh) we can use for testing:

#!/usr/bin/env bash
echo starting >> $FE2_DIR/myscript.log
sleep 60
echo done >> $FE2_DIR/myscript.log

When running, the environment variable FE2_DIR will be set to the directory your script and files are in. Remember to give your script executable permissions:

$ chmod u+x myscript.sh

When testing it on a fresh instance, just set FE2_DIR and create that directory, then see if your script runs OK (it’s a good idea to have some parameter to your script that causes it to run a quick version for testing).

$ export FE2_DIR=~/fastec2/spot2
$ mkdir -p $FE2_DIR
$ ./myscript.sh

Running the script with fastec2

You need some computer running that can be used to collect the results of the long running script. You won’t want to use a spot instance for this, since it can be shut down at any time, causing you to lose your work. But it can be a cheap instance type; if you’ve had your AWS account for less than 1 year then you can use a t2.micro instance for free. Otherwise a t3.micro is a good choice—it should cost you around US$7/month (plus storage costs) if you leave it running.

To run your script under fastec2, you need to provide the following information:

  1. The name of the instance to use (first create it with launch)
  2. The name of your script
  3. Additional arguments ([--myip MYIP] [--user USER] [--keyfile KEYFILE]) to connect to the monitoring instance to copy results to. If no host is provided, it uses the IP of the computer where fe2 is running.

E.g. this command will run myscript.sh on spot2 and copy results back to 18.188.16.203:

$ fe2 launch spot2 base 80 m5.large --spot
$ fe2 script myscript.sh spot2 18.188.162.203

Here’s what happens after you run the fe2 script line above:

  1. A directory called ~/fastec2/spot2 is created on the monitoring instance if it doesn’t already exist (it is always a subdirectory of ~/fastec2 and is given the same name as the instance you’re connecting to, which in this case is spot2)
  2. Your script is copied to this directory
  3. This directory is copied to the target instance (in this case, spot2)
  4. A file called ~/fastec2/current is created on the target instance, containing the name of this task (“spot2 in this case”)
  5. lsyncd is run in the background on the target instance, which will continually copy any new/changed files from ~/fastec2/spot2 on the target instance, to the monitoring instance
  6. ~/fastec2/spot2/myscript.sh is run inside the tmux session

If you want the instance to terminate after the script completes, remember to include systemctl poweroff (for Ubuntu) or similar at the end of your script.

Creating a data volume

One issue with the above process is that if you have a bunch of different large datasets to work with, you either need to copy all of them to each AMI you want to use (which is expensive, and means recreating that AMI every time you add a dataset), or creating a new AMI for each dataset (which means as you change your configuration or add applications, that you have to change all your AMIs).

An easier approach is to put your datasets on to a separate volume (that is, an AWS disk). fastec2 makes it easy to create a volume (formatted with ext4, which is the most common type of filesystem on Linux). To do so, it’s easiest to use the fastec2 REPL (see the last section of part 1 of this series for an introduction to the REPL), since we need an ssh object which can connect to an instance to mount and format our new volume. For instance, to create a volume using instance od1 (assuming it’s already running):

$ fe2 i
IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: inst = e.get_instance('od1')

In [2]: ssh = e.ssh(inst)

In [3]: vol = e.create_volume(ssh, 20)

In [4]: vol
Out[4]: od1 (vol-0bf4a7b9a02d6f942 in-use): 20GB

In [5]: print(ssh.run('ls -l /mnt/fe2_disk'))
total 20
-rw-rw-r-- 1 ubuntu ubuntu     2 Feb 20 14:36 chk
drwx------ 2 ubuntu root   16384 Feb 20 14:36 lost+found

As you see, the new disk has been mounted on the requested instance under the directory /mnt/fe2_disk, and the new volume has been given the same name (od1) as the instance it was created with. You can now connect to your instance and copy your datasets to this directory, and when you’re done, unmount the volume (sudo umount /mnt/fe2_disk in your ssh session), and then you can detach the volume with fastec2. If you do’nt have your previous REPL session open any more, you’ll need to get your volume object first, then you can detach it.

In [1]: vol = e.get_volume('od1')

In [2]: vol
Out[2]: od1 (vol-0bf4a7b9a02d6f942 in-use): 20GB

In [3]: e.detach_volume(vol)

In [4]: vol
Out[4]: od1 (vol-0bf4a7b9a02d6f942 available): 20GB

In the future, you can re-mount your volume through the repl:

In [5]: e.mount_volume(ssh, vol)

Using snapshots

A significant downside of volumes is that you can only attach a volume to one instance at a time. That means you can’t use volumes to launch lots of tasks all connected to the same dataset. Instead, for this purpose you should create a snapshot. A snapshot is a template for a volume; any volumes created from this snapshot will have the same data that the original volume did. Note however that snapshots are not updated with any additional information added to volumes—the data originally included in the snapshot remains without any changes.

To create a snapshot from a volume (assuming you already have a volume object vol, as above, and you’ve detached it from the instance):

In [7]: snap = e.create_snapshot(vol, name="snap1")

You can now create a volume using this snapshot, which attaches to your instance automatically:

In [8]: vol = e.create_volume(ssh, name="vol1", snapshot="snap1")

Summary

Now we’ve got all the pieces of the puzzle. In a future post we’ll discuss best practices for running tasks using fastec2 using all these pieces—but here’s the quick summary of the process:

  1. Launch an instance and set it up with the software and configuration you’ll need
  2. Create a volume for your datasets if required, and make a snapshot from it
  3. Stop that instance, and create an AMI from it (optionally you can terminate the instance after that is done)
  4. Launch a monitoring instance using an inexpensive instance type
  5. Launch a spot instance for your long-running task
  6. Create a volume from your snapshot, attached to your spot instance
  7. Run your long running task on that instance, passing the IP of your monitoring instance
  8. Ensure that your long running task shuts down the instance when done, to avoid paying for the instance after complete. (You may also want to delete the volume created from the snapshot at that time.)

To run additional tasks, you only need to repeat the last 4 steps. You can automate that process using the API calls shown in this guide.

fastec2: AWS computer management for regular folks

This is part 1 of a series on fastec2. To learn how to run and monitor long-running tasks with fastec2 check out part 2.

AWS EC2 is a wonderful system; it allows anyone to rent a computer for a few cents an hour, including a fast network connection and plenty of disk space. I’m particularly grateful to AWS, because thanks to their Activate program we’ve got lots of compute credits to use for our research and development at fast.ai.

But if you’ve spent any time working with AWS EC2, then for setting it up you’ve probably found yourself stuck between the slow and complex AWS Console GUI, and the verbose and clunky command line interface (CLI). There are various tools available to streamline AWS management, but they tend towards the power user end of the spectrum, written for people that are deploying dozens of computers in complex architectures.

Where’s the tool for regular folks? Folks who just want to launch a computer or two for getting some work done, and shutting it down when it’s finished? Folks who aren’t really that keen to learn a whole bunch of AWS-specific jargon about VPCs and Security Groups and IAM Roles and oh god please just make it go away…

The delights of the AWS Console
The delights of the AWS Console

Contents

  1. Overview
  2. Installation and configuration
  3. Creating your initial on-demand instance
  4. Creating your Amazon Machine Instance (AMI)
  5. Launching and connecting to your instance
  6. Launching a spot instance
  7. Using the interactive REPL and ssh API

Since I’m an extremely regular folk myself, I figured I better go write that tool. So here it is: fastec2. Is it for you? Here’s a summary of what it is designed to make easy (‘instance’ here simply means ‘AWS computer’):

  • Launch a new on-demand or spot instance
  • See what instances are running
  • Start an instance
  • Connect to a named instance using ssh
  • Run a long-running script in a spot instance and monitor and save results
  • Create and use volumes and snapshots, including automatic formatting/mounting
  • Change the type of an instance (e.g. add or remove a GPU)
  • See pricing for on-demand and spot instance types
  • Access through either a standard command line or through a Jupyter Notebook API
  • Tab completion
  • IPython command line interactive REPL available for further exploration

I expect that this will be most useful to people who are doing data analysis, data collection, and machine learning model training. Note that fastec2 is not designed to make it easy to manage huge fleets of servers on set up complex network architectures, or to help with deployment of applications. If you’re wanting to do that, you might want to check out Terraform or CloudFormation.

To see how it works, let’s do a complete walkthru of creating a new Amazon Machine Image (AMI), then lauching an AMI from this instance, and connecting to it. We’ll also see how to launch a spot instance, running a long-running script on it, and collect the results of the script. I’m assuming you already have an AWS account, and know the basics of connecting to instances with ssh. (If you’re not sure about this bit, first you should follow this tutorial on DataCamp.) Note that much of the coolest functionality in fastec2 is being provided by the wonderful Fire, Paramiko, and boto3 libraries—so a big thanks to all the wonderful people that made these available!

Overview

The main use case that we’re looking to support with fastec2 is as follows: you want to interactively start and stop machines of various types, each time getting the same programs, data, and configuration automatically. Sometimes you’ll create an on-demand image and start and stop it as required. You may also want to change the instance type occassionally, such as adding a GPU, or increasing the RAM. (This can be done instantly with a single command!) Sometimes you’ll fire up a spot instance in order to run a script and save the results (such as for training a machine learning model, or completing a web scraping task).

The key to having this work well is to set up an AMI which is set up just as you need it. You may think of an AMI as being something that only sysadmin geniuses at Amazon build for you, but as you’ll see it’s actually pretty quick and easy. By making it easy to create and use AMIs, you can then easily create the machines you need, when you need them.

Everything in fastec2 can also be done through the AWS Console, and through the official AWS CLI. Furthermore, there’s lots of things that fastec2 can’t do—it’s not meant to be complete, it’s meant to be convenient for the most commonly used functionality. But hopefully you’ll discover that for what it provides, it makes it easier and faster than anything else out there…

Installation and configuration

You’ll need python 3.6 or later - we highly recommend installing Anaconda if you’re not already using python 3.6. It lets you have as many different python versions as you want, and different environments, and switch between them as needed. To install fastec2:

pip install git+https://github.com/fastai/fastec2.git

You can also save some time by installing tab-completion for your shell. See the readme for setup steps for this. Once installed, hit Tab at any point to complete a command, or hit Tab again to see possible alternatives.

fastec2 uses a python interface to the AWS CLI to do its work, so you’ll need to configure this. The CLI uses region codes, instead of the region names you see in that console. To find out the region code for the region you wish to use, fastec2 can help. To run the fastec2 application type fe2, along with a command name and any required arguments. The command region will show the first code that matches the (case-sensitive) substring you provide, eg (note that I’m using ‘$’ to indicate the lines you type, and other lines are the responses):

$ fe2 region Ohio
us-east-2

Now that you have your region code, you can configure AWS CLI:

$ aws configure
AWS Access Key ID: XXX
AWS Secret Access Key: XXX
Default region name: us-east-2

For information on setting this up, including getting your access keys for AWS, see Configuring the AWS CLI.

Creating your initial on-demand instance

Life is much easier when you can rapidly create new instances which are all set up just how you like them, with the right software installed, data files downloaded, and configuration set up. You can do this by creating an AMI, which is simply a “frozen” version of a computer that you’ve set up, and can then recreate as many times as you like, nearly instantly.

Therefore, we will first set up an EC2 instance with whatever we’re going to need (we’ll call this your base instance). (You might already have an instance set up, in which case you can skip this step).

One thing that will make things a bit easier is if you ensure you have a key pair on AWS called “default”. If you don’t, go ahead and upload or create one with that name now. Although fastec2 will happily use other named keys if you wish, you’ll need to specify the key name every time if you don’t use “default”. You don’t need to make your base instance disk very big, since you can always use a larger size later when you launch new instances using your AMI. Generally 60GB is a reasonable size to choose.

To create our base image, we’ll need to start with some existing AMI that contains a Linux distribution. If you already have some preferred AMI that you use, feel free to use it; otherwise, we suggest using the latest stable Ubuntu image. To get the AMI id for the latest Ubuntu, type:

$ fe2 get-ami - id
ami-0c55b159cbfafe1f0

This shows a powerful feature of fastec2: all commands that start with “get-” return an AWS object, on which you can call any method or property (each of these commands also has a version without the get- prefix, which prints a brief summary of the object instead of returning it). Type your method or property name after a hyphen, as shown above. In this case, we’re getting the ‘id’ property of the AMI object returned by get-ami (which defaults to the latest stable Ubuntu image; see below for examples of other AMIs). To see the list of properties and methods, simply call the command without a property or method added:

$ fe2 get-ami -

Usage:           fe2 get-ami
                 fe2 get-ami architecture
                 fe2 get-ami block-device-mappings
                 fe2 get-ami create-tags
                 fe2 get-ami creation-date
                 ...

Now you can launch your instance—this creates a new “on-demand” Linux instance, and when complete (it’ll take a couple of minutes) it will print out the name, id, status, and IP address. The command will wait until ssh is accessible on your new instance before it returns:

$ fe2 launch base ami-0c55b159cbfafe1f0 50 m5.xlarge
base (i-00c7f2f81a841b525 running): 18.216.25.57

The fe2 launch command takes a minimum of 4 parameters: the name of the instance to create, the ami to use (either id or name—here we’re using the AMI id we retrieved earlier), the size of the disk to create (in GB), and the instance type. You can learn about the different instance types available from this AWS page. To see the pricing of different instances, you can use this command (replace m5 with whichever instance series you’re interested in; note that currently only US prices are displayed, and they may not be accurate or up to date—use the AWS web site for full price lists):

$ fe2 price-demand m5
["m5.large", 0.096]
["m5.metal", 4.608]
["m5.xlarge", 0.192]
["m5.2xlarge", 0.384]
["m5.4xlarge", 0.768]
["m5.12xlarge", 2.304]
["m5.24xlarge", 4.608]

With our instance running, we can now connect to it with ssh:

$ fe2 connect base
Welcome to Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-1032-aws x86_64)

Last login: Fri Feb 15 22:10:28 2019 from 4.78.240.2

ubuntu@ip-172-31-13-138:~$ |

Now you can configure your base instance as required, so go ahead and apt install any software you want, copy over data files you’ll need, and so forth. In order to use some features of fastec2 (discussed below) you’ll need tmux and lsyncd installed in your AMI, so go ahead and install then now (sudo apt install -y tmux lsyncd). Also, if you’ll be using the long-running script functionality in fastec2 you’ll need a private key in your ~/.ssh directory which has permission to connect to another instance to save results of the script. So copy your regular private key over (if it’s not too sensitive) or create a new one (type: ssh-keygen) and grab the ~/.ssh/id_dsa.pub file it creates.

Check: make sure you’ve done the following in your instance before you make it into an AMI: installed lsyncd and tmux; copied over your private key.

If you want to connect to jupyter notebook, or any other service on your instance, you can use ssh tunneling. To create ssh tunnels, add an extra argument to the above fe2 connect command, passing in either a single int (one port) or an array (multiple ports), e.g.:

# Tunnel to just jupyter notebook (running on port 8888)
fe2 connect od1 8888
# Two tunnels: jupyter notebook, and a server running on port 8008
fe2 connect od1 [8888,8008]

This doesn’t do any fancy fowarding between different machines on the networks - it’s just a direct connection from the computer you run fe2 connect on, to your computer you’re ssh’ing to. So generally you’ll run this on your own PC, and then access (for Jupyter) http://localhost:8888 in your browser.

Creating your Amazon Machine Instance (AMI)

Once you’ve configured your base instance, you can create your own AMI:

$ fe2 freeze base
ami-01b7ceef9767a163a

Here ‘freeze’ is the command, and ‘base’ is the argument. Replace myname with the name of your base instance that you wish to “freeze” into an AMI. Note that your instance will be rebooted during this process, so ensure that you’ve saved any open documents and it’s OK to shut down. It might take 15 mins or so for the process to complete (for very large disks of hundreds of GB it could take hours). To check on progress, either look in the AMIs section of the AWS console, or type this command (it will display ‘pending’ whilst it is still creating the image):

$ fe2 get-ami base - state
pending

(As you’ll see, this is using the method-calling functionality of fastec2 that we saw earlier.)

Launching and connecting to your instance

Now you’ve gotten your AMI, you can launch a new instance using that template. It only take a couple of minutes for your new instance to be created, as follows:

$ fe2 launch inst1 base 80 m5.large
inst1 (i-0f5a3b544274c645f running): 18.191.111.211

We’re calling our new instance ‘inst1’, and using the ‘base’ AMI we created earlier. As you can see, the disk size and instance type need not be the same as you used when creating the AMI (although the disk size can’t be smaller than the size you created with). You can see all the options available for the launch command; we’ll see how to use the iops and spot parameters in the next section:

$ fe2 launch -- --help

Usage: fe2 launch NAME AMI DISKSIZE INSTANCETYPE [KEYNAME] [SECGROUPNAME] [IOPS] [SPOT]
       fe2 launch --name NAME --ami AMI --disksize DISKSIZE --instancetype INSTANCETYPE
         [--keyname KEYNAME] [--secgroupname SECGROUPNAME] [--iops IOPS] [--spot SPOT]

Congratulations, you’ve launched your first instance from your own AMI! You can repeat the previous fe2 launch command, just passing in a different name, to create more instances, and ssh to each with fe2 connect <name>. To shutdown an instance, enter in the terminal of your instance:

sudo shutdown -h now

…or alternatively enter in the terminal of your own computer (change inst1 to the name of your instance):

fe2 stop inst1

If you replace stop with terminate in the above command it will terminate your instance (i.e. it will destroy it, and by default will remove all of your data on the instance; when terminating the instance, fastec2 will also remove its name tag, so it’s immediately available to reuse). If you want to have fastec2 wait until the instance is stopped, use this command (otherwise it will happen automatically in the background):

$ fe2 get-instance inst1 - wait-until-stopped

Here’s a really handy feature: after you’ve stopped your instance, you can change it to a different type! This means that you can do your initial prototyping on a cheap instance type, and then run your big analysis on a super-fast machine when you’re ready.

$ fe2 change-type inst1 p3.8xlarge

Then you can re-start your instance and connect to it as before:

$ fe2 start inst1
inst1 (i-0f5a3b544274c645f running): 52.14.245.85

$ fe2 connect inst1
Welcome to Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-1032-aws x86_64)

With all this playing around with instances you may get lost as to what you’ve created and what’s actually running! To find out, just use the instances command:

$ fe2 instances
spot1 (i-0b39947b710d05337 running): 3.17.155.171
inst1 (i-0f5a3b544274c645f stopped): No public IP
base (i-00c7f2f81a841b525 running): 18.216.25.57
od1 (i-0a1b47f88993b2bba stopped): No public IP

The instances with “No public IP” will automatically get a public IP when you start them. Generally you won’t need to worry about what the IP is, since you can fe2 connect using just the name; however you can always grab the IP through fastec2 if needed:

$ fe2 get-instance base - public-ip-address
18.216.25.57

Launching a spot instance

Spot instances can be 70% (or more) cheaper than on-demand instances. However, they may be shut down at any time, may not always be available, and all data on their root volume is deleted when they are shut down (in fact, they can only be terminated; they can’t be shut down and restarted later). Spot instance prices vary over time, by instance type, and by region. To see the last 3 days’ pricing for instances in a group (in this case, for p3 types), enter:

$ fe2 price-hist p3
Timestamp      2019-02-13  2019-02-14  2019-02-15
InstanceType
p3.2xlarge         1.1166      1.1384      1.1547
p3.8xlarge         3.9462      3.8884      3.8699
p3.16xlarge        7.3440      7.4300      8.0867
p3dn.24xlarge         NaN         NaN         NaN

Let’s compare to on-demand pricing:

$ fe2 price-demand p3
["p3.2xlarge", 3.06]
["p3.8xlarge", 12.24]
["p3.16xlarge", 24.48]
["p3dn.24xlarge", 31.212]

That’s looking pretty good! To get more detailed price graphs, check out the spot pricing tool on the AWS console, or else try using the fastec2 jupyter notebook API. This API is identical to the fe2 command, except that you create an instance of the EC2 class (optionally passing a region to the constructor), and call methods on that class. (If you haven’t used Jupyter Notebook before, then you should definitely check it out, because it’s amazingly great! Here’s a helpful tutorial from the kind folks at DataQuest to get you started.) The price-demand method has an extra feature when used in a notebook that prints the last few weeks prices in a graph for you (note that hyphens must be replaced with underscores in the notebook API).

Example of spot pricing in the notebook API
Example of spot pricing in the notebook API

To launch a spot instance, just add --spot to your launch command:

$ fe2 launch spot1 base 80 m5.large --spot
spot1 (i-0b39947b710d05337 running): 3.17.155.171

Note that this is only requesting a spot instance. It’s possible that no capacity will be available for your request. In that case, after a few minutes you’ll see an error from fastec2 telling you that the request failed. We can see that the above request was successful, because it’s printed out a message showing the new instance is “running”.

Remember: if you stop this spot instance it will be terminated and all data will be lost! And AWS can decide to shut it down at any time.

Using the interactive REPL and ssh API

How do you know what methods and properties are available? And how can you access them more conveniently? The answer is: use the interactive REPL! A picture tells a thousand words:…

The fastec2 REPL
The fastec2 REPL

If you add -- -i to the end of a command which returns an object (which is currently the instance, get-ami, and ssh commands) then you’ll be popped in to an IPython session with that object available in the special name result. So just type result. and hit Tab to see all the methods and properties available. This is a full python interpreter, so you can use the full power of python to interact with this object. When you’re done, hit Ctrl-d twice to exit.

One interesting use of this is to experiment with the ssh command, which provides an API to issue commands to the remote instance via ssh. The object returned by this command is a standard Paramiko SSHClient, with a couple of extra goodies. One of those goodies is send(cmd), which sends ‘cmd’ to a tmux session (that’s automatically started) on the instance. This is mainly designed for you to use from scripts, but you can experiment with it via the REPL, as shown below.

Communicating with remote tmux session via the REPL
Communicating with remote tmux session via the REPL

If you just want to explore the fastec2 API interactively, the easiest way is by launching the REPL using fe2 i (you can optionally append a region id or part of a region name). A fastec2.EC2 object called e will be automatically created for you. Type e. and hit Tab to see a list of options. IPython is started in smart autocall mode, which means that you often don’t even need to type parentheses to run methods. For instance:

$ fe2 i Ohio
IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: e.instances
inst1 (i-0f5a3b544274c645f m5.large running): 18.222.175.103
base (i-00c7f2f81a841b525 m5.xlarge stopped): No public IP
od1 (i-0a1b47f88993b2bba t3.micro running): 18.188.162.203

In [2]: i=e.get_instance('od1')

In [3]: i.block_device_mappings
Out[3]:
[{'DeviceName': '/dev/sda1',
  'Ebs': {'AttachTime': datetime.datetime(2019, 2, 14, 9, 30, 16),
   'DeleteOnTermination': True,
   'Status': 'attached',
   'VolumeId': 'vol-0d1b1a47539d5bcaf'}}]

fastec2 provides many convenient methods for managing AWS EC2, and also adds functionalty to make SSH and SFTP easier to use. We’ll look at these features of the fastec2 API in more detail in a future article.

If you want to learn how to run and monitor long-running tasks with fastec2 check out part 2 of this series, where we’ll also see how fastec2 helps to create and use volumes and snapshots, including automatic formatting/mounting.

Some thoughts on zero-day threats in AI, and OpenAI's GPT-2

There’s been a lot of discussion in the last couple of days about OpenAI’s new language model. OpenAI made the unusual decision to not release their trained model (the AI community is usually extremely open about sharing them). On the whole, the reaction has been one of both amazement and concern, and has been widely discussed in the media, such as this thoughtful and thorough coverage in The Verge. The reaction from the academic NLP community, on the other hand, has been largely (but not exclusively) negative, claiming that:

  1. This shouldn’t be covered in the media, because it’s nothing special
  2. OpenAI had no reason to keep the model to themselves, other than to try to generate media hype through claiming their model is so special it has to be kept secret.

On (1), whilst it’s true that there’s no real algorithmic leap being done here (the model is mainly a larger version of something that was published by the same team months ago), the academic “nothing to see here” reaction misses the point entirely. Whilst academic publishing is (at least in this field) largely driven by specific technical innovations, broader community interest is driven by societal impact, surprise, narrative, and other non-technical issues. Every layperson I’ve spoken to about this new work has reacted with stunned amazement. And there’s clearly a discussion to be had about potential societal impacts of a tool that may be able to scale up disinformation campaigns by orders of magnitude, especially in our current environment where such campaigns have damaged democracy even without access to such tools.

In addition, the history of technology has repeatedly shown that the hard thing is not, generally, solving a specific engineering problem, but showing that a problem can be solved. So showing what is possible is, perhaps, the most important step in technology development. I’ve been warning about potential misuse of pre-trained language models for a while, and even helped develop some of the approaches the people are using now to build this tech; but it’s not until OpenAI actually showed what can be done in practice that the broader community has woken up to some of the concerns.

But what about the second issue: should OpenAI release their pretrained model? This one seems much more complex. We’ve already heard from the “anti-model-release” view, since that’s what OpenAI has published and also discussed with the media. Catherine Olsson (who previously worked at OpenAI) asked on Twitter if anyone has yet seen a compelling explanation of the alternative view:

I’ve read a lot of the takes on this, and haven’t yet found one that really qualifies. A good-faith explanation would need to engage with what OpenAI’s researchers actually said, which takes a lot of work, since their team have written a lot of research on the societal implications of AI (both at OpenAI, and elsewhere). The most in-depth analysis of this topic is the paper The Malicious Use of Artificial Intelligence. The lead author of this paper now works at OpenAI, and was heavily involved in the decision around the model release. Let’s take a look at the recommendations of that paper:

  1. Policymakers should collaborate closely with technical researchers to investigate, prevent, and mitigate potential malicious uses of AI
  2. Researchers and engineers in artificial intelligence should take the dual-use nature of their work seriously, allowing misuserelated considerations to influence research priorities and norms, and proactively reaching out to relevant actors when harmful applications are foreseeable.
  3. Best practices should be identified in research areas with more mature methods for addressing dual-use concerns, such as computer security, and imported where applicable to the case of AI.
  4. Actively seek to expand the range of stakeholders and domain experts involved in discussions of these challenges.

An important point here is that an appropriate analysis of potential malicious use of AI requires a cross-functional team and deep understanding of history in related fields. I agree. So what follows is just my one little input to this discussion. I’m not ready to claim that I have the answer to the question “should OpenAI have released the model”. I will also try to focus on the “pro-release” side, since that’s the piece that hasn’t had much thoughtful input yet.

A case for releasing the model

OpenAI said that their release strategy is:

Due to concerns about large language models being used to generate deceptive, biased, or abusive language at scale, we are only releasing a much smaller version of GPT-2 along with sampling code.

So specifically we need to be discussing scale. Their claim is that a larger scale model may cause significant harm without time for the broader community to consider it. Interestingly, even they don’t claim to be confident of this concern:

This decision, as well as our discussion of it, is an experiment: while we are not sure that it is the right decision today, we believe that the AI community will eventually need to tackle the issue of publication norms in a thoughtful way in certain research areas.

Let’s get specific. How much scale are we actually talking about? I don’t see this explicitly mentioned in their paper of blog post, but we can make a reasonable guess. The new GPT2 model has (according to the paper) about ten times as many parameters as their previous GPT model. Their previous model took 8 GPUs 1 month to train. One would expect that they can train their model faster by now, since they’ve had plenty of time to improve their algorithms, but on the other hand, their new model probably takes more epochs to train. Let’s assume that these two balance out, so we’re left with the difference of 10x in parameters.

If you’re in a hurry and you want to get this done in a month, then you’re going to need 80 GPUs. You can grab a server with 8 GPUs from the AWS spot market for $7.34/hour. That’s around $5300 for a month. You’ll need ten of these servers, so that’s around $50k to train the model in a month. OpenAI have made their code available, and described how to create the necessary dataset, but in practice there’s still going to be plenty of trial and error, so in practice it might cost twice as much.

If you’re in less of a hurry, you could just buy 8 GPUs. With some careful memory handling (e.g. using Gradient checkpointing) you might be able to get away with buying RTX 2070 cards at $500 each, otherwise you’ll be wanting the RTX 2080 ti at $1300 each. So for 8 cards, that’s somewhere between $4k and $10k for the GPUs, plus probably another $10k or so for a box to put them in (with CPUs, HDDs, etc). So that’s around $20k to train the model in 10 months (again, you’ll need some extra time and money for the data collection, and some trial and error).

Most organizations doing AI already have 8 or more GPUs available, and can often get access to far more (e.g. AWS provides up to $100k credits to startups in its AWS Activate program, and Google provides dozens of TPUs to any research organization that qualifies for their research program).

So in practice, the decision not to release the model has a couple of outcomes:

  1. It’ll probably take at least a couple of months before another organization has successfully replicated it, so we have some breathing room to discuss what to do when this is more widely available
  2. Small organizations that can’t afford to spend $100k or so are not able to use this technology at the scale being demonstrated.

Point (1) seems like a good thing. If suddenly this tech is thrown out there for anyone to use without any warning, then no-one can be prepared at all. (In theory, people could have been prepared because those within the language modeling community have been warning of such a potential issue, but in practice people don’t tend to take it seriously until they can actually see it happening.) This is what happens, for instance, in the computer security community, where if you find a flaw the expectation is that you help the community prepare for it, and only then do you release full details (and perhaps an exploit). When this doesn’t happen, it’s called a zero day attack or exploit, and it can cause enormous damage.

I’m not sure I want to promote a norm that zero-day threats are OK in AI.

On the other hand, point (2) is a problem. The most serious threats are most likely to come from folks with resources to spend $100k or so on (for example) a disinformation campaign to attempt to change the outcome of a democratic election. In practice, the most likely exploit is (in my opinion) a foreign power spending that money to dramatically escalate existing disinformation campaigns, such as those that have been extensively documented by the US intelligence community.

The only practical defense against such an attack is (as far as I can tell) to use the same tools to both attempt to identify, and push back against, such disinformation. These kinds of defenses are likely to be much more powerful when wielded by the broader community of those impacted. The power of a large group of individuals has repeatedly been shown to be more powerful at creating, than at destruction, as we see in projects such as Wikipedia, or open source software.

In addition, if these tools aren’t in the hands of people without access to large compute resources, then they remain abstract and mysterious. What can they actually do? What are their constraints? For people to make informed decisions, they need to have a real understanding of these issues.

Conclusion

So, should OpenAI release their trained model? Frankly, I don’t know. There’s no question in my mind that they’ve demonstrated something fundamentally qualitatively different to what’s been demonstrated before (despite not showing any significant algorithmic or theoretic breakthroughs). And I’m sure it will be used maliciously; it will be a powerful tool for disinformation and for influencing discourse at massive scale, and probably only costs about $100k to create.

By releasing the model, this malicious use will happen sooner. But by not releasing the model, there will be fewer defenses available and less real understanding of the issues from those that are impacted. Those both sound like bad outcomes to me.