fastec2 script: Running and monitoring long-running tasks

technical
Author

Jeremy Howard

Published

February 15, 2019

This is part 2 of a series on fastec2. For an introduction to fastec2, see part 1.

Spot instances are particularly good for long-running tasks, since you can save a lot of money, and you can use more expensive instance types just for the period you’re actually doing heavy computation. fastec2 has some features to make this use case much more convenient. Let’s see an example. Here’s what we’ll be doing:

  1. Use an inexpensive on-demand monitoring instance for collecting results (and optionally for launching the task). We’ll call this od1 in this guide (but you can call it anything you like)
  2. Create a script to do the work required, and put any configuration files it needs in a specific folder. The script will need to be written to save results to a specific folder so they’ll be saved
  3. Test the script works OK in a fresh instance
  4. Run the script under fastec2, which will cause it to be launched inside a tmux session on a new instance, with the required files copied over, and any results copied back to od1 as they’re created
  5. While the script is running, check its progress either by connecting to the tmux session it’s running in, or looking at the results being copied back to od1 as it runs
  6. When done, the instance will be terminated automatically, and we’ll review the results on od1.

Let’s look at the details of how this works, and how to use it. Later in this post, we’ll also see how to use fastec2’s volumes and snapshots functionality to make it easier to connect to large datasets.

Setting up your monitoring instance and script

First, create a script that completes the task you need. When running under fastec2, the script will be launched inside a directory called ~/fastec2, and this directory will also contain any extra files (that aren’t already in your AMI) needed for the script, and will be monitored for changes which are copied back to your on-demand instance (od1, in this guide). Here’s a example (we’ll call it myscript.sh) we can use for testing:

#!/usr/bin/env bash
echo starting >> $FE2_DIR/myscript.log
sleep 60
echo done >> $FE2_DIR/myscript.log

When running, the environment variable FE2_DIR will be set to the directory your script and files are in. Remember to give your script executable permissions:

$ chmod u+x myscript.sh

When testing it on a fresh instance, just set FE2_DIR and create that directory, then see if your script runs OK (it’s a good idea to have some parameter to your script that causes it to run a quick version for testing).

$ export FE2_DIR=~/fastec2/spot2
$ mkdir -p $FE2_DIR
$ ./myscript.sh

Running the script with fastec2

You need some computer running that can be used to collect the results of the long running script. You won’t want to use a spot instance for this, since it can be shut down at any time, causing you to lose your work. But it can be a cheap instance type; if you’ve had your AWS account for less than 1 year then you can use a t2.micro instance for free. Otherwise a t3.micro is a good choice—it should cost you around US$7/month (plus storage costs) if you leave it running.

To run your script under fastec2, you need to provide the following information:

  1. The name of the instance to use (first create it with launch)
  2. The name of your script
  3. Additional arguments ([--myip MYIP] [--user USER] [--keyfile KEYFILE]) to connect to the monitoring instance to copy results to. If no host is provided, it uses the IP of the computer where fe2 is running.

E.g. this command will run myscript.sh on spot2 and copy results back to 18.188.16.203:

$ fe2 launch spot2 base 80 m5.large --spot
$ fe2 script myscript.sh spot2 18.188.162.203

Here’s what happens after you run the fe2 script line above:

  1. A directory called ~/fastec2/spot2 is created on the monitoring instance if it doesn’t already exist (it is always a subdirectory of ~/fastec2 and is given the same name as the instance you’re connecting to, which in this case is spot2)
  2. Your script is copied to this directory
  3. This directory is copied to the target instance (in this case, spot2)
  4. A file called ~/fastec2/current is created on the target instance, containing the name of this task (“spot2 in this case”)
  5. lsyncd is run in the background on the target instance, which will continually copy any new/changed files from ~/fastec2/spot2 on the target instance, to the monitoring instance
  6. ~/fastec2/spot2/myscript.sh is run inside the tmux session

If you want the instance to terminate after the script completes, remember to include systemctl poweroff (for Ubuntu) or similar at the end of your script.

Creating a data volume

One issue with the above process is that if you have a bunch of different large datasets to work with, you either need to copy all of them to each AMI you want to use (which is expensive, and means recreating that AMI every time you add a dataset), or creating a new AMI for each dataset (which means as you change your configuration or add applications, that you have to change all your AMIs).

An easier approach is to put your datasets on to a separate volume (that is, an AWS disk). fastec2 makes it easy to create a volume (formatted with ext4, which is the most common type of filesystem on Linux). To do so, it’s easiest to use the fastec2 REPL (see the last section of part 1 of this series for an introduction to the REPL), since we need an ssh object which can connect to an instance to mount and format our new volume. For instance, to create a volume using instance od1 (assuming it’s already running):

$ fe2 i
IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: inst = e.get_instance('od1')

In [2]: ssh = e.ssh(inst)

In [3]: vol = e.create_volume(ssh, 20)

In [4]: vol
Out[4]: od1 (vol-0bf4a7b9a02d6f942 in-use): 20GB

In [5]: print(ssh.run('ls -l /mnt/fe2_disk'))
total 20
-rw-rw-r-- 1 ubuntu ubuntu     2 Feb 20 14:36 chk
drwx------ 2 ubuntu root   16384 Feb 20 14:36 lost+found

As you see, the new disk has been mounted on the requested instance under the directory /mnt/fe2_disk, and the new volume has been given the same name (od1) as the instance it was created with. You can now connect to your instance and copy your datasets to this directory, and when you’re done, unmount the volume (sudo umount /mnt/fe2_disk in your ssh session), and then you can detach the volume with fastec2. If you do’nt have your previous REPL session open any more, you’ll need to get your volume object first, then you can detach it.

In [1]: vol = e.get_volume('od1')

In [2]: vol
Out[2]: od1 (vol-0bf4a7b9a02d6f942 in-use): 20GB

In [3]: e.detach_volume(vol)

In [4]: vol
Out[4]: od1 (vol-0bf4a7b9a02d6f942 available): 20GB

In the future, you can re-mount your volume through the repl:

In [5]: e.mount_volume(ssh, vol)

Using snapshots

A significant downside of volumes is that you can only attach a volume to one instance at a time. That means you can’t use volumes to launch lots of tasks all connected to the same dataset. Instead, for this purpose you should create a snapshot. A snapshot is a template for a volume; any volumes created from this snapshot will have the same data that the original volume did. Note however that snapshots are not updated with any additional information added to volumes—the data originally included in the snapshot remains without any changes.

To create a snapshot from a volume (assuming you already have a volume object vol, as above, and you’ve detached it from the instance):

In [7]: snap = e.create_snapshot(vol, name="snap1")

You can now create a volume using this snapshot, which attaches to your instance automatically:

In [8]: vol = e.create_volume(ssh, name="vol1", snapshot="snap1")

Summary

Now we’ve got all the pieces of the puzzle. In a future post we’ll discuss best practices for running tasks using fastec2 using all these pieces—but here’s the quick summary of the process:

  1. Launch an instance and set it up with the software and configuration you’ll need
  2. Create a volume for your datasets if required, and make a snapshot from it
  3. Stop that instance, and create an AMI from it (optionally you can terminate the instance after that is done)
  4. Launch a monitoring instance using an inexpensive instance type
  5. Launch a spot instance for your long-running task
  6. Create a volume from your snapshot, attached to your spot instance
  7. Run your long running task on that instance, passing the IP of your monitoring instance
  8. Ensure that your long running task shuts down the instance when done, to avoid paying for the instance after complete. (You may also want to delete the volume created from the snapshot at that time.)

To run additional tasks, you only need to repeat the last 4 steps. You can automate that process using the API calls shown in this guide.