The 2020 in-person course is now accepting applications!

Our online courses (all are free and have no ads):

Our software: fastai v1 for PyTorch

fast.ai in the news:


Blogging with screenshots

The joy of screenshots

One of the most useful tools for blogging is screenshots. You can use a screenshot for all kinds of things. I find it particularly useful for including stuff that I find on the Internet. For instance, did you know that the NY Times reported in 1969 that there were folks who thought the moon landing was faked? A screenshot of the NY Times website tells the story:

Generally, the most useful kind of screenshot is where you select a region of your screen which you wish to include. To do this on Windows, press Windows-Shift-S, then drag over the area you wish to create an image of; on Mac, it’s Command-Shift-4.

By combining this with writing blogs in Microsoft Word or Google Docs it makes it particularly easy to include pictures in your posts. You don’t have to worry about finding special HTML syntax to embed a tweet, or downloading and resizing an appropriate sized image to include some photo that you found, or any other special approach for different kinds of content. Anything that appears on the screen, you can take a screenshot, so you can put it in your post!

The problem of giant screenshots

Unfortunately, you might notice that when you paste your screenshot into your blog post, and then publish it, that it appears way too big, and not as sharp as you would like. For instance, here is the “commit” example image from my blog post about using Github Desktop:

This is about twice as big as it actually appeared on my screen, and not nearly as sharp. To fix it, have a look at the markdown for the image:

![](/images/gitblog/commit.png)

And replace it instead with this special syntax (note that you don’t need to include the ‘/images/’ prefix in this form):

{% include screenshot url="gitblog/commit.png" %}

This appears the correct size, and with the same sharpness that it had on my screen (assuming you have a high-resolution screen too):

You can do this quite quickly by using your editor’s search and replace feature to replace \!\[\](/images/ with the appropriate syntax, and then replace the end of the lines with ‘ %}’ manually. (Of course, if you know how to use regular expressions in your editor, you can do it in one step!)

(This cartoon is from the wonderful XKCD.)

The gory details

So, what’s going on here? Frankly, it doesn’t matter in the slightest. Feel free to stop reading right now and go and to something more interesting. But, if you have nothing better to do, let’s talk a little bit about the arcane world of high-resolution displays and web browsers. (Many thanks to Ross Wightman for getting me on the right track to figuring this out!)

The basic issue is that your web browser does not define a “pixel” as… well, as a pixel. It actually defines it as 1/96 of an inch. The reason for this is that for a long time computer displays were 96 dots per inch. When screens started getting higher resolutions, as we moved from CRT to LCD displays, this caused a problem: things started looking too small. When a designer created something that was 96 pixels across, they had expected it to be 1 inch wide. But on a 200 dpi display it’s less than half an inch wide! So, the powers that be decided that the definition of a pixel in a web browser would remain as 1/96 of an inch, regardless of the actual size of pixels on your monitor.

But when I take a screenshot it actually has a specific number of pixels in it. When I then insert it into a webpage, my web browser decides to make each of those pixels take up 1/96 of an inch. And so now I have a giant image! There isn’t really one great way to fix this. Web forums are full of discussions amongst designers on various workarounds. But it turns out there’s a neat hack that generally works pretty well. Here’s what the HTML for an image normally looks like:

<img src="image.png">

We can replace that with this slight variation:

<img srcset="image.png 2w" sizes="1px">

The ‘2w’ and ‘1px’ tells the browser how we want to map the width of an image to pixels. All that matters is the ratio between these numbers, which in this case is 2. That means that the image will be scaled down by a factor of 2, and it will be done in a way that fully uses the viewers high-resolution display (if they have one).

This is a somewhat recent addition to web browsers, so if we also want this to work on people using older software, then we should include both the new and old approaches that we have seen, like so:

<img srcset="image.png 2w" sizes="1px" src="image.png”>

It would be annoying to have to write that out by hand every time we wanted an image, but we can take advantage of something called Jekyll, which is the underlying templating approach that Github Pages uses. We can create our own template, that is, a small piece of text where we can fill in parameters later, like so:

<img srcset="/images/{{ include.url }} 2w" sizes="1px" src="/images/{{ include.url }}">

In fast_template we have this saved in the filename “screenshot”, which is why we can use the convenient syntax we saw earlier:

{% include screenshot url="gitblog/commit.png" %}

Syncing your blog with your PC, and using your word processor

You’ve already seen how to create your own hosted blog, the easy, free, open way, with the help of fast_template. Now I’ll show you how to make life even easier, by syncing your blog with your computer, and writing posts with MS Word or Google Docs (especially useful if you’re including lots of images in your posts).

Synchronizing GitHub and your computer

There’s lots of reasons you might want to copy your blog content from GitHub to your computer. Perhaps you want to read or edit your posts offline. Or maybe you’d like a backup in case something happens to your GitHub repository.

GitHub does more than just let you copy your repository to your computer; it lets you synchronize it with your computer. So, you can make changes on GitHub, and they’ll copy over to your computer, and you can make changes on your computer, and they’ll copy over to GitHub. You can even let other people access and modify your blog, and their changes and your changes will be automatically combined together next time you sync.

To make this work, you have to install an application called GitHub Desktop to your computer. It runs on Mac, Windows, and Linux. Follow the directions at the link to install it, then when you run it it’ll ask you to login to GitHub, and then to select your repository to sync; click “Clone a repository from the Internet”.

Once GitHub has finished syncing your repo, you’ll be able to click “View the files of your repository in Finder” (or Explorer), and you’ll see the local copy of your blog! Try editing one of the files on your computer. Then return to GitHub Desktop, and you’ll see the “Sync” button is waiting for you to press it. When you click it, your changes will be copied over to GitHub, where you’ll see them reflected on the web site.

Writing with Microsoft Word or Google Docs

One place this is particularly handy is for creating posts with lots of images, especially screen shots that you copy with your computer. I find these much easier to create in Microsoft Word, since I can paste my images directly into the document (Google Docs provides similar functionality). You can convert your Microsoft Word documents into markdown blog posts; in fact, I’m using it right now!

To do so, create your Word doc in the usual way. For headings, be sure to choose “heading 1”, “heading 2”, etc from the style ribbon—don’t manually format them. (A shortcut for “heading 1” is to press Ctrl-Alt-1, and so forth for each heading level).

When you want to insert an image, simply paste it directly into your document, or drag a file into your document. (On Windows, press Windows-Shift-S to create a screenshot, then drag over the area you wish to create an image of. On Mac, press Command-Shift-4.)

Once you’ve finished, save your work as usual, then we’ll need to convert it to markdown format. To do this, we use a program called Pandoc. Download and install Pandoc (by double clicking the file you download from the Pandoc website). Then navigate in Finder or Explorer to where you saved your Word doc and open a command line (Terminal, Command Prompt, or PowerShell) there. To do so:

  • In Windows: hold down Alt and press f to open the File menu, then click ‘Open Command Prompt’ or ‘Open Windows PowerShell’
  • In MacOS: this requires an extra setup step to make this work. Follow these instructions and you’ll be up and running!

Now paste the following command into the command line window:

pandoc -o name.md --extract-media=name/ name.docx -w gfm --atx-headers --columns 9999

Replace “name” with the name of your file. To paste into the command line window:

  • In Windows: press Ctrl-v or right-click
  • In MacOS: press Command-Shift-v or middle-click.

After you press Enter, you’ll find you have a new file (name.md) containing your blog in markdown format, and a new folder called “media” containing a folder with the name of your doc. In that folder, you’ll find all your images. Use Explorer or Finder to move that folder into your blog repo’s images folder, and to move the markdown file name.md into your _posts folder.

You just have one more step. Open the markdown file in an editor or word processor, and do a search and replace (Ctrl-H in Microsoft Word), searching for “name/media”, and replace with “/images/name” (be careful to type the forward slash characters exactly as you see them). The lines containing your images should now look like this:

![](/images/name/image1.png)

You can now commit your changes (that is, save them to the repo) by swiching to GitHub Desktop, filling in a summary of your changes in the bottom left corner, and clicking Commit to master. Finally, click Push origin at the top right of the window, to send your changes to the GitHub server.

Instead of going to the command line and pasting the pandoc line every time you want to convert a file, there’s an easier way. If you’re on Windows, create a text file with the following exact contents:

pandoc -o %~n1.md --extract-media=%~n1 %1 -w gfm --atx-headers --columns 9999

Save the text file with the name “pandocblog.bat”. Now you can drag any MS Word file onto the pandocblog.bat icon, and it will convert it to markdown right away! (If you’re a Mac user and know how to do something similar for MacOS, please send me the details and I’ll add them to this post.)

Using your own domain name

One thing that can make your blog appear more professional is to use your own domain name, rather than a subdomain of github.io. This costs around $12/year, depending on the top-level domain you choose. To set this up, first go to www.domains.google, search for the domain name you want to register, and click “Get it”. If you get back the message that your domain is already registered, click on “All endings” to see alternative top level domains that are available, along with their prices.

Once you’ve found a domain you like, add it to your shopping basket and check out. You now need to do two things:

  1. Tell GitHub Pages to use this custom domain
  2. Tell Google Domains to direct connections to this custom domain to GitHub Pages.

There’s a great tutorial available written by Trent Yang, on how to do this, so rather than repeat his excellent work, I’ll just suggest you head over there to complete this process: How to setup google domain for github pages.

Your own hosted blog, the easy, free, open way (even if you're not a computer expert)

This post introduces fast_template, the easiest way to create your own hosted blog. There’s no ads or paywall, and you have your own hosted blog using open standards and data that you own. It requires no coding, no use of the command line, and supports custom themes and even your custom domain (which is entirely optional). Behind the scenes, you’ll be using powerful foundations like git and Jekyll. But you won’t have to learn anything about these underlying technologies; instead, I’ll show you how to do everything using a simple web-based interface.

  1. You should blog
  2. Creating the repository
  3. Setting up your homepage
  4. Creating posts
  5. Going further

You should blog

Rachel Thomas, co-founder of fast.ai, said it best in her article Why you (yes, you) should blog:

The top advice I would give my younger self would be to start blogging sooner. Here are some reasons to blog:

  • It’s like a resume, only better. I know of a few people who have had blog posts lead to job offers!
  • Helps you learn. Organizing knowledge always helps me synthesize my own ideas. One of the tests of whether you understand something is whether you can explain it to someone else. A blog post is a great way to do that.
  • I’ve gotten invitations to conferences and invitations to speak from my blog posts. I was invited to the TensorFlow Dev Summit (which was awesome!) for writing a blog post about how I don’t like TensorFlow.
  • Meet new people. I’ve met several people who have responded to blog posts I wrote.
  • Saves time. Any time you answer a question multiple times through email, you should turn it into a blog post, which makes it easier for you to share the next time someone asks.

Perhaps her most important tip is this: “You are best positioned to help people one step behind you. The material is still fresh in your mind. Many experts have forgotten what it was like to be a beginner (or an intermediate) and have forgotten why the topic is hard to understand when you first hear it. The context of your particular background, your particular style, and your knowledge level will give a different twist to what you’re writing about.”

Unfortunately, when it comes to blogging, it seems like you have to make a decision: either use a platform that makes it easy, but subjects you and your readers to advertisements, pay walls, and fees, or spend hours setting up your own hosting and weeks learning about all kinds of intricate details. Perhaps the biggest benefit to the “do-it-yourself” approach is that you really owning your own posts, rather than being at the whim of a service provider, and their decisions about how to monetize your content in the future.

It turns out, however, that you can have the best of both worlds! You can host on a platform called GitHub Pages, which is free, has no ads or pay wall, and makes your data available in a standard way such that you can at any time move your blog to another host. But all the approaches I’ve seen to using GitHub Pages have required knowledge of the command line and arcane tools that only software developers are likely to be familiar with. For instance, GitHub’s own documentation on setting up a blog requires installing the Ruby programming language, using the git command line tool, copying over version numbers, and more. 17 steps in total!

We’ve curated an easy approach, which allows you to use an entirely browser-based interface for all your blogging needs. You will be up and running with your new blog within about five minutes. It doesn’t cost anything, and in the future, you can easily add your own custom domain to it if you wish to. Here’s how to do it, using a template we’ve created called fast_template.

Creating the repository

You’ll need an account on GitHub. So, head over there now, and create an account if you don’t have one already. Make sure that you are logged in. Normally, GitHub is used by software developers for writing code, and they use a sophisticated command line tool to work with it. But I’m going to show you an approach that doesn’t use the command line at all!

To get started, click on this link: https://github.com/fastai/fast_template/generate . This will allow you to create a place to store your blog, called a “repository”. You will see the following screen; you have to enter your repository name using the exact form you see below, that is, the username you used at GitHub followed by “.github.io”.

Important: Note that if you don't use username.github.io as the name, it won't work!

Once you’ve entered that, and any description you like, click on “create repository from template”. Note that, unless you pay, you need to make your repository “public”. But since you are creating a blog that you want other people to read, having the underlying files publicly available hopefully won’t be a problem for you.

Setting up your homepage

When readers first arrive at your blog the first thing that they will see is the content of a file called “index.md”. This is a markdown file. Markdown is a powerful yet simple way of creating formatted text, such as bullet points, italics, hyperlinks, and so forth. It is very widely used, including all the formatting in Jupyter notebooks, nearly every part of the GitHub site, and many other places all over the Internet. To create markdown text, you can just type in plain regular English. But then you can add some special characters to add special behavior. For instance, if you type a * character around a word or phrase then that will put it in italics. Let’s try it now.

To open the file, click its file name in GitHub.

To edit it, click on the pencil icon at the far right hand side of the screen.

You can add, edit, or replace the texts that you see. Click on the “preview changes” button to see how well your markdown text will look on your blog. Lines that you have added or changed will appear with a green bar on the left-hand side.

To save your changes to your blog, you must scroll to the bottom and click on the “commit changes” green button. On GitHub, to “commit” something means to save it to the GitHub server.

Next, you should configure your blog’s settings. To do so, click on the file called “_config.yml”, and then click on the edit button like you did for the index file above. Change the title, description, and GitHub username values. You need to leave the names before the colons in place and type your new values in after the colon and space on each line. You can also add to your email and Twitter username if you wish — but note that these will appear on your public blog if you do fill them in here.

After you’re done, commit your changes just like you did with the index file before. Then wait about a minute, whilst GitHub processes your new blog. Then you will be able to go to your blog in your web browser, by opening the URL: username.github.io (replace “username” with your GitHub username). You should see your blog!

Creating posts

Now you’re ready to create your first post. All your posts will go in the “_posts” folder. Click on that now, and then click on the “create file” button. You need to be careful to name your file in the following format: “year-month-day-name.md”, where year is a four-digit number, and month and day are two-digit numbers. “Name” can be anything you want, that will help you remember what this post was about. The “md” extension is for markdown documents.

You can then type the contents of your first post. The only rule is that the first line of your post must be a markdown heading. This is created by putting “# ” at the start of a line (that creates a level 1 heading, which you should just use once at the start of your document; you create level 2 headings using “## ”, level 3 with “### ”, and so forth.)

As before, you can click on the “preview” button to see how your markdown formatting will look.

And you will need to click the “commit new file” button to save it to GitHub.

Have a look at your blog homepage again, and you will see that this post has now appeared! (Remember that you will need to wait a minute or so for GitHub to process it.)

You’ll also see that we provided a sample blog post, which you can go ahead and delete now. Go to your posts folder, as before, and click on “2020-01-14-welcome.md”. Then click on the trash icon on the far right.

In GitHub, nothing actually changes until you commit— including deleting a file! So, after you click the trash icon, scroll down to the bottom and commit your changes.

You can include images in your posts by adding a line of markdown like the following:

![Image description](images/filename.jpg)

For this to work, you will need to put the image inside your “images” folder. To do this, click on the images folder to go into it in GitHub, and then click the “upload files” button.

Going further

If you want to add a table of contents to your post (like this one), then add these 2 lines to your post in the place you want your table of contents to appear:

1. TOC
{:toc}

Any headings that you’ve created (by starting a line with one or more # characters) will appear in the table of contents, with automatic links to the sections.

You can also add math equations using LaTeX within a paragraph by including them in $ characters, like this: $\sum_n (x)$, which appears as: $\sum_n (x)$ . Or you can put them in their own paragraph by surrounding them in $$ on a line by themselves, like this:

$$
\sum_n (x)
$$

This appears like so:

To make LaTeX math work in your blog, you have to change the line that reads use_math: in _config.yml so it reads:

use_math: true

Now you know how to create a blog! That just leaves the question of just what to write in it… Rachel Thomas has provided some helpful thoughts in her article Advice for Better Blog Posts.

Want more? Then be sure to check out my followup post, where I show some of the powerful features that GitHub Pages supports, such as custom domains, and the ability to synchronize a folder on your own computer with GitHub and use your own word processing software. I’ll also introduce you to the wonderful world of git, a powerful software tool that may just change your life for the better…

Self-supervised learning and computer vision

Update: Jan 20th, 2020: Thanks to Yann LeCun for suggesting two papers from Facebook AI, Self-Supervised Learning of Pretext-Invariant Representations and Momentum Contrast for Unsupervised Visual Representation Learning. I’ve added a section “consistency loss” that discusses the approach used in these works (and similar ideas). Thanks for Phillip Isola for finding early uses of the term “self-supervised learning”, which I’ve added to the post.

  1. Introduction to self-supervised learning
  2. Self-supervised learning in computer vision
    1. Colorization
    2. Placing image patches in the right place
    3. Placing frames in the right order
    4. Inpainting
    5. Classify corrupted images
    6. Choosing a pretext task
  3. Fine tuning for your downstream tasks
  4. Consistency loss
  5. Further reading

Introduction to self-supervised learning

Wherever possible, you should aim to start your neural network training with a pre-trained model, and fine tune it. You really don’t want to be starting with random weights, because that’s means that you’re starting with a model that doesn’t know how to do anything at all! With pretraining, you can use 1000x less data than starting from scratch.

So, what do you do if there are no pre-trained models in your domain? For instance, there are very few pre-trained models in the field of medical imaging. One interesting recent paper, Transfusion: Understanding Transfer Learning for Medical Imaging has looked at this question and identified that using even a few early layers from a pretrained ImageNet model can improve both the speed of training, and final accuracy, of medical imaging models. Therefore, you should use a general-purpose pre-trained model, even if it is not in the domain of the problem that you’re working in.

However, as this paper notes, the amount of improvement from an ImageNet pretrained model when applied to medical imaging is not that great. We would like something which works better but doesn’t will need a huge amount of data. The secret is “self-supervised learning”. This is where we train a model using labels that are naturally part of the input data, rather than requiring separate external labels.

This idea has a long history, discussed back in 1989 by Jürgen Schmidhuber in his (way ahead of its time!) 1989 paper Making the World Differentiable. By 1994, the term was also being used to cover a related approach, which is using one modality as labels for another, such as the paper Learning Classification with Unlabeled Data, which uses audio data as labels, and video data as predictors. The paper gives the example:

Hearing “mooing” and seeing cows tend to occur together

Self-supervised learning is the secret to ULMFiT, a natural language processing training approach that dramatically improves the state-of-the-art in this important field. In ULMFiT we start by pretraining a “language model” — that is, a model that learns to predict the next word of a sentence. We are not necessarily interested in the language model itself, but it turns out that the model which can complete this task must learn about the nature of language and even a bit about the world in the process of its training. When we’d then take this pretrained language model, and fine tune it for another task, such as sentiment analysis, it turns out that we can very quickly get state-of-the-art results with very little data. For more information about how this works, have a look at this introduction to ULMFiT and language model pretraining.

Self-supervised learning in computer vision

In self-supervised learning the task that we use for pretraining is known as the “pretext task”. The tasks that we then use for fine tuning are known as the “downstream tasks”. Even although self-supervised learning is nearly universally used in natural language processing nowadays, it is used much less in computer vision models than we might expect, given how well it works. Perhaps this is because ImageNet pretraining has been so widely successful, so folks in communities such as medical imaging may be less familiar with the need for self-supervised learning. In the rest of this post I will endeavor to provide a brief introduction to the use of self-supervised learning in computer vision, in the hope that this might help more people take advantage of this very useful technique.

The most important question that needs to be answered in order to use self-supervised learning in computer vision is: “what pretext task should you use?” It turns out that there are many you can choose from. Here is a list of a few, and papers describing them, along with an image from a paper in each section showing the approach.

Colorization

(paper 1, paper 2, paper 3)

Placing image patches in the right place

(paper 1, paper 2)

Placing frames in the right order

(paper 1, paper 2)

Inpainting

(paper)

Classify corrupted images

(paper)

In this example, the green images are not corrupted, and the red images are corrupted. Note that an overly simple corruption scheme may result in a task that’s too easy, and doesn’t result in useful features. The paper above uses a clever approach that corrupts an autoencoder’s features, and then tries to reconstruct them, to make it a challenging task.

Choosing a pretext task

The tasks that you choose needs to be something that, if solved, would require an understanding of your data which would also be needed to solve your downstream task. For instance, practitioners often used as a pretext task something called an “autoencoder”. This is a model which can take an input image, converted into a greatly reduced form (using a bottleneck layer), and then convert it back into something as close as possible to the original image. It is effectively using compression as a pretext task. However, solving this task requires not just regenerating the original image content, but also regenerating any noise in the original image. Therefore, if your downstream task is something where you want to generate higher quality images, then this would be a poor choice of pretext task.

You should also ensure that the pretext task is something that a human could do. For instance, you might use as a pretext task the problem of generating a future frame of a video. But if the frame you try to generate is too far in the future then it may be part of a completely different scene, such that no model could hope to automatically generate it.

Fine tuning for your downstream tasks

Once you have pretrained your model with a pretext task, you can move on to fine tuning. At this point, you should treat this as a transfer learning problem, and therefore you should be careful not to hurt your pretrained weights. Use the things discussed in the ULMFiT paper to help you here, such as gradual unfreezing, discriminative learning rates, and one-cycle training. If you are using fastai2 then you can simply call the fine_tune method to have this all done for you.

Overall, I would suggest not spending too much time creating the perfect pretext model, but just build whatever you can that is reasonably fast and easy. Then you can find out whether it is good enough for your downstream task. Often, it turns out that you don’t need a particularly complex pretext task to get great results on your downstream task. Therefore, you could easily end up wasting time over engineering your pretext task.

Note also that you can do multiple rounds of self-supervised pretraining and regular pretraining. For instance, you could use one of the above approaches for initial pretraining, and then do segmentation for additional pretraining, and then finally train your downstream task. You could also do multiple tasks at once (multi-task learning) at either or both stages. But of course, do the simplest thing first, and then add complexity only if you determine you really need it!

Consistency loss

There’s a very useful trick you can add on top of your self-supervised training, which is known as “consistency loss” in NLP, or “noise contrastive estimation” in computer vision. The basic idea is this: your pretext task is something that messes with your data, such as obscuring sections, rotating it, moving patches, or (in NLP) changing words or translating a sentence to a foreign language and back again. In each case, you’d hope that the original item and the “messed up” item give the same predictions in a pretext task, and create the same features in intermediate representations. And you’d also hope that the same item, when “messed up” in two different ways (e.g. an image rotated by two different amounts) should also have the same consistent representations.

Therefore, we add to the loss function something that penalizes getting different answers for different versions of the same data. Here’s a pictorial representation, from Google’s post Advancing Semi-supervised Learning with Unsupervised Data Augmentation.

To say that this is “effective” would be a giant understatement… for instance, the approach discussed in the Google post above totally and absolutely smashed our previously state of the art approach to text classification with ULMFiT. They used 1000x less labeled data than we did!

Facebook AI has recently released two papers using this idea in a computer vision setting: Self-Supervised Learning of Pretext-Invariant Representations and Momentum Contrast for Unsupervised Visual Representation Learning. Like the Google paper in NLP, these methods beat the previous state of the art approaches, and require less data.

It’s likely that you can add a consistency loss to your model, for nearly any pretext task that you pick. Since it’s so effective, I’d strongly recommend you give it a try!

Further reading

If you’re interested in learning more about self-supervised learning in computer vision, have a look at these recent works:

Data project checklist

Creative Commons LicenseThis post is licensed under: Creative Commons Attribution-ShareAlike 4.0 International.

As we discussed in Designing Great Data Products, there’s a lot more to creating useful data projects than just training an accurate model! When I used to do consulting, I’d always seek to understand an organization’s context for developing data projects, based on these considerations:

  • Strategy: What is the organization trying to do (objective) and what can it change to do it better (levers)?
  • Data: Is the organization capturing necessary data and making it available?
  • Analytics: What kinds of insights would be useful to the organization?
  • Implementation: What organizational capabilities does it have?
  • Maintenance: What systems are in place to track changes in the operational environment?
  • Constraints: What constraints need to be considered in each of the above areas?
The analytics value chain
The analytics value chain

I developed a questionnaire that I had clients fill out before a project started, and then throughout the project I’d help them refine their answers. This questionnaire is based on decades of projects across many industries, including agriculture, mining, banking, brewing, telecoms, retail, and more. Here I am sharing it publicly for the first time.

Organizational

Data scientists

Data scientists should have a clear path to become senior executives, and there should also be hiring plans in place to bring data experts directly into senior executive roles. In a data-driven organization data scientists should be amongst the most well-paid employees. There should be systems in place to allow data scientists throughout the organization to collaborate and learn from each other.

  • What data science skills are currently in the organization?
  • How are data scientists being recruited?
  • How are people with data science skills being identified within the organization?
  • What skills are being looked for? How are they being judged? How were those skills selected as being important?
  • What data science consulting is being used? In which situations is data science outsourced? How is this work transferred to the organization?
  • How much are data scientists being paid? Who do they report to? How are their skills kept current?
  • What is the career path for data scientists?
  • How many executives have strong data analysis expertise?
  • How is work for data scientists selected and allocated?
  • What software and hardware do data scientists have access to?

Strategy

All data projects should be based on solving strategically important problems. Therefore, an understanding of business strategy must come first.

  • What are the 5 most important strategic issues at the organization today?
  • What data is available to help deal with these issues?
  • Is a data driven approach being used for these issues? Are data scientists working on these?
  • What are the profit drivers that the organization can most strongly impact?
Some of the kinds of things that may be important profit drivers at an organization
Some of the kinds of things that may be important profit drivers at an organization
  • For each of the most important profit drivers listed above, what are the specific actions and decisions that the organization can take that can influence that driver, including both operational actions (e.g. call customer) and strategic decisions (e.g. release new product)?
  • For each of the most important actions and decisions above, what data could be available (either within the organization, or from a vendor, or that could be collected in the future) that may help to better target or optimize that decision?
  • Based on the above analysis, what are the biggest opportunities for data-driven analysis within the organization?
  • For each opportunity:
    • What value driver it is designed to influence?
    • What specific actions or decisions it will drive?
    • How these actions and decisions will be connected to the results of the project?
    • What is the estimated ROI of the impact of each project based on the above?
    • What time constraints and deadlines, if any, may impact it?

Data

Without data, we can’t train models! Data also needs to be available, integrated, and verifiable.

  • What data platforms does the organization have, including data marts, OLAP cubes, data warehouses, Hadoop clusters, OLTP systems, departmental spreadsheets, and so forth
  • Provide any information that has been collated that provides an overview of data availability at the organization, and current work and future plans for building data platforms
  • What tools and processes are available to move data between systems and formats?
  • How are the data sources accessed by different groups of users and admins?
  • What data access tools (e.g. database clients, OLAP clients, in-house software, SAS, etc.) are available for the organization data scientists and for sysadmins? How many people use each of these tools, and what are they positions in the organization?
  • How are users informed of new systems, changes to systems, new and changed data elements, and so forth? Provide examples
  • How are decisions made regarding data access restrictions? How are requests to access secured data managed? By who? Based on what criteria? How long is the average time to respond? What % of requests are accepted? How is this tracked?
  • How does the organization decide when to collect additional data or purchase external data? Provide examples
  • What data has been used so far to analyze recent data-driven projects? What has been found to be most useful? What was not useful? How was this judged?
  • What additional internal data may provide insights useful for data-driven decision making for proposed projects? External data?
  • What are the possible constraints or challenges in accessing or incorporating this data?
  • What changes to data collection, coding, integration, etc has occurred in the last 2 years that may impact the interpretation or availability of the collected data

Analytics

Data scientists need to be able to access up to date tools, based on their own particular needs. New tools should be regularly assessed to see if they significantly improve over current approaches.

  • What analytics tools are used at the organization, by who? How are they selected, configured, and maintained?
  • What is the process to get additional analytical tools set up on a client machine? What is the average time to complete this? What is the % requests accepted?
  • How are analytical systems built by external consultants transferred to the organization? Are external contractors asked to restrict the systems used to ensure the results conform to internal infrastructure?
  • In what situations has cloud processing been used? What are the plans for using the cloud?
  • In what situations have external experts been used for specialist analytics? How has this been managed? How have the experts been identified and selected?
  • What analytic tools have been tried for recent projects?
  • What worked, and what didn’t? Why?
  • Provide any outputs that are available from work done to date for these projects
  • How have the results of this analysis been judged? What metrics? Compared to what benchmarks? How do you know whether a model is “good enough”?
  • In what situations does the organization use visualization, vs. tabular reporting, vs. predictive modelling (and similar machine learning tools)? For more advanced modelling approaches, how are the models calibrated and tested? Provide examples

Implementation

IT constraints are often the downfall of data projects. Be sure to consider them up front!

  • Provide some examples of past data-driven projects which have had successful, and unsuccessful implementations, and provide details on the IT integration and human capital challenges and how they were faced
  • How are the validity of analytical models confirmed prior to implementation? How are they benchmarked?
  • How are the performance requirements defined for analytical project implementations (in terms of speed and accuracy)?
  • For the proposed projects provide information about:
    • What IT systems will be used to support the data driven decisions and actions
    • How this IT integration will be done
    • What alternatives there are which may require less IT integration
    • What jobs will be impacted by the data driven approaches
    • How these staff will be trained, monitored, and supported
    • What implementation challenges may occur
    • Which stakeholders will be needed to ensure implementation success? How might they perceive these projects and their potential impact on them?

Maintenance

Unless you track your models carefully, you may find them leading you to disaster.

  • How are analytical systems built by third parties maintained? When are they transferred to internal teams?
  • How are the effectiveness of models tracked? When does the organization decide to rebuild models?
  • How are data changes communicated internally, and how are they managed?
  • How do data scientists work with software engineers to ensure algorithms are correctly implemented?
  • When are test cases developed, and how are they maintained?
  • When is refactoring performed on code? How is the correctness and performance of models maintained and validated during refactoring?
  • How are maintenance and support requirements logged? How are these logs used?

Constraints

For each project being considered enumerate potential constraints that may impact the success of the project, e.g.:

  • Will IT systems need to be modified or developed to use the results of the project? Are there simpler implementations that could avoid substantial IT changes? If so, would this simplified implementation result in a significant reduction in impact?
  • What regulatory constraints exist on data collection, analysis, or implementation? Has the specific legislation and precedents been examined recently? What workarounds might exist?
  • What organizational constraints exist, including culture, skills, or structure?
  • What management constraints are there?
  • Are there any past analytic projects which may impact how the organization resources would view data-driven approaches?