Our online courses (all are free and have no ads):

Our software: fastai v1 for PyTorch

fast.ai in the news:


nbdev: use Jupyter Notebooks for everything

“I really do think [nbdev] is a huge step forward for programming environments”: Chris Lattner, inventor of Swift, LLVM, and Swift Playgrounds.

My fast.ai colleague Sylvain Gugger and I have been working on a labor of love for the last couple of years. It is a Python programming environment called nbdev, which allows you to create complete python packages, including tests and a rich documentation system, all in Jupyter Notebooks. We’ve already written a large programming library (fastai v2) using nbdev, as well as a range of smaller projects.

Nbdev is a system for something that we call exploratory programming. Exploratory programming is based on the observation that most of us spend most of our time as coders exploring and experimenting. We experiment with a new API that we haven’t used before, to understand exactly how it behaves. We explore the behavior of an algorithm that we are developing, to see how it works with various kinds of data. We try to debug our code, by exploring different combinations of inputs. And so forth…

  1. nbdev: exploratory programming
  2. Software development tools
  3. Interactive programming environments
  4. So what’s missing in Jupyter Notebook?
  5. Dynamic Python
  6. What now

nbdev: exploratory programming

We believe that the very process of exploration is valuable in itself, and that this process should be saved so that other programmers (including yourself in six months time) can see what happened and learn by example. Think of it as something like a scientist’s journal. You can use it to show the things that you tried, what worked and what didn’t, and what you did to develop your understanding of the system within which you are working. During this exploration, you will realize that some parts of this understanding are critical for the system to work. Therefore, the exploration should include the addition of tests and assertions to ensure this behavior.

This kind of “exploring” is easiest when you develop on the prompt (or REPL), or using a notebook-oriented development system like Jupyter Notebooks. But these systems are not as strong for the “programming” part. This is why people use such systems mainly for early exploring, and then switch to an IDE or text editor later in a project. They switch to get features like good doc lookup, good syntax highlighting, integration with unit tests, and (critically!) the ability to produce final, distributable source code files, as opposed to notebooks or REPL histories.

The point of nbdev is to bring the key benefits of IDE/editor development into the notebook system, so you can work in notebooks without compromise for the entire lifecycle. To support this kind of exploration, nbdev is built on top of Jupyter Notebook (which also means we get much better support for Python’s dynamic features than in a normal editor or IDE), and adds the following critically important tools for software development:

  • Python modules are automatically created for you, following best practices such as automatically defining __all__ (more details) with your exported functions, classes, and variables
  • Navigate and edit your code in a standard text editor or IDE, and export any changes automatically back into your notebooks
  • Automatically create searchable, hyperlinked documentation from your code; any word you surround in backticks will by hyperlinked to the appropriate documentation, a sidebar will be created for you in your documentation site with links to each of your modules, and more
  • Pip installers (uploaded to pypi for you)
  • Testing (defined directly in your notebooks, and run in parallel)
  • Continuous integration
  • Version control conflict handling

Here’s a snippet from our actual “source code” for nbdev, which is itself written in nbdev (of course!)

Exploring the notebook file format in the nbdev source code
Exploring the notebook file format in the nbdev source code

As you can see, when you build software this way, everyone in your project team gets to benefit from the work you do in building an understanding of the problem domain, such as file formats, performance characteristics, API edge cases, and so forth. Since development occurs in a notebook, you can also add charts, text, links, images, videos, etc, that will be included automatically in the documentation of your library. The cells where your code is defined will be hidden and replaced by standardized documentation of your function, showing its name, arguments, docstring, and link to the source code on GitHub.

For more information about features, installation, and how to use nbdev, see its documentation (which is, naturally, automatically generated from its source code). I’ll be posting a step by step tutorial in the coming days. In the rest of this post, I’ll describe more of the history and context behind the why: why did we build it, and why did we design it the way we did. First, let’s talk about a little history… (And if you’re not interested in the history, you can skip ahead to What’s missing in Jupyter Notebook.)

Software development tools

Most software development tools are not built from the foundations of thinking about exploratory programming. When I began coding, around 30 years ago, waterfall software development was used nearly exclusively. It seemed to me at the time that this approach, where an entire software system would be defined in minute detail upfront, and then coded as closely to the specification as possible, did not fit at all well with how I actually got work done.

In the 1990s, however, things started to change. Agile development became popular. People started to understand the reality that most software development is an iterative process, and developed ways of working which respected this fact. However, we did not see major changes to the software development tools that we used, that matched the major changes to our ways of working. There were some pieces of tooling which got added to our arsenal, particularly around being able to do test driven development more easily. However this tooling tended to appear as minor extensions to existing editors and development environments, rather than truly rethinking what a development environment could look like.

In recent years we’ve also begun to see increasing interest in exploratory testing as an important part of the agile toolbox. We absolutely agree! But we also think this doesn’t go nearly far enough; we think in nearly every part of the software development process that exploration should be a central part of the story.

The legendary Donald Knuth was way ahead of his time. He wanted to see things done very differently. In 1983 he developed a methodology called literate programming. He describes it asa methodology that combines a programming language with a documentation language, thereby making programs more robust, more portable, more easily maintained, and arguably more fun to write than programs that are written only in a high-level language. The main idea is to treat a program as a piece of literature, addressed to human beings rather than to a computer.” For a long time I was fascinated by this idea, but unfortunately it never really went anywhere. The tooling available for working this way resulted in software development taking much longer, and very few people decided that this compromise was worth it.

Nearly 30 years later another brilliant and revolutionary thinker, Bret Victor, expressed his deep discontent for the current generation of development tools, and described how to design “a programming system for understanding programs”. As he said in his groundbreaking speech “Inventing on Principle”: “Our current conception of what a computer program is — a list of textual definitions that you hand to a compiler — that’s derived straight from Fortran and ALGOL in the late ‘50’s. Those languages were designed for punchcards.

He laid out, and illustrated with fully worked examples, a range of new principles for designing programming systems. Whilst nobody has as yet fully implemented all of his ideas, there have been some significant attempts to implement some parts of them. Perhaps the most well-known and complete implementation, including inline display of intermediate results, is Chris Lattner’s Swift and Xcode Playgrounds.

Demonstration of Playgounds in Xcode
Demonstration of Playgounds in Xcode

Whilst this is a big step forward, it is still very constrained by the basic limitations of sitting within a development environment which was not originally built with such explorations in mind. For instance, the exploration process is not captured by this at all, tests cannot be directly integrated into it, and the full rich vision of literate programming cannot be implemented.

Interactive programming environments

There has been another very different direction in software development, which is interactive programming (and the related live programming). It started many decades ago with the LISP and Forth REPLs, which allowed developers to interactively add and remove code in a running application. Smalltalk took things even further, providing a fully interactive visual workspace. In all these cases, the languages themselves were well suited to this kind of interactive work, for instance with LISP’s macro system and “code as data” foundations.

Live programming in Smalltalk (1980)
Live programming in Smalltalk (1980)

Although this approach isn’t how most regular software development is done today, it is the most popular approach in many areas of scientific, statistical, and other data-driven programming. (JavaScript front-end programming is however increasingly borrowing ideas from those approaches, such as hot reloading and in-browser live editing.) Matlab, for instance, started out as an entirely interactive tool back in the 1970’s, and today is still widely used in engineering, biology, and various other areas (it also provides regular software development features nowadays). Similar approaches were used by SPLUS, and it’s open source cousin R, which is today extremely popular in the statistic and data visualization communities (amongst others).

I got particularly excited when I first used Mathematica about 25 years ago. Mathematica looked to me like the closest thing I’d seen to something that could support literate programming, without compromising on productivity. To do this, it used a “notebook” interface, which behaved a lot like a traditional REPL, but also allowed other types of information to be included, including charts, images, formatted text, outlining sections, and so forth. In fact, not only did it not compromise on productivity, but I found it actually allowed me to build things that previously were beyond me, because I could try algorithms out and immediately get feedback in a very visual way.

In the end though, Mathematica didn’t really help me build anything useful, because I couldn’t distribute my code or applications to colleagues (unless they spent thousands of dollars for a Mathematica license to use it), and I couldn’t easily create web applications for people to access from the browser. In addition, I found my Mathematica code would often end up much slower and more memory hungry than code I wrote in other languages.

So you can imagine my excitement when Jupyter Notebook appeared on the scene. This used the same basic notebook interface as Mathematica (although, at first, with a small subset of the functionality) but was open source, and allowed me to write in languages that were widely supported and freely available. I’ve been using Jupyter not just for exploring algorithms, APIs, and new research ideas, but also as a teaching tool at fast.ai. Many students have found that the ability to experiment with inputs and view intermediate results and outputs, as well as try out their own modifications, helped them to more fully and deeply understand the topics being discussed.

We are also writing a book entirely using Jupyter Notebooks, which has been an absolute pleasure, allowing us to combine prose, code examples, hierarchical structured headings, and so forth, whilst ensuring that our sample outputs (including charts, tables, and images) always correctly match up to the code examples.

In short: we have really enjoyed using Jupyter Notebook, we find that we do great work using it, and our students love it. But it just seemed like such a shame that we weren’t actually using it to build our software!

So what’s missing in Jupyter Notebook?

Whilst Jupyter Notebook is great at the “explorable” part part of “explorable programming”, it’s not so great at the “programming” part. For instance, it doesn’t really provide a way to do things like:

  • Create modular reusable code, which can be run outside of Jupyter
  • Creating hyperlinked searchable documentation
  • Test code (including automatically through continuous integration)
  • Navigate code
  • Handle version control

Because of this, people generally have to switch between a mix of poorly integrated tools, with significant friction as they move from tool to tool, to get the advantages of each:

Development Pros Cons
IDE/Editor
  • Produces final distributable module
  • Integration with doc lookup
  • Integration with syntax highlighting and type-checking
  • Non-interactive, so hard to explore
  • Incomplete support of dynamic languages
  • Documentation is text-only
  • No facility for documenting a session of interaction, or explaining through example
REPL/shell
  • Good for small interactive explorations
  • Bad for everything else, including producing distributable modules
Traditional notebooks
  • Mixes code, rich text, and images
  • Explaining thru examples by recording a session of interaction
  • Accurate code navigation and auto-completion for dynamic languages
  • Same cons as REPL programming

We decided that the best way to handle these things was to leverage great tools that already exist, where possible, and build our own where needed. For instance, for handling pull requests and viewing diffs, there’s already a great tool: ReviewNB. When you look at graphical diffs in ReviewNB, you suddenly realize how much has been missing all this time in plain text diffs. For instance, what if a commit made your image generation blurry? Or made your charts appear without labels? You really know what’s going on when you have that visual diff.

Visual diff in ReviewNB, showing change to tabular output
Visual diff in ReviewNB, showing change to tabular output

Many merge conflicts are avoided with nbdev, because it installs git hooks for you which strip out much of the metadata that causes those conflicts in the first place. If you get a merge conflict when you pull from git, just run nbdev_fix_merge. With this command, nbdev will simply use your cell outputs where there are conflicts in outputs, and if there are conflicts in cell inputs, then both cells are included in the final notebook, along with conflict markers so you can easily find them and fix them directly in Jupyter.

An example of a cell-based nbdev merge conflict
An example of a cell-based nbdev merge conflict

Modular reusable code is created by nbdev by simply creating standard Python modules. nbdev looks for special comments in code cells, such as #export, which indicates that a cell should be exported to a python module. Each notebook is associated with a particular python module by using a special comment at the start of the notebook. A documentation site (using Jekyll, so supported directly by GitHub Pages) is automatically built from the notebooks and special comments. We wrote our own documentation system, since existing approaches such as Sphinx didn’t provided all the features that we needed.

For code navigation, there are already wonderful features built into most editors and IDEs, such as vim, Emacs, and vscode. And as a bonus, GitHub even supports code navigation directly in its web interface now (in beta, in selected projects, such as fastai)! So we’ve ensured that the code that nbdev exports can be navigated and edited directly in any of these systems - and that any edits can be automatically synchronized back with the notebooks.

For testing, we’ve written our own simple library and command line tool. Tests are written directly in notebooks, as part of the exploration and development (and documentation) process, and the command line tool runs tests in all notebooks in parallel. The natural statefulness of notebooks turns out to be a really great way to develop both unit tests and integration tests. Rather than having special syntax to learn to create test suites, you just use the regular collection and looping constructs in python. So there’s much fewer new concepts to learn. These tests can also be run in your normal continuous integration tools, and they provide clear information about the source of any test errors that come up. The default nbdev template includes integration with GitHub Actions for continuous integration and other features (PRs for other platforms are welcome).

Dynamic Python

One of the challenges in fully supporting Python in a regular editor or IDE is that Python has particularly strong dynamic features. For instance, you can add methods to a class at any time, you can change the way that classes are created and how they work by using the metaclass system, and you can change how functions and methods behave by using decorators. Microsoft developed the Language Server Protocol, which can be used by development environments to get information about the current file and project necessary for auto-completions, code navigation, and so forth. However, with a truly dynamic language like python, such information will always just be guesses, since actually providing the correct information would require running the python code itself (which it can’t really do, for all kinds of reasons - for instance the code may be in a state while you’re writing it that actually deletes all your files!)

On the other hand, a notebook contains an actual running Python interpreter instance that you’re fully in control of. So Jupyter can provide auto-completions, parameter lists, and context-sensitive documentation based on the actual state of your code. For instance, when using Pandas we get tab completion of all the column names of our DataFrames. We’ve found that this feature of Jupyter Notebook makes exploratory programming significantly more productive. We haven’t needed to change anything to make it work well in nbdev; it’s just part of the great features of Jupyter that we get for free by building on that platform.

What now

In conjunction with developing nbdev, we’ve been writing fastai v2 from scratch entirely in nbdev. fastai v2 provides a rich, well-structured API for building deep learning models. It will be released in the first half of 2020. It’s already feature complete, and early adopters are already building cool projects with the pre-release version. We’ve also written other projects in fastai v2, some of which will be released in the coming weeks.

We’ve found that we’re 2x-3x more productive using nbdev than using traditional programming tools. For me, this is a big surprise, since I have coded nearly every day for over 30 years, and in that time have tried dozens of tools, libraries, and systems for building programs. I didn’t expect there could still be room for such a big job in productivity. It’s made me feel excited about the future, because I suspect there could still be a lot of room to develop other ideas for developer productivity, and because I’m looking forward to seeing what people build with nbdev.

If you decide to give it a go, please let us know how you get along! And of course feel free to ask any questions. The best place for these discussions is this forum thread that we’ve created for nbdev. PRs are of course welcome in the nbdev GitHub repo.

Thank you for taking an interest in our project!

Acknowledgements: Thanks to Alexis Gallagher and Viacheslav Kovalevskyi for their helpful feedback on drafts of this article. Thanks to Andrew Shaw for helping to build prototypes of show_doc, and to Stas Bekman for much of the git hooks functionality. Thanks to Hamel Husain for helpful ideas for using GitHub Actions.

Concerned about the impacts of data misuse? Ways to get involved with the USF Center for Applied Data Ethics

An algorithm applied to over 200 million patients is more likely to recommend extra health care for relatively healthy white patients over sicker black patients (paper in Science and news coverage). Russia was found to be running influence operations in 6 African countries via 73 Facebook pages, many of which purported to be local news sources, and which also spanned WhatsApp and Telegram (paper from Stanfod Internet Observatory and news coverage). An Indigenous elder revealed that the Indigenous consultation that SideWalk Labs (an Alphabet/Google company) conducted was “hollow and tokenistic”, with zero of the 14 recommendations that arose from the consultation included in SideWalks Labs’ 1,500 page report, even though the report mentions the Indigenous consultation many times. All these stories occurred just in the last week, the same week during which former chairman of the Alphabet board Eric Schmidt complained that people “don’t need to yell” about bias. Issues of data misuse, including bias, surveillance, and disinformation continue to be urgent and pervasive. For those of you living in the SF Bay Area, the Tech Policy Workshop (register here) being hosted Nov 16-17 by the USF Center for Applied Data Ethics (CADE) will be an excellent opportunity to learn and engage around these issues (the sessions will be recorded and shared online later). And for those of you living elsewhere, we have several videos to watch and news articles to read now!

Events to attend

Exploratorium After Dark: Matters of Fact

I’m going to be speaking at the Exploratorium After Dark (an adults-only evening program) on Thurs, Nov 7, on Disinformation: The Threat We Are Facing is Bigger Than Just “Fake News”. The event is from 6-10pm (my talk is at 8:30pm) and a lot of the other exhibits sound fascinating. Details and tickets here.

Great speaker line-up for our Nov 16-17 Tech Policy Workshop in SF

Systemic problems, such as increasing surveillance, spread of disinformation, concerning uses of predictive policing, and the magnification of unjust bias, all require systemic solutions. We hope to facilitate collaborations between those in tech and in policy, as well as highlight the need for policy interventions in addressing ethical issues to those working in tech.

USF Center for Applied Data Ethics Tech Policy Workshop
Dates: Nov 16-17, 9am-5:30pm
Location: McLaren Conference Center, 2130 Fulton Street, San Francisco, CA 94117
Breakfast, lunch, and snack included
Details: https://www.sfdatainstitute.org/
Register here
Anyone interested in the impact of data misuse on society & the intersection with policy is welcome!

Some of the many great speakers lined up for our Tech Policy Workshop.
Some of the many great speakers lined up for our Tech Policy Workshop.

Info Session about our spring Data Ethics Course

I will be teaching an Intro to Data Ethics course downtown on Monday evenings, from Jan 27 to March 9 (with no class Feb 17). The course is intended for working professionals. Come find out more at an info session on Nov 12.

Videos of our Data Ethics Seminars

We had 3 fantastic speakers for our Data Ethics Seminar this fall: Deborah Raji, Ali Alkhatib, and Brian Brackeen.

Deborah Raji gave a powerful inaugural seminar, opening with the line “There is an urgency to AI ethics & accountability work, because there are currently real people being affected.” Unfortunately, what it means to do machine learning that matters in the real world is different than what academia incentivizes. Using her work with Joy Buolamwini on GenderShades as a case study, she shared how research can be designed with the specific goal of having a concrete impact. GenderShades has been cited in a number of lawsuits, bans, federal bills, and state bills around the use of facial recognition.

Ali Alkhatib gave an excellent seminar on using lenses and frameworks originating in the social sciences to understand problems situated in technology, “Everything we do is situated within cultural & historical backdrops. If we’re serious about ethics & justice, we need to be serious about understanding those histories”

Brian Brackeen shared his experience founding a facial recognition start-up, the issues of racial bias in facial recogniton, and his current work funding under-represented founders in tech. After his talk, we had a fire-side chat and a lively Q&A session. Unfortunately, due to a mix-up with the videographer, we do not have a recording of his talk. I encourage you to follow Brian on twitter, read his powerful TechCrunch op-ed on why he refused to sell facial recognition to law enforcement, and watch this previous panel he was on.

All of these events were open to the public, and we will be hosting more seminars in the spring. To keep up on events, please join our email list:

And finally, I want to share a recent talk I gave on “Getting Specific About Algorithmic Bias.” Through a series of case studies, I illustrate how different types of bias have different sources (and require different approaches to mitigate the bias), debunk several misconceptions about bias, and share some steps towards solutions.

Apply to our data ethics fellowships

We are offering full-time fellowships for those working on problems of applied data ethics, with a particular focus on work that has a direct, practical impact. Applications will be reviewed after November 1, 2019 with roles starting in January or June 2020. We welcome applicants from any discipline (including, but not limited to computer science, statistics, law, social sciences, history, media studies, political science, public policy, business, etc.). We are looking for people who have shown deep interest and expertise in areas related to data ethics, including disinformation, surveillance/privacy, unjust bias, or tech policy, with a particular focus on practical impact. Here are the job postings:

News articles:

Blog Posts and talks:

Join the CADE mailing list:

The problem with metrics is a big problem for AI

Goodhart’s Law states that “When a measure becomes a target, it ceases to be a good measure.” At their heart, what most current AI approaches do is to optimize metrics. The practice of optimizing metrics is not new nor unique to AI, yet AI can be particularly efficient (even too efficient!) at doing so.

This is important to understand, because any risks of optimizing metrics are heightened by AI. While metrics can be useful in their proper place, there are harms when they are unthinkingly applied. Some of the scariest instances of algorithms run amok (such as Google’s algorithm contributing to radicalizing people into white supremacy, teachers being fired by an algorithm, or essay grading software that rewards sophisticated garbage) all result from over-emphasizing metrics. We have to understand this dynamic in order to understand the urgent risks we are facing due to misuse of AI.

Headlines from HBR, Washington Post, and Vice on some of the outcomes of over-optimizing metrics: rewarding gibberish essays, promoting propaganda, massive fraud at Wells Fargo, and firing good teachers
Headlines from HBR, Washington Post, and Vice on some of the outcomes of over-optimizing metrics: rewarding gibberish essays, promoting propaganda, massive fraud at Wells Fargo, and firing good teachers

The following principles will be illustrated through a series of case studies:

We can’t measure the things that matter most

Metrics are typically just a proxy for what we really care about. The paper Does Machine Learning Automate Moral Hazard and Error? covers an interesting example: the researchers investigate which factors in someone’s electronic medical record are most predictive of a future stroke. However, the researchers found that several of the most predictive factors (such as accidental injury, a benign breast lump, or colonoscopy) don’t make sense as risk factors for stroke. So, just what is going on? It turned out that the model was just identifying people who utilize health care a lot. They didn’t actually have data of who had a stroke (a physiological event in which regions of the brain are denied new oxygen); they had data about who had access to medical care, chose to go to a doctor, were given the needed tests, and had this billing code added to their chart. But a number of factors influence this process: who has health insurance or can afford their co-pay, who can take time off of work or find childcare, gender and racial biases that impact who gets accurate diagnoses, cultural factors, and more. As a result, the model was largely picking out people who utilized healthcare versus who did not.

This an example of the common phenomenon of having to use proxies: You want to know what content users like, so you measure what they click on. You want to know which teachers are most effective, so you measure their students test scores. You want to know about crime, so you measure arrests. These things are not the same. Many things we do care about can not be measured. Metrics can be helpful, but we can’t forget that they are just proxies.

As another example, Google used hours spent watching YouTube as a proxy for how happy users were with the content, writing on the Google blog that “If viewers are watching more YouTube, it signals to us that they’re happier with the content they’ve found.” Guillaume Chaslot, an AI engineer who formerly worked at Google/YouTube, shares how this had the side effect of incentivizing conspiracy theories, since convincing users that the rest of the media is lying kept them watching more YouTube.

Metrics can, and will, be gamed

It is almost inevitable that metrics will be gamed, particularly when they are given too much power. One week this spring, Chaslot collected 84,695 videos from YouTube and analyzed the number of views and the number of channels from which they were recommended. This is what he found (also covered in the Washington Post):

Chart showing Russia Today's video on the Mueller Report as being an outlier in how many YouTube channels recommended it. <a href='https://twitter.com/gchaslot/status/1121603851675553793?s=20'>Source</a>
Chart showing Russia Today's video on the Mueller Report as being an outlier in how many YouTube channels recommended it. Source

The state-owned media outlet Russia Today was an extreme outlier in how much YouTube’s algorithm had selected it to be recommended by a wide-variety of other YouTube channels. Such algorithmic selections, which begin autoplaying as soon as your current video is done, account for 70% of the time that users spend on YouTube. This chart strongly suggests that Russia Today has in some way gamed YouTube’s algorithm. (More evidence about issues with YouTube’s recommendation system is detailed here.) Platforms are rife with attempts to game their algorithms, to show up higher in search results or recommended content, through fake clicks, fake reviews, fake followers, and more.

Automatic essay grading software focuses primarily on metrics like sentence length, vocabulary, spelling, and subject-verb agreement, but is unable to evaluate aspects of writing that are hard to quantify, such as creativity. As a result, gibberish essays randomly generated by computer programs to contain lots of sophisticated words score well. Essays from students in mainland China, which do well on essay length and sophisticated word choice, received higher scores from the algorithms than from expert human graders, suggesting that these students may be using chunks of pre-memorized text.

As USA education policy began over-emphasizing student test scores as the primary way to evaluate teachers, there have been widespread scandals of teachers and principals cheating by altering students scores, in Georgia, Indiana, Massachusetts, Nevada, Virginia, Texas, and elsewhere. One consequence of this is that teachers who don’t cheat may be penalized or even fired (when it appears student test scores have dropped to more average levels under their instruction). When metrics are given undue importance, attempts to game those metrics become common.

Metrics tend to overemphasize short-term concerns

It is much easier to measure short-term quantities: click through rates, month-over-month churn, quarterly earnings. Many long-term trends have a complex mix of factors and are tougher to quantify. What is the long-term impact on user trust of having your brand associated with promoting pedophilia, white supremacy, and flat-earth theories? What is the long-term impact on hiring to be the subject of years worth of privacy scandals, political manipulation, and facilitating genocide?

Simply measuring what users click on is a short-term concern, and does not take into account factors like the potential long-term impact of a long-form investigative article which may have taken months to research and which could help shape a reader’s understanding of a complex issue and even lead to significant societal changes.

A recent Harvard Business Review article looked at Wells Fargo as a case study of how letting metrics replace strategy can harm a business. After identifying cross-selling as a measure of long-term customer relationships, Wells Fargo went overboard emphasizing the cross-selling metric: intense pressure on employees combined with an unethical sales culture led to 3.5 million fraudulent deposit and credit card accounts being opened without customers’ consent. The metric of cross-selling is a much more short-term concern compared to the loftier goal of nurturing long-term customer relationships. Overemphasizing metrics removes our focus from long-term concerns such as our values, trust and reputation, and our impact on society and the environment, and myopically focuses on the short-term.

Many metrics gather data of what we do in highly addictive environments

It matters which metrics we gather and in what environment we do so. Metrics such as what users click on, how much time they spend on sites, and “engagement” are heavily relied on by tech companies as proxies for user preference, and are used to drive important business decisions. Unfortunately, these metrics are gathered in environments engineered to be highly addictive, laden with dark patterns, and where financial and design decisions have already greatly circumscribed the range of options.

Our online environment is a buffet of junk food
Our online environment is a buffet of junk food

Zeynep Tufekci, a professor at UNC and regular contributor to the New York Times, compares recommendation algorithms (such as YouTube choosing which videos to auto-play for you and Facebook deciding what to put at the top of your newsfeed) to a cafeteria shoving junk food into children’s faces. “This is a bit like an autopilot cafeteria in a school that has figured out children have sweet teeth, and also like fatty and salty foods. So you make a line offering such food, automatically loading the next plate as soon as the bag of chips or candy in front of the young person has been consumed.” As those selections get normalized, the output becomes ever more extreme: “So the food gets higher and higher in sugar, fat and salt – natural human cravings – while the videos recommended and auto-played by YouTube get more and more bizarre or hateful.” Too many of our online environments are like this, with metrics capturing that we love sugar, fat, and salt, not taking into account that we are in the digital equivalent of a food desert and that companies haven’t been required to put nutrition labels on what they are offering. Such metrics are not indicative of what we would prefer in a healthier or more empowering environment.

When Metrics are Useful

All this is not to say that we should throw metrics out altogether. Data can be valuable in helping us understand the world, test hypotheses, and move beyond gut instincts or hunches. Metrics can be useful when they are in their proper context and place. One way to keep metrics in their place is to consider a slate of many metrics for a fuller picture (and resist the temptation to try to boil these down to a single score). For instance, knowing the rates at which tech companies hire people from under-indexed groups is a very limited data point. For evaluating diversity and inclusion at tech companies, we need to know comparative promotion rates, cap table ownership, retention rates (many tech companies are revolving doors driving people from under-indexed groups away with their toxic cultures), number of harassment victims silenced by NDAs, rates of under-leveling, and more. Even then, all this data should still be combined with listening to first-person experiences of those working at these companies.

Columbia professor and New York Times Chief Data Scientist Chris Wiggins wrote that quantitative measures should always be combined with qualitative information, “Since we can not know in advance every phenomenon users will experience, we can not know in advance what metrics will quantify these phenomena. To that end, data scientists and machine learning engineers must partner with or learn the skills of user experience research, giving users a voice.”

Another key to keeping metrics in their proper place is to keep domain experts and those who will be most impacted closely involved in their development and use. Surely most teachers could have foreseen that evaluating teachers primarily on the standardized test scores of their students would lead to a host of negative consequences.

I am not opposed to metrics; I am alarmed about the harms caused when metrics are overemphasized, a phenomenon that we see frequently with AI, and which is having a negative, real-world impact. AI running unchecked to optimize metrics has led to Google/YouTube’s heavy promotion of white supremacist material, essay grading software that rewards garbage, and more. By keeping the risks of metrics in mind, we can try to prevent these harms.

8 Things You Need to Know about Surveillance

Over 225 police departments have partnered with Amazon to have access to Amazon’s video footage obtained as part of the “smart” doorbell product Ring, and in many cases these partnerships are heavily subsidized with taxpayer money. Police departments are allowing Amazon to stream 911 call information directly in real-time, and Amazon requires police departments to read pre-approved scripts when talking about the program. If a homeowner doesn’t want to share data from their video camera doorbell with police, an officer for the Fresno County Sheriff’s Office said they can just go directly to Amazon to obtain it. This creation of an extensive surveillance network, the murky private-public partnership surrounding it, and a lack of any sort of regulations or oversight is frightening. And this is just one of many examples related to surveillance technology that have recently come to light.

I frequently talk with people who are not that concerned about surveillance, or who feel that the positives outweigh the risks. Here, I want to share some important truths about surveillance:

  1. Surveillance can facilitate human rights abuses and even genocide
  2. Data is often used for different purposes than why it was collected
  3. Data often contains errors
  4. Surveillance typically operates with no accountability
  5. Surveillance changes our behavior
  6. Surveillance disproportionately impacts the marginalized
  7. Data privacy is a public good
  8. We don’t have to accept invasive surveillance

While I was writing this post, a number of investigative articles came out with disturbing new developments related to surveillance. I decided that rather than attempt to include everything in one post (which would make it too long and too dense), I would go ahead and share the above facts about surveillance, as they are just a relevant as ever.

1. Surveillance can facilitate human rights abuses and even genocide

There is a long history of data about sensitive attributes being misused, including the use of the 1940 USA Census to intern Japanese Americans, a system of identity cards introduced by the Belgian colonial government that were later used during the 1994 Rwandan genocide (in which nearly a million people were murdered), and the role of IBM in helping Nazi Germany use punchcard computers to identify and track the mass killing of millions of Jewish people. More recently, the mass internment of over one million people who are part of an ethnic minority in Western China was facilitated through the use of a surveillance network of cameras, biometric data (including images of people’s faces, audio of their voices, and blood samples), and phone monitoring.

Adolf Hitler meeting with IBM CEO Tom Watson Sr. in 1937.  Source: <a href=' https://www.computerhistory.org/revolution/punched-cards/2/15/109'>Computer History</a>
Adolf Hitler meeting with IBM CEO Tom Watson Sr. in 1937. Source: Computer History

Pictured above is Adolf Hitler (far left) meeting with IBM CEO Tom Watson Sr. (2nd from left), shortly before Hitler awarded Watson a special “Service to the Reich” medal in 1937 (for a timeline of the Holocaust, see here). Watson returned the medal in 1940, although IBM continued to do business with the Nazis. IBM technology helped the Nazis conduct detailed censuses in countries they occupied, to thoroughly identify anyone of Jewish descent. Nazi concentration camps used IBM’s punchcard machines to tabulate prisoners, recording whether they were Jewish, gay, or Gypsies, and whether they died of “natural causes,” execution, suicide, or via “special treatment” in gas chambers. It is not the case that IBM sold the machines and then was done with it. Rather, IBM and its subsidiaries provided regular training and maintenance on-site at the concentration camps: printing off cards, configuring machines, and repairing them as they broke frequently.

2. Data is often used for different purposes than why it was collected

In the above examples, the data collection began before genocide was committed. IBM began selling to Nazi Germany well before the Holocaust (although continued for far too long), including helping with Germany’s 1933 census conducted by Adolf Hitler, which was effective at identifying far more Jewish people than had previously been recognized in Germany.

It is important to recognize how data and images gathered through surveillance can be weaponized later. Columbia professor Tim Wu wrote that “One [hard truth] is that data and surveillance networks created for one purpose can and will be used for others. You must assume that any personal data that Facebook or Android keeps are data that governments around the world will try to get or that thieves will try to steal.”

Plenty of data collection is not involved with such extreme abuse as genocide; however, in a time of global resurgence of white supremacist, ethno-nationalist, and authoritarian movements, it would be deeply irresponsible to not consider how data & surveillance can and will be weaponized against already vulnerable groups.

3. Data often has errors (and no mechanism for correcting them)

A database of suspected gang members maintained by California law enforcement officials was found to be full of errors, including 42 babies who had been added to the database when they were less than 1 year old (28 of whom were marked as “admitting to being gang members”). Even worse, there was no process in place for correcting mistakes or removing people once they’ve been added.

An NPR reporter recounts his experience of trying to rent an apartment and discovering that TransUnion, one of the 3 major credit bureaus, incorrectly reported him as having two felony firearms convictions. TransUnion only removed the mistakes after a dozen phone calls and notification that the story would be reported on. This is not an unusual story: the FTC’s large-scale study of credit reports in 2012 found 26% of consumers had at least one mistake in their files and 5% had errors that could be devastating. An even more opaque, unregulated “4th bureau” exists: a collection of companies buying and selling personal information about people on the margins of the banking system (such as immigrants, students, and people with low incomes), with no standards on what types of data are included, no way to opt out, and no system for identifying or correcting mistakes.

4. Surveillance typically operates with no accountability

What makes the examples in the previous section disturbing is not just that errors occurred, but that there was no way to identify or correct them, and no accountability for those profiting off the error-laden data. Often, even the existence of the systems being used is not publicly known (much less details of how these systems work), unless discovered by journalists or revealed by whistleblowers. The Detroit Police Dept used facial recognition technology for nearly two years without public input and in violation of a requirement that a policy be approved by the city’s Board of Police Commissioners, until a study from Georgetown Law’s Center for Privacy & Technology drew attention to the issue. Palantir, the defense startup founded by billionaire Peter Thiel, ran a program with New Orleans Police Department for 6 years which city council members did not even know about, much less have any oversight.

After two studies found that Amazon’s facial recognition software produced inaccurate and racially biased results, Amazon countered that the researchers should have changed the default parameters. However, it turned out that Amazon was not instructing police departments that use its software to do this either. Surveillance programs are operating with few regulations, no oversight, no accountability around accuracy or mistakes, and in many cases, no public knowledge of what is going on.

5. Surveillance changes our behavior

Hundreds of thousands of people in Hong Kong are protesting an unpopular new bill which would allow extradition to China. Typically, Hong Kong locals use their rechargeable smart cards to ride the subway. However, during the protests, long lines of people waited to use cash to buy paper tickets (usually something that only tourists do) concerned that they would be tracked for having attended the protests. Would fewer people protest if this was not an option?

In the United States, in 2015 the Baltimore Police Department used facial recognition technology to surveil people protesting the death of Freddie Grey, a young Black man who was killed in police custody, and arrested protesters with outstanding warrants. Mass surveillance could have a chilling impact on our rights to move about freely, to express ourselves, and to protest. “We act differently when we know we are ‘on the record.’ Mass privacy is the freedom to act without being watched and thus in a sense, to be who we really are,” Columbia professor Tim Wu wrote in the New York Times.

Flyer from the company Geofeedia <a href='https://www.aclunc.org/docs/20161011_geofeedia_baltimore_case_study.pdf'>Source</a>
Flyer from the company Geofeedia Source

6. Surveillance disproportionately impacts those who are already marginalized

Surveillance is applied unevenly, causing the greatest harm to people who are already marginalized, including immigrants, people of color, and people living in poverty. These groups are more heavily policed and surveilled. The Perpetual Line-Up from the Georgetown Law Center on Privacy and Technology studied the unregulated use of facial recognition by police, with half of all Americans appearing in law enforcement databases, and the risks of errors, racial bias, misuses, and threats to civil liberties. The researchers pointed out that African Americans are more likely to appear in these databases (many of which are drawn from mug shots) since they are disproportionately likely to be stopped, interrogated, or arrested. For another example, consider the contrast between how easily people over 65 can apply for Medicare benefits by filling out an online form, with the invasive personal questions asked of a low-income mother on Medicaid about her lovers, hygiene, parental shortcomings, and personal habits.

In an article titled Trading privacy for survival is another tax on the poor, Ciara Byrne wrote, “Current public benefits programs ask applicants extremely detailed and personal questions and sometimes mandate home visits, drug tests, fingerprinting, and collection of biometric information… Employers of low-income workers listen to phone calls, conduct drug tests, monitor closed-circuit television, and require psychometric tests as conditions of employment. Prisoners in some states have to consent to be voiceprinted in order to make phone calls.”

7. Data privacy is a public good, like air quality or safe drinking water

Data is more revealing in aggregate. It can be nearly impossible to know what your individual data could reveal when combined with the data of others or with data from other sources, or when machine learning inference is performed on it. For instance, as Zeynep Tufekci wrote in the New York Times, individual Strava users could not have predicted how in aggregate their data could be used to identify the locations of US military bases. “Data privacy is not like a consumer good, where you click ‘I accept’ and all is well. Data privacy is more like air quality or safe drinking water, a public good that cannot be effectively regulated by trusting in the wisdom of millions of individual choices. A more collective response is needed.”

Unfortunately, this also means that you can’t fully safeguard your privacy on your own. You may choose not to purchase Amazon’s ring doorbell, yet you can still show up in the video footage collected by others. You might strengthen your online privacy practices, yet conclusions will still be inferred about you based on the behavior of others. As Professor Tufekci wrote, we need a collective response.

8. We don’t have to accept invasive surveillance

Many people are uncomfortable with surveillance, but feel like they have no say in the matter. While the threats surveillance poses are large, it is not too late to act. We are seeing success: in response to community organizing and an audit, Los Angeles Police Department scrapped a controversial program to predict who is most likely to commit violent crimes. Citizens, researchers, and activists in Detroit have been effective at drawing attention to the Detroit Police Department’s unregulated use of facial recognition and a bill calling for a 5-year moratorium has been introduced to the state legislature. Local governments in San Francisco, Oakland, and Somerville have banned the use of facial recognition by police.

For further resources, please check out:

Make Delegation Work in Python

The Delegation Problem

Let’s look at a problem that all coders have faced; something that I call the delegation problem. To explain, I’ll use an example. Here’s an example class you might see in a content management system:

class WebPage:
    def __init__(self, title, category="General", date=None, author="Jeremy"):
        self.title,self.category,self.author = title,category,author
        self.date = date or datetime.now()
        ...

Then, you want to add a subclass for certain types of page, such as a product page. It should have all the details of WebPage, plus some extra stuff. One way to do it would be with inheritance, like this:

class ProductPage(WebPage):
    def __init__(self, title, price, cost, category="General", date=None, author="Jeremy"):
        super().__init__(title, category=category, date=date, author=author)
        ...

But now we’re violating the Don’t Repeat Yourself (DRY) principal. We’ve duplicated both our list of parameter names, and the defaults. So later on, we might decide to change the default author to “Rachel”, so we change the definition in WebPage.__init__. But we forget to do the same in ProductPage, and now we have a bug 🐛! (When writing the fastai deep learning library I’ve created bugs many times in this way, and sometimes they’ve been extremely hard to track down, because differences in deep learning hyper-parameters can have very subtle and hard to test or detect implications.)

To avoid this, perhaps we could instead write it this way:

class ProductPage(WebPage):
    def __init__(self, title, price, cost, **kwargs):
        super().__init__(title, **kwargs)
        ...

The key to this approach is the use of **kwargs. In python **kwargs in a parameter like means “put any additional keyword arguments into a dict called kwarg. And **kwargs in an argument list means “insert all key/value pairs in the kwargs dict as named arguments here”. This approach is used in many popular libraries, such as matplotlib, in which the main plot function simply has the signature plot(*args, **kwargs). The plot documentation says “The kwargs are Line2D properties” and then lists those properties.

It’s not just python that uses this approach. For instance, in the R language the equivalent to **args is simply written ... (an ellipsis). The R documentation explains: “Another frequent requirement is to allow one function to pass on argument settings to another. For example many graphics functions use the function par() and functions like plot() allow the user to pass on graphical parameters to par() to control the graphical output.This can be done by including an extra argument, literally ‘…’, of the function, which may then be passed on”.

For more details on using **kwargs in python, Google will find you many nice tutorials, such as this one. The **kwargs solution appears to work quite nicely:

p = ProductPage('Soap', 15.0, 10.50, category='Bathroom', author="Sylvain")
p.author
'Sylvain'

However, this makes our API quite difficult to work with, because now the environment we’re using for editing our Python code (examples in this article assume we’re using Jupyter Notebook) doesn’t know what parameters are available, so things like tab-completion of parameter names and popup lists of signatures won’t work 😢. In addition, if we’re using an automatic tool for generating API documentation (such as fastai’s show_doc or Sphinx), our docs won’t include the full list of parameters, and we’ll need to manually add information about these delegated parameters (i.e. category, date, and author, in this case). In fact, we’ve seen this already, in matplotlib’s documentation for plot.

Another alternative is to avoid inheritance, and instead use composition, like so:

class ProductPage:
    def __init__(self, page, price, cost):
        self.page,self.price,self.cost = page,price,cost
        ...

p = ProductPage(WebPage('Soap', category='Bathroom', author="Sylvain"), 15.0, 10.50)
p.page.author
'Sylvain'

This has a new problem, however, which is that the most basic attributes are now hidden underneath p.page, which is not a great experience for our class users (and the constructor is now rather clunky compared to our inheritance version).

Solving the problem with delegated inheritance

The solution to this that I’ve recently come up with is to create a decorator that is used like this:

@delegates()
class ProductPage(WebPage):
    def __init__(self, title, price, cost, **kwargs):
        super().__init__(title, **kwargs)
        ...

…which looks exactly like what we had before for our inheritance version with kwargs, but has this key difference:

print(inspect.signature(ProductPage))
(title, price, cost, category='General', date=None, author='Jeremy')

It turns out that this approach, which I call delegated inheritance, solves all of our problems; in Jupyter if I hit the standard “show parameters” key Shift-Tab while instantiating a ProductPage, I see the full list of parameters, including those from WebPage. And hitting Tab will show me a completion list including the WebPage parameters. In addition, documentation tools see the full, correct signature, including the WebPage parameter details.

To decorate delegating functions instead of class __init__ we use much the same syntax. The only difference is that we now need to pass the function we’re delegating to:

def create_web_page(title, category="General", date=None, author="Jeremy"):
    ...

@delegates(create_web_page)
def create_product_page(title, price, cost, **kwargs):
    ...

print(inspect.signature(create_product_page))
(title, price, cost, category='General', date=None, author='Jeremy')

I really can’t overstate how significant this little decorator is to my coding practice. In early versions of fastai we used kwargs frequently for delegation, because we wanted to ensure my code was as simple as possible to write (otherwise I tend to make a lot of mistakes!) We used it not just for delegating __init__ to the parent, but also for standard functions, similar to how it’s used in matplotlib’s plot function. However, as fastai got more popular, I heard more and more feedback along the lines of “I love everything about fastai, except I hate dealing with kwargs”! And I totally empathized; indeed, dealing with ... in R APIs and kwargs in Python APIs has been a regular pain-point for me too. But here I was, inflicting it on my users! 😯

I am, of course, not the first person to have dealt with this. The Use and Abuse of Keyword Arguments in Python is a thoughtful article which concludes “So it’s readability vs extensibility. I tend to argue for readability over extensibility, and that’s what I’ll do here: for the love of whatever deity/ies you believe in, use **kwargs sparingly and document their use when you do”. This is what we ended up doing in fastai too. Last year Sylvain spent a pleasant (!) afternoon removing every kwargs he could and replaced it with explicit parameter lists. And of course now we get the occassional bug resulting from one of us failing to update parameter defaults in all functions…

But now that’s all in the past. We can use **kwargs again, and have the simpler and more reliable code thanks to DRY, and also a great experience for developers. 🎈 And the basic functionality of delegates() is just a few lines of code (source at bottom of article).

Solving the problem with delegated composition

For an alternative solution, let’s look again at the composition approach:

class ProductPage:
    def __init__(self, page, price, cost): self.page,self.price,self.cost = page,price,cost

page = WebPage('Soap', category='Bathroom', author="Sylvain")
p = ProductPage(page, 15.0, 10.50)
p.page.author
'Sylvain'

How do we make it so we can just write p.author, instead of p.page.author. It turns out that Python has a great solution to this: just override __getattr__, which is called automatically any time an unknown attribute is requested:

class ProductPage:
    def __init__(self, page, price, cost): self.page,self.price,self.cost = page,price,cost
    def __getattr__(self, k): return getattr(self.page,k)

p = ProductPage(page, 15.0, 10.50)
p.author
'Sylvain'

That’s a good start. But we have a couple of problems. The first is that we’ve lost our tab-completion again… But we can fix it! Python calls __dir__ to figure out what attributes are provided by an object, so we can override it and list the attributes in self.page as well.

The second problem is that we often want to control which attributes are forwarded to the composed object. Having anything and everything forwarded could lead to unexpected bugs. So we should consider providing a list of forwarded attributes, and use that in both __getattr__ and __dir__.

I’ve created a simple base class called GetAttr that fixes both of these issues. You just have to set the default property in your object to the attribute you wish to delegate to, and everything else is automatic! You can also optionally set the _xtra attribute to a string list containing the names of attributes you wish to forward (it defaults to every attribute in default, except those whose name starts with _).

class ProductPage(GetAttr):
    def __init__(self, page, price, cost):
        self.page,self.price,self.cost = page,price,cost
        self.default = page

p = ProductPage(page, 15.0, 10.50)
p.author
'Sylvain'

Here’s the attributes you’ll see in tab completion:

[o for o in dir(p) if not o.startswith('_')]
['author', 'category', 'cost', 'date', 'default', 'page', 'price', 'title']

So now we have two really nice ways of handling delegation; which you choose will depend on the details of the problem you’re solving. If you’ll be using the composed object in a few different places, the composition approach will probably be best. If you’re adding some functionality to an existing class, delegated inheritance might result in a cleaner API for your class users.

See the end of this post for the source of GetAttr and delegates.

Making delegation work for you

Now that you have this tool in your toolbox, how are you going to use it?

I’ve recently started using it in many of my classes and functions. Most of my classes are building on the functionality of other classes, either my own, or from another module, so I often use composition or inheritance. When I do so, I normally like to make available the full functionality of the original class available too. By using GetAttr and delegates I don’t need to make any compromises between maintainability, readability, and usability!

I’d love to hear if you try this, whether you find it helpful or not. I’d also be interested in hearing about other ways that people are solving the delegation problem. The best way to reach me is to mention me on Twitter, where I’m @jeremyphoward.

A brief note on coding style

PEP 8 shows the “coding conventions for the Python code comprising the standard library in the main Python distribution”. They are also widely used in many other Python projects. I do not use PEP 8 for data science work, or for teaching more generally, since the goals and context are very different to the goals and context of the Python standard library (and PEP 8’s very first point is “A Foolish Consistency is the Hobgoblin of Little Minds”. Generally my code tends to follow the fastai style guide, which was designed for data science and teaching. So please:

  • Don’t follow the coding convensions in this code if you work on projects that use PEP 8
  • Don’t complain to me that my code doesn’t use PEP 8.

Source code

Here’s the delegates() function; just copy it somewhere and use it… I don’t know that it’s worth creating a pip package for:

def delegates(to=None, keep=False):
    "Decorator: replace `**kwargs` in signature with params from `to`"
    def _f(f):
        if to is None: to_f,from_f = f.__base__.__init__,f.__init__
        else:          to_f,from_f = to,f
        sig = inspect.signature(from_f)
        sigd = dict(sig.parameters)
        k = sigd.pop('kwargs')
        s2 = {k:v for k,v in inspect.signature(to_f).parameters.items()
              if v.default != inspect.Parameter.empty and k not in sigd}
        sigd.update(s2)
        if keep: sigd['kwargs'] = k
        from_f.__signature__ = sig.replace(parameters=sigd.values())
        return f
    return _f

And here’s GetAttr. As you can see, there’s not much to it!

def custom_dir(c, add): return dir(type(c)) + list(c.__dict__.keys()) + add

class GetAttr:
    "Base class for attr accesses in `self._xtra` passed down to `self.default`"
    @property
    def _xtra(self): return [o for o in dir(self.default) if not o.startswith('_')]
    def __getattr__(self,k):
        if k in self._xtra: return getattr(self.default, k)
        raise AttributeError(k)
    def __dir__(self): return custom_dir(self, self._xtra)