Our online courses (all are free and have no ads):

Our software: fastai v1 for PyTorch

fast.ai in the news:

Concerned about the impacts of data misuse? Ways to get involved with the USF Center for Applied Data Ethics

An algorithm applied to over 200 million patients is more likely to recommend extra health care for relatively healthy white patients over sicker black patients (paper in Science and news coverage). Russia was found to be running influence operations in 6 African countries via 73 Facebook pages, many of which purported to be local news sources, and which also spanned WhatsApp and Telegram (paper from Stanfod Internet Observatory and news coverage). An Indigenous elder revealed that the Indigenous consultation that SideWalk Labs (an Alphabet/Google company) conducted was “hollow and tokenistic”, with zero of the 14 recommendations that arose from the consultation included in SideWalks Labs’ 1,500 page report, even though the report mentions the Indigenous consultation many times. All these stories occurred just in the last week, the same week during which former chairman of the Alphabet board Eric Schmidt complained that people “don’t need to yell” about bias. Issues of data misuse, including bias, surveillance, and disinformation continue to be urgent and pervasive. For those of you living in the SF Bay Area, the Tech Policy Workshop (register here) being hosted Nov 16-17 by the USF Center for Applied Data Ethics (CADE) will be an excellent opportunity to learn and engage around these issues (the sessions will be recorded and shared online later). And for those of you living elsewhere, we have several videos to watch and news articles to read now!

Events to attend

Exploratorium After Dark: Matters of Fact

I’m going to be speaking at the Exploratorium After Dark (an adults-only evening program) on Thurs, Nov 7, on Disinformation: The Threat We Are Facing is Bigger Than Just “Fake News”. The event is from 6-10pm (my talk is at 8:30pm) and a lot of the other exhibits sound fascinating. Details and tickets here.

Great speaker line-up for our Nov 16-17 Tech Policy Workshop in SF

Systemic problems, such as increasing surveillance, spread of disinformation, concerning uses of predictive policing, and the magnification of unjust bias, all require systemic solutions. We hope to facilitate collaborations between those in tech and in policy, as well as highlight the need for policy interventions in addressing ethical issues to those working in tech.

USF Center for Applied Data Ethics Tech Policy Workshop
Dates: Nov 16-17, 9am-5:30pm
Location: McLaren Conference Center, 2130 Fulton Street, San Francisco, CA 94117
Breakfast, lunch, and snack included
Details: https://www.sfdatainstitute.org/
Register here
Anyone interested in the impact of data misuse on society & the intersection with policy is welcome!

Some of the many great speakers lined up for our Tech Policy Workshop.
Some of the many great speakers lined up for our Tech Policy Workshop.

Info Session about our spring Data Ethics Course

I will be teaching an Intro to Data Ethics course downtown on Monday evenings, from Jan 27 to March 9 (with no class Feb 17). The course is intended for working professionals. Come find out more at an info session on Nov 12.

Videos of our Data Ethics Seminars

We had 3 fantastic speakers for our Data Ethics Seminar this fall: Deborah Raji, Ali Alkhatib, and Brian Brackeen.

Deborah Raji gave a powerful inaugural seminar, opening with the line “There is an urgency to AI ethics & accountability work, because there are currently real people being affected.” Unfortunately, what it means to do machine learning that matters in the real world is different than what academia incentivizes. Using her work with Joy Buolamwini on GenderShades as a case study, she shared how research can be designed with the specific goal of having a concrete impact. GenderShades has been cited in a number of lawsuits, bans, federal bills, and state bills around the use of facial recognition.

Ali Alkhatib gave an excellent seminar on using lenses and frameworks originating in the social sciences to understand problems situated in technolgoy, “Everything we do is situated within cultural & historical backdrops. If we’re serious about ethics & justice, we need to be serious about understanding those histories”

Brian Brackeen shared his experience founding a facial recognition start-up, the issues of racial bias in facial recogniton, and his current work funding under-represented founders in tech. After his talk, we had a fire-side chat and a lively Q&A session. Unfortunately, due to a mix-up with the videographer, we do not have a recording of his talk. I encourage you to follow Brian on twitter, read his powerful TechCrunch op-ed on why he refused to sell facial recognition to law enforcement, and watch this previous panel he was on.

All of these events were open to the public, and we will be hosting more seminars in the spring. To keep up on events, please join our email list:

And finally, I want to share a recent talk I gave on “Getting Specific About Algorithmic Bias.” Through a series of case studies, I illustrate how different types of bias have different sources (and require different approaches to mitigate the bias), debunk several misconceptions about bias, and share some steps towards solutions.

Apply to our data ethics fellowships

We are offering full-time fellowships for those working on problems of applied data ethics, with a particular focus on work that has a direct, practical impact. Applications will be reviewed after November 1, 2019 with roles starting in January or June 2020. We welcome applicants from any discipline (including, but not limited to computer science, statistics, law, social sciences, history, media studies, political science, public policy, business, etc.). We are looking for people who have shown deep interest and expertise in areas related to data ethics, including disinformation, surveillance/privacy, unjust bias, or tech policy, with a particular focus on practical impact. Here are the job postings:

News articles:

Blog Posts and talks:

Join the CADE mailing list:

The problem with metrics is a big problem for AI

Goodhart’s Law states that “When a measure becomes a target, it ceases to be a good measure.” At their heart, what most current AI approaches do is to optimize metrics. The practice of optimizing metrics is not new nor unique to AI, yet AI can be particularly efficient (even too efficient!) at doing so.

This is important to understand, because any risks of optimizing metrics are heightened by AI. While metrics can be useful in their proper place, there are harms when they are unthinkingly applied. Some of the scariest instances of algorithms run amok (such as Google’s algorithm contributing to radicalizing people into white supremacy, teachers being fired by an algorithm, or essay grading software that rewards sophisticated garbage) all result from over-emphasizing metrics. We have to understand this dynamic in order to understand the urgent risks we are facing due to misuse of AI.

Headlines from HBR, Washington Post, and Vice on some of the outcomes of over-optimizing metrics: rewarding gibberish essays, promoting propaganda, massive fraud at Wells Fargo, and firing good teachers
Headlines from HBR, Washington Post, and Vice on some of the outcomes of over-optimizing metrics: rewarding gibberish essays, promoting propaganda, massive fraud at Wells Fargo, and firing good teachers

The following principles will be illustrated through a series of case studies:

We can’t measure the things that matter most

Metrics are typically just a proxy for what we really care about. The paper Does Machine Learning Automate Moral Hazard and Error? covers an interesting example: the researchers investigate which factors in someone’s electronic medical record are most predictive of a future stroke. However, the researchers found that several of the most predictive factors (such as accidental injury, a benign breast lump, or colonoscopy) don’t make sense as risk factors for stroke. So, just what is going on? It turned out that the model was just identifying people who utilize health care a lot. They didn’t actually have data of who had a stroke (a physiological event in which regions of the brain are denied new oxygen); they had data about who had access to medical care, chose to go to a doctor, were given the needed tests, and had this billing code added to their chart. But a number of factors influence this process: who has health insurance or can afford their co-pay, who can take time off of work or find childcare, gender and racial biases that impact who gets accurate diagnoses, cultural factors, and more. As a result, the model was largely picking out people who utilized healthcare versus who did not.

This an example of the common phenomenon of having to use proxies: You want to know what content users like, so you measure what they click on. You want to know which teachers are most effective, so you measure their students test scores. You want to know about crime, so you measure arrests. These things are not the same. Many things we do care about can not be measured. Metrics can be helpful, but we can’t forget that they are just proxies.

As another example, Google used hours spent watching YouTube as a proxy for how happy users were with the content, writing on the Google blog that “If viewers are watching more YouTube, it signals to us that they’re happier with the content they’ve found.” Guillaume Chaslot, an AI engineer who formerly worked at Google/YouTube, shares how this had the side effect of incentivizing conspiracy theories, since convincing users that the rest of the media is lying kept them watching more YouTube.

Metrics can, and will, be gamed

It is almost inevitable that metrics will be gamed, particularly when they are given too much power. One week this spring, Chaslot collected 84,695 videos from YouTube and analyzed the number of views and the number of channels from which they were recommended. This is what he found (also covered in the Washington Post):

Chart showing Russia Today's video on the Mueller Report as being an outlier in how many YouTube channels recommended it. <a href='https://twitter.com/gchaslot/status/1121603851675553793?s=20'>Source</a>
Chart showing Russia Today's video on the Mueller Report as being an outlier in how many YouTube channels recommended it. Source

The state-owned media outlet Russia Today was an extreme outlier in how much YouTube’s algorithm had selected it to be recommended by a wide-variety of other YouTube channels. Such algorithmic selections, which begin autoplaying as soon as your current video is done, account for 70% of the time that users spend on YouTube. This chart strongly suggests that Russia Today has in some way gamed YouTube’s algorithm. (More evidence about issues with YouTube’s recommendation system is detailed here.) Platforms are rife with attempts to game their algorithms, to show up higher in search results or recommended content, through fake clicks, fake reviews, fake followers, and more.

Automatic essay grading software focuses primarily on metrics like sentence length, vocabulary, spelling, and subject-verb agreement, but is unable to evaluate aspects of writing that are hard to quantify, such as creativity. As a result, gibberish essays randomly generated by computer programs to contain lots of sophisticated words score well. Essays from students in mainland China, which do well on essay length and sophisticated word choice, received higher scores from the algorithms than from expert human graders, suggesting that these students may be using chunks of pre-memorized text.

As USA education policy began over-emphasizing student test scores as the primary way to evaluate teachers, there have been widespread scandals of teachers and principals cheating by altering students scores, in Georgia, Indiana, Massachusetts, Nevada, Virginia, Texas, and elsewhere. One consequence of this is that teachers who don’t cheat may be penalized or even fired (when it appears student test scores have dropped to more average levels under their instruction). When metrics are given undue importance, attempts to game those metrics become common.

Metrics tend to overemphasize short-term concerns

It is much easier to measure short-term quantities: click through rates, month-over-month churn, quarterly earnings. Many long-term trends have a complex mix of factors and are tougher to quantify. What is the long-term impact on user trust of having your brand associated with promoting pedophilia, white supremacy, and flat-earth theories? What is the long-term impact on hiring to be the subject of years worth of privacy scandals, political manipulation, and facilitating genocide?

Simply measuring what users click on is a short-term concern, and does not take into account factors like the potential long-term impact of a long-form investigative article which may have taken months to research and which could help shape a reader’s understanding of a complex issue and even lead to significant societal changes.

A recent Harvard Business Review article looked at Wells Fargo as a case study of how letting metrics replace strategy can harm a business. After identifying cross-selling as a measure of long-term customer relationships, Wells Fargo went overboard emphasizing the cross-selling metric: intense pressure on employees combined with an unethical sales culture led to 3.5 million fraudulent deposit and credit card accounts being opened without customers’ consent. The metric of cross-selling is a much more short-term concern compared to the loftier goal of nurturing long-term customer relationships. Overemphasizing metrics removes our focus from long-term concerns such as our values, trust and reputation, and our impact on society and the environment, and myopically focuses on the short-term.

Many metrics gather data of what we do in highly addictive environments

It matters which metrics we gather and in what environment we do so. Metrics such as what users click on, how much time they spend on sites, and “engagement” are heavily relied on by tech companies as proxies for user preference, and are used to drive important business decisions. Unfortunately, these metrics are gathered in environments engineered to be highly addictive, laden with dark patterns, and where financial and design decisions have already greatly circumscribed the range of options.

Our online environment is a buffet of junk food
Our online environment is a buffet of junk food

Zeynep Tufekci, a professor at UNC and regular contributor to the New York Times, compares recommendation algorithms (such as YouTube choosing which videos to auto-play for you and Facebook deciding what to put at the top of your newsfeed) to a cafeteria shoving junk food into children’s faces. “This is a bit like an autopilot cafeteria in a school that has figured out children have sweet teeth, and also like fatty and salty foods. So you make a line offering such food, automatically loading the next plate as soon as the bag of chips or candy in front of the young person has been consumed.” As those selections get normalized, the output becomes ever more extreme: “So the food gets higher and higher in sugar, fat and salt – natural human cravings – while the videos recommended and auto-played by YouTube get more and more bizarre or hateful.” Too many of our online environments are like this, with metrics capturing that we love sugar, fat, and salt, not taking into account that we are in the digital equivalent of a food desert and that companies haven’t been required to put nutrition labels on what they are offering. Such metrics are not indicative of what we would prefer in a healthier or more empowering environment.

When Metrics are Useful

All this is not to say that we should throw metrics out altogether. Data can be valuable in helping us understand the world, test hypotheses, and move beyond gut instincts or hunches. Metrics can be useful when they are in their proper context and place. One way to keep metrics in their place is to consider a slate of many metrics for a fuller picture (and resist the temptation to try to boil these down to a single score). For instance, knowing the rates at which tech companies hire people from under-indexed groups is a very limited data point. For evaluating diversity and inclusion at tech companies, we need to know comparative promotion rates, cap table ownership, retention rates (many tech companies are revolving doors driving people from under-indexed groups away with their toxic cultures), number of harassment victims silenced by NDAs, rates of under-leveling, and more. Even then, all this data should still be combined with listening to first-person experiences of those working at these companies.

Columbia professor and New York Times Chief Data Scientist Chris Wiggins wrote that quantitative measures should always be combined with qualitative information, “Since we can not know in advance every phenomenon users will experience, we can not know in advance what metrics will quantify these phenomena. To that end, data scientists and machine learning engineers must partner with or learn the skills of user experience research, giving users a voice.”

Another key to keeping metrics in their proper place is to keep domain experts and those who will be most impacted closely involved in their development and use. Surely most teachers could have foreseen that evaluating teachers primarily on the standardized test scores of their students would lead to a host of negative consequences.

I am not opposed to metrics; I am alarmed about the harms caused when metrics are overemphasized, a phenomenon that we see frequently with AI, and which is having a negative, real-world impact. AI running unchecked to optimize metrics has led to Google/YouTube’s heavy promotion of white supremacist material, essay grading software that rewards garbage, and more. By keeping the risks of metrics in mind, we can try to prevent these harms.

8 Things You Need to Know about Surveillance

Over 225 police departments have partnered with Amazon to have access to Amazon’s video footage obtained as part of the “smart” doorbell product Ring, and in many cases these partnerships are heavily subsidized with taxpayer money. Police departments are allowing Amazon to stream 911 call information directly in real-time, and Amazon requires police departments to read pre-approved scripts when talking about the program. If a homeowner doesn’t want to share data from their video camera doorbell with police, an officer for the Fresno County Sheriff’s Office said they can just go directly to Amazon to obtain it. This creation of an extensive surveillance network, the murky private-public partnership surrounding it, and a lack of any sort of regulations or oversight is frightening. And this is just one of many examples related to surveillance technology that have recently come to light.

I frequently talk with people who are not that concerned about surveillance, or who feel that the positives outweigh the risks. Here, I want to share some important truths about surveillance:

  1. Surveillance can facilitate human rights abuses and even genocide
  2. Data is often used for different purposes than why it was collected
  3. Data often contains errors
  4. Surveillance typically operates with no accountability
  5. Surveillance changes our behavior
  6. Surveillance disproportionately impacts the marginalized
  7. Data privacy is a public good
  8. We don’t have to accept invasive surveillance

While I was writing this post, a number of investigative articles came out with disturbing new developments related to surveillance. I decided that rather than attempt to include everything in one post (which would make it too long and too dense), I would go ahead and share the above facts about surveillance, as they are just a relevant as ever.

1. Surveillance can facilitate human rights abuses and even genocide

There is a long history of data about sensitive attributes being misused, including the use of the 1940 USA Census to intern Japanese Americans, a system of identity cards introduced by the Belgian colonial government that were later used during the 1994 Rwandan genocide (in which nearly a million people were murdered), and the role of IBM in helping Nazi Germany use punchcard computers to identify and track the mass killing of millions of Jewish people. More recently, the mass internment of over one million people who are part of an ethnic minority in Western China was facilitated through the use of a surveillance network of cameras, biometric data (including images of people’s faces, audio of their voices, and blood samples), and phone monitoring.

Adolf Hitler meeting with IBM CEO Tom Watson Sr. in 1937.  Source: <a href=' https://www.computerhistory.org/revolution/punched-cards/2/15/109'>Computer History</a>
Adolf Hitler meeting with IBM CEO Tom Watson Sr. in 1937. Source: Computer History

Pictured above is Adolf Hitler (far left) meeting with IBM CEO Tom Watson Sr. (2nd from left), shortly before Hitler awarded Watson a special “Service to the Reich” medal in 1937 (for a timeline of the Holocaust, see here). Watson returned the medal in 1940, although IBM continued to do business with the Nazis. IBM technology helped the Nazis conduct detailed censuses in countries they occupied, to thoroughly identify anyone of Jewish descent. Nazi concentration camps used IBM’s punchcard machines to tabulate prisoners, recording whether they were Jewish, gay, or Gypsies, and whether they died of “natural causes,” execution, suicide, or via “special treatment” in gas chambers. It is not the case that IBM sold the machines and then was done with it. Rather, IBM and its subsidiaries provided regular training and maintenance on-site at the concentration camps: printing off cards, configuring machines, and repairing them as they broke frequently.

2. Data is often used for different purposes than why it was collected

In the above examples, the data collection began before genocide was committed. IBM began selling to Nazi Germany well before the Holocaust (although continued for far too long), including helping with Germany’s 1933 census conducted by Adolf Hitler, which was effective at identifying far more Jewish people than had previously been recognized in Germany.

It is important to recognize how data and images gathered through surveillance can be weaponized later. Columbia professor Tim Wu wrote that “One [hard truth] is that data and surveillance networks created for one purpose can and will be used for others. You must assume that any personal data that Facebook or Android keeps are data that governments around the world will try to get or that thieves will try to steal.”

Plenty of data collection is not involved with such extreme abuse as genocide; however, in a time of global resurgence of white supremacist, ethno-nationalist, and authoritarian movements, it would be deeply irresponsible to not consider how data & surveillance can and will be weaponized against already vulnerable groups.

3. Data often has errors (and no mechanism for correcting them)

A database of suspected gang members maintained by California law enforcement officials was found to be full of errors, including 42 babies who had been added to the database when they were less than 1 year old (28 of whom were marked as “admitting to being gang members”). Even worse, there was no process in place for correcting mistakes or removing people once they’ve been added.

An NPR reporter recounts his experience of trying to rent an apartment and discovering that TransUnion, one of the 3 major credit bureaus, incorrectly reported him as having two felony firearms convictions. TransUnion only removed the mistakes after a dozen phone calls and notification that the story would be reported on. This is not an unusual story: the FTC’s large-scale study of credit reports in 2012 found 26% of consumers had at least one mistake in their files and 5% had errors that could be devastating. An even more opaque, unregulated “4th bureau” exists: a collection of companies buying and selling personal information about people on the margins of the banking system (such as immigrants, students, and people with low incomes), with no standards on what types of data are included, no way to opt out, and no system for identifying or correcting mistakes.

4. Surveillance typically operates with no accountability

What makes the examples in the previous section disturbing is not just that errors occurred, but that there was no way to identify or correct them, and no accountability for those profiting off the error-laden data. Often, even the existence of the systems being used is not publicly known (much less details of how these systems work), unless discovered by journalists or revealed by whistleblowers. The Detroit Police Dept used facial recognition technology for nearly two years without public input and in violation of a requirement that a policy be approved by the city’s Board of Police Commissioners, until a study from Georgetown Law’s Center for Privacy & Technology drew attention to the issue. Palantir, the defense startup founded by billionaire Peter Thiel, ran a program with New Orleans Police Department for 6 years which city council members did not even know about, much less have any oversight.

After two studies found that Amazon’s facial recognition software produced inaccurate and racially biased results, Amazon countered that the researchers should have changed the default parameters. However, it turned out that Amazon was not instructing police departments that use its software to do this either. Surveillance programs are operating with few regulations, no oversight, no accountability around accuracy or mistakes, and in many cases, no public knowledge of what is going on.

5. Surveillance changes our behavior

Hundreds of thousands of people in Hong Kong are protesting an unpopular new bill which would allow extradition to China. Typically, Hong Kong locals use their rechargeable smart cards to ride the subway. However, during the protests, long lines of people waited to use cash to buy paper tickets (usually something that only tourists do) concerned that they would be tracked for having attended the protests. Would fewer people protest if this was not an option?

In the United States, in 2015 the Baltimore Police Department used facial recognition technology to surveil people protesting the death of Freddie Grey, a young Black man who was killed in police custody, and arrested protesters with outstanding warrants. Mass surveillance could have a chilling impact on our rights to move about freely, to express ourselves, and to protest. “We act differently when we know we are ‘on the record.’ Mass privacy is the freedom to act without being watched and thus in a sense, to be who we really are,” Columbia professor Tim Wu wrote in the New York Times.

Flyer from the company Geofeedia <a href='https://www.aclunc.org/docs/20161011_geofeedia_baltimore_case_study.pdf'>Source</a>
Flyer from the company Geofeedia Source

6. Surveillance disproportionately impacts those who are already marginalized

Surveillance is applied unevenly, causing the greatest harm to people who are already marginalized, including immigrants, people of color, and people living in poverty. These groups are more heavily policed and surveilled. The Perpetual Line-Up from the Georgetown Law Center on Privacy and Technology studied the unregulated use of facial recognition by police, with half of all Americans appearing in law enforcement databases, and the risks of errors, racial bias, misuses, and threats to civil liberties. The researchers pointed out that African Americans are more likely to appear in these databases (many of which are drawn from mug shots) since they are disproportionately likely to be stopped, interrogated, or arrested. For another example, consider the contrast between how easily people over 65 can apply for Medicare benefits by filling out an online form, with the invasive personal questions asked of a low-income mother on Medicaid about her lovers, hygiene, parental shortcomings, and personal habits.

In an article titled Trading privacy for survival is another tax on the poor, Ciara Byrne wrote, “Current public benefits programs ask applicants extremely detailed and personal questions and sometimes mandate home visits, drug tests, fingerprinting, and collection of biometric information… Employers of low-income workers listen to phone calls, conduct drug tests, monitor closed-circuit television, and require psychometric tests as conditions of employment. Prisoners in some states have to consent to be voiceprinted in order to make phone calls.”

7. Data privacy is a public good, like air quality or safe drinking water

Data is more revealing in aggregate. It can be nearly impossible to know what your individual data could reveal when combined with the data of others or with data from other sources, or when machine learning inference is performed on it. For instance, as Zeynep Tufekci wrote in the New York Times, individual Strava users could not have predicted how in aggregate their data could be used to identify the locations of US military bases. “Data privacy is not like a consumer good, where you click ‘I accept’ and all is well. Data privacy is more like air quality or safe drinking water, a public good that cannot be effectively regulated by trusting in the wisdom of millions of individual choices. A more collective response is needed.”

Unfortunately, this also means that you can’t fully safeguard your privacy on your own. You may choose not to purchase Amazon’s ring doorbell, yet you can still show up in the video footage collected by others. You might strengthen your online privacy practices, yet conclusions will still be inferred about you based on the behavior of others. As Professor Tufekci wrote, we need a collective response.

8. We don’t have to accept invasive surveillance

Many people are uncomfortable with surveillance, but feel like they have no say in the matter. While the threats surveillance poses are large, it is not too late to act. We are seeing success: in response to community organizing and an audit, Los Angeles Police Department scrapped a controversial program to predict who is most likely to commit violent crimes. Citizens, researchers, and activists in Detroit have been effective at drawing attention to the Detroit Police Department’s unregulated use of facial recognition and a bill calling for a 5-year moratorium has been introduced to the state legislature. Local governments in San Francisco, Oakland, and Somerville have banned the use of facial recognition by police.

For further resources, please check out:

Make Delegation Work in Python

The Delegation Problem

Let’s look at a problem that all coders have faced; something that I call the delegation problem. To explain, I’ll use an example. Here’s an example class you might see in a content management system:

class WebPage:
    def __init__(self, title, category="General", date=None, author="Jeremy"):
        self.title,self.category,self.author = title,category,author
        self.date = date or datetime.now()

Then, you want to add a subclass for certain types of page, such as a product page. It should have all the details of WebPage, plus some extra stuff. One way to do it would be with inheritance, like this:

class ProductPage(WebPage):
    def __init__(self, title, price, cost, category="General", date=None, author="Jeremy"):
        super().__init__(title, category=category, date=date, author=author)

But now we’re violating the Don’t Repeat Yourself (DRY) principal. We’ve duplicated both our list of parameter names, and the defaults. So later on, we might decide to change the default author to “Rachel”, so we change the definition in WebPage.__init__. But we forget to do the same in ProductPage, and now we have a bug 🐛! (When writing the fastai deep learning library I’ve created bugs many times in this way, and sometimes they’ve been extremely hard to track down, because differences in deep learning hyper-parameters can have very subtle and hard to test or detect implications.)

To avoid this, perhaps we could instead write it this way:

class ProductPage(WebPage):
    def __init__(self, title, price, cost, **kwargs):
        super().__init__(title, **kwargs)

The key to this approach is the use of **kwargs. In python **kwargs in a parameter like means “put any additional keyword arguments into a dict called kwarg. And **kwargs in an argument list means “insert all key/value pairs in the kwargs dict as named arguments here”. This approach is used in many popular libraries, such as matplotlib, in which the main plot function simply has the signature plot(*args, **kwargs). The plot documentation says “The kwargs are Line2D properties” and then lists those properties.

It’s not just python that uses this approach. For instance, in the R language the equivalent to **args is simply written ... (an ellipsis). The R documentation explains: “Another frequent requirement is to allow one function to pass on argument settings to another. For example many graphics functions use the function par() and functions like plot() allow the user to pass on graphical parameters to par() to control the graphical output.This can be done by including an extra argument, literally ‘…’, of the function, which may then be passed on”.

For more details on using **kwargs in python, Google will find you many nice tutorials, such as this one. The **kwargs solution appears to work quite nicely:

p = ProductPage('Soap', 15.0, 10.50, category='Bathroom', author="Sylvain")

However, this makes our API quite difficult to work with, because now the environment we’re using for editing our Python code (examples in this article assume we’re using Jupyter Notebook) doesn’t know what parameters are available, so things like tab-completion of parameter names and popup lists of signatures won’t work 😢. In addition, if we’re using an automatic tool for generating API documentation (such as fastai’s show_doc or Sphinx), our docs won’t include the full list of parameters, and we’ll need to manually add information about these delegated parameters (i.e. category, date, and author, in this case). In fact, we’ve seen this already, in matplotlib’s documentation for plot.

Another alternative is to avoid inheritance, and instead use composition, like so:

class ProductPage:
    def __init__(self, page, price, cost):
        self.page,self.price,self.cost = page,price,cost

p = ProductPage(WebPage('Soap', category='Bathroom', author="Sylvain"), 15.0, 10.50)

This has a new problem, however, which is that the most basic attributes are now hidden underneath p.page, which is not a great experience for our class users (and the constructor is now rather clunky compared to our inheritance version).

Solving the problem with delegated inheritance

The solution to this that I’ve recently come up with is to create a decorator that is used like this:

class ProductPage(WebPage):
    def __init__(self, title, price, cost, **kwargs):
        super().__init__(title, **kwargs)

…which looks exactly like what we had before for our inheritance version with kwargs, but has this key difference:

(title, price, cost, category='General', date=None, author='Jeremy')

It turns out that this approach, which I call delegated inheritance, solves all of our problems; in Jupyter if I hit the standard “show parameters” key Shift-Tab while instantiating a ProductPage, I see the full list of parameters, including those from WebPage. And hitting Tab will show me a completion list including the WebPage parameters. In addition, documentation tools see the full, correct signature, including the WebPage parameter details.

To decorate delegating functions instead of class __init__ we use much the same syntax. The only difference is that we now need to pass the function we’re delegating to:

def create_web_page(title, category="General", date=None, author="Jeremy"):

def create_product_page(title, price, cost, **kwargs):

(title, price, cost, category='General', date=None, author='Jeremy')

I really can’t overstate how significant this little decorator is to my coding practice. In early versions of fastai we used kwargs frequently for delegation, because we wanted to ensure my code was as simple as possible to write (otherwise I tend to make a lot of mistakes!) We used it not just for delegating __init__ to the parent, but also for standard functions, similar to how it’s used in matplotlib’s plot function. However, as fastai got more popular, I heard more and more feedback along the lines of “I love everything about fastai, except I hate dealing with kwargs”! And I totally empathized; indeed, dealing with ... in R APIs and kwargs in Python APIs has been a regular pain-point for me too. But here I was, inflicting it on my users! 😯

I am, of course, not the first person to have dealt with this. The Use and Abuse of Keyword Arguments in Python is a thoughtful article which concludes “So it’s readability vs extensibility. I tend to argue for readability over extensibility, and that’s what I’ll do here: for the love of whatever deity/ies you believe in, use **kwargs sparingly and document their use when you do”. This is what we ended up doing in fastai too. Last year Sylvain spent a pleasant (!) afternoon removing every kwargs he could and replaced it with explicit parameter lists. And of course now we get the occassional bug resulting from one of us failing to update parameter defaults in all functions…

But now that’s all in the past. We can use **kwargs again, and have the simpler and more reliable code thanks to DRY, and also a great experience for developers. 🎈 And the basic functionality of delegates() is just a few lines of code (source at bottom of article).

Solving the problem with delegated composition

For an alternative solution, let’s look again at the composition approach:

class ProductPage:
    def __init__(self, page, price, cost): self.page,self.price,self.cost = page,price,cost

page = WebPage('Soap', category='Bathroom', author="Sylvain")
p = ProductPage(page, 15.0, 10.50)

How do we make it so we can just write p.author, instead of p.page.author. It turns out that Python has a great solution to this: just override __getattr__, which is called automatically any time an unknown attribute is requested:

class ProductPage:
    def __init__(self, page, price, cost): self.page,self.price,self.cost = page,price,cost
    def __getattr__(self, k): return getattr(self.page,k)

p = ProductPage(page, 15.0, 10.50)

That’s a good start. But we have a couple of problems. The first is that we’ve lost our tab-completion again… But we can fix it! Python calls __dir__ to figure out what attributes are provided by an object, so we can override it and list the attributes in self.page as well.

The second problem is that we often want to control which attributes are forwarded to the composed object. Having anything and everything forwarded could lead to unexpected bugs. So we should consider providing a list of forwarded attributes, and use that in both __getattr__ and __dir__.

I’ve created a simple base class called GetAttr that fixes both of these issues. You just have to set the default property in your object to the attribute you wish to delegate to, and everything else is automatic! You can also optionally set the _xtra attribute to a string list containing the names of attributes you wish to forward (it defaults to every attribute in default, except those whose name starts with _).

class ProductPage(GetAttr):
    def __init__(self, page, price, cost):
        self.page,self.price,self.cost = page,price,cost
        self.default = page

p = ProductPage(page, 15.0, 10.50)

Here’s the attributes you’ll see in tab completion:

[o for o in dir(p) if not o.startswith('_')]
['author', 'category', 'cost', 'date', 'default', 'page', 'price', 'title']

So now we have two really nice ways of handling delegation; which you choose will depend on the details of the problem you’re solving. If you’ll be using the composed object in a few different places, the composition approach will probably be best. If you’re adding some functionality to an existing class, delegated inheritance might result in a cleaner API for your class users.

See the end of this post for the source of GetAttr and delegates.

Making delegation work for you

Now that you have this tool in your toolbox, how are you going to use it?

I’ve recently started using it in many of my classes and functions. Most of my classes are building on the functionality of other classes, either my own, or from another module, so I often use composition or inheritance. When I do so, I normally like to make available the full functionality of the original class available too. By using GetAttr and delegates I don’t need to make any compromises between maintainability, readability, and usability!

I’d love to hear if you try this, whether you find it helpful or not. I’d also be interested in hearing about other ways that people are solving the delegation problem. The best way to reach me is to mention me on Twitter, where I’m @jeremyphoward.

A brief note on coding style

PEP 8 shows the “coding conventions for the Python code comprising the standard library in the main Python distribution”. They are also widely used in many other Python projects. I do not use PEP 8 for data science work, or for teaching more generally, since the goals and context are very different to the goals and context of the Python standard library (and PEP 8’s very first point is “A Foolish Consistency is the Hobgoblin of Little Minds”. Generally my code tends to follow the fastai style guide, which was designed for data science and teaching. So please:

  • Don’t follow the coding convensions in this code if you work on projects that use PEP 8
  • Don’t complain to me that my code doesn’t use PEP 8.

Source code

Here’s the delegates() function; just copy it somewhere and use it… I don’t know that it’s worth creating a pip package for:

def delegates(to=None, keep=False):
    "Decorator: replace `**kwargs` in signature with params from `to`"
    def _f(f):
        if to is None: to_f,from_f = f.__base__.__init__,f.__init__
        else:          to_f,from_f = to,f
        sig = inspect.signature(from_f)
        sigd = dict(sig.parameters)
        k = sigd.pop('kwargs')
        s2 = {k:v for k,v in inspect.signature(to_f).parameters.items()
              if v.default != inspect.Parameter.empty and k not in sigd}
        if keep: sigd['kwargs'] = k
        from_f.__signature__ = sig.replace(parameters=sigd.values())
        return f
    return _f

And here’s GetAttr. As you can see, there’s not much to it!

def custom_dir(c, add): return dir(type(c)) + list(c.__dict__.keys()) + add

class GetAttr:
    "Base class for attr accesses in `self._xtra` passed down to `self.default`"
    def _xtra(self): return [o for o in dir(self.default) if not o.startswith('_')]
    def __getattr__(self,k):
        if k in self._xtra: return getattr(self.default, k)
        raise AttributeError(k)
    def __dir__(self): return custom_dir(self, self._xtra)

USF Launches New Center of Applied Data Ethics

Update: The first year of the USF Center for Applied Data Ethics will be funded with a generous gift from Craig Newmark Philanthropies, the organization of craigslist founder Craig Newmark. Read the official press release for more details.

While the widespread adoption of data science and machine learning techniques has led to many positive discoveries, it also poses risks and is causing harm. Facial recognition technology sold by Amazon, IBM, and other companies has been found to have significantly higher error rates on Black women, yet these same companies are already selling facial recognition and predictive policing technology to police, with no oversight, regulations, or accountability. Millions of people’s photos have been compiled into databases, often without their knowledge, and shared with foreign governments, military operations, and police departments. Major tech platforms (such as Google’s YouTube, which auto-plays videos selected by an algorithm), have been shown to disproportionately promote conspiracy theories and disinformation, helping radicalize people into toxic views such as white supremacy.

USF Data Institute in downtown SF, Image Credit: <a href='https://commons.wikimedia.org/w/index.php?curid=3460420'>By Eric in SF - Own work, CC BY-SA 4.0</a>
USF Data Institute in downtown SF, Image Credit: By Eric in SF - Own work, CC BY-SA 4.0

In response to these risks and harms, I am helping to launch a new Center for Applied Data Ethics (CADE), housed within the University of San Francisco’s Data Institute to address issues surrounding the misuse of data through education, research, public policy and civil advocacy. The first year will include a tech policy workshop, a data ethics seminar series, and data ethics courses, all of which will be open to the community at-large.

Misuses of data and AI include the encoding & magnification of unjust bias, increasing surveillance & erosion of privacy, spread of disinformation & amplification of conspiracy theories, lack of transparency or oversight in how predictive policing is being deployed, and lack of accountability for tech companies. These problems are alarming, difficult, urgent, and systemic, and it will take the efforts of a broad and diverse range of people to address them. Many individuals, organizations, institutes, and entire fields are already hard at work tackling these problems. We will not reinvent the wheel, but instead will leverage existing tools and will amplify experts from a range of backgrounds. Diversity is a crucial component in addressing tech ethics issues, and we are committed to including a diverse range of speakers and supporting students and researchers from underrepresented groups.

I am director of the new center. Since you’re reading the fast.ai blog, you may be familiar with my work, but if not, you can read about my background here. I earned my PhD at Duke University in 2010, was selected by Forbes as one of “20 Incredible Women in AI”, am co-founder of fast.ai, and have been a researcher at USF Data Institute since it was founded in 2016. In the past few years, I have done a lot of writing and speaking on data ethics issues.

Speaking about misuses of AI at <a href='https://www.youtube.com/watch?v=LqjP7O9SxOM&list=PLtmWHNX-gukLQlMvtRJ19s7-8MrnRV6h6&index=2&t=0s'>TEDx SF</a>
Speaking about misuses of AI at TEDx SF

What is the USF Data Institute?

The Center for Applied Data Ethics will be housed within the USF Data Institute, located in downtown San Francisco, and will be able to leverage our existing community, partnerships, and successes. In the 3 years since the founding of the Data Institute, more than 900 entrepreneurs & employees from local tech companies have taken evening and weekend courses here, and we have granted more than 177 diversity scholarships to people from underrepresented groups. The USF MS in Data Science program, now housed in the Data Institute, is entering its 8th year, and all students complete 8 month practicum projects at our 160 partner companies. Jeremy Howard and I have both been involved with the USF Data Institute since it first began 3 years ago; it is where we have taught the in-person versions of our deep learning, machine learning, computational linear algebra, and NLP courses, and we have both been chairs of tracks for the Data Institute conference. Additionally, Jeremy launched the Wicklow AI in Medicine Research Initiative as part of the Data Institute last year.

What will you do in the 1st year? How can I get involved?

Data Ethics Seminar Series: We will bring in experts on issues of data ethics in talks open to the community, and high-quality recordings of the talks will be shared online. We are excited to have Deborah Raji as our first speaker. Please join us on Monday August 19 for a reception with food and Deborah’s talk on “Actionable Auditing and Algorithmic Justice.”

Tech Policy Workshop: Systemic problems require systemic solutions. Individual behavior change will not address the structural misalignment of incentives and lack of accountability. We need thoughtful and informed laws to safeguard human rights, and we do not want legislation written by corporate lobbyists. When it comes to setting policy in this area, too few legislators have the needed technical background and too few of those with knowledge of the tech industry have the needed policy background. We will hold a 3-day tech policy workshop, tentatively scheduled for November 15-17.

Data Ethics Certificate Course open to the community: The USF Data Institute has been offering part-time evening and weekend courses in downtown SF for the last 3 years, including the popular Practical Deep Learning for Coders course taught by Jeremy Howard. You do not need to be a USF student to attend these courses, and over 900 people, most working professionals, have attended past courses at the Data Institute. I will be teaching a Data Ethics course one evening per week in January-February 2020.

Required Data Ethics Course for MS in Data Science students: USF has added a required data ethics course that all students in the Masters of Science in Data Science program will take.

Data Ethics Fellows: We plan to offer research fellowships for those working on problems of applied data ethics, with a particular focus on work that has a direct, practical impact. Fellows will have access to the resources, community, and courses at the USF Data Institute. We will begin accepting invitations this fall, for 1-year long fellowships with start dates of January 2020 or June 2020.

If you are interested in any of these upcoming initiatives, please sign up for our mailing list to be notified when applications open.

Other FAQ

Q: What does this mean for your involvement with fast.ai?

A: We plan to release a data ethics course through fast.ai, sometime in mid-2020. (We have previously covered ethics issues in our Deep Learning for Coders course, and our recent A Code-First Intro to NLP included lessons on unjust bias and disinformation). I will continue to blog here on the fast.ai site and am still committed to the fast.ai mission.

Q: Given misuses of AI, isn’t your work at fast.ai to make AI accessible to more people dangerous?

A: What is dangerous is having a homogeneous and exclusive group designing technology that impacts us all. Companies such as Amazon, Palantir, Facebook, and others are generally considered quite prestigious and only hire those with “elite” backgrounds, yet we can see the widespread harm these companies are causing. We need a broader and more diverse group of people involved with AI, both to take advantage of the positives, as well as to address misuses of the technology. Please see my TEDx San Francisco talk for more details on this.

Q: Will you be coming up with a set of AI ethics principles?

A: No, there are many sets of AI ethics principles out there. We will not attempt to duplicate the work of others, but instead hope to amplify excellent work that is already being done (in addition to doing our own research).

Q: What do you consider the biggest ethical issues in tech?

Some of the issues that alarm me most are the encoding & magnification of unjust bias, increasing surveillance & erosion of privacy, spread of disinformation & amplification of conspiracy theories, lack of transparency or oversight in how predictive policing is being deployed, and lack of accountability for tech companies. For more information on these, please see some of my talks and posts linked below.

Here are some of my talks that you may be interested in:

And some previous blog posts:

I hope you can join us for our first data ethics seminar on the evening of Monday Aug 19 downtown in SF, and please sign up for our mailing list to stay in touch!