Our online courses (all are free and have no ads):

Our software: fastai v1 for PyTorch

fast.ai in the news:


8 Things You Need to Know about Surveillance

Over 225 police departments have partnered with Amazon to have access to Amazon’s video footage obtained as part of the “smart” doorbell product Ring, and in many cases these partnerships are heavily subsidized with taxpayer money. Police departments are allowing Amazon to stream 911 call information directly in real-time, and Amazon requires police departments to read pre-approved scripts when talking about the program. If a homeowner doesn’t want to share data from their video camera doorbell with police, an officer for the Fresno County Sheriff’s Office said they can just go directly to Amazon to obtain it. This creation of an extensive surveillance network, the murky private-public partnership surrounding it, and a lack of any sort of regulations or oversight is frightening. And this is just one of many examples related to surveillance technology that have recently come to light.

I frequently talk with people who are not that concerned about surveillance, or who feel that the positives outweigh the risks. Here, I want to share some important truths about surveillance:

  1. Surveillance can facilitate human rights abuses and even genocide
  2. Data is often used for different purposes than why it was collected
  3. Data often contains errors
  4. Surveillance typically operates with no accountability
  5. Surveillance changes our behavior
  6. Surveillance disproportionately impacts the marginalized
  7. Data privacy is a public good
  8. We don’t have to accept invasive surveillance

While I was writing this post, a number of investigative articles came out with disturbing new developments related to surveillance. I decided that rather than attempt to include everything in one post (which would make it too long and too dense), I would go ahead and share the above facts about surveillance, as they are just a relevant as ever.

1. Surveillance can facilitate human rights abuses and even genocide

There is a long history of data about sensitive attributes being misused, including the use of the 1940 USA Census to intern Japanese Americans, a system of identity cards introduced by the Belgian colonial government that were later used during the 1994 Rwandan genocide (in which nearly a million people were murdered), and the role of IBM in helping Nazi Germany use punchcard computers to identify and track the mass killing of millions of Jewish people. More recently, the mass internment of over one million people who are part of an ethnic minority in Western China was facilitated through the use of a surveillance network of cameras, biometric data (including images of people’s faces, audio of their voices, and blood samples), and phone monitoring.

Adolf Hitler meeting with IBM CEO Tom Watson Sr. in 1937.  Source: <a href=' https://www.computerhistory.org/revolution/punched-cards/2/15/109'>Computer History</a>
Adolf Hitler meeting with IBM CEO Tom Watson Sr. in 1937. Source: Computer History

Pictured above is Adolf Hitler (far left) meeting with IBM CEO Tom Watson Sr. (2nd from left), shortly before Hitler awarded Watson a special “Service to the Reich” medal in 1937 (for a timeline of the Holocaust, see here). Watson returned the medal in 1940, although IBM continued to do business with the Nazis. IBM technology helped the Nazis conduct detailed censuses in countries they occupied, to thoroughly identify anyone of Jewish descent. Nazi concentration camps used IBM’s punchcard machines to tabulate prisoners, recording whether they were Jewish, gay, or Gypsies, and whether they died of “natural causes,” execution, suicide, or via “special treatment” in gas chambers. It is not the case that IBM sold the machines and then was done with it. Rather, IBM and its subsidiaries provided regular training and maintenance on-site at the concentration camps: printing off cards, configuring machines, and repairing them as they broke frequently.

2. Data is often used for different purposes than why it was collected

In the above examples, the data collection began before genocide was committed. IBM began selling to Nazi Germany well before the Holocaust (although continued for far too long), including helping with Germany’s 1933 census conducted by Adolf Hitler, which was effective at identifying far more Jewish people than had previously been recognized in Germany.

It is important to recognize how data and images gathered through surveillance can be weaponized later. Columbia professor Tim Wu wrote that “One [hard truth] is that data and surveillance networks created for one purpose can and will be used for others. You must assume that any personal data that Facebook or Android keeps are data that governments around the world will try to get or that thieves will try to steal.”

Plenty of data collection is not involved with such extreme abuse as genocide; however, in a time of global resurgence of white supremacist, ethno-nationalist, and authoritarian movements, it would be deeply irresponsible to not consider how data & surveillance can and will be weaponized against already vulnerable groups.

3. Data often has errors (and no mechanism for correcting them)

A database of suspected gang members maintained by California law enforcement officials was found to be full of errors, including 42 babies who had been added to the database when they were less than 1 year old (28 of whom were marked as “admitting to being gang members”). Even worse, there was no process in place for correcting mistakes or removing people once they’ve been added.

An NPR reporter recounts his experience of trying to rent an apartment and discovering that TransUnion, one of the 3 major credit bureaus, incorrectly reported him as having two felony firearms convictions. TransUnion only removed the mistakes after a dozen phone calls and notification that the story would be reported on. This is not an unusual story: the FTC’s large-scale study of credit reports in 2012 found 26% of consumers had at least one mistake in their files and 5% had errors that could be devastating. An even more opaque, unregulated “4th bureau” exists: a collection of companies buying and selling personal information about people on the margins of the banking system (such as immigrants, students, and people with low incomes), with no standards on what types of data are included, no way to opt out, and no system for identifying or correcting mistakes.

4. Surveillance typically operates with no accountability

What makes the examples in the previous section disturbing is not just that errors occurred, but that there was no way to identify or correct them, and no accountability for those profiting off the error-laden data. Often, even the existence of the systems being used is not publicly known (much less details of how these systems work), unless discovered by journalists or revealed by whistleblowers. The Detroit Police Dept used facial recognition technology for nearly two years without public input and in violation of a requirement that a policy be approved by the city’s Board of Police Commissioners, until a study from Georgetown Law’s Center for Privacy & Technology drew attention to the issue. Palantir, the defense startup founded by billionaire Peter Thiel, ran a program with New Orleans Police Department for 6 years which city council members did not even know about, much less have any oversight.

After two studies found that Amazon’s facial recognition software produced inaccurate and racially biased results, Amazon countered that the researchers should have changed the default parameters. However, it turned out that Amazon was not instructing police departments that use its software to do this either. Surveillance programs are operating with few regulations, no oversight, no accountability around accuracy or mistakes, and in many cases, no public knowledge of what is going on.

5. Surveillance changes our behavior

Hundreds of thousands of people in Hong Kong are protesting an unpopular new bill which would allow extradition to China. Typically, Hong Kong locals use their rechargeable smart cards to ride the subway. However, during the protests, long lines of people waited to use cash to buy paper tickets (usually something that only tourists do) concerned that they would be tracked for having attended the protests. Would fewer people protest if this was not an option?

In the United States, in 2015 the Baltimore Police Department used facial recognition technology to surveil people protesting the death of Freddie Grey, a young Black man who was killed in police custody, and arrested protesters with outstanding warrants. Mass surveillance could have a chilling impact on our rights to move about freely, to express ourselves, and to protest. “We act differently when we know we are ‘on the record.’ Mass privacy is the freedom to act without being watched and thus in a sense, to be who we really are,” Columbia professor Tim Wu wrote in the New York Times.

Flyer from the company Geofeedia <a href='https://www.aclunc.org/docs/20161011_geofeedia_baltimore_case_study.pdf'>Source</a>
Flyer from the company Geofeedia Source

6. Surveillance disproportionately impacts those who are already marginalized

Surveillance is applied unevenly, causing the greatest harm to people who are already marginalized, including immigrants, people of color, and people living in poverty. These groups are more heavily policed and surveilled. The Perpetual Line-Up from the Georgetown Law Center on Privacy and Technology studied the unregulated use of facial recognition by police, with half of all Americans appearing in law enforcement databases, and the risks of errors, racial bias, misuses, and threats to civil liberties. The researchers pointed out that African Americans are more likely to appear in these databases (many of which are drawn from mug shots) since they are disproportionately likely to be stopped, interrogated, or arrested. For another example, consider the contrast between how easily people over 65 can apply for Medicare benefits by filling out an online form, with the invasive personal questions asked of a low-income mother on Medicaid about her lovers, hygiene, parental shortcomings, and personal habits.

In an article titled Trading privacy for survival is another tax on the poor, Ciara Byrne wrote, “Current public benefits programs ask applicants extremely detailed and personal questions and sometimes mandate home visits, drug tests, fingerprinting, and collection of biometric information… Employers of low-income workers listen to phone calls, conduct drug tests, monitor closed-circuit television, and require psychometric tests as conditions of employment. Prisoners in some states have to consent to be voiceprinted in order to make phone calls.”

7. Data privacy is a public good, like air quality or safe drinking water

Data is more revealing in aggregate. It can be nearly impossible to know what your individual data could reveal when combined with the data of others or with data from other sources, or when machine learning inference is performed on it. For instance, as Zeynep Tufekci wrote in the New York Times, individual Strava users could not have predicted how in aggregate their data could be used to identify the locations of US military bases. “Data privacy is not like a consumer good, where you click ‘I accept’ and all is well. Data privacy is more like air quality or safe drinking water, a public good that cannot be effectively regulated by trusting in the wisdom of millions of individual choices. A more collective response is needed.”

Unfortunately, this also means that you can’t fully safeguard your privacy on your own. You may choose not to purchase Amazon’s ring doorbell, yet you can still show up in the video footage collected by others. You might strengthen your online privacy practices, yet conclusions will still be inferred about you based on the behavior of others. As Professor Tufekci wrote, we need a collective response.

8. We don’t have to accept invasive surveillance

Many people are uncomfortable with surveillance, but feel like they have no say in the matter. While the threats surveillance poses are large, it is not too late to act. We are seeing success: in response to community organizing and an audit, Los Angeles Police Department scrapped a controversial program to predict who is most likely to commit violent crimes. Citizens, researchers, and activists in Detroit have been effective at drawing attention to the Detroit Police Department’s unregulated use of facial recognition and a bill calling for a 5-year moratorium has been introduced to the state legislature. Local governments in San Francisco, Oakland, and Somerville have banned the use of facial recognition by police.

For further resources, please check out:

Make Delegation Work in Python

The Delegation Problem

Let’s look at a problem that all coders have faced; something that I call the delegation problem. To explain, I’ll use an example. Here’s an example class you might see in a content management system:

class WebPage:
    def __init__(self, title, category="General", date=None, author="Jeremy"):
        self.title,self.category,self.author = title,category,author
        self.date = date or datetime.now()
        ...

Then, you want to add a subclass for certain types of page, such as a product page. It should have all the details of WebPage, plus some extra stuff. One way to do it would be with inheritance, like this:

class ProductPage(WebPage):
    def __init__(self, title, price, cost, category="General", date=None, author="Jeremy"):
        super().__init__(title, category=category, date=date, author=author)
        ...

But now we’re violating the Don’t Repeat Yourself (DRY) principal. We’ve duplicated both our list of parameter names, and the defaults. So later on, we might decide to change the default author to “Rachel”, so we change the definition in WebPage.__init__. But we forget to do the same in ProductPage, and now we have a bug 🐛! (When writing the fastai deep learning library I’ve created bugs many times in this way, and sometimes they’ve been extremely hard to track down, because differences in deep learning hyper-parameters can have very subtle and hard to test or detect implications.)

To avoid this, perhaps we could instead write it this way:

class ProductPage(WebPage):
    def __init__(self, title, price, cost, **kwargs):
        super().__init__(title, **kwargs)
        ...

The key to this approach is the use of **kwargs. In python **kwargs in a parameter like means “put any additional keyword arguments into a dict called kwarg. And **kwargs in an argument list means “insert all key/value pairs in the kwargs dict as named arguments here”. This approach is used in many popular libraries, such as matplotlib, in which the main plot function simply has the signature plot(*args, **kwargs). The plot documentation says “The kwargs are Line2D properties” and then lists those properties.

It’s not just python that uses this approach. For instance, in the R language the equivalent to **args is simply written ... (an ellipsis). The R documentation explains: “Another frequent requirement is to allow one function to pass on argument settings to another. For example many graphics functions use the function par() and functions like plot() allow the user to pass on graphical parameters to par() to control the graphical output.This can be done by including an extra argument, literally ‘…’, of the function, which may then be passed on”.

For more details on using **kwargs in python, Google will find you many nice tutorials, such as this one. The **kwargs solution appears to work quite nicely:

p = ProductPage('Soap', 15.0, 10.50, category='Bathroom', author="Sylvain")
p.author
'Sylvain'

However, this makes our API quite difficult to work with, because now the environment we’re using for editing our Python code (examples in this article assume we’re using Jupyter Notebook) doesn’t know what parameters are available, so things like tab-completion of parameter names and popup lists of signatures won’t work 😢. In addition, if we’re using an automatic tool for generating API documentation (such as fastai’s show_doc or Sphinx), our docs won’t include the full list of parameters, and we’ll need to manually add information about these delegated parameters (i.e. category, date, and author, in this case). In fact, we’ve seen this already, in matplotlib’s documentation for plot.

Another alternative is to avoid inheritance, and instead use composition, like so:

class ProductPage:
    def __init__(self, page, price, cost):
        self.page,self.price,self.cost = page,price,cost
        ...

p = ProductPage(WebPage('Soap', category='Bathroom', author="Sylvain"), 15.0, 10.50)
p.page.author
'Sylvain'

This has a new problem, however, which is that the most basic attributes are now hidden underneath p.page, which is not a great experience for our class users (and the constructor is now rather clunky compared to our inheritance version).

Solving the problem with delegated inheritance

The solution to this that I’ve recently come up with is to create a decorator that is used like this:

@delegates()
class ProductPage(WebPage):
    def __init__(self, title, price, cost, **kwargs):
        super().__init__(title, **kwargs)
        ...

…which looks exactly like what we had before for our inheritance version with kwargs, but has this key difference:

print(inspect.signature(ProductPage))
(title, price, cost, category='General', date=None, author='Jeremy')

It turns out that this approach, which I call delegated inheritance, solves all of our problems; in Jupyter if I hit the standard “show parameters” key Shift-Tab while instantiating a ProductPage, I see the full list of parameters, including those from WebPage. And hitting Tab will show me a completion list including the WebPage parameters. In addition, documentation tools see the full, correct signature, including the WebPage parameter details.

To decorate delegating functions instead of class __init__ we use much the same syntax. The only difference is that we now need to pass the function we’re delegating to:

def create_web_page(title, category="General", date=None, author="Jeremy"):
    ...

@delegates(create_web_page)
def create_product_page(title, price, cost, **kwargs):
    ...

print(inspect.signature(create_product_page))
(title, price, cost, category='General', date=None, author='Jeremy')

I really can’t overstate how significant this little decorator is to my coding practice. In early versions of fastai we used kwargs frequently for delegation, because we wanted to ensure my code was as simple as possible to write (otherwise I tend to make a lot of mistakes!) We used it not just for delegating __init__ to the parent, but also for standard functions, similar to how it’s used in matplotlib’s plot function. However, as fastai got more popular, I heard more and more feedback along the lines of “I love everything about fastai, except I hate dealing with kwargs”! And I totally empathized; indeed, dealing with ... in R APIs and kwargs in Python APIs has been a regular pain-point for me too. But here I was, inflicting it on my users! 😯

I am, of course, not the first person to have dealt with this. The Use and Abuse of Keyword Arguments in Python is a thoughtful article which concludes “So it’s readability vs extensibility. I tend to argue for readability over extensibility, and that’s what I’ll do here: for the love of whatever deity/ies you believe in, use **kwargs sparingly and document their use when you do”. This is what we ended up doing in fastai too. Last year Sylvain spent a pleasant (!) afternoon removing every kwargs he could and replaced it with explicit parameter lists. And of course now we get the occassional bug resulting from one of us failing to update parameter defaults in all functions…

But now that’s all in the past. We can use **kwargs again, and have the simpler and more reliable code thanks to DRY, and also a great experience for developers. 🎈 And the basic functionality of delegates() is just a few lines of code (source at bottom of article).

Solving the problem with delegated composition

For an alternative solution, let’s look again at the composition approach:

class ProductPage:
    def __init__(self, page, price, cost): self.page,self.price,self.cost = page,price,cost

page = WebPage('Soap', category='Bathroom', author="Sylvain")
p = ProductPage(page, 15.0, 10.50)
p.page.author
'Sylvain'

How do we make it so we can just write p.author, instead of p.page.author. It turns out that Python has a great solution to this: just override __getattr__, which is called automatically any time an unknown attribute is requested:

class ProductPage:
    def __init__(self, page, price, cost): self.page,self.price,self.cost = page,price,cost
    def __getattr__(self, k): return getattr(self.page,k)

p = ProductPage(page, 15.0, 10.50)
p.author
'Sylvain'

That’s a good start. But we have a couple of problems. The first is that we’ve lost our tab-completion again… But we can fix it! Python calls __dir__ to figure out what attributes are provided by an object, so we can override it and list the attributes in self.page as well.

The second problem is that we often want to control which attributes are forwarded to the composed object. Having anything and everything forwarded could lead to unexpected bugs. So we should consider providing a list of forwarded attributes, and use that in both __getattr__ and __dir__.

I’ve created a simple base class called GetAttr that fixes both of these issues. You just have to set the default property in your object to the attribute you wish to delegate to, and everything else is automatic! You can also optionally set the _xtra attribute to a string list containing the names of attributes you wish to forward (it defaults to every attribute in default, except those whose name starts with _).

class ProductPage(GetAttr):
    def __init__(self, page, price, cost):
        self.page,self.price,self.cost = page,price,cost
        self.default = page

p = ProductPage(page, 15.0, 10.50)
p.author
'Sylvain'

Here’s the attributes you’ll see in tab completion:

[o for o in dir(p) if not o.startswith('_')]
['author', 'category', 'cost', 'date', 'default', 'page', 'price', 'title']

So now we have two really nice ways of handling delegation; which you choose will depend on the details of the problem you’re solving. If you’ll be using the composed object in a few different places, the composition approach will probably be best. If you’re adding some functionality to an existing class, delegated inheritance might result in a cleaner API for your class users.

See the end of this post for the source of GetAttr and delegates.

Making delegation work for you

Now that you have this tool in your toolbox, how are you going to use it?

I’ve recently started using it in many of my classes and functions. Most of my classes are building on the functionality of other classes, either my own, or from another module, so I often use composition or inheritance. When I do so, I normally like to make available the full functionality of the original class available too. By using GetAttr and delegates I don’t need to make any compromises between maintainability, readability, and usability!

I’d love to hear if you try this, whether you find it helpful or not. I’d also be interested in hearing about other ways that people are solving the delegation problem. The best way to reach me is to mention me on Twitter, where I’m @jeremyphoward.

A brief note on coding style

PEP 8 shows the “coding conventions for the Python code comprising the standard library in the main Python distribution”. They are also widely used in many other Python projects. I do not use PEP 8 for data science work, or for teaching more generally, since the goals and context are very different to the goals and context of the Python standard library (and PEP 8’s very first point is “A Foolish Consistency is the Hobgoblin of Little Minds”. Generally my code tends to follow the fastai style guide, which was designed for data science and teaching. So please:

  • Don’t follow the coding convensions in this code if you work on projects that use PEP 8
  • Don’t complain to me that my code doesn’t use PEP 8.

Source code

Here’s the delegates() function; just copy it somewhere and use it… I don’t know that it’s worth creating a pip package for:

def delegates(to=None, keep=False):
    "Decorator: replace `**kwargs` in signature with params from `to`"
    def _f(f):
        if to is None: to_f,from_f = f.__base__.__init__,f.__init__
        else:          to_f,from_f = to,f
        sig = inspect.signature(from_f)
        sigd = dict(sig.parameters)
        k = sigd.pop('kwargs')
        s2 = {k:v for k,v in inspect.signature(to_f).parameters.items()
              if v.default != inspect.Parameter.empty and k not in sigd}
        sigd.update(s2)
        if keep: sigd['kwargs'] = k
        from_f.__signature__ = sig.replace(parameters=sigd.values())
        return f
    return _f

And here’s GetAttr. As you can see, there’s not much to it!

def custom_dir(c, add): return dir(type(c)) + list(c.__dict__.keys()) + add

class GetAttr:
    "Base class for attr accesses in `self._xtra` passed down to `self.default`"
    @property
    def _xtra(self): return [o for o in dir(self.default) if not o.startswith('_')]
    def __getattr__(self,k):
        if k in self._xtra: return getattr(self.default, k)
        raise AttributeError(k)
    def __dir__(self): return custom_dir(self, self._xtra)

USF Launches New Center of Applied Data Ethics

Update: The first year of the USF Center for Applied Data Ethics will be funded with a generous gift from Craig Newmark Philanthropies, the organization of craigslist founder Craig Newmark. Read the official press release for more details.

While the widespread adoption of data science and machine learning techniques has led to many positive discoveries, it also poses risks and is causing harm. Facial recognition technology sold by Amazon, IBM, and other companies has been found to have significantly higher error rates on Black women, yet these same companies are already selling facial recognition and predictive policing technology to police, with no oversight, regulations, or accountability. Millions of people’s photos have been compiled into databases, often without their knowledge, and shared with foreign governments, military operations, and police departments. Major tech platforms (such as Google’s YouTube, which auto-plays videos selected by an algorithm), have been shown to disproportionately promote conspiracy theories and disinformation, helping radicalize people into toxic views such as white supremacy.

USF Data Institute in downtown SF, Image Credit: <a href='https://commons.wikimedia.org/w/index.php?curid=3460420'>By Eric in SF - Own work, CC BY-SA 4.0</a>
USF Data Institute in downtown SF, Image Credit: By Eric in SF - Own work, CC BY-SA 4.0

In response to these risks and harms, I am helping to launch a new Center for Applied Data Ethics (CADE), housed within the University of San Francisco’s Data Institute to address issues surrounding the misuse of data through education, research, public policy and civil advocacy. The first year will include a tech policy workshop, a data ethics seminar series, and data ethics courses, all of which will be open to the community at-large.

Misuses of data and AI include the encoding & magnification of unjust bias, increasing surveillance & erosion of privacy, spread of disinformation & amplification of conspiracy theories, lack of transparency or oversight in how predictive policing is being deployed, and lack of accountability for tech companies. These problems are alarming, difficult, urgent, and systemic, and it will take the efforts of a broad and diverse range of people to address them. Many individuals, organizations, institutes, and entire fields are already hard at work tackling these problems. We will not reinvent the wheel, but instead will leverage existing tools and will amplify experts from a range of backgrounds. Diversity is a crucial component in addressing tech ethics issues, and we are committed to including a diverse range of speakers and supporting students and researchers from underrepresented groups.

I am director of the new center. Since you’re reading the fast.ai blog, you may be familiar with my work, but if not, you can read about my background here. I earned my PhD at Duke University in 2010, was selected by Forbes as one of “20 Incredible Women in AI”, am co-founder of fast.ai, and have been a researcher at USF Data Institute since it was founded in 2016. In the past few years, I have done a lot of writing and speaking on data ethics issues.

Speaking about misuses of AI at <a href='https://www.youtube.com/watch?v=LqjP7O9SxOM&list=PLtmWHNX-gukLQlMvtRJ19s7-8MrnRV6h6&index=2&t=0s'>TEDx SF</a>
Speaking about misuses of AI at TEDx SF

What is the USF Data Institute?

The Center for Applied Data Ethics will be housed within the USF Data Institute, located in downtown San Francisco, and will be able to leverage our existing community, partnerships, and successes. In the 3 years since the founding of the Data Institute, more than 900 entrepreneurs & employees from local tech companies have taken evening and weekend courses here, and we have granted more than 177 diversity scholarships to people from underrepresented groups. The USF MS in Data Science program, now housed in the Data Institute, is entering its 8th year, and all students complete 8 month practicum projects at our 160 partner companies. Jeremy Howard and I have both been involved with the USF Data Institute since it first began 3 years ago; it is where we have taught the in-person versions of our deep learning, machine learning, computational linear algebra, and NLP courses, and we have both been chairs of tracks for the Data Institute conference. Additionally, Jeremy launched the Wicklow AI in Medicine Research Initiative as part of the Data Institute last year.

What will you do in the 1st year? How can I get involved?

Data Ethics Seminar Series: We will bring in experts on issues of data ethics in talks open to the community, and high-quality recordings of the talks will be shared online. We are excited to have Deborah Raji as our first speaker. Please join us on Monday August 19 for a reception with food and Deborah’s talk on “Actionable Auditing and Algorithmic Justice.”

Tech Policy Workshop: Systemic problems require systemic solutions. Individual behavior change will not address the structural misalignment of incentives and lack of accountability. We need thoughtful and informed laws to safeguard human rights, and we do not want legislation written by corporate lobbyists. When it comes to setting policy in this area, too few legislators have the needed technical background and too few of those with knowledge of the tech industry have the needed policy background. We will hold a 3-day tech policy workshop, tentatively scheduled for November 15-17.

Data Ethics Certificate Course open to the community: The USF Data Institute has been offering part-time evening and weekend courses in downtown SF for the last 3 years, including the popular Practical Deep Learning for Coders course taught by Jeremy Howard. You do not need to be a USF student to attend these courses, and over 900 people, most working professionals, have attended past courses at the Data Institute. I will be teaching a Data Ethics course one evening per week in January-February 2020.

Required Data Ethics Course for MS in Data Science students: USF has added a required data ethics course that all students in the Masters of Science in Data Science program will take.

Data Ethics Fellows: We plan to offer research fellowships for those working on problems of applied data ethics, with a particular focus on work that has a direct, practical impact. Fellows will have access to the resources, community, and courses at the USF Data Institute. We will begin accepting invitations this fall, for 1-year long fellowships with start dates of January 2020 or June 2020.

If you are interested in any of these upcoming initiatives, please sign up for our mailing list to be notified when applications open.

Other FAQ

Q: What does this mean for your involvement with fast.ai?

A: We plan to release a data ethics course through fast.ai, sometime in mid-2020. (We have previously covered ethics issues in our Deep Learning for Coders course, and our recent A Code-First Intro to NLP included lessons on unjust bias and disinformation). I will continue to blog here on the fast.ai site and am still committed to the fast.ai mission.

Q: Given misuses of AI, isn’t your work at fast.ai to make AI accessible to more people dangerous?

A: What is dangerous is having a homogeneous and exclusive group designing technology that impacts us all. Companies such as Amazon, Palantir, Facebook, and others are generally considered quite prestigious and only hire those with “elite” backgrounds, yet we can see the widespread harm these companies are causing. We need a broader and more diverse group of people involved with AI, both to take advantage of the positives, as well as to address misuses of the technology. Please see my TEDx San Francisco talk for more details on this.

Q: Will you be coming up with a set of AI ethics principles?

A: No, there are many sets of AI ethics principles out there. We will not attempt to duplicate the work of others, but instead hope to amplify excellent work that is already being done (in addition to doing our own research).

Q: What do you consider the biggest ethical issues in tech?

Some of the issues that alarm me most are the encoding & magnification of unjust bias, increasing surveillance & erosion of privacy, spread of disinformation & amplification of conspiracy theories, lack of transparency or oversight in how predictive policing is being deployed, and lack of accountability for tech companies. For more information on these, please see some of my talks and posts linked below.

Here are some of my talks that you may be interested in:

And some previous blog posts:

I hope you can join us for our first data ethics seminar on the evening of Monday Aug 19 downtown in SF, and please sign up for our mailing list to stay in touch!

new fast.ai course: A Code-First Introduction to Natural Language Processing

Our newest course is a code-first introduction to NLP, following the fast.ai teaching philosophy of sharing practical code implementations and giving students a sense of the “whole game” before delving into lower-level details. Applications covered include topic modeling, classfication (identifying whether the sentiment of a review is postive or negative), language modeling, and translation. The course teaches a blend of traditional NLP topics (including regex, SVD, naive bayes, tokenization) and recent neural network approaches (including RNNs, seq2seq, attention, and the transformer architecture), as well as addressing urgent ethical issues, such as bias and disinformation. Topics can be watched in any order.

All videos for the course are on YouTube and all code is on GitHub
All videos for the course are on YouTube and all code is on GitHub

All the code is in Python in Jupyter Notebooks, using PyTorch and the fastai library. You can find all code for the notebooks available on GitHub and all the videos of the lectures are in this playlist.

This course was originally taught in the University of San Francisco MS in Data Science program during May-June 2019. The USF MSDS has been around for 7 years (over 330 students have graduated and gone on to jobs as data scientists during this time!) and is now housed at the Data Institute in downtown SF. In previous years, Jeremy taught the machine learning course and I’ve taught a computational linear algebra elective as part of the program.

Highlights

Some highlights of the course that I’m particularly excited about:

Risks raised by new language models such as GPT-2
Risks raised by new language models such as GPT-2

Most of the topics can stand alone, so no need to go through the course in order if you are only interested in particular topics (although I hope everyone will watch the videos on bias and disinformation, as these are important topics for everyone interested in machine learning). Note that videos vary in length between 20-90 minutes.

Course Topics

Overview

There have been many major advances in NLP in the last year, and new state-of-the-art results are being achieved every month. NLP is still very much a field in flux, with best practices changing and new standards not yet settled on. This makes for an exciting time to learn NLP. This course covers a blend of more traditional techniques, newer neural net approaches, and urgent issues of bias and disinformation.

Traditional NLP Methods

For the first third of the course, we cover topic modeling with SVD, sentiment classification via naive bayes and logisitic regression, and regex. Along the way, we learn crucial processing techniques such as tokenization and numericalizaiton.

Deep Learning: Transfer learning for NLP

Jeremy shares jupyter notebooks stepping through ULMFit, his groundbreaking work with Sebastian Ruder last year to successfully apply transfer learning to NLP. The technique involves training a language model on a large corpus, fine-tuning it for a different and smaller corpus, and then adding a classifier to the end. This work has been built upon by more recent papers such as BERT, GPT-2, and XLNet. In new material (accompanying updates to the fastai library), Jeremy shares tips and tricks to work with languages other than English, and walks through examples implementing ULMFit for Vietnamese and Turkish.

Jeremy shares ULMFit implementations in Vietnamese and Turkish
Jeremy shares ULMFit implementations in Vietnamese and Turkish

Deep Learning: Seq2Seq translation and the Transformer

We will dig into some underlying details of how simple RNNs work, and then consider a seq2seq model for translation. We build up our translation model, adding approaches such as teacher forcing, attention, and GRUs to improve performance. We are then ready to move on to the Transformer, exploring an implementation.

The Transformer for language translation
The Transformer for language translation

Ethical Issues in NLP

NLP raises important ethical issues, such as how stereotypes can be encoded in word embeddings and how the words of marginalized groups are often more likely to be classified as toxic. It was a special treat to have Stanford PhD student Nikhil Garg share his work which had been published in PNAS on this topic. We also learn about a framework for better understanding the causes of different types of bias, the importance of questioning what work we should avoid doing altogether, and steps towards addressing bias, such as Data Statements for NLP.

Nikhil Garg gave a guest lecture on his work showing how word embeddings quantify stereotypes over the last 100 years
Nikhil Garg gave a guest lecture on his work showing how word embeddings quantify stereotypes over the last 100 years

Bias is not the only ethical issue in NLP. More sophisticated language models can create compelling fake prose that may drown out real humans or manipulate public opinion. We cover dynamics of disinformation, risks of compelling computer generated text, OpenAI’s controversial decision of staged release for GPT-2, and some proposed steps towards solutions, such as systems for verification or digital signatures.

On why algorithmic bias matters, different types, and steps towards addressing it
On why algorithmic bias matters, different types, and steps towards addressing it

We hope you will check out the course! All the code for the jupyter notebooks used in the class can be found on GitHub and a playlist of all the videos is available on YouTube.

Prerequisites

(Updated to add) Familiarity with working with data in Python, as well as with machine learning concepts (such as training and test sets) is a necessary prerequisite. Some experience with PyTorch and neural networks is helpful.

As always, at fast.ai we recommend learning on an as-needed basis (too many students feel like they need to spend months or even years on background material before they can get to what really interests them, and too often, much of that background material ends up not even being necessary). If you are interested in this course, but unsure whether you have the right background, go ahead and try the course! If you find necessary concepts that you are unfamiliar with, you can always pause and study up on them.

Also, please be sure to check out the fast.ai forums as a place to ask questions and share resources.

Deep Learning from the Foundations

Today we are releasing a new course (taught by me), Deep Learning from the Foundations, which shows how to build a state of the art deep learning model from scratch. It takes you all the way from the foundations of implementing matrix multiplication and back-propogation, through to high performance mixed-precision training, to the latest neural network architectures and learning techniques, and everything in between. It covers many of the most important academic papers that form the foundations of modern deep learning, using “code-first” teaching, where each method is implemented from scratch in python and explained in detail (in the process, we’ll discuss many important software engineering techniques too). The whole course, covering around 15 hours of teaching and dozens of interactive notebooks, is entirely free (and ad-free), provided as a service to the community. The first five lessons use Python, PyTorch, and the fastai library; the last two lessons use Swift for TensorFlow, and are co-taught with Chris Lattner, the original creator of Swift, clang, and LLVM.

This course is the second part of fast.ai’s 2019 deep learning series; part 1, Practical Deep Learning for Coders, was released in January, and is a required pre-requisite. It is the latest in our ongoing commitment to providing free, practical, cutting-edge education for deep learning practitioners and educators—a commitment that has been appreciated by hundreds of thousands of students, led to The Economist saying “Demystifying the subject, to make it accessible to anyone who wants to learn how to build AI software, is the aim of Jeremy Howard… It is working”, and to CogX awarding fast.ai the Outstanding Contribution in AI award.

The purpose of Deep Learning from the Foundations is, in some ways, the opposite of part 1. This time, we’re not learning practical things that we will use right away, but are learning foundations that we can build on. This is particularly important nowadays because this field is moving so fast. In this new course, we will learn to implement a lot of things that are inside the fastai and PyTorch libraries. In fact, we’ll be reimplementing a significant subset of the fastai library! Along the way, we will practice implementing papers, which is an important skill to master when making state of the art models.

Chris Lattner at TensorFlow Dev Summit
Chris Lattner at TensorFlow Dev Summit

A huge amount of work went into the last two lessons—not only did the team need to create new teaching materials covering both TensorFlow and Swift, but also create a new fastai Swift library from scratch, and add a lot of new functionality (and squash a few bugs!) in Swift for TensorFlow. It was a very close collaboration between Google Brain’s Swift for TensorFlow group and fast.ai, and wouldn’t have been possible without the passion, commitment, and expertise of the whole team, from both Google and fast.ai. This collaboration is ongoing, and today Google is releasing a new version of Swift for TensorFlow (0.4) to go with the new course. For more information about the Swift for TensorFlow release and lessons, have a look at this post on the TensorFlow blog.

In the remainder of this post I’ll provide a quick summary of some of the topics you can expect to cover in this course—if this sounds interesting, then get started now! And if you have any questions along the way (or just want to chat with other students) there’s a very active forum for the course, with thousands of posts already.

Lesson 8: Matrix multiplication; forward and backward passes

Our main goal is to build up to a complete system that can train Imagenet to a world-class result, both in terms of accuracy and speed. So we’ll need to cover a lot of territory.

Our roadmap for training a CNN
Our roadmap for training a CNN

Step 1 is matrix multiplication! We’ll gradually refactor and accelerate our first, pure python, matrix multiplication, and in the process will learn about broadcasting and einstein summation. We’ll then use this to create a basic neural net forward pass, including a first look at how neural networks are initialized (a topic we’ll be going into in great depth in the coming lessons).

Broadcasting and einsum let us accelate matmul dramatically
Broadcasting and einsum let us accelate matmul dramatically

Then we will implement the backwards pass, including a brief refresher of the chain rule (which is really all the backwards pass is). We’ll then refactor the backwards path to make it more flexible and concise, and finally we’ll see how this translates to how PyTorch actually works.

Back propagation from scratch
Back propagation from scratch

Papers discussed

Lesson 9: Loss functions, optimizers, and the training loop

In the last lesson we had an outstanding question about PyTorch’s CNN default initialization. In order to answer it, I did a bit of research, and we start lesson 9 seeing how I went about that research, and what I learned. Students often ask “how do I do research”, so this is a nice little case study.

Then we do a deep dive into the training loop, and show how to make it concise and flexible. First we look briefly at loss functions and optimizers, including implementing softmax and cross-entropy loss (and the logsumexp trick). Then we create a simple training loop, and refactor it step by step to make it more concise and more flexible. In the process we’ll learn about nn.Parameter and nn.Module, and see how they work with nn.optim classes. We’ll also see how Dataset and DataLoader really work.

Once we have those basic pieces in place, we’ll look closely at some key building blocks of fastai: Callback, DataBunch, and Learner. We’ll see how they help, and how they’re implemented. Then we’ll start writing lots of callbacks to implement lots of new functionality and best practices!

Callbacks in the training loop
Callbacks in the training loop

Papers discussed

Lesson 10: Looking inside the model

In lesson 10 we start with a deeper dive into the underlying idea of callbacks and event handlers. We look at many different ways to implement callbacks in Python, and discuss their pros and cons. Then we do a quick review of some other important foundations:

  • __dunder__ special symbols in Python
  • How to navigate source code using your editor
  • Variance, standard deviation, covariance, and correlation
  • Softmax
  • Exceptions as control flow
Python's special methods let us create objects that behave like builtin ones
Python's special methods let us create objects that behave like builtin ones

Next up, we use the callback system we’ve created to set up CNN training on the GPU. This is where we start to see how flexible this sytem is—we’ll be creating many callbacks during this course.

Some of the callbacks we'll create in this course
Some of the callbacks we'll create in this course

Then we move on to the main topic of this lesson: looking inside the model to see how it behaves during training. To do so, we first need to learn about hooks in PyTorch, which allow us to add callbacks to the forward and backward passes. We will use hooks to track the changing distribution of our activations in each layer during training. By plotting this distributions, we can try to identify problems with our training.

An example temporal activation histogram
An example temporal activation histogram

In order to fix the problems we see, we try changing our activation function, and introducing batchnorm. We study the pros and cons of batchnorm, and note some areas where it performs poorly. Finally, we develop a new kind of normalization layer to overcome these problems, compare it to previously published approaches, and see some very encouraging results.

Papers discussed

Lesson 11: Data Block API, and generic optimizer

We start lesson 11 with a brief look at a smart and simple initialization technique called Layer-wise Sequential Unit Variance (LSUV). We implement it from scratch, and then use the methods introduced in the previous lesson to investigate the impact of this technique on our model training. It looks pretty good!

Then we look at one of the jewels of fastai: the Data Block API. We already saw how to use this API in part 1 of the course; but now we learn how to create it from scratch, and in the process we also will learn a lot about how to better use it and customize it. We’ll look closely at each step:

  • Get files: we’ll learn how os.scandir provides a highly optimized way to access the filesystem, and os.walk provides a powerful recursive tree walking abstraction on top of that
  • Transformations: we create a simple but powerful list and function composition to transform data on-the-fly
  • Split and label: we create flexible functions for each
  • DataBunch: we’ll see that DataBunch is a very simple container for our DataLoaders

Next up, we build a new StatefulOptimizer class, and show that nearly all optimizers used in modern deep learning training are just special cases of this one class. We use it to add weight decay, momentum, Adam, and LAMB optimizers, and take a look a detailed look at how momentum changes training.

The impact of varying momentum on a synthetic training example
The impact of varying momentum on a synthetic training example

Finally, we look at data augmentation, and benchmark various data augmentation techniques. We develop a new GPU-based data augmentation approach which we find speeds things up quite dramatically, and allows us to then add more sophisticated warp-based transformations.

Using GPU batch-level data augmentation provides big speedups
Using GPU batch-level data augmentation provides big speedups

Papers discussed

Lesson 12: Advanced training techniques; ULMFiT from scratch

We implement some really important training techniques in lesson 12, all using callbacks:

  • MixUp, a data augmentation technique that dramatically improves results, particularly when you have less data, or can train for a longer time
  • Label smoothing, which works particularly well with MixUp, and significantly improves results when you have noisy labels
  • Mixed precision training, which trains models around 3x faster in many situations.
An example of MixUp augmentation
An example of MixUp augmentation

We also implement xresnet, which is a tweaked version of the classic resnet architecture that provides substantial improvements. And, even more important, the development of it provides great insights into what makes an architecture work well.

Finally, we show how to implement ULMFiT from scratch, including building an LSTM RNN, and looking at the various steps necessary to process natural language data to allow it to be passed to a neural network.

ULMFiT
ULMFiT

Papers discussed

Lesson 13: Basics of Swift for Deep Learning

By the end of lesson 12, we’ve completed building much of the fastai library for Python from scratch. Next we repeat the process for Swift! The final two lessons are co-taught by Jeremy along with Chris Lattner, the original developer of Swift, and the lead of the Swift for TensorFlow project at Google Brain.

Swift code and Python code don't look all that different
Swift code and Python code don't look all that different

In this lesson, Chris explains what Swift is, and what it’s designed to do. He shares insights on its development history, and why he thinks it’s a great fit for deep learning and numeric programming more generally. He also provides some background on how Swift and TensorFlow fit together, both now and in the future. Next up, Chris shows a bit about using types to ensure your code has less errors, whilst letting Swift figure out most of your types for you. And he explains some of the key pieces of syntax we’ll need to get started.

Chris also explains what a compiler is, and how LLVM makes compiler development easier. Then he shows how we can actually access and change LLVM builtin types directly from Swift! Thanks to the compilation and language design, basic code runs very fast indeed - about 8000 times faster than Python in the simple example Chris showed in class.

Learning about the implementation of `float` in Swift
Learning about the implementation of `float` in Swift

Finally, we look at different ways of calculating matrix products in Swift, including using Swift for TensorFlow’s Tensor class.

Swift resources

Lesson 14: C interop; Protocols; Putting it all together

Today’s lesson starts with a discussion of the ways that Swift programmers will be able to write high performance GPU code in plain Swift. Chris Lattner discusses kernel fusion, XLA, and MLIR, which are exciting technologies coming soon to Swift programmers.

Then Jeremy talks about something that’s available right now: amazingly great C interop. He shows how to use this to quickly and easily get high performance code by interfacing with existing C libraries, using Sox audio processing, and VIPS and OpenCV image processing as complete working examples.

Behind the scenes of Swift's C interop
Behind the scenes of Swift's C interop

Next up, we implement the Data Block API in Swift! Well… actually in some ways it’s even better than the original Python version. We take advantage of an enormously powerful Swift feature: protocols (aka type classes).

Data blocks API in Swift!
Data blocks API in Swift!

We now have enough Swift knowledge to implement a complete fully connect network forward pass in Swift—so that’s what we do! Then we start looking at the backward pass, and use Swift’s optional reference semantics to replicate the PyTorch approach. But then we learn how to do the same thing in a more “Swifty” way, using value semantics to do the backward pass in a really concise and flexible manner.

Finally, we put it all together, implementing our generic optimizer, Learner, callbacks, etc, to train Imagenette from scratch! The final notebooks in Swift show how to build and use much of the fastai.vision library in Swift, even although in these two lessons there wasn’t time to cover everything. So be sure to study the notebooks to see lots more Swift tricks…

Further information

More lessons

We’ll be releasing even more lessons in the coming months and adding them to an attached course we’ll be calling Applications of Deep Learning. They’ll be linked from the Part 2 course page, so keep an eye out there. The first in this series will be a lesson about audio processing and audio models. I can’t wait to share it with you all!

Sneak peak at the forthcoming Audio lesson
Sneak peak at the forthcoming Audio lesson