Doing Data Science for Social Good, Responsibly

The phrase “data science for social good” is a broad umbrella, ambiguously defined. As many others have pointed out, the term often fails to specify good for whom. Data science for social good can be used to refer to: nonprofits increasing their impact through more effective data use, hollow corporate PR efforts from big tech, well-intentioned projects that inadvertently result in surveillance and privacy invasion of marginalized groups, efforts seeped in colonialism, or many other types of projects. Note that none of the categories in the previous list are mutually exclusive, and one project may fit several of these descriptors.

Picture from a presentation given in 2018 by Sara Hooker, founder of non-profit Delta Analytics and an AI researcher at Google, on Why “data for good” lacks precision.

I have been involved with data science for social good efforts for several years: chairing the Data for Good track at the USF Data Institute Conference in 2017; coordinating and mentoring graduate students in internships with nonprofits Human Rights Data Analysis Group (for a project on entity resolution to obtain more accurate casualty conflicts in Syria and Sri Lanka) and the American Civil Liberties Union (one student analyzed covid vaccine equity in California and another analyzed disparities in school disciplinary action against Black and disabled students) during my time as director of the Center for Applied Data Ethics at USF; and now as a co-lead of the Data Science for Social Good program at Queensland University of Technology (QUT). At QUT, grad students and recent graduates partnered with non-profits Cancer Council Queensland (well known for their Australian Cancer Atlas) and FareShare food rescue organisation, which operates Australia’s largest charity kitchens. While data for good projects can be incredibly useful, there are also pitfalls to be mindful of when approaching data for social good.

Some Questions & Answers

I recently spoke on a panel at the QUT Data Science for Social Good showcase event. I appreciated the thoughtful, nuanced questions from the moderators, Dr. Timothy Graham and Dr. Char-lee Moyle, who brought up some of the potential risks. I want to share their questions below, along with an expanded version of my answers.

What ethical and governance considerations do you think not-for-profits should consider when starting to adopt data science?

  1. Be specific about the goals of the project and how different stakeholders will be impacted: A series of interviews with African data experts revealed that power imbalances, failure to acknowledge extractive practices, failure to build trust, and Western-centric policies were all prevalent. Even in “data for good” projects, the people whose data is accessed and shared may not reap the benefits that those who control the project do. Stakeholders such as government bodies and non-profits have significantly more power and leverage compared to data subjects. There are issues where data gathered for one goal ends up being repurposed or shared for other uses. While Western “notions of privacy often focus on the individual, there is growing awareness that collective identity is also important within many African communities, and that sharing aggregate information about communities can also be regarded as a privacy violation.”
  2. Center the problem to be solved, not a flashy solution. Sometimes machine learning practitioners have a solution searching for a problem. It is important to stay focused on the root problem and be open to “mundane” or even non-technical solutions. One data for good project used the records of 15 million mobile phone owners in Kenya to quantify the movements of workers who migrate for seasonal work to an area with malaria, and made recommendations to increase malaria surveillance in their hometowns when they return. As a journalist for Nature reported, “But it’s unclear whether the results were needed, or useful. Malaria-control officers haven’t incorporated the analyses into their efforts.” The excitement around “flashy” big data approaches contrasts with the lack of funding for proven measures like bed nets, insecticides, treatment drugs, and health workers.
  3. Take data privacy seriously. Be clear about how the data will be stored, who has access to it, and what will happen to it later. Ask what data is truly needed, and if there are less invasive ways to get this information. Note that the above example tracking Kenyan mobile phone owners raises risks around lack of consent, invasion of privacy, and risk of de-anonymization.
  4. Include the people most impacted, and recognize that their values may be different from those of both non-profits or academic stakeholders involved. A recent article from AI Now Institute recommended that “social good projects should be developed at a small scale for local contexts — they should be designed in consultation with the community or social environment impacted by the systems in order to identify core values and needs.” One example of differing values: Indigenous scholars highlighted that a set of open data principles developed primarily by Western scholars to improve data discovery and reuse created tension with Indigenous values. The FAIR principles, first developed at a workshop in the Netherlands in 2014 and elaborated on in this paper published in Nature, call for data to be findable, accessible, interoperable, and reusable. In response, Indigenous scholars convened to develop the CARE principles for Indigenous data governance, calling for collective benefit, authority to control, responsibility, and ethics, intended as a complement to the FAIR principles.
  5. Avoid answering the “wrong problem.” For instance, many European governments are currently using algorithmic approaches to justify austerity cuts. Arguments about reducing fraud often accompany these, even when fraud is minimal. Due to a biometric identity system in India, many poor and elderly people are no longer able to access their food benefits due to faded fingerprints, not being able to travel to scanners, or intermittent internet connections.

Do you think that data science for social good can increase the surveillance and control of disadvantaged groups or certain segments of society?

Many well-meaning projects inadvertently lead to increased surveillance, despite good intentions. Cell-phone data from millions of phone owners in over two dozen low- and middle- income countries has been anonymized and analyzed in the wake of humanitarian disasters. This data raises concern of the lack of consent of the phone users and risks of de-anonymization. Furthermore, it is often questionable whether the results are truly useful, as well as if they could have been discovered through other, less invasive approaches. One such project analyzed the cell phone data of people in Sierra Leone during an Ebola outbreak. However, this approach didn’t address how Ebola spreads (only through direct contact with body fluids) or help with the most urgent issue (which was convincing symptomatic people to come to clinics to isolate).

What do you think is the role of government and universities in supporting and incentivising the not-for-profit sector in adopting data science?

Academia and government have a big role to play. Often non-profits lack the in-house data science skill to take advantage of their data, and many data scientists who are searching for meaningful and impactful real-world problems to work on. We will also need the government to regulate topics such as data privacy to help protect those who may be impacted. It is important to recognize that privacy should not just be considered an individual right, but also a public good.

What are your thoughts around the development of ethical frameworks to guide data science – are they more than marketing tactics to increase trustworthiness and reputation of data science?

We need ethical frameworks AND regulation. Both are crucially important. Many people want to do the right thing, and having standardized processes to guide them can help. I recommend the Markkula Center Tech Ethics Toolkit, which includes practical processes you can implement in your organization to try to identify ethical risks BEFORE they cause harm. At the same time, we need legal protections anywhere that data science impacts human rights and civil rights. Meaningful consequences are needed for those who cause harm to others. Also, policy is the appropriate tool to address negative externalities, such as when corporations offset their costs and harms to society while reaping the profits for themselves. Otherwise, there will always be a race to the bottom.

What skills and training do you think the not-for-profits sector needs to embrace data science and what’s the best strategies for upskilling?

The people who are already working for an organization are best positioned to understand that organization’s problems and challenges, and where data science can help. Upskilling in-house talent is underutilized. Don’t feel that you need to hire someone new with a fancy pedigree, if there are people at your organization who are interested and eager to learn. I would start by learning to code in Python. Have a project from your not-for-profit that you are working on as you go, and let that project motivate you to learn what you need as you need to (rather than feeling like you need to spend years studying before you can tackle the problems you care about). One of our core missions with fast.ai is to train people in different domains to use machine learning for themselves, as they best understand the problems in their domain and what is needed. There are many myths that you need a super-elite background to use techniques like deep learning, but it’s not magic. Anyone with a year of coding experience can learn to use state-of-the-art deep learning.

Further Reading/Watching

Here are some additional articles (and one video) that I recommend to learn more on this topic: