As leaders of a developing field, data scientists often have to grapple with a frustratingly slippery question: What is data science, precisely, and what is it for?
Alfred Spector is a visiting scholar in MIT’s Department of Electrical Engineering and Computer Science (EECS), an influential developer of distributed computing systems and applications, and a successful technology executive at companies such as IBM and Google. Along with three co-authors: Peter Norvig of Stanford University and Google, Chris Wiggins of Columbia University and The New York Times, and Jeannette M. Wing at Columbia: Spector recently published “Data Science in Context: Foundations, Challenges, Opportunities” (Cambridge University Press), which provides a comprehensive and conversational overview of the broad field driving change in sectors ranging from healthcare from health to transportation to commerce to entertainment.
Here, Spector talks about the data-driven life, what makes a good data scientist, and how his book came to be during the height of the Covid-19 pandemic.
Q: One of the most common buzzwords Americans hear is “data-driven,” but many may not know what that term is supposed to mean. Can you unpack it for us?
A: Data-driven refers broadly to techniques or algorithms that are driven by data: they provide information or reach conclusions, for example a recommendation or a prediction. The models power algorithms that are becoming increasingly entwined in the fabric of science, commerce, and life, often delivering great results. The list of his hits is really too long to begin listing. However, one concern is that the proliferation of data makes it easier for us, as students, scientists, or simply members of the public, to draw the wrong conclusions. As just one example, our own confirmation biases make us prone to believe that some data item or insight “proves” something we already believe to be true. Also, we often tend to see causal relationships where the data only shows correlation. It may seem paradoxical, but data science makes critical reading and data analysis even more important.
Q: What, in your opinion, makes a good data scientist?
A: [In talking to students and colleagues] I optimistically emphasize the power of data science and the importance of acquiring the computational, statistical, and machine learning skills to apply it. But I also remind students that we are obligated to solve problems well. In our book, Chris [Wiggins] he paraphrases danah boyd, who says that a successful application of data science is not one that simply meets some technical goal, but one that actually improves lives. More specifically, I urge professionals to provide a real solution to the problems, or clearly identify what we are not solving so that people see the limitations of our work. We must be extremely clear so as not to generate harmful results or lead others to wrong conclusions. I also remind people that all of us, including scientists and engineers, are human and are subject to the same human foibles as everyone else, such as various prejudices.
Q: You talk about Covid-19 in your book. While some short-range models for mortality were highly accurate during the heart of the pandemic, you note the failure of long-range models to predict any of the four major geotemporal waves of 2020 Covid in the United States. Do you think that Covid was an exceptionally difficult situation to model?
A: Covid was particularly difficult to predict in the long term due to many factors: the virus was changing, human behavior was changing, political entities changed their minds. Furthermore, we did not have detailed mobility data (perhaps for good reasons) and we lacked sufficient scientific knowledge of the virus, particularly in the first year.
I think there are many other domains that are just as difficult. Our book reveals many reasons why data-driven models may not be applicable. It may be too difficult to obtain or retain the necessary data. Perhaps the past does not predict the future. If the data models are used in life and death situations, we may not be able to make them reliable enough; this is particularly true since we’ve seen all the motivations bad actors have for finding vulnerabilities. So as we continue to apply data science, we need to think about all the requirements that we have and the ability of the field to meet them. They often line up, but not always. And, as data science seeks to solve problems in increasingly important areas, such as human health, education, transportation safety, etc., there will be many challenges.
Q: Let’s talk about the power of good visualization. You mention the popular Baby Name Voyager website from the early 2000s as one that changed your mind about the importance of data visualization. Tell us how that happened.
A: That website, recently reborn as Name Grapher, had two features that I found brilliant. First, it had a really natural interface, where you type the initial characters of a name and it shows a frequency graph of all the names starting with those letters and their popularity over time. Second, it’s much better than a spreadsheet with 140 columns representing years and rows representing names, even though it contains no additional information. It also provided instant feedback with its visualization graph that dynamically changes as you type. To me this showed the power of a very simple transformation done right.
Q: When you and your co-authors started planning “Data Science in Context”, what did you hope to deliver?
A: We present today’s data science as a field that has already had enormous benefits, that holds even more opportunities for the future, but that requires equally enormous care in its use. By referring to the word “context” in the title, we explain that proper use of data science should consider the details of the application, the laws and regulations of the society in which the application is used, and even the period of use. time of its implementation. And, most importantly to an MIT audience, the practice of data science must go beyond the data and model to careful consideration of an application’s goals, its security, privacy, abuse, and risks. resilience, and even the understanding it conveys to humans. . Within this expansive notion of context, we finally explain that data scientists must also carefully consider the ethical tradeoffs and social implications.
Q: How did you stay focused throughout the entire process?
A: As with open source projects, I played both the role of coordinating author and general librarian for all the material, but we all made significant contributions. Chris Wiggins is very knowledgeable about the Belmont principles and applied ethics; he was the main contributor to those sections. Peter Norvig, as co-author of a best-selling AI textbook, was particularly involved in the sections on model building and causation. Jeannette Wing worked closely with me on our Seven Element Analysis Rubric and recognized that a checklist for data science professionals would end up being one of the most important contributions to our book.
From a practical perspective, we wrote the book during Covid, using a large Google doc shared with weekly video conferences. Surprisingly, Chris, Jeannette, and I never met in person, and Peter and I only saw each other once: sitting outside on a wooden bench on the Stanford campus.
Q: That is an unusual way to write a book! Do you recommend it?
A: It would be nice to have more social interaction, but a shared document, at least with a coordinating author, worked quite well for something of this size. The benefit is that we always had a single, consistent textual base, similar to the way a programming team works together.
This is an edited and condensed version of a longer interview that originally appeared on the MIT EECS website.