FAQ – Putting science into data science

Who is developing this site?

This page explains who we are and why we built this. In a nutshell, we observed the same need for scientific reasoning in two different data science contexts: Business analytics in the corporate world (Jay Cordes) and geospatial data science in academia (Julia Koschinsky).

Where did the domain name “Putting Science Into Data Science” come from?

The name came from this talk on “Putting the ‘Science’ into Data Science” that Jay Cordes gave at Claremont McKenna College in November 2022. We discovered later that the same title had been used earlier in several industry contexts: Several business analysts published short blog posts, like Ulas Burkay, who emphasized the need to understand explanatory mechanisms in predictive modeling in 2016, as did Stephen Jaye in 2019. In the same year, Jeroen van Zeeland argued for integrating scientific method with data science. There is also a 3-hour paid course from 2019 (Jennifer Leo) of the same name. It covers research methods for business managers, including designing experiments. Finally, there’s a post on O’Reilly Radar showing data science teams how to make their workflows and results reproducible (Daniel Whitenack). Like Jay, all of these analysts identified a similar shortcoming in the corporate world.

What were the pain points that motivated this effort?

These are some of the problems that made us look for solutions.

What do you mean by “scientific reasoning”?

This is what we mean and here are some of the references that informed our understanding.

What are you hoping to achieve with this site?

There are three main reasons why we built this site: First, to raise awareness of the problem that data science students are learning the tool skills but lacking the core skills that are necessary to critically solve problems. Next, to have a home for the solution to this problem: open teaching materials we plan to release throughout 2023 that educators can incorporate into their courses. And, finally, to bring together the work of others at the intersection of scientific reasoning and data science.

What are the broad learning objectives of your teaching materials?

We aim for: 1) Less mechanical use of methods and more engagement in the thrill of discovery; 2) fewer statistical errors and biases; 3) more scientific rigor; 4) more critical thinking while analyzing data; and 5) a better integration of descriptive statistics with explanations. More on this here.

How do you know if this approach and these materials are effective?

We have come to trust scientific reasoning as the logic for analyzing data for several reasons: In Julia’s case, there were three applications of scientific reasoning to spatial data analysis that convinced her that this integration holds promise: In 2020, she ran a 10-week summer lab with six students (high school, undergraduate, and graduate) at the University of Chicago to experiment with different ways of analyzing spatial data with scientific reasoning. Engagement in the thrill of discovery and detecting statistical errors and biases worked. Focusing on more explanatory insights only worked to a certain extent (and getting more surprising results was not an outcome we achieved). The integration has also been tested in many Data4All workshops and student evaluations show that students end up thinking about programming in a larger context of competing explanations, evidence and repeated testing.

In Jay’s case, as a professional data analyst, he saw firsthand how inferior the quality of work was from analysts on auto-pilot compared with what was produced by analysts who thought like scientists. He co-authored two books with lots of examples from the corporate world of what can go wrong when you underestimate the importance of scientific rigor.

At the same time, we do not know yet if the new teaching modules we are still developing will be effective. To find out, we are simultaneously designing assignments with hidden pitfalls to see if data science students with scientific reasoning training are better able to detect them than those without. Pending funding, the outcomes will also be externally evaluated.

How are your teaching modules different from what already exists?

There’s a lot of existing work on scientific reasoning and a lot of work on data science but not much that integrates both for the data analyst (as opposed to the scientist). This integration is the “special sauce” we are interested in here. Fortunately, some recent work has emerged (featured here, including Fowler and Bueno de Mesquita’s Thinking Clearly with Data and Lakens’ Improving Your Statistical Inferences).

We are adding several things to what already exists: a modular approach to support the broader integration into existing courses; a focus on statistical pitfalls especially relevant in an industry context and on geospatial data & methods; applications of experimental designs; a greater reliance on statistical graphics; and exercises implemented with different tools (e.g., spreadsheet, Python, or R). The modules will also contain quiz questions to test if the materials are effective at showing students how to avoid pitfalls and match vetted explanations.

Will the full courses be offered online, too?

The pitfalls course will eventually be openly accessible online but we will first test it in actual classrooms to make sure it works.

The spatial course and workshop will not be offered online or recorded since they are structured around in-person interactions. At least some of the spatial modules will be recorded and shared online.

May I use your teaching materials?

All materials are GPL-licensed, i.e., they can only be used for educational, non-commercial purposes. No problem if you work in industry but would like to learn or teach using our materials. But you cannot sell them.

What are your case examples about?

They involve classic examples from the history of science, like the search for how cholera is transmitted in the 19th century. Or the question of what drives higher rates of COVID in some neighborhoods versus others.

Other examples address more specific questions of general interest, such as why the best scoring schools are the smallest ones or whether drugs taken for relaxation help students score higher on the SAT. They are also about earthquakes, homicides & temperature, sports, premature mortality, presidential elections, and more.

What kind of data are in the course modules? Are there any big data?

The exercises are mostly based on structured data in csv or geoJSON formats, as opposed to unstructured data like text, images or video. A lot of the data are small (under 50,000 records).

We are convinced that a scientific mindset applies to all data, big and small. But for a general introduction to scientific reasoning, the additional computational skillset required to manage big data is a distraction. In some cases, big data makes a scientific mindset even more relevant, because of the increased potential for problems like p-hacking, publication bias, and HARKing.

What about the site images?

We explain the why, what and how of the site images here.

Can I get involved?

Yes! We are interested in connecting with others who are interested in integrating data science education with scientific reasoning. If this is you, please reach out.

How to contact you?

You can reach us by email: spatial@uchicago.edu (Julia Koschinsky) and jjcordes@ca.rr.com (Jay Cordes).