Austin Z. Henley

Associate Teaching Professor
Carnegie Mellon University

azhenley@cmu.edu
@austinzhenley
github/AZHenley

Can ChatGPT do data science?

7/14/2024

Whimsical sketch of a robot doing science, badly.

This post is a summary of our paper, "Conversational Challenges in AI-Powered Data Science: Obstacles, Needs, and Design Opportunities". See the preprint for more details. This work was led by Bhavya Chopra at Microsoft. Special thanks to our collaborators: Ananya Singha, Anna Fariha, Sumit Gulwani, Chris Parnin, and Ashish Tiwari.

Conversational Challenges in AI-Powered Data Science

Programmers have found great value in ChatGPT, but can it do data science?

The challenges of using ChatGPT (e.g., providing context, false assumptions, and hallucinations) are made worse for data science tasks. Data scientists work with a variety of resources, like datasets, code, notebooks, visualizations, documentation, and pipelines. The data is often very large and may have quality issues. The tasks and data may require domain expertise which is tedious to fully articulate to a chat assistant. Moreover, these challenges can go both ways, as data scientists have to understand the context, data, code, and assumptions in ChatGPT's responses.

To understand how data scientists use ChatGPT and the challenges they face, we conducted two studies. In the first study, we observed 14 professional data scientists performing a series of common tasks while using ChatGPT. We avoided the use of integrated AI tools to give them complete control over prompt writing while also allowing us to study the fundamentals of interacting with AI for data science. The tasks involved type casting, splitting columns, feature selection, and plotting on the publicly available New York EMS emergency calls dataset.

A three part figure from one participant completing a task: their notebook, their prompt to ChatGPT, and ChatGPT's response. — *An example of one of the participant's working on a task. C shows the context retrieved by the participant, P shows their prompt, and A shows the answer provided by ChatGPT.*

In the second study, we surveyed 114 professional data scientists to validate and generalize the findings from the observational study. We filtered out any respondents that did not have experience using LLMs for data science tasks. The survey consisted of 8 agree/disagree statements and an open-ended question.

Prompting behaviors

We observed the participants write 111 prompts to ChatGPT. They spent 64% of the task time preparing prompts, 27% adapting code, and the remaining time validating code. They often struggled with writing the initial prompt, especially for the feature selection task. They also iterated and refined their prompts many times, especially for the plotting task. To summarize their behavior, they all required multiple steps and the participants struggled with providing the right context to ChatGPT and then to adapt the response from ChatGPT to accomplish the task.

Figure showing the time breakdown per task.

Challenges when communicating with ChatGPT

We observed the following challenges when participants communicated with ChatGPT:

Sharing context is difficult (10 of 14 participants). What information and how much? The data scientists begin prompt writing by figuring out which data and context they need to provide. One even called it "daunting" to get started. Some tried providing raw data, while others took elaborate steps to fetch information they deemed relevant. For example, they wrote new snippets in their notebooks to get information specifically for ChatGPT or they went to Excel to extract snippets of data. Some participants spent considerable effort reformatting information after pasting it into ChatGPT too.

Table showing how many participants shared each type of context for each task. — *Frequency of data context shared by participants for prompt writing for each task.*

ChatGPT opaquely makes assumptions (14 of 14 participants). Participants were enthusiastic about ChatGPT's ability to infer knowledge about data just from column names. However, it still made many false assumptions, including time formats, data types, and how to handle outliers. This led to some participants using code from ChatGPT that actually hid a data quality issue. One participant expressed the desire for a way to iterate on small changes faster than what is possible in the ChatGPT interface now.

Misaligned expectations (13 of 14 participants). The response that ChatGPT provides is highly temperamental to the phrasing of the prompt. For example, to include code or no code, an explanation or no explanation, one large code snippet or many small snippets, to use Pandas or another library, etc. Several participants observed that ChatGPT does not have their domain knowledge, which caused incorrect responses. They also complained of excessive explanations for straightforward portions, and we observed all participants skipping to the code.

Challenges in using ChatGPT's responses

We observed several challenges faced by participants using the code generated by ChatGPT. Recall that ChatGPT does not have direct access to their notebook, data, or other resources.

Generation of repeated code (13 of 14 participants). ChatGPT often generated redundant code, such as pandas.read_csv(). Blindly copy pasting such code would cause the participants' dataframes to be over written.

Table showing how many times a code adaption and validation activity was performed by participants for each task. — *Frequency of code adaptation and validation activities performed by participants for each task.*

Data and notebook preferences (11 of 14 participants). The participants follow specific patterns of organizing their code, which ChatGPT did not adhere to. For example, they broke the code from ChatGPT down into small chunks and organized them as separate cells in their notebooks, sometimes in a different order than provided by ChatGPT. A few participants also refactored the generated code to not be parameterized or to not be contained within a function. Similarly, participants often modified ChatGPT's code to get the resulting data in the form they wanted (e.g., how to handle missing data and when to mutate a column versus creating a new column).

Code validation (13 of 14 participants). Many participants shared the importance of validating code from LLMs. One remarked that, "affirmative language in responses, like 'Definitely! Here's the code you need', is extremely deceiving" since it conveys a high level of confidence that was unjustified. Although participants did not spend much time verifying the code, they did employ different ways to check the correctness. They manually inspected the resulting dataframe, generated plots, checked for changes in descriptive statistics, and wrote scripts to provide further evidence.

Strategies for overcoming these challenges

We also observed strategies to overcome the challenges that data scientists faced:

Techniques for prompt construction (11 of 14 participants). We observed the use of one-shot prompting, few-shot prompting, chain of thought prompting, and asking ChatGPT to assume the rule of an expert. They also wrote prompt templates to reduce context switching between ChatGPT and their notebook or spreadsheet.

Scaffolding with domain expertise (3 of 14 participants). Participants were cautious of ChatGPT's "understanding" of their data, though we only observed a handful of cases where they attempted to use their own domain expertise to successfully guide ChatGPT. For example, one participant wanted to avoid ChatGPT using any of the time data so they omitted columns that contained 'TIME' or 'ID' in them from the prompt. Another participant did something similar by filtering the columns to those that were of type 'float' first.

Choosing an alternative resource (6 of 14 participants). Participants shared reasons why ChatGPT may not be appropriate for their work. Three participants asked about data sharing and privacy for ChatGPT. Other participants said that writing code for processing new data sets is relatively rare, since they often rerun their existing pipelines on new batches of data in the same structure. A few did remark that using LLMs help them save time over performing web searches.

Survey results

The survey results can be seen below. It looks to be a coin toss for whether data scientists know if a task is suitable for ChatGPT or not. They strongly agree that they should include data in their prompt, it will require multiple back and forths with ChatGPT, and the code will require changes.

Tool design recommendations

From our studies and the related work, we make three recommendations for the design of AI-powered data science tools:

Provide preemptive and fluid context when interacting with AI assistants. Participants spent considerable time writing their prompts and gathering context. In fact, 45% of the prompts included data that was manually entered or required additional scripts to obtain. Additional interfaces are needed to enable data scientists to efficiently select and manage context. For example, what if you could select regions of the screen to include or exclude as context?

A mockup of what a data context selecting interface could look like where you select regions of the screen to include.

Provide inquisitive feedback loops and validation-aware operations. For open-ended or complex tasks, it can take many interactions with ChatGPT to get to a satisfactory result. Instead of this back-and-forth conversation instigated by the user, the system could guide the user throughout the task and proactively ask clarifying questions.

Provide transparency about shared context and domain expertise solutions. It can be difficult to know what context is needed and what assumptions are being made. There needs to be mechanisms for more efficient sharing of context that goes both ways between the data scientist and the AI. A possible interface for this could be a separate pane that maintains an updated list of assumptions that can be modified at any time.

A number of AI-powered data science tools have popped up in recent years, yet many of these challenges remain. If you enjoyed this, here are a few of my other posts about AI tools and data science: