Performing Analytical Queries with LLMs

A Practical Approach to Using LLMs for Data Exploration and Analysis

Consider the following scenario. You have a CSV file containing 5 million rows and 20 columns. This CSV file includes transaction records of customers, such as sales date, unit price, quantity, customer name, address, and more. Based on this data, you want an LLM to help answer questions like:

What did customer A purchase on a particular day?
What was the total sales amount for a particular month?
Tabulate the sales amount for each month in a particular year.

Limitations of LLM

If you try using an LLM to answer these questions, you’ll quickly realize that it cannot provide accurate or reliable answers. Why is this so?

Most LLMs have a limited context window size, which restricts the amount of data they can process in a single interaction. The size of the dataset (5 million rows) far exceeds this limit, making it infeasible to feed the entire CSV file into the LLM at once. Without access to the entire dataset, the LLM cannot effectively perform analytical queries.
Additionally, LLMs are trained on static textual data and lack real-time access to external structured datasets. They do not have the ability to natively interact with or process large-scale tabular data. Furthermore, LLMs rely on probabilistic reasoning and generate approximate responses rather than performing precise computations, which are essential for tasks like calculating sales totals or tabulating data. These limitations make LLMs unsuitable for handling analytical queries directly from large datasets.

Apart from the limited context window size, another key concern with sending your data to a large language model (LLM) is privacy. When using an LLM hosted by a third party, such as OpenAI, Claude, or DeepSeek, organizations are particularly concerned about the potential exposure of sensitive or confidential information. Legal constraints, such as compliance with regulations like GDPR, HIPAA, and other regional data protection laws, further complicate the situation. These regulations require strict controls over how personal and organizational data is stored, processed, and shared.

For example, data sent to an external LLM provider may be logged for quality improvement, training, or debugging purposes. Even if these logs are anonymized, there is still the risk that sensitive information could inadvertently be exposed or misused. Moreover, the lack of transparency about how third-party providers handle, store, and delete data can raise questions about accountability, particularly in high-stakes industries like healthcare, finance, or legal services.
Furthermore, some organizations worry about the potential for data leakage or reverse engineering. If proprietary information is input into an LLM, there is a risk that aspects of that data could be “memorized” by the model and unintentionally included in future responses, whether to the same user or someone else.
In addition, companies may fear that LLMs could be exploited for malicious purposes, such as extracting trade secrets or confidential business strategies through carefully crafted prompts.

Addressing the challenge

Given these challenges, how can you effectively address the difficulties of using LLMs for analytical queries? One approach is to use prompt engineering to guide the LLM in generating code that retrieves or processes the necessary data. By combining this with complementary tools like SQL, pandas, or Retrieval-Augmented Generation (RAG), you can ensure that the data is properly handled before being used by the LLM. This combination leverages the strengths of structured data processing and the LLM’s ability to interpret and generate natural language and code, ultimately enhancing the accuracy and relevance of analytical insights.

The solution

The key steps for using an LLM to perform analytical queries, with prompt engineering as a central component, are illustrated in the following diagram:

Step 1 — The user poses a question related to the dataset, such as, “How much did Customer A spend in January 2023?”
Step 2 — A prompt is sent to the LLM. Rather than embedding the entire dataset, the prompt includes a detailed description of the dataset’s schema. This could be the structure of database tables or just the header of a CSV file, depending on the format of the data. The schema should be as comprehensive as possible. Using clear, descriptive column names will improve the LLM’s ability to interpret the question and generate meaningful responses. Including examples of the data can also help the model understand the data types and relationships within the dataset.
Step 3 — The LLM analyzes the user’s query based on the schema and formulates a response. Instead of directly computing the result, the LLM will return executable code — such as SQL queries or Python code — that you can run against your dataset to retrieve the desired information.
Step 4 — You take the code provided by the LLM and run it against your dataset or database to compute the result.
Step 5 — Finally, the output of the code is sent back to the user, providing the requested answer.

The approach outlined above offers several key advantages:

Data Privacy: No actual data is ever shared with the LLM. Only the schema of the dataset is passed along, ensuring that sensitive or proprietary information remains secure. This makes the approach particularly suitable for enterprises concerned about data leakage or privacy breaches.
Optimized Prompt Size: Since there’s no need to embed large datasets in the prompt, the prompt size remains small and efficient. This helps minimize the computational load, reducing the cost associated with using commercial LLMs, especially those that charge based on token usage.
Flexibility in LLM Choice: You can choose to work with a commercially available LLM provider or run your own local LLM. With advancements in LLM capabilities, it’s becoming increasingly feasible to deploy and maintain your own LLM infrastructure. This gives you more control over both cost and performance, particularly if you have specialized data needs or prefer to keep everything in-house.

These benefits make the approach not only privacy-conscious but also cost-effective and scalable, which is crucial for businesses with sensitive data or large-scale operations.

Our Example using Python

Now that you have understood the constraints of LLM and how to make use of it to perform analytical queries based on your own dataset, let’s build a sample application to understand how it can be implemented.

For this example, we’ll be using the Titanic dataset, a well-known dataset commonly used in machine learning tutorials and exercises. The goal is to leverage an LLM to help us better understand this dataset. You will be able to ask questions in plain English, and the LLM will generate Python code that you can run to analyze the data, extract insights, and perform various types of analysis. Whether you’re interested in exploring survival rates, identifying patterns, or making predictions, the LLM will guide you by generating the necessary code to perform these tasks efficiently.

LLMs

For this example, we are going to use two different LLMs:

OpenAI: We will use the gpt-4o-mini model from OpenAI. This is a paid model, so you’ll need to register with OpenAI and generate an API key to access it.
LM Studio: LM Studio allows you to run an LLM locally on your machine. For this article, we’ll be using the meta-llama-3.1–8b-instruct model, which you can download and run directly on your computer using LM Studio. This provides the flexibility of running the model without relying on an external API, while still offering powerful capabilities.

By using these two models, you’ll get a comparison of how cloud-based and local LLMs perform in the context of analyzing the Titanic dataset. Whether you choose to use OpenAI or LM Studio, both options provide an effective way to query and understand the data through natural language processing.

If you are new to LM Studio, refer to my earlier article on how to expose a model through an API:

Creating the notebook

For this example, I will use a Jupyter Notebook.

Adding the OpenAI API key

First, save your OpenAI API key to an environment variable:

import os

os.environ['OPENAI_API_KEY'] = "OpenAI_API_KEY"

Creating a dictionary of LLM providers

Install the OpenAI package:

!pip install openai

Since the LM Studio API also uses the OpenAI class, you can set up two separate instances of the OpenAIclass — one for interacting with OpenAI’s API and one for using LM Studio locally. To manage these instances efficiently, you can create a dictionary that stores them, making it easier to switch between the two models as needed:

import os
from openai import OpenAI

## connects to OpenAI
client_openai = OpenAI(
    api_key = os.environ.get("OPENAI_API_KEY"),
)

## connects to LM Studio API
client_lmstudio = OpenAI(
    base_url = "http://localhost:1234/v1"
)

## create a dictionary with the model name as the key and the value as the 
## OpenAI instance
clients = {
    "gpt-4o-mini": client_openai,
    "meta-llama-3.1-8b-instruct": client_lmstudio
}

Creating the prompt and getting the response

The next step is the most crucial — sending the prompt to the LLMs and getting a response:

import re
import pandas as pd

## load the CSV file as a Pandas DataFrame
df = pd.read_csv('Titanic_train.csv')

while True:
    messages = []
    messages.append(
    {
        'role':'user',
        'content':'''
            Here is the schema of my data:
            PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked            
            Note that for Survived, 0 means dead, 1 means alive
            Return the answer in Python code only.
            For your info, I have already loaded the CSV file into a dataframe named df.
        '''
    })

    prompt = input('\nAsk a question: ') 
    if prompt == "quit":
        break
        
    messages.append(
    {
        'role':'user',
        'content':prompt
    })

    # send the question to all the LLMs
    for client_name, client in clients.items():
        completion = client.chat.completions.create(
            model =  client_name,
            messages = messages,
            max_tokens = 1024,
            temperature = 0)

        print(f"Code generated by {client_name}")    
        response = completion.choices[0].message.content    
        print(response)

I’ve already loaded the Titanic_train.csv file into a Pandas DataFrame. Now, I need the LLM to generate Python code that can be executed on this DataFrame.

When you run this code snippet, you will be asked to enter a question:

Let enter the question “Plot the distribution of passengers by embarkation port”. The two LLMs will respond with the following:

Code generated by gpt-4o-mini
```python
import matplotlib.pyplot as plt
import seaborn as sns

## Set the style of seaborn
sns.set(style="whitegrid")

## Count the number of passengers by embarkation port
embarked_counts = df['Embarked'].value_counts()

## Create a bar plot
plt.figure(figsize=(8, 5))
sns.barplot(x=embarked_counts.index, y=embarked_counts.values, palette='viridis')

## Add titles and labels
plt.title('Distribution of Passengers by Embarkation Port')
plt.xlabel('Embarkation Port')
plt.ylabel('Number of Passengers')

## Show the plot
plt.show()

Code generated by meta-llama-3.1-8b-instruct

import matplotlib.pyplot as plt

## Plot the distribution of passengers by embarkation port
plt.figure(figsize=(10,6))
df['Embarked'].value_counts().plot(kind='bar')
plt.title('Distribution of Passengers by Embarkation Port')
plt.xlabel('Port')
plt.ylabel('Count')
plt.show()

This code will create a bar chart showing the distribution of passengers by embarkation port. The value_counts() function is used to count the number of occurrences for each unique value in the ‘Embarked’ column, and then we plot these counts as bars.

The first block of code (enclosed with a pair of back quotes `````) is from the *gpt\-4o\-mini* while the second block is from *meta\-llama\-3\.1–8b\-instruct*. If your machine has limited processing power, running *meta\-llama\-3\.1–8b\-instruct* locally may be slightly slower.


### Extracting and executing the response

With the responses from the LLMs in hand, the next step is to execute them. To do this, you can use regular expressions to extract the Python code and then execute it using the `exec()` or `eval()` function. The following code shows how you can extract the returned Python code and execute it:


```python
import re
import pandas as pd

## load the CSV file as a Pandas DataFrame
df = pd.read_csv('Titanic_train.csv')

while True:
    messages = []
    messages.append(
    {
        'role':'user',
        'content':'''
            Here is the schema of my data:
            PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked            
            Note that for Survived, 0 means dead, 1 means alive
            Return the answer in Python code only.
            For your info, I have already loaded the CSV file into a dataframe named df.
        '''
    })

    prompt = input('\nAsk a question: ') 
    if prompt == "quit":
        break
        
    messages.append(
    {
        'role':'user',
        'content':prompt
    })

    # send the question to all the LLMs
    for client_name, client in clients.items():
        completion = client.chat.completions.create(
            model =  client_name,
            messages = messages,
            max_tokens = 1024,
            temperature = 0)

        print(f"Code generated by {client_name}")
        response = completion.choices[0].message.content    
        print(response)

        #===Extract the Python code from the response===
        pattern = re.compile(r'```python\s*([\s\S]*)\n```')
        match = pattern.search(response)
    
        #===Execute the Python code===
        if match:
            extracted_content = match.group(1)
            if extracted_content.count('\n') > 1:
                exec(extracted_content)           # multi-line response    
            else:              
                display(eval(extracted_content))  # single-line response       
        else:
            print("No content found within ```python...```.")

Executing the code returned by gpt-4o-mini resulted in the following chart:

Executing the response returned by meta-llama-3.1–8b-instruct has the following chart plotted:

It seems that the result from OpenAI’s model is more impressive!

Let’s try another question: “Plot the survival rate for each passenger class (Pclass)”. This is the result from gpt-4o-mini:

And this is from meta-llama-3.1–8b-instruct:

Let’s try another question: “Plot a pie chart showing the embarkation point”. This is from gpt-4o-mini:

And this is from eta-llama-3.1–8b-instruct:

You can try out some other questions, such as:

What is the proportion of male and female passengers in the dataset? (e.g., Pie chart or Count plot)
How does survival vary for passengers with small vs. large families? (e.g., Grouped Bar chart)
Can you visualize the survival rate for passengers traveling alone vs. with family?
Can you visualize the interaction of passenger titles (e.g., Mr., Mrs., Miss, etc.) and survival? (e.g., Count plot or Bar chart)

As you try more questions, you’ll notice that not all the generated code is bug-free. However, with careful review and testing, you can often identify and correct these issues to ensure the code works as intended.

One concern is that the code generated by the LLM could potentially be malicious. For instance, it may include snippets that send your data to an external source. One way to address this is to set up restricted execution environments where the LLM-generated code is allowed to access only the necessary resources. For example, limit the code’s access to sensitive files, networks, and databases to reduce potential damage.

Summary

This article explores the limitations of large language models (LLMs) in performing analytical queries, highlighting privacy concerns and the reluctance of organizations to share sensitive data with third-party LLM providers. It emphasizes how prompt engineering can be leveraged to guide LLMs in generating effective and actionable code, offering a practical solution to address these challenges while maintaining data security.