Data Exploration with Agentic AI: Exploring the Titanic Dataset using SmolAgents

Rifx.Online
Data Science , Machine Learning , Natural Language Processing
11 Jan, 2025

When I began my journey into machine learning a decade ago, like many of us, I started with the Titanic dataset. I vividly recall the thrill of performing my first exploratory data analysis (EDA), uncovering patterns and correlations. Fast forward to today, and the landscape of data analysis has evolved in ways I could never have imagined. In this era of agentic AI, we now have the capability to delegate much of our EDA to intelligent agents. The question is no longer can we automate EDA? but rather, how far can we push these capabilities?

The short answer: Quite far. With multi-agent frameworks powered by cutting-edge AI models, it’s possible to perform detailed, dynamic EDA simply by asking questions. Imagine interacting with your dataset conversationally — requesting insights, clarifications, and visualizations as naturally as you would with a data science colleague. Let’s explore this transformative capability.

Setting the Stage

What are SmolAgents?: SmolAgents is a versatile library from Hugging Face that allows developers to deploy agents with just a few lines of code. Despite its simplicity, it is highly effective at simplifying complex workflows.

Here is a simple workflow to demonstrate the power of SmolAgents for EDA:

## Step 1: Import necessary libraries
from dotenv import load_dotenv 
from smolagents import CodeAgent, LiteLLMModel, tool, GradioUI 
import pandas as pd

## Step 2: Load environment variables, including API keys, from a .env file
load_dotenv()  

## Step 3: Define the Language Model (LLM). Here, we use Google's Gemini model
model = LiteLLMModel(model_id="gemini/gemini-1.5-flash",  
                     api_key=os.environ["GOOGLE_API_KEY"])

This code begins by importing the necessary libraries, including smolagents for AI agent functionality. The environment variables, such as API keys, are loaded from a .env file using load_dotenv. The language model used is Google’s Gemini 1.5 Flash, instantiated via the LiteLLMModel class.

## Step 4: Define tools

## Tool 1: A custom tool for loading the Titanic dataset
@tool
def get_titanic_data() -> dict:
    """Returns titanic dataset in a dictionary format.
    """    
    df = pd.read_csv('data/Titanic-Dataset.csv')    
    return df.to_dict()

## Tool 2: A custom tool for saving a dataset as a CSV file
@tool
def save_data(dataset:dict, file_name:str) -> None:
    """Takes the dataset in a dictionary format and saves it as a CSV file.

       Args:
           dataset: dataset in a dictionary format
           file_name: name of the file of the saved dataset
    """    
    df = pd.DataFrame(dataset)
    df.to_csv(f'data/{file_name}.csv', index=False)  


## Step 5: Define the Agent
## Using SmolAgents, we configure the agent with tools, the chosen LLM, and authorized library imports
agent = CodeAgent(tools=[get_titanic_data],    
                  model=model, 
                  additional_authorized_imports=['numpy', 'pandas', 'matplotlib.pyplot'])

A custom tool, get_titanic_data, is defined to load the Titanic dataset from a CSV file and return it as a dictionary for further exploration. This tool is then integrated into a CodeAgent, part of the SmolAgents framework, which combines tools, LLM, and authorized Python libraries to perform exploratory data analysis (EDA) efficiently.

## Step 6: Launch a user-friendly chat interface with a single line of code
GradioUI(agent).launch()

Finally, the GradioUI class provides a user-friendly interface for interacting with the agent. With a single line of code, the Gradio-based chat interface can be launched.

Asking Questions

Here are some of the questions I posed to the agent.

The first set of questions I asked focused on understanding the Titanic dataset’s structure. These included explaining the columns based on their names, identifying missing values, and detecting outliers. The aim was to handle missing values, fix outliers and save the cleaned data using the save_datatool.

Next, I asked how specific features might influence survival rates. For example, I explored whether ticket class or age had any effect on survival and why these factors might play a role.

Finally, I shifted my focus to predictive modeling. I asked about new features that could enhance predictions and asked the agent to build a predictive model to report the F1 score.

Your Turn

If you’re intrigued by the potential of SmolAgents, why not try it yourself? Load your favorite dataset, start asking questions, and see what insights you uncover. The age of agentic AI is here — and it’s changing the game.

Do follow if you liked the article!