Smolagents + Web Scraper + DeepSeek V3 Python = Powerful AI Research Agent

Rifx.Online
Programming , Natural Language Processing , Chatbots
19 Jan, 2025

In this video, I have a super quick tutorial showing you how to create a multi-agent chatbot with Smolagents, Web Scraper and DeepSeek V3 to make a powerful Agent Chatbot for your business or personal use.

If you have been around the AI community, you may have noticed many viral clips about Nvidia, where the announced AI agent is worth a billion dollars, or you may have heard from Zark that we are not going to hire a mid-level engineer in the next year.

I asked myself, how does that come about? When I developed the AI agent system, it took me many days. Building intelligent agents has always been an extremely complex and technically demanding task, involving many complicated and trivial steps, such as API integration, environment configuration, and dependency management.

Imagine building an AI agent in the time it takes to finish a cup of coffee. Smolagents make this possible. This is a new generation of agent framework created by the Hugging Face team, It makes building AI much easier. It’s like a gift for developers — simple but powerful.

Smolagents saw this and wanted to find the perfect balance between complexity and simplicity. Code Agent is the biggest feature of smolagents.

What’s most impressive is that it only takes three lines of code to build your first agent. It can even build Agent-Retrieval-Generation (Agent-RAG) systems. Unlike Crew AI and Autogen, which are feature-rich but complex.

At the beginning of the new year, DeepSeek launched its latest large-scale language model DeepSeek V3 a powerful Mixture-of-Experts (MoE) language model with a total parameter count of 671B.

The biggest reason DeepSeek V3 is attracting attention is its low price. Don’t think “cheap means bad”! DeepSeek V3 is quite low-priced and achieves high-precision natural language processing.

It is comparable in performance to competing models such as GPT-4 and Claude, but its price is lower. It is a very attractive option, especially for companies and individuals with limited budgets.

Users can freely customize it, making creating AI models specialized for specific industries and applications easy. This flexibility is what differentiates DeepSeek V3 from its competitors.

So, Let me give you a quick demo of a live chatbot to show you what I mean.

I went to a real estate website to scrape the agents’ names, phone numbers, and company names. I prompted the agent with thoughtful inputs and tools for better results, as we know that DeepSeek v3 has better performance in coding ability and works well with a chain-of-thought and tool-based approach.

If you look at how SmolAgents generates output, you’ll see I built a scraper to gather real estate agent information — including names, phone numbers, and company names. I made it flexible so you can specify the state, city, and the number of pages to scrape.

I used the requests library to fetch web pages and ensure the scraper stops if the response code isn’t 200. I searched the HTML for agent names, phone numbers, and company names using specific classes, cleaned up the data, and added it to lists. If no agents were found, the script would return an error message.

After scraping, I organized the data into a dictionary and wrote another function to save it as a CSV file. I padded shorter ones with empty strings to ensure all lists were the same length. Then, I used Pandas to create a DataFrame and saved it with UTF-8 encoding. If the save was successful, I returned a message with the filename and entry count; otherwise, I returned an error.

By the end of this video, you will understand what Smolagents is, its Features, Why it is worth paying attention to, how Smolagents works, and how Smolagents can be used to create a super AI Agent.

Disclaimer: This article is only for educational purposes. We do not encourage anyone to scrape websites, especially those web properties that may have terms and conditions against such actions.

Before we start! 🦸🏻‍♀️

If you like this topic and you want to support me:

Clap my article 50 times; that will really help me out.👏
Follow me on Medium and subscribe to get my latest article for Free🫶
Join the family — Subscribe to YouTube channel

What is SmolAgents?

Smolagents is a lightweight Python library that enables developers to build efficient agents in a minimalist way. It solves the common pain points in agent development: cumbersome code, inefficient processes, and difficulties integrating various tools. It supports secure “agents that write their actions in code” and is integrated with Hugging Face Hub. It also handles a lot of non-routine complexity for you. For example, ensure the consistency of code format in system prompts, parsers, and execution links.

Why is it worth paying attention to?

SmolAgents aims for simplicity and LLM-agnostic design and focuses on supporting secure code-writing agents. It also integrates with Hugging Face Hub.

The agent system takes a step away from traditional workflows for narrow tasks and expands the possibilities to tackle more complex real-world problems. Hugging Face engineers say that agents provide LLMs with access to the outside world.

Smolagents uses code, rather than JSON, as a new way to describe actions. This allows for more composition, data management, and versatility. Building an agent involves challenges such as parsing the agent’s output and synthesizing prompts based on previous iterations, which are among the key features Smolagents provides.

Hugging Face also conducted benchmark tests with other models and found that open models competed with closed models reducing 30% of steps and API calls compared to traditional solutions while showing stronger performance in difficult benchmark tests.

In addition to Smolagents, other companies, such as OpenAI’s Swarm and Microsoft’s Magentic-One. Prioritizing safety, SmolAgents includes a built-in sandbox mode for secure code execution, while its integration with the Hugging Face Hub allows easy sharing with just a single line of code.

How SmolAgents Work

SmolAgents is designed with usability and efficiency in mind. Its intuitive API allows developers to easily build intelligent agents to accomplish tasks such as command understanding, e-source connection, dynamic code generation and execution. Specific features include:

Understanding Language: Using advanced Natural Language Processing (NLP) models, SmolAgents can understand commands and queries.

Intelligent Search: Connect to external data sources to provide fast and accurate search results.

Dynamic Code Execution: Agents can generate and execute code as needed to solve specific problems.

The modular design of SmolAgents makes it suitable for a variety of scenarios, whether it is rapid prototyping or full-scale production environment applications. By leveraging pre-trained models, developers can save a lot of time and effort and get strong performance without having to customize models from scratch.

Let’s start coding

Let us now explore step by step and unravel the answer to how to create SmolAgents. we will install the libraries that support the model. For this, we will do a pip install requirements

pip install -r requirements.txt

The next step is the usual one where we will import the relevant libraries, the significance of which will become evident as we proceed.

The Codeagent: is the default agent. It will write and execute Python code snippets at each step.

LiteLLM Model: lets you call 100+ different models

tool: A list Tools that the agent can use to solve the task.

Once you have these two arguments, tools and model, you can create an agent and run it. You can use any LLM you’d like, either through Hugging Face API, transformers, ollama, or LiteLLM.

from typing import Optional, Dict
from smolagents import CodeAgent, tool, LiteLLMModel , GradioUI
import requests
import os
import time
from bs4 import BeautifulSoup
import pandas as pd

I started by creating a scraper function to scrape and gather real estate agent information from a website called realtor.com. where we going to fetch names, phone numbers and company names

You can specify the state (e.g., “CA” for California), the city (e.g., “Angeles”), and the number of pages you want to scrape.

initializes lists to store the data and uses headers to mimic browser requests for each page URL dynamically

then I created a loop to run from page 1 to the number of pages For the first page, the Link is simple and second, we added the number of pages to the link.

I use the requests library to download the webpage.If the HTTP response status code isn’t 200 (OK), the scraper stops and returns an error.

we search through the HTML with the class agent-name. If found, strip any extra spaces and add them to the list. or Phone numbers to check all possible spots. If found, it cleans the number and adds it to the list as well as the company name. If no agents were found, it returns an error message. Otherwise, it organizes the data into a dictionary and sends it back.

@tool
def scrape_realtor(state: str, city_name: str, num_pages: Optional[int] = 2) -> Dict[str, any]:
    """Scrapes realtor.com for agent information in specified city and state
    
    Args:
        state: State abbreviation (e.g., 'CA', 'NY')
        city_name: City name with hyphens instead of spaces (e.g., 'buffalo')
        num_pages: Number of pages to scrape (default: 2)
    """
    try:
        # Initialize results
        results = []         # Names
        phone_results = []   # Phone numbers
        office_results = []  # Office names
        pages_scraped = 0
        
        # Set up headers
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.5",
            "Connection": "keep-alive"
        }

        # Process pages
        for page in range(1, num_pages + 1):
            # Construct URL
            if page == 1:
                url = f'https://www.realtor.com/realestateagents/{city_name}_{state}/'
            else:
                url = f'https://www.realtor.com/realestateagents/{city_name}_{state}/pg-{page}'
            
            print(f"Scraping page {page}...")
            
            # Get page content
            r = requests.get(url, headers=headers)
            if r.status_code != 200:
                return {"error": f"Failed to access page {page}: Status code {r.status_code}"}

            soup = BeautifulSoup(r.text, features="html.parser")
            
            # Find all agent cards
            agent_cards = soup.find_all('div', class_='agent-list-card')
            
            for card in agent_cards:
                # Find name
                name_elem = card.find('div', class_='agent-name')
                if name_elem:
                    name = name_elem.text.strip()
                    if name and name not in results:
                        results.append(name)
                        print(f"Found agent: {name}")

                # Find phone
                phone_elem = card.find('a', {'data-testid': 'agent-phone'}) or \
                            card.find(class_='btn-contact-me-call') or \
                            card.find('a', href=lambda x: x and x.startswith('tel:'))
                
                if phone_elem:
                    phone = phone_elem.get('href', '').replace('tel:', '').strip()
                    if phone:
                        phone_results.append(phone)
                        print(f"Found phone: {phone}")

                # Get office/company name
                office_elem = card.find('div', class_='agent-group') or \
                            card.find('div', class_='text-semibold')
                if office_elem:
                    office = office_elem.text.strip()
                    office_results.append(office)
                    print(f"Found office: {office}")
                else:
                    office_results.append("")
            
            pages_scraped += 1
            time.sleep(2)  # Rate limiting

        if not results:
            return {"error": "No agents found. The website structure might have changed or no results for this location."}

        # Return structured data
        return {
            "names": results,
            "phones": phone_results,
            "offices": office_results,
            "total_agents": len(results),
            "pages_scraped": pages_scraped,
            "city": city_name,
            "state": state
        }
        
    except Exception as e:
        return {"error": f"Scraping error: {str(e)}"}

Next, I create another function that saves the scraped real estate data into a CSV file. It takes a dictionary data containing the scraped information and an optional filename. If the data contains an error, it returns the error message.

Otherwise, it ensures all lists (names, phones, offices) are of equal length by padding shorter lists with empty strings. A DataFrame is then created from the lists and written to a CSV file using UTF-8 encoding. If the operation succeeds, it returns a success message with the filename and the number of entries saved; otherwise, it returns an error message.

@tool
def save_to_csv(data: Dict[str, any], filename: Optional[str] = None) -> str:
    """Saves scraped realtor data to CSV file
    
    Args:
        data: Dictionary containing scraping results
        filename: Optional filename (default: cityname.csv)
    """
    try:
        if "error" in data:
            return f"Error: {data['error']}"
            
        if not filename:
            filename = f"{data['city'].replace('-', '')}.csv"
            
        # Ensure all lists are of equal length
        max_length = max(len(data['names']), len(data['phones']), len(data['offices']))
        
        # Pad shorter lists with empty strings
        data['names'].extend([""] * (max_length - len(data['names'])))
        data['phones'].extend([""] * (max_length - len(data['phones'])))
        data['offices'].extend([""] * (max_length - len(data['offices'])))
        
        # Create DataFrame with just names, phones, and offices
        df = pd.DataFrame({
            'Names': data['names'],
            'Phone': data['phones'],
            'Office': data['offices']
        })
        
        df.to_csv(filename, index=False, encoding='utf-8')
        return f"Data saved to {filename}. Total entries: {len(df)}"
        
    except Exception as e:
        return f"Error saving CSV: {str(e)}"

To use LiteLLMModel, you need to set the environment variable ANTHROPIC_API_KEY or OPENAI_API_KEY, or pass api_key variable upon initialization.

We use CodeAgent by default to execute Python code. this should be safe because the only functions that can be called are the tools you provided and a set of predefined safe functions or functions from the math module, so you’re already limited in what can be executed.

The Python interpreter also doesn’t allow imports by default outside of a safe list, so all the most obvious attacks shouldn’t be an issue. You can authorize additional imports bypassing the authorized modules as a list of strings in the argument

deepseek_model = LiteLLMModel(
        model_id="ollama/nezahatkorkmaz/deepseek-v3"
    )

    # Create agent with tools
    agent = CodeAgent(
        tools=[scrape_realtor, save_to_csv],
        model=deepseek_model,
        additional_authorized_imports=["pandas", "bs4", "time"]
    )

Finally, let’s run the agent, launch the Gradio interface for the agent, and test out the chatbot. Please keep in mind that the more precise your prompt is, the better the result you’ll get

 result = agent.run("""
Thought: Let's scrape realtor data
Code:
```python
## Scrape realtor data
data = scrape_realtor(state="NY", city_name="buffalo", num_pages=2)

## Save to CSV
if "error" not in data:
    result = save_to_csv(data)
    print(result)
else:
    print(f"Error: {data['error']}")

""")

print(result)
GradioUI(agent).launch()


## Conclusion :

Hugging Face has always had a deep understanding of developer users. Coupled with the advantages of its community, many of the frameworks it released have received good responses.

This framework may be the result of the thinking and accumulation of the entire AI community over the past two years.

It reminds us that sometimes, small but exquisite is the best choice.

In 2025, when agent technology is booming, SmolAgents solves the most critical problems in the simplest way.

**It is difficult to find a soulmate, and self\-cultivation is also difficult. Seize the opportunities of cutting\-edge technology and become an innovative super individual with us (grasp the personal power in the AIGC era).**


### Reference :

* <https://github.com/huggingface/smolagents/blob/main/docs/source/en/guided_tour.md>


> ***🧙‍♂️ I am an AI Generative expert! If you want to collaborate on a project, drop an [inquiry here](https://docs.google.com/forms/d/e/1FAIpQLSelxGSNOdTXULOG0HbhM21lIW_mTgq7NsDbUTbx4qw-xLEkMQ/viewform) or Book a [1\-on\-1 Consulting](https://calendly.com/gao-dalie/ai-consulting-call) Call With Me.***

*📚 Feel free to check out my other articles:*