Demystifying Generative AI Agents

Rifx.Online
Generative AI , Chatbots , Autonomous Systems
19 Jan, 2025

From single interactions to complex multi-agent systems

Overview

Lost in the hype around Generative AI agents? You’re not alone. This post cuts through the noise, providing a clear definition of what an agent is and how it works. We break down the key components, including the crucial role of “tools,” and offer practical insights into building and deploying agents, from single interactions to complex multi-agent systems. We also explore how multi-agent architectures can be implemented within an enterprise context, drawing parallels with microservices. A future post will deep-dive into Agent and Operations (AgentOps) and how to build a platform for enterprise-scale multi-agent systems.

What is an Agent?

“An Agent is nothing else than a prompt that instructs a foundation model to interact with specific tools”

A Generative AI agent orchestrates the interaction between a foundation model (FM) and external tools through carefully crafted prompts. These prompts instruct the FM on when and how to use these tools.

Each “tool” is essentially a collection of function specifications (or as we call them “declarations”). These declarations include:

Function Name: The identifier for the tool.
Description: A comprehensive explanation of the tool’s purpose, the problems it addresses, and in which cases someone could use it.
Parameters: A list of input parameters, with descriptions of their meaning, types, and expected values.
Output (optional): A description of the expected output format and content.

To formalize these declarations, the OpenAPI format (based on JSON) is commonly used in the market. This standardized format allows for clear, machine-readable descriptions of APIs, facilitating seamless integration with Generative AI models. Here’s an example of a function declaration to retrieve a stock price within a tool list using OpenAPI format :

{
  "tools": [
    {
      "functionDeclarations": [
        {
          "name": "get_stock_price",
          "description": "Fetch the current stock price of a given company.",
          "parameters": {
            "type": "object",
            "properties": {
              "ticker": {
                "type": "string",
                "description": "Stock ticker symbol (e.g., AAPL, MSFT)."
              }
            },
            "required": ["ticker"]
          },
          "returns": {
              "type": "number",
              "description": "The current stock price."
          }
        }
      ]
    }
  ]
}

The same function declaration can be created programmatically using a Python SDK, such as the one available in Vertex AI. This allows for dynamic creation and management of tool specifications:

from vertexai.generative_models import (
    FunctionDeclaration,
    GenerationConfig,
    GenerativeModel,
    Part,
    Tool,
)

## Create Function Declarations
get_stock_price = FunctionDeclaration(
    name="get_stock_price",
    description="Fetch the current stock price of a given company",
    parameters={
        "type": "object",
        "properties": {
            "ticker": {
                "type": "string",
                "description": "Stock ticker symbol for a company",
            }
        },
        "required": ["ticker"]
    },
    returns={ # Add the returns field!
        "type": "number",
        "description": "The current stock price."
    }
)
## ... provide more Function Declarations

## Define the list of the available functions to FM as a tool
company_insights_tool = Tool(
    function_declarations=[
        get_stock_price,
        # ... Other Function Declarations
    ],
)

To illustrate these concepts practically, we’ll use the Vertex AI SDK and Gemini’s function calling capabilities in the following examples, based on this code repository (we highly recommend exploring it). This approach provides a solid foundation for understanding how agents work at a lower level. Once you grasp these fundamentals, you’ll be well-equipped to use higher-level agentic frameworks like LangChain.

Up to this point, we’ve focused on defining the structure of tools using JSON and demonstrating how to create those same definitions programmatically using the Vertex AI SDK. These tool definitions, which are ultimately converted to text, are appended to the instruction prompt. This allows the model to reason about which tool (if any) is necessary to fulfill the user’s request and which parameters to use.

Here’s an example demonstrating how these elements — tools, model, and instructions — come together:

## Select the LLM, configuration and provide the available tools
gemini_model = GenerativeModel(
    "gemini-2.0-flash-exp",
    generation_config=GenerationConfig(temperature=0),
    tools=[company_insights_tool],
)

## Prepare the instructions for the LLM
instruction = """
    Give a concise, high-level summary. 
    Only use information that you learn from
    the API responses.
    """

agent = gemini_model.start_chat()

The next step is to start sending new inputs to the model:

## Prepare your query/question for the LLM
query = "What is the current stock price for Google?"

## Send both instructions and query to the LLM
prompt = instruction + query
response = agent.send_message(prompt)

What do you think the response will look like? Will the model call a real function?

If company_insights_tool is correctly defined (including the get_stock_price function with a ticker parameter and a returns field as shown in previous examples), Gemini should recognize that it has a tool capable of answering the question. It will likely generate a structured request to call the get_stock_price function with the parameter ticker=”GOOG” (or “GOOGL,” depending on how you want to handle Google’s two classes of stock).

The important point is that Gemini will not directly execute external code or make a real-time stock price API call itself. Instead, it will generate a structured request for you (the developer) to execute. By running the following code:

## LLM checks the available declaration of the tools
## LLM returns the most applicable function and parameters
function_call = response.candidates[0].content.parts[0].function_call

the response might look something like this (simplified):

name: "get_stock_price" 
args { 
   fields 
     { key: "ticker" 
       value {string_value: "GOOG"}        
  } 
}

So, how do you actually get the stock price? This is where your code comes in:

Thus, the user (or more likely, your code) is responsible to trigger the right code based on the model’s response. To achieve this, we need to implement separate Python functions for each tool declaration. Here’s an example for the get_stock_price function:

## Implement a Python function per Declaration
def get_stock_price_from_api(content):
   url = f"https://www.alphavantage.co/query?function=GLOBAL_QUOTE"
         f"&symbol={content['ticker']}&apikey={API_KEY}"
   api_request = requests.get(url)
   return api_request.text

## ... Other function implementations

To make the triggering of function calls easier, we recommend creating a function handler (i.e. Python dictionary) that links the function name from the function declarations with the actual code functions:

## Link function declarations with a specific Python function
function_handler = {
   "get_stock_price": get_stock_price_from_api,
   # ...,
}

By having implemented the Python function and being able to get the function name and parameters from the model’s response, the next step is to execute the corresponding function:

## LLM checks the available declaration of the tools
## LLM returns the most applicable function and parameters
function_call = response.candidates[0].content.parts[0].function_call

function_name = function_call.name
params = {key: value for key, value in function_call.args.items()}

## Invoke the corresponding Python function (or API)
function_api_response = function_handler[function_name](params)[:20000]

The output of the API call looks like this:

{
   "Global Quote": { 
      "01. symbol": "GOOG", 
      "02. open": "179.7500",     
      "03. high": "180.4450", 
      "04. low": "176.0300", 
      "05. price": "177.3500", 
      "06. volume": "17925763", 
      "07. latest trading day": "2024-11-14", 
      "08. previous close": "180.4900", 
      "09. change": "-3.1400", 
      "10. change percent": "-1.7397%" 
   }
}

Then, the results can then be sent back to the model for final processing and response generation:

## Send the return value of the function to the LLM to generate the final answer
final_response = agent.send_message(
    Part.from_function_response(
           name=function_name,
           response={"content": function_api_response},
    ),
)

By passing the response from function to the LLM we get the final response:

Google's (GOOG) stock price is currently \$177.35, reflecting a 1.74% dip from yesterday's \$180.49 close.

And this is the final answer the end user expects.

Here’s the five-step summary of the process:

User asks question: The user asks the model a question (“What is the current stock price for Google?”).
Model requests function call: The model generates a request indicating which function to call (“get_stock_price”) and what parameter value to use (“GOOG”).
Developer executes function: Your code extracts the function call information from the response and runs the corresponding Python function (“get_stock_price_from_api”) with the parameter value. This function retrieves the actual stock price data from an external API.
Function result sent back to the model for the final response generation: The result from the external API call (stock price data) is sent back to the LLM.

The process we have described is the foundation of Agents. Specifically, the workflow we’ve outlined represents the core logic of a single-turn agent: a single input triggering a function call and resulting in a response. This modular design is ideal for cloud deployment and scaling. By containerizing this logic and deploying it on a service like Google Cloud Run, you can create a robust, serverless agent accessible via API, both within your VPC and to the wider world.

Moving to Multi-Turn Agents

While the single-turn model provides a crucial foundation, most real-world Generative AI applications demand more sophisticated interactions. Users rarely get what they need in a single question and answer. This section explores multi-turn agents, which can maintain context, handle follow-up questions, and orchestrate multiple function calls to achieve more complex goals.

To illustrate this concept, we’ll use an example inspired by this code repository. Our goal is to create a Generative AI agent that can answer questions about movies and theaters in a specific area. As with single-turn agents, we first need to define the functions the agent can use. For simplicity, we’ll provide the function signatures and descriptions directly in the code:

def find_movies(description: str, location: str = “”): “””find movie titles currently playing in theaters based on any description, genre, title words, etc. Args: description: Any kind of description including category or genre, title words, attributes, etc. location: The city and state, e.g. San Francisco, CA or a zip code e.g. 95616 “””

def find_movies(description: str, location: str = ""): 
    """find movie titles currently playing in theaters based on any description, genre, title words, etc. Args: description: Any kind of description including category or genre, title words, attributes, etc. location: The city and state, e.g. San Francisco, CA or a zip code e.g. 95616 """ 
    ...
    return ["Barbie", "Oppenheimer"] 

def find_theaters(location: str, movie: str = ""): 
    """Find theaters based on location and optionally movie title which are is currently playing in theaters. Args: location: The city and state, e.g. San Francisco, CA or a zip code e.g. 95616 movie: Any movie title """ 
    ...
    return ["Googleplex 16", "Android Theatre"] 

def get_showtimes(location: str, movie: str, theater: str, date: str): 
    """ Find the start times for movies playing in a specific theater. Args: location: The city and state, e.g. San Francisco, CA or a zip code e.g. 95616 movie: Any movie title theater: Name of the theater date: Date for requested showtime """ 
    ...
    return ["10:00", "11:00"]

The next step is to run the function identification, execution (using function handler), and response generation in a loop until the model has all sufficient information to respond fully to the user’s request. To achieve multi-turn interactions, Gemini support automatic function calling and this can happen automatically using the following code:

chat = model.start_chat(enable_automatic_function_calling=True) 
response = chat.send_message( 
"Which comedy movies are shown tonight in Mountain view and at what time?") 

for content in chat.history: 
   print(content.role, "->", [type(part).to_dict(part) for part in content.parts]) 
   print("-" * 80)

The following interaction demonstrates the model’s behavior when the code is executed with the user query, “Which comedy movies are shown tonight in Mountain View and at what time?”:

user -> [{'text': 'Which comedy movies are shown tonight in Mountain view and at what time?’}] 
-------------------------------------------------------------------------------- 
model -> [{'function_call': {'name': 'find_movies', 'args': {'location': 'Mountain View, CA', 'description': 'comedy'}}}] 
-------------------------------------------------------------------------------- 
user -> [{'function_response': {'name': 'find_movies', 'response': {'result': ['Barbie', 'Oppenheimer']}}}] 
-------------------------------------------------------------------------------- 
model -> [{'function_call': {'name': 'find_theaters', 'args': {'movie': 'Barbie', 'location': 'Mountain View, CA'}}}] 
-------------------------------------------------------------------------------- 
user -> [{'function_response': {'name': 'find_theaters', 'response': {'result': ['Googleplex 16', 'Android Theatre']}}}] 
-------------------------------------------------------------------------------- model -> [{'function_call': {'name': 'get_showtimes', 'args': {'date': 'tonight', 'location': 'Mountain View, CA', 'theater': 'Googleplex 16', 'movie': 'Barbie'}}}] 
-------------------------------------------------------------------------------- 
user -> [{'function_response': {'name': 'get_showtimes', 'response': {'result': ['10:00', '11:00']}}}] 
-------------------------------------------------------------------------------- 
model -> [{'text': 'The comedy movie "Barbie" is showing at Googleplex 16 at 10:00 and 11:00 tonight. \n’}] 
--------------------------------------------------------------------------------

This interaction demonstrates that the model uses the complete conversation history at each turn to determine what information is still needed, which tool to use, and how to formulate its response. This record of past interactions is crucial for multi-turn conversations and it is so-called short-term memory. In addition, to conversation history, it is important to store operational metrics, e.g. execution time, latency, memory, per model interaction for further experimentation and optimization.

Here’s a 7-step summary of the multi-turn agent execution process depicted in the diagram:

New Query: The user initiates the interaction by providing a new query or question.
Function Identification: The foundation model (FM), along with the available tools and instructions, analyzes the query and determines if a function call is necessary. If so, it identifies the appropriate function name and the required parameters.
Function Call Preparation: The FM generates a structured function call request, specifying the function name and the parameters to be passed.
Function Call Execution: This step is performed by the developer’s code, not by the FM itself (however Gemini supports the automatic function calling enabling easier implementation). The code receives the function call request, executes the corresponding function (e.g., making an API call), and retrieves the result.
Intermediate Response: The result of the function execution (the data retrieved by the function) is sent back to the FM as an intermediate response.
Context Update (Conversation History): The FM updates its conversation history (represented as “Short-term Memory” in the diagram) with the intermediate response. The FM then uses this updated context to decide whether further function calls are needed or if enough information has been gathered to generate a final response. The process loops back to step 2 if more information is needed.
Final Response: Once the FM determines it has all the necessary information (or a maximum number of steps is reached to avoid infinity loops), it generates the final response to the user, incorporating the information gathered from all the function calls.

This multi-turn interaction illustrates realistically how a Generative AI Agent handles a single user request. However, real-world usage often involves repeated interactions over time. Imagine a user who uses the agent to find movie showtimes one week and then returns the following week with a similar request. If the agent retains a long-term memory of past interactions, it can provide more tailored recommendations, perhaps suggesting movies or theaters the user has previously shown interest in. This long-term memory stores a summary or complete record of the short-term conversation history for each interaction.

Both short-term and long-term memory play crucial roles in enabling effective multi-turn agent interactions. Here’s a breakdown of each and the implementation options:

Short-Term Memory (Conversation History): Stores the ongoing conversation within a single user session. This includes the user’s queries, the model’s function calls, and the responses from those function calls. This context is essential for the model to understand follow-up questions and maintain coherence throughout the interaction. Implementation Options:

Logs (small text logs): For simple applications with short conversations, storing the interaction history as plain text logs can be sufficient. This is easy to implement but may become inefficient for long conversations or high volumes of traffic.
Cloud storage/Database (large non-text logs): For more complex applications (e.g. leveraging mutli-modal model capabilities using images or audio as input), a cloud storage service or a database is a better choice. This allows for more structured storage and efficient retrieval of conversation history.
API Session (client-side memory): The conversation history can also be managed on the client-side (e.g., in a web browser or mobile app) using API sessions. This reduces server-side storage requirements but may have limitations on the amount of data that can be stored.
Combination of all the above: A hybrid approach might be used, combining different storage mechanisms depending on the specific needs of the application.

Long-Term Memory: Stores information about past user interactions across multiple sessions. This allows the agent to learn user preferences, provide personalized recommendations, and offer more efficient service over time. Implementation Options:

Vector Databases (for RAG — Retrieval Augmented Generation): Vector databases are particularly well-suited for long-term memory in agent applications. They store data as vector embeddings, which capture the semantic meaning of the data. This enables efficient similarity search, allowing the agent to retrieve relevant information from past interactions based on the current user query. This is often used in Retrieval Augmented Generation (RAG) pipelines.
Metadata Storage/Graphs (session ID, other metadata): Metadata storage, such as graph databases or key-value stores, can be used to store information about user sessions, such as session IDs, timestamps, and other relevant metadata. This can be used to organize and retrieve past conversation histories and interaction relationships.
Cloud Storage/Databases (actual logs): Complete logs of past conversations can be stored in cloud storage or databases. This provides a full record of all interactions but may require more storage space and more complex retrieval mechanisms.
Combination of all the above: Similar to short-term memory, a combination of storage mechanisms can be used to optimize performance and storage efficiency. For example, summarized information could be stored in a vector database for quick retrieval, while full logs could be stored in cheaper cloud storage for auditing or more detailed analysis.

Agents Calling Agents: The Power of Multi-Agent Systems

While individual agents can handle complex tasks, some problems require a coordinated effort. Multi-agent systems address this by enabling multiple agents to work together, each specializing in a particular sub-task. This collaborative approach allows for the decomposition of complex problems into smaller, more manageable parts, leading to more efficient and robust solutions.

A key concept in multi-agent systems is treating agents as tools: just as an agent can use external APIs or functions, it can also use other agents to perform specific sub-tasks. This “agents as tools” paradigm allows for the creation of hierarchical systems where one agent orchestrates the work of others.

Here’s a description the most common multi-agent patterns:

Router Agent (one by one): In this pattern, a central “router” agent receives the initial request and then delegates it to other agents one at a time. The router acts as a coordinator, determining which agent is best suited to handle each part of the task. After an agent completes its sub-task, the result is passed back to the router, which then decides the next step. This is useful for tasks that can be broken down into sequential steps, where the output of one step influences the next.
Parallel (one to many): Here, a single agent distributes sub-tasks to multiple agents concurrently. This is effective when the sub-tasks are independent and can be executed in parallel, significantly reducing the overall processing time. Once all agents have completed their work, their results are aggregated, often by the initial agent that distributed the tasks.
Sequential (predefined sequence): This pattern involves a predefined flow of information between agents. The output of one agent is directly passed as input to the next agent in a fixed sequence. This is suitable for tasks with a well-defined, linear workflow.
Circular Flow (predefined sequence): Similar to the sequential pattern, but the flow of information forms a loop. The output of the last agent in the sequence is passed back to the first agent, creating a cycle. This can be useful for iterative processes where agents refine their output based on feedback from other agents in the loop.
Dynamic (all-to-all): In this more complex pattern, any agent can communicate with any other agent. There’s no central coordinator or predefined flow. Agents can dynamically exchange information and negotiate with each other to achieve a common goal. This pattern is more flexible but also more complex to manage, requiring sophisticated communication and coordination mechanisms.

To summarize, think of it this way:

In the Router pattern, the router agent is using other agents as specialized tools, calling them one by one as needed.
In the Parallel pattern, the initial agent is using multiple agents as tools simultaneously to speed up the process.
In the Sequential and Circular Flow patterns, agents are used as tools in a predefined pipeline or loop.
In the Dynamic pattern, agents are interacting more like a team, with each agent acting as both a user and a tool for other agents, depending on the situation.

These patterns provide different ways to structure multi-agent interactions, allowing developers to choose the most appropriate approach based on the specific requirements of their application.

Having established the concept of multi-agent collaboration, the next logical question is how to put this into practice within an enterprise setting. The diagram above illustrates an enterprise-grade architecture based on a microservices approach. This approach treats each agent as an independent service, similar to how microservices break down large applications into smaller, independently deployable components. This analogy is powerful because it allows us to leverage existing microservices best practices: each business unit can develop and deploy its own agents as independent microservices, tailored to their specific needs. This decentralized approach allows for greater agility and faster development cycles, as teams can work independently without affecting other parts of the system. Just as microservices communicate via APIs, agents in a multi-agent system communicate by exchanging messages, often in structured formats like JSON. 1 To ensure interoperability and avoid redundant development, a central tool registry provides access to shared tools, while an agent template catalog offers reusable code and best practices. 2 This approach fosters collaboration, accelerates development, and promotes consistency across the organization. We’ll explore this architecture in greater depth in an upcoming post.

Conclusion

This blog post serves as a valuable resource for anyone seeking to understand the inner workings of Generative AI Agents. By delving into the core functionalities, single-turn vs. multi-turn interactions, and the collaborative power of multi-agent systems, it equips readers with the fundamental knowledge to leverage these technologies for their own applications.

Key takeaways from this post include:

Generative AI models rely on agents to interact with external tools through well-defined function declaration.
An Agent is nothing else than a prompt that instructs a foundation model to interact with specific tools
Multi-turn agents handle complex user requests by dynamically calling various functions and maintaining context throughout the interaction.
Multi-agent systems empower agents to collaborate and tackle intricate problems by delegating sub-tasks and coordinating their efforts.