Type something to search...
Mastering Ai Agents: 10 Key Questions Answered to Demystify GoogleS Revolutionary Whitepaper

Mastering Ai Agents: 10 Key Questions Answered to Demystify GoogleS Revolutionary Whitepaper

10 FAQs

This article is part of a new series I’m launching called 10 FAQs. In this series, I aim to break down complex concepts by answering the ten most common questions you’re likely to have on the topic. My goal is to use simple language and relatable analogies to make these ideas easy to grasp.

Photo by Solen Feyissa on Unsplash

If you prefer to listen, here is an AI-generated podcast I made using Notebook LM on the same topic

Introduction

Google, in September 2024, published a paper titled “Agents” by Julia Wiesinger, Patrick Marlow, and Vladimir Vuskovic. Recently, this paper went viral on Twitter. I read through the entire paper (so you don’t have to) and answered ten key questions to help you understand AI agents in depth. This single article is all you need to get started and excited about AI agents.

1. What are Agents and Why should I know about them?

Generative AI agents can be defined as applications that attempt to achieve a goal by observing the world & acting upon it using the tools they have at their disposal. Agents are autonomous & can act independently of human intervention. You can understand it by a simple example of humans. We humans are very good at learning complex topics and messy pattern recognition, but we make use of external tools like books, the internet, etc. Similarly, we can train foundational AI models to access real-time information & act upon this.

Industry leaders like Mark Zuckerberg (META CEO), Jensen Huang (NVIDIA CEO) have been praising AI Agents. Mark Zuckerberg remarked

There will be more AI agents than people, as businesses and individuals create AI agents that reflect their values and interact with the world on their behalf

Similarly, Jensen Huang referred to AI agents

as the ‘digital workforce’ that could revolutionize various sectors of jobs, coupled with their degree of autonomy, that could help companies who deploy it to keep running their business and workspace smoothly without a need for human interference

It’s essential to understand AI agents because they represent a revolutionary shift in how language models interact with the outside world. These agents can have a transformative impact on industries such as healthcare, finance, retail, and beyond, shaping the way we live and work.

2. What is Cognitive Architecture and Which components constitute of Cognitive Architecture?

Agents can reason about what they should do next to achieve their goal even in the absence of explicit information from humans. The combination of components that drive an agent’s behaviour, action, & decision-making can be described as Cognitive Architecture.

Cognitive Architecture & its Components

Cognitive Architecture is formed using three components:

  1. Orchestration
  2. Model
  3. Tools

3. Explain the components of Cognitive Architecture in brief.

The Model

  • In context of Agents, the model refers to the Language Model (LM).
  • This LM can be utilised for centralised decision-making.
  • This LM could be a single or multiple LM of any size (small/large).
  • These LMs should be capable of following instructions-based reasoning & logic frameworks like ReAct, Chain-of-Thought, and Tree-of-Thought.
  • LMs can be general-purpose, Multimodel or fine-tuned based on a specific Agent.
  • The important thing to note here is that the LMs are not typically trained on specific configuration settings (i.e. tool choices, orchestration, etc.)
  • However, it is possible to refine the model for agent’s tasks by providing it with examples that showcase the agent’s capabilities.

The tools

  • Tools bridge the gap between traditional foundational models by allowing them to interact with the outside world.
  • Tools can take various forms & have varying depths of complexity, but typically align with common web API methods like GET, POST, PATCH & DELETE.
  • Tools allow agents to access real-world information; this empowers them to support more specialised systems like Retrieval Augmented Generation (RAG).

The Orchestration Layer

  • The orchestration layer describes a cyclical process that governs how the agent takes in information, performs some internal reasoning, & uses that reasoning to inform its next action or decision.
  • In general, this loop will continue until the agent has reached its goal.
  • The complexity of the orchestration layer varies deeply depending on the agent & task it’s performing.

4. What are the differences between traditional Models and AI Agents (Refer to page 8 for details)

  • In traditional models, knowledge is limited to what is available in their training data but in case of agents, knowledge is extended through connection with external systems via tools.
  • In traditional model, no native logic layer is implemented, but in case of agents, reasoning frameworks like ReAct, Chain-of-Thought (CoT) & Tree of Thought (ToT).
  • In traditional models, there is no concept of memory management (chat history), but in the case of Agents chat history is managed for more accurate responses.
  • In traditional models there is no concept of tools, but in the case of Agents tools are natively implemented in agent architecture.

Example with ReAct framework

As you can see in the figure above, the agent uses a reasoning framework like ReAct to reach its end goal. This is an iterative process that extracts information, makes informed decisions, and refines next actions based on previous outputs.

At the core of Cognitive Architecture lies Orchestration layer, which is responsible for maintaining memory, state, reasoning, and planning.

6. What are Tools and What type of tools does Google support (Extensions, Functions, & Data Stores)?

Tools bridge the gap between foundational models and the outside world. No matter how much training data you throw at the model, it still lacks the skill to interact with the outside world. Functions, Extensions, Data Stores, & Plugins are all ways to provide this critical ability to the model.

As of the date of publication of the paper (September 2024), Google supported three primary tool types that are able to interact with models:

  1. Extensions
  2. Functions
  3. Data Stores

7. What is the difference between Extensions vs. Functions and when to use what?

Extensions interacting with API

Extensions allow agents to seamlessly execute APIs regardless of their underlying implementation. Extensions bridge the gap between an agent and an API by:

  1. Teaching the agent how to use API endpoints using examples.
  2. Teaching the agent what arguments or parameters are needed to successfully call the API endpoint.

The key strength of using extensions is that the agent can decide which extension, if any, would be suitable for solving the user’s query based on examples at runtime.

One to many relationship between Agents, Extensions & API

On the other hand, functions provide the developer with more granular control over the flow of data in the application. A model can take a set of unknown functions and decide when to use each function & what arguments the function needs based on its specification.

Client-side vs Agent-side control for extension and function calling

Functions differ from Extensions in a few ways, most notably:

  1. A model outputs a function & its arguments but doesn’t make a live API call.
  2. Functions are executed on the client side, while extensions are executed on the agent side.

Reasons for choosing functions over extensions:

  1. API calls need to be made at another layer of the application stack, outside of the direct agent architecture flow.
  2. Security or authentication restrictions that prevent the agent from calling an API directly.
  3. Timing or order-of-operations constraints that prevent the agent from making API calls in real-time.
  4. Additional data transformation logic needs to be applied to the API response that the agent cannot perform.
  5. The developer wants to iterate an agent development without deploying additional infrastructure for the API endpoints.

Note: If you want to learn function calling in details with example, refer to pg. 23 of the original paper (link at the bottom)

8. What are Data Stores and When to use Data Stores (RAG)?

Foundational language models have a knowledge cutoff due to their non-exposure to real-time information. Suppose a model was trained on data till September 2024; it won’t be able to answer questions about events that happened after September 2024. To solve this issue, we can make use of Data Stores.

Data store connects the agent to different sources of information

Data stores allow developers to provide additional data in original format to an agent, eliminating the need for time-consuming data transformations, model retraining, or fine-tuning. Data stores convert the incoming document into a set of vector database embeddings (a type of high-dimensional or mathematical representation of data) that the agent can use at runtime to extract the information it needs to supplement its next action or response to the user. One of the prolific examples of implementation of data stores has been Retrieval Augmented Generation (RAG) based applications.

Sample RAG-based application with ReAct reasoning/planning

Note: A detailed lifecycle of RAG applications is provided in the original paper (refer to pg. 29 and figure 13)

9. How to enhance model performance and types of techniques used to achieve this?

A crucial aspect of using models is their ability to choose the right tools when generating output. To achieve optimal model performance and help models gain access to specific types of knowledge, several approaches exist:

  • In Context learning
  • Retirement-based in-context learning
  • Fine-tuning based learning.

In-Context Learning: This method provides the generalised model with a prompt, tools, and few-shot examples at inference time, which allows it to learn on the fly how and when to use tools for a specific task. For example, ReAct framework.

Retrieval-based in-context learning: This technique dynamically populates the model prompt with the most relevant information tools and associated examples by retrieving them from external memory.

Fine-tuning based learning: This method involves training a model using a larger dataset of specific examples prior to inference. This helps the model understand when and how to apply certain tools prior to receiving any user queries.

If this was too technical for you, we can understand all three approaches with a simple analogy:

  • Imagine a chef has received a specific recipe (the prompt), a few key ingredients (relevant tools) and some example dishes (few-shot examples) from a customer. Based on this limited information and the chef’s general knowledge of cooking, they will need to figure out how to prepare the dish ‘on the fly’ that most closely aligns with the recipe and the customer’s preferences. This is in-context learning.
  • Now let’s imagine our chef in a kitchen that has a well-stocked pantry (external data stores) filled with various ingredients and cookbooks (examples and tools). The chef is now able to dynamically choose ingredients and cookbooks from the pantry and better align to the customer’s recipe and preferences. This allows the chef to create a more informed and refined dish leveraging both existing and new knowledge. This is retrieval-based in-context learning.
  • Finally, let’s imagine that we sent our chef back to school to learn a new cuisine or set of cuisines (pre-training on a larger dataset of specific examples). This allows the chef to approach future unseen customer recipes with deeper understanding. This approach is perfect if we want the chef to excel in specific cuisines (knowledge domains). This is fine-tuning based learning.

10. How can I build a AI Agent?

Up until now we explored core concepts of AI agents but building a production grade AI agents requires integrating them with additional tools like user interfaces, evaluation frameworks, and continuous improvement mechanism. Google’s Vertex AI platform simplifies this process by offering fully managed environment with all the fundamental elements covered earlier. We can also leverage open source libraries like LangChain and LangGraph to make prototype agents. These popular open source libraries allow users to build customer agents by “chaining” together sequences of logic, reasoning, and tool calls to answer a user’s query.

Sample end-to-end agent architecture built on Vertex AI platform

  • Using a natural language interface, developers can rapidly define crucial elements of their agents — goals, task instructions, tools, sub-agents for task delegation, and examples — to easily construct the desired system behavior.
  • The platform also comes with a set of development tools that allow for testing, evaluation, measuring agent performance, debugging, and improving the overall quality of developed agents.
  • This allows developers to focus on building and refining their agents while the complexities of infrastructure, deployment and maintenance are managed by the platform itself.

Note: If you want build custom AI agents using Vertex AI platform you can refer to their documentation here.

Conclusion

The future of AI agents holds exciting advancements and we’ve only scratched the surface of what is possible. As tools become more sophisticated and reasoning capabilities are enhanced, agents will be empowered to solve increasingly complex problems.

Furthermore, the strategic approach of “agent chaining” will continue to gain momentum. By combining specialized agents — each excelling in their particular industry or task — we can create a mixture of agents experts approach, capable of delivering exceptional results across various industries and problem areas.

References

[1]“Agents” by Julia Wiesinger, Patrick Marlow, and Vladimir Vuskovic.

You can also refer to my Handwritten Notes.

Related Posts

结合chatgpt-o3-mini与perplexity Deep Research的3步提示:提升论文写作质量的终极指南

结合chatgpt-o3-mini与perplexity Deep Research的3步提示:提升论文写作质量的终极指南

AI RESEARCH REPORTS & ESSAY WRITING Merging both system instructions to get the best of both models Perplexity AI’s Deep Research tool delivers expert-level research reports, while

Read More
10 Creative Ways to Use ChatGPT Search The Web Feature

10 Creative Ways to Use ChatGPT Search The Web Feature

For example, prompts and outputs Did you know you can use the “search the web” feature of ChatGPT for many tasks other than your basic web search? For those who don't know, ChatGPT’s new

Read More
📚 10 Must-Learn Skills to Stay Ahead in AI and Tech 🚀

📚 10 Must-Learn Skills to Stay Ahead in AI and Tech 🚀

In an industry as dynamic as AI and tech, staying ahead means constantly upgrading your skills. Whether you’re aiming to dive deep into AI model performance, master data analysis, or transform trad

Read More
10 Myths About DeepSeek AI That Everyone Gets Wrong

10 Myths About DeepSeek AI That Everyone Gets Wrong

Separating Fact from Fiction in the AI Arms Race Is DeepSeek AI the game-changer it’s made out to be, or is it just clever marketing and strategic hype? 👀 While some hail it as a revolu

Read More
10 Powerful Perplexity AI Prompts to Automate Your Marketing Tasks

10 Powerful Perplexity AI Prompts to Automate Your Marketing Tasks

In today’s fast-paced digital world, marketers are always looking for smarter ways to streamline their efforts. Imagine having a personal assistant who can create audience profiles, suggest mar

Read More
Type something to search...