Building and Serving a RAG Agent with Custom Tools: A Complete Guide

Rifx.Online
Programming , Generative AI , Data Science
14 Jan, 2025

Goal

The goal of this article is to demonstrate how to create a Large Language Model (LLM) agent using LangGraph and LangChain, which will perform Retrieval-Augmented Generation (RAG) on a set of documents. Additionally, we will explore how to build a tool for making API calls to enable the LLM to acquire real-time knowledge from external sources. Finally, we will serve this agent locally using a Flask API (FastAPI) and use a local PostgreSQL database server to store chat logs with the LLM.

Note: LLMs are likely already trained on Wikipedia articles and other sources, providing them with some inherent knowledge about finance. However, we are using them in this context to illustrate how to create a RAG tool effectively.

What are Tools?

Large Language Models (LLMs), though powerful, are limited by their reliance on the knowledge they were trained on. For an LLM to respond to:

Events not seen by the LLM (e.g., events that occurred after its training period), or
Data from documents not publicly available or previously unseen by the LLM,

we need to equip it with tools.

Tools are functions that we provide to the LLM. These can range from simple utilities, like performing mathematical calculations, to complex operations, such as calling APIs and using external resources to generate responses. Proper documentation of a tool’s usage gives the LLM context about when and why the tool should be used, as well as the parameters it requires.

In this tutorial, we will create the following two tools and bind them to OpenAI’s gpt-4o-mini LLM:

A RAG tool: This tool will use Wikipedia articles about finance for Retrieval-Augmented Generation.
A stock market trends tool: This tool will retrieve news articles for a given stock ticker symbol within a specified date range using the Finnhub API. It will help answer queries and explain trends in the stock market.

Note: To use OpenAI models, you will need an API key, which can be obtained by signing up on the OpenAI platform and requesting one. Some OpenAI models may require payment. Alternatively, you can explore other options like Mistral, Llama, and Cohere, which offer free access to some LLMs in exchange for using your data to train their models.

RAG Tool

Breakdown

RAG, or Retrieval-Augmented Generation, refers to a type of AI framework or architecture that focuses on equipping LLMs with external sources of knowledge. This technique is particularly useful when we need to provide the LLM with data from documents it has not been trained on.

Lets talk about each component of this tool:

Knowledge Base (Wikipedia articles) The knowledge base consists of documents used to provide additional knowledge to the LLM. In a real-world use case, this could include annual reports, company-related documents, or any domain-specific data relevant to the task.
Document LoaderDocument Loaders are responsible for loading data from the knowledge base. This component provides a standard interface for loading any type of data and formats it in a standardized way for LangChain to process.
**Splitter (Chunking)**Loaded documents often consist of large texts with smaller sections, where only certain parts are relevant to a query. To improve accuracy and ensure the LLM processes only the relevant sections, we split (chunk) the documents into smaller parts. Several methods are available for chunking: a. Character-based text splitting: Splits documents based on a separator you define. b. Recursive character splitting: Attempts to maintain structure by keeping paragraphs intact (if under the chunk size) and ensuring sentences within the same chunk don’t spill into the next. c. Document-structure-based splitting: Useful for logically structured documents, such as HTML or XML, where tags define the splits. d. Semantic meaning-based splitting: (Method we’ll use) Splits text based on its semantic meaning. This approach ensures chunks are separated by content relevance rather than characters or logical structure, making it ideal when the text contains topic shifts.
Embedding layerOnce the documents are chunked, the text is converted into numerical representations, known as embeddings, which the LLM can understand. The embedding layer will map or ‘embed’ the chunks into numerical values. We will use the Hugging Face Transformers library to generate embeddings but will rely on the more convenient HuggingFaceEmbeddings class from LangChain’s core library.
Vector StoreAfter creating embeddings for the document chunks, we store them in a vector store. Similar to how relational databases store and index tabular data, vector stores are optimized for storing embeddings as vectors and performing semantic searches. Semantic search works by embedding a query as a vector and calculating its similarity to stored vectors in the vector store. Vector stores can range from simple vector stores which use your computer’s memory to distributed vector stores on the cloud such as Pinecone. For the tutorial, we will be using FAISS. FAISS does a pretty good job at real-time retrieval of vectors. FAISS can optimize storing for large vectors in your memory by compressing them. You can read more about the low memory usage here.
Index Stored on DiskComputing embeddings is time-intensive, and since the vector store resides in memory, there’s a risk of losing data during an unexpected shutdown. To ensure embeddings persist between system sessions, we will save the vector store index locally on disk, allowing it to be reused after restarts.

Finnhub API Tool

Finnhub.io provides a Stock API with a range of endpoints from querying stock prices to news articles on the stock market. We will use their Company News endpoint for passing the stock ticker symbol and a range of dates to get articles relevant to that time period. We will use an API key provided by Finnhub to send request to its API.

What is an Agent?

Agents are systems designed to take on high-level tasks and utilize an LLM as the reasoning engine. They rely on LLMs to decide the next course of action given a specific task. LangChain recommends using LangGraph to build agents, as it allows the creation of a graph-like structure to define the flow of control within an agent.

Core Concepts of Graph Agents in LangGraph:

NodesNodes represent the vertices in the Agent graph. In LangGraph, nodes are essentially Python functions that contain the logic for the agent.
StateThe state refers to the current snapshot of the application or a checkpoint during the graph’s execution.
EdgesEdges define the flow of logic in the Agent graph. Different edge types exist for various use cases: Normal Edges: Represent fixed or unconditional transitions between nodes. Conditional Edges: Represent transitions to different nodes based on the output of a function.

Agent Workflow in the Tutorial

In this tutorial, we will define an agent that performs the following steps:

Accept a query from the user.
Determine whether a tool is required and decide which tool to use.
Use the selected tool’s response to invoke the previously defined LLM or if no tool is selected, use the response from the LLM.
Return the response to the user.
Persist the conversation by storing messages in a Postgres database (explained in the “Serving the Agent” section).

Developing the Agent

Libraries

langchainWe will use the core LangChain library to chain components together and build the RAG architecture. It will also be used to create the Finnhub API tool. a. The langchain library makes use of a basic unit known as a Runnable. b. A Runnable abstracts and encapsulates the actual code required to access the basic components (e.g., LLMs, vector stores, etc.) of an LLM application architecture. c. Each Runnable unit has a consistent interface, enabling it to be invoked, batched, streamed, and more with the same functions.
langchain-openai and langchain-communityThese libraries help integrate third-party APIs into LangChain as Runnables, ensuring all components work seamlessly. Think of LangChain as a middleman, simplifying access to APIs without requiring you to handle the complexities of different API methods.
langgraphLangGraph lets you create a graph like structure for defining the flow of control in an agent.
unstructuredUsed while loading documents, unstructured is a library requirement to use Document loaders in langchain.
faiss-gpuWe will use FAISS (Facebook AI Similarity Search) as our vector store. If GPU memory is unavailable, you can use faiss-cpu instead to utilize CPU memory.

Development Environment

We will develop the agent in a Python notebook, enabling an interactive and iterative development process.

Step 0. Installing dependencies

We will install all python dependencies required for the development phase.

%pip install --quiet --upgrade langchain langchain-openai langchain-community unstructured faiss-gpu python-dotenv

Setting environment variables (for development purposes only):

import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass()

os.environ["FINNHUB_API_KEY"] = getpass.getpass()

Using getpass gives you an input field where you can enter the API keys.

RAG Tool Development

Step 1. Document loader

For this tutorial, we will load Wikipedia articles using a predefined list of URLs stored in a links.txt file.Note: If you want to load documents from a local directory instead of URLs, you can use DirectoryLoader

with open('links.txt', 'r') as f:
    links = f.readlines()

## Remove duplicates and newline characters from each link
links = list(set([link.strip() for link in links]))

We will use the WebBaseLoader from the langchain_community library to load the documents from the provided links.

from langchain_community.document_loaders import WebBaseLoader

## Initialize the WebBaseLoader with the list of links
wikipedia_loader = WebBaseLoader(links)

## Load the documents
wikipedia_docs = wikipedia_loader.load()

Step 2. Chunking

For this tutorial, we will use the Sentence Transformers library to split text into chunks. The SentenceTransformersTokenTextSplitter splits documents based on their semantic meaning, ensuring the chunks are meaningful and contextually relevant.

How It Works

The splitter uses the sentence-transformers/all-mpnet-base-v2 model (or another model of your choice) to understand the document’s content at a sentence level.
The documents are passed through the model, which produces tokens representing the text’s meaning.
Based on sentence-level similarity, tokens are either grouped together or separated into distinct chunks.
Finally, the tokens are decoded back into the original text, preserving their contextual integrity.

from langchain_text_splitters import SentenceTransformersTokenTextSplitter

text_splitter = SentenceTransformersTokenTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
)
splits = text_splitter.split_documents(wikipedia_docs)

Step 3. Embedding Layer

We will use the all-MiniLM-L6-v2 model to generate embeddings for the document chunks. This process is similar to the one we discussed earlier with Sentence Transformers, where the text was tokenized. These tokens represent the document’s meaning and help the LLM efficiently retrieve relevant chunks.

Key Details

Padding Tokens: When embedding text, we add padding tokens to ensure that all inputs have a consistent size, even if some text chunks are shorter than others.
The all-MiniLM-L6-v2 model has an input size restriction of 256 tokens.
Padding ensures uniformity and avoids errors caused by varying input lengths.

from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
embeddings.client.tokenizer.add_special_tokens({'pad_token': '[PAD]'})

Step 4. Vector Store

We will use the langchain_community.vectorstores.FAISS library for using FAISS as our vector store.

from langchain_community.vectorstores import FAISS

db = FAISS.from_documents(splits, embedding=embeddings)

Step 5. Save index locally

Now that we have the vector store created from the splits/chunks and using the previously defined embeddings, we will store the index locally.

db.save_local("faiss_index")

Step 6. Loading index from disk

When its time to use the locally saved index in our application (when we want to serve the application), we need to define the embeddings we used once again. We can load it as follows:

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
embeddings.client.tokenizer.add_special_tokens({'pad_token': '[PAD]'})
db = FAISS.load_local(
    "faiss_index", embeddings, allow_dangerous_deserialization=True
)

Step 7. Create Retriever tool

To create the RAG tool, the first step is to define a retriever object. This object will determine the strategy for retrieving relevant chunks. For example, we can choose to retrieve the top-K results based on similarity or set a similarity threshold to return only those chunks that exceed the threshold. In this tutorial, we will keep it simple and use the top chunk matching by similarity.

retriever = db.as_retriever(
        search_type="similarity", search_kwargs={"k": 1})

We use the retriever object to create a Retriever tool, which will be used in our agent. The arguments for creating the tool are as follows:

The retriever object (first argument).
The name of the tool (second argument).
The description of the tool (third argument).

retriever_tool = create_retriever_tool(
        retriever,
        "explain_financial_terms",
        "Explain financial terms in the query",)

That wraps up the development part for the RAG tool.

Finnhub API Tool Development

Step 8. Finnhub API Tool

First lets create a function to send a GET request to the company-news endpoint on the Finnhub API.

def news_helper(symbol: str, start_date: str, last_date: str):
  API_KEY = os.environ["FINNHUB_API_KEY"]
  API_ENDPOINT = "https://finnhub.io/api/v1/company-news"
  queryString = f"{API_ENDPOINT}?symbol={symbol}&from={start_date}&to={last_date}&token={API_KEY}"

  # Send the search query to the Search API
  response = requests.get(queryString)
  # Read the response
  articles = response.json()[-5:]
  summaries = [article["summary"] for article in articles]
  return ",".join(summaries)

LangChain provides a simple way to implement tools using the @tool decorator. By adding this decorator, the function becomes a StructuredTool object. Below, we define the search_news_for_symbol tool, which includes documentation on when to use the tool and what inputs and outputs to expect. The news_helper function is called within search_news_for_symbol to send an API request to the Finnhub API endpoint.

from langchain_core.tools import tool

@tool
def search_news_for_symbol(symbol: str, start_date: str, last_date: str) -> str:
  """Search for news articles in a time period for a given ticker symbol. eg: NVDA, MSFT, TSLA etc.

  Args:
        symbol: The symbol to search for.
        start_date: The start date of the search.
        last_date: The last date of the search.
  Returns:
        A string containing the news articles.
  """

  company_news = news_helper(symbol=symbol, start_date=start_date, last_date=last_date)
  return company_news

When you print the search_news_for_symbol tool, you will see it is defined as a StructuredTool and has a func property, which references the search_news_for_symbol function (the Python function with the code). This completes the creation of the Finnhub API tool.

Step 9. Binding tools to LLM

LLM and Tools

Finally, we get to the LLM part of the article. In this step, we will define the LangChain OpenAI chat class and bind the tools to the LLM. By binding the tools, we allow the LLM to access and utilize them when needed.

We create a list containing the names of the tools — specifically, the Finnhub API tool and the RAG tool. This way, the LLM knows when to use these tools to fulfill specific tasks.

from langchain_openai import ChatOpenAI
tools = [retriever_tool, search_news_for_symbol]
llm = ChatOpenAI(model="gpt-4o-mini").bind_tools(tools)

Agent Development

Using the previously defined tools, we will now create an agent. As discussed in the concepts for an agent, we will define a graph with a state, nodes, and edges.

The StateGraph class is the main class in LangGraph for creating a graph for an agent. It uses the State as a variable to keep track of the current agent environment. For example, it can use the MessagesState class which helps to keep track of all messages in the agent.

from langgraph.graph import StateGraph, MessagesState
workflow = StateGraph(MessagesState)

The first node we will define will handle sending the previous messages along with the query asked by the user. It will also include messages from the tools while processing the query.

def llm_node(state: MessagesState):
  messages = state['messages']
  response = llm.invoke(messages)
  # We return a list, because this will get added to the existing list
  return {"messages": [response]}

Next, we will define a ToolNode, which is essentially a Runnable (from LangChain). It takes in messages as input and returns the messages from the tools.

from langgraph.prebuilt import ToolNode
tool_node = ToolNode(tools) # tools: list of tools we defined earlier

Now, we will add the llm_node and tool_node to the graph.

workflow.add_node("llm_node", llm_node)  # agent
workflow.add_node("tools", tool_node)

We will define a should_continue function, which will be responsible for routing in the graph. If the last message in the state indicates that the LLM should call a tool, we will route the graph to the tools node. If the last message does not call any tool, we will route to the END node to stop the graph and return the response.

from langgraph.graph import END

def should_continue(state: MessagesState) -> Literal["tools", END]:
  messages = state['messages']
  last_message = messages[-1]
  # If the LLM makes a tool call, then we route to the "tools" node
  if last_message.tool_calls:
      return "tools"
  # Otherwise, we stop (reply to the user)
  return END

We’ve now added the nodes, but the graph still lacks edges between them. We will add the edges step by step, starting from the starting state (START) of the graph, which will take in the input. When adding an edge, the first argument is the FROM node, and the second argument is the TO node.

from langgraph.graph import START
workflow.add_edge(START, "llm_node")

Next, we will add a conditional edge, where the FROM node is fixed, but the TO node depends on the output of the should_continue function.

workflow.add_conditional_edges(
    "llm_node",
    should_continue,
)

Now, we have a conditional edge from the llm_node to the tools node, depending on the output of the should_continue function. We also need to add an edge from the tools node back to the llm_node.

workflow.add_edge("tools", 'llm_node')

We’ve defined the nodes and edges, and we’re almost ready to compile the graph. Before the final step, let’s define a place to store the state of the graph, or “checkpoint” it. This helps us keep track of the conversation and message history for the agent. Depending on the type of store used, the messages will span from short durations to being persisted indefinitely.

For now, we can use a MemorySaver for a quick and easy way to store messages in memory while we’re developing the agent. However, to ensure the messages aren’t lost when memory is cleared, we will store them in a database like Postgres when serving the application.

## In Memory store
checkpointer = MemorySaver()

Now that we’ve defined the graph and the checkpointer, let’s compile it. We pass the checkpointer object when compiling the graph. After running the compile function on the graph, you’ll be able to see the graph visually. And that’s it! The agent graph is ready for invocation.

graph = workflow.compile(checkpointer=checkpointer)
graph

To invoke the graph, we can use a user prompt. We pass the prompt as a HumanMessage object, which helps the LLM keep track of the conversation and understand the source of the message.

from langchain_core.messages import HumanMessage

prompt = "Explain options trading"

final_state = graph.invoke(
    {"messages": [HumanMessage(content=prompt)]},
    config={"configurable": {"thread_id": 42}}
)

To get a more detailed understanding of the transitions that take place in the agent, you can print every message in the final_state.

for message in final_state["messages"]:
  print(message)

Output:

content='Explain options trading' additional_kwargs={} response_metadata={} id='b3ca2cc4-9604-4139-8b6f-e4105329d65a'
content='' additional_kwargs={'tool_calls': [{'id': 'call_tAJd7qTJxgeHNqaeiz5KKAyg', 'function': {'arguments': '{"query":"options trading"}', 'name': 'explain_financial_terms'}, 'type': 'function'}], 'refusal': None} response_metadata={'token_usage': {'completion_tokens': 19, 'prompt_tokens': 149, 'total_tokens': 168, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_0aa8d3e20b', 'finish_reason': 'tool_calls', 'logprobs': None} id='run-f4080ffb-4449-4837-921e-2c5a3604f16c-0' tool_calls=[{'name': 'explain_financial_terms', 'args': {'query': 'options trading'}, 'id': 'call_tAJd7qTJxgeHNqaeiz5KKAyg', 'type': 'tool_call'}] usage_metadata={'input_tokens': 149, 'output_tokens': 19, 'total_tokens': 168, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}
content='8 ] all made important improvements to the theory of options pricing. fischer black and myron scholes demonstrated in 1968 that a dynamic revision of a portfolio removes the expected return of the security, thus inventing the risk neutral argument. [ 9 ] [ 10 ] they based their thinking on work previously done by market researchers and practitioners including the work mentioned above, as well as work by sheen kassouf and edward o. thorp. black and scholes then attempted to apply the formula to the markets, but incurred financial losses, due to a lack of risk management in their trades. in 1970, they decided to return to the academic environment. [ 11 ] after three years of efforts, the formula — named in honor of them for making it public — was finally published in 1973 in an article titled " the pricing of options and corporate liabilities ", in the journal of political economy. [ 12 ] [ 13 ] [ 14 ] robert c. merton was the first to publish a paper expanding the mathematical understanding of the options pricing model, and coined the term " black – scholes options pricing model ". the formula led to a boom in options trading and provided mathematical legitimacy to the activities of the chicago board options exchange and other options markets around the world. [ 15 ] merton and scholes received the 1997 nobel memorial prize in economic sciences for their work, the committee citing their discovery of the risk neutral dynamic revision as a breakthrough that separates the option from the risk of the underlying security. [ 16 ] although ineligible for the prize because of his death in 1995, black was mentioned as a contributor by the swedish academy. [ 17 ] the black – scholes model assumes that the market consists of at least one risky asset, usually called the stock, and one riskless asset, usually called the money market, cash, or bond. the following assumptions are made about the assets ( which relate to the names of the assets' name='explain_financial_terms' id='bcf9ffc0-5a04-41e7-b95f-2acb1006eba8' tool_call_id='call_tAJd7qTJxgeHNqaeiz5KKAyg'
content='Options trading involves the buying and selling of options contracts, which are financial derivatives that provide the buyer the right, but not the obligation, to buy or sell an underlying asset at a predetermined price (the strike price) on or before a specific expiration date.\n\n### Key Concepts in Options Trading:\n\n1. **Options Contracts**: There are two main types of options contracts:\n   - **Call Options**: Gives the holder the right to buy the underlying asset at the strike price.\n   - **Put Options**: Gives the holder the right to sell the underlying asset at the strike price.\n\n2. **Strike Price**: The predetermined price at which the underlying asset can be bought or sold.\n\n3. **Expiration Date**: The date on which the option contract becomes void if not exercised.\n\n4. **Premium**: The price paid to purchase the option, which is a cost to the buyer and income to the seller (writer) of the option.\n\n5. **Underlying Asset**: The financial instrument (e.g., stock, commodity, index) that the option contract is based on.\n\n### Pricing Models:\nThe Black-Scholes model is one of the most well-known methods for pricing options. Developed by Fischer Black, Myron Scholes, and Robert Merton, it provides a mathematical formula to determine the fair price of options based on various factors, including the price of the underlying asset, the strike price, time to expiration, and volatility.\n\n### Uses of Options:\nOptions trading can be used for various purposes, including:\n- **Hedging**: Protecting against potential losses in an investment.\n- **Speculation**: Betting on the future price movement of an asset to generate profit.\n- **Income Generation**: Writing options to collect premiums.\n\n### Risks:\nOptions trading can be risky, especially for inexperienced traders. The potential for loss can be significant, particularly when trading strategies involve leverage or complex positions.\n\nOverall, options trading is a sophisticated financial practice that requires a good understanding of market dynamics, pricing models, and risk management strategies.' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 415, 'prompt_tokens': 585, 'total_tokens': 1000, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_0aa8d3e20b', 'finish_reason': 'stop', 'logprobs': None} id='run-cef2ef0b-d6b6-4c4d-882f-a15537f9ce99-0' usage_metadata={'input_tokens': 585, 'output_tokens': 415, 'total_tokens': 1000, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'output_token_details': {'audio': 0, 'reasoning': 0}}

Serving the Agent

When serving the agent, we will extend beyond a Python notebook and focus on how to serve the agent using a REST API. We’ll structure the project by separating the steps to build and compile the agent from the tools themselves. Instead of using a memory checkpointer, we will use PostgresDB to ensure the chat history persists longer than in memory. Below is the directory structure we’ll use:

.
├── agent.py
├── faiss_index
│   ├── index.faiss
│   └── index.pkl
├── main.py
├── requirements.txt
└── utils
    ├── __init__.py
    ├── nodes.py
    ├── state.py
    └── tools.py

3 directories, 9 files

Explanation of the Directory Structure:

agent.py: Contains the steps to add nodes, edges, and checkpointing to the graph.
faiss_index: Contains the FAISS index files (index.faiss and index.pkl).
utils: Contains files that define the nodes, state (if applicable), and tools in their respective files.
main.py: Defines the API routes and handles the Postgres DB connection pool while invoking the agent.
requirements.txt: Lists the dependencies used to serve the agent.

Step 0. Setting up PostgresDB

You can follow any tutorial on how to set up a database in PostgresDB. We just need the connection string for the database. In my case, the connection string as follows:

"postgresql://username:password@localhost:5433/DatabaseName?sslmode=disable"

Step 1. Installing dependencies and setting environment variables

We will create a requirements.txt file for the package we will use while serving. To add to the previous packages, we have a few new ones that we will use:

fastapi[standard]: We will use fastapi which is a quick way to create a Flask API and good enough for our current use case.
psycopg: PostgreSQL adapter for Python. Will help us connect to PostgresDB programmatically
psycopg-pool: This package will create a connection pool with PostgresDB. A pool helps to maintain the connection and reuse it instead of opening and deleting connections.
langgraph-checkpoint-postgres: Langgraph’s implementation of the Checkpointer class for PostgresDB

Your requirements.txt will look as follows:

langchain
langchain-openai
langchain-community
unstructured
langgraph
faiss-cpu
sentence-transformers
fastapi[standard]
psycopg
psycopg-pool
langgraph-checkpoint-postgres

To load environment variables locally, we can use the load_dotenv Python library. This is better suited for local development, but for hosting services like Heroku, follow the service’s instructions for setting environment variables.

Create a .env file in the root directory with the following content:

FINNHUB_API_KEY=c3****
OPENAI_API_KEY=sk****
DB_URI="postgresql://username:password@localhost:5433/DatabaseName?sslmode=disable"

In the main.py file, to load variables from the .env file, start with the following code:

from dotenv import load_dotenv
load_dotenv()

Step 2. Copy the FAISS index to root directory

Copy the folder containing the FAISS index that we saved earlier in the root directory. This folder should include the index.faiss and index.pkl files.

Step 3. Copy nodes and tools in the utils subdirectory

In utils/tools.py, define the tools as follows:

from langchain_core.tools import tool
import requests
import os
from langchain.tools.retriever import create_retriever_tool
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings


def get_retriever_tool():
    embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    embeddings.client.tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    db = FAISS.load_local(
        "faiss_index", embeddings, allow_dangerous_deserialization=True
    )
    retriever = db.as_retriever(
        search_type="similarity", search_kwargs={"k": 1})
    retriever_tool = create_retriever_tool(
        retriever,
        "explain_financial_terms",
        "Explain financial terms in the query",)
    return retriever_tool


def news_helper(symbol: str, start_date: str, last_date: str):
    #   "c3smgt2ad3ide69e4jtg"
    FINNHUB_API_KEY = os.environ["FINNHUB_API_KEY"]
    API_ENDPOINT = "https://finnhub.io/api/v1/company-news"
    queryString = f"{API_ENDPOINT}?symbol={symbol}&from={start_date}&to={last_date}&token={FINNHUB_API_KEY}"

    # Send the search query to the Search API
    response = requests.get(queryString)
    # Read the response
    articles = response.json()[-5:]
    summaries = [article["summary"] for article in articles]
    return ",".join(summaries)


@tool
def search_news_for_symbol(symbol: str, start_date: str, last_date: str) -> str:
    """Search for news articles in a time period for a given ticker symbol. eg: NVDA, MSFT, TSLA etc.

     Args:
          symbol: The symbol to search for.
          start_date: The start date of the search.
          last_date: The last date of the search.
    Returns:
          A string containing the news articles.
    """

    company_news = news_helper(
        symbol=symbol, start_date=start_date, last_date=last_date)
    return company_news


def get_tools():
    return [get_retriever_tool(), search_news_for_symbol]

As you can see most of the code is straight from the development phase. We add the process to create the retriever tool in the get_retriever_tool() which gets called only once. We have the get_tools function to return the tools in a list from the tools.py file.

In utils/nodes.py, define the nodes:

from .tools import get_tools
from langgraph.prebuilt import ToolNode
from langchain_core.messages import HumanMessage
from langchain_openai import ChatOpenAI
from langgraph.graph import END, MessagesState
from typing import Literal

tools = get_tools()
llm = ChatOpenAI(model="gpt-4o-mini").bind_tools(tools)


def agent(state: MessagesState):
    messages = state['messages']
    response = llm.invoke(messages)
    return {"messages": [response]}

tool_node = ToolNode(tools)

def should_continue(state: MessagesState) -> Literal["tools", END]:
    messages = state['messages']
    last_message = messages[-1]
    # If the LLM makes a tool call, then we route to the "tools" node
    if last_message.tool_calls:
        return "tools"
    # Otherwise, we stop (reply to the user)
    return END

We bind the tools to the LLM in this file. We define the nodes like we did in the development phase.

Step 4. Add graph creation and compilation steps

In agent.py, define the graph creation and compilation steps:

from utils.nodes import agent, tool_node, should_continue
from langgraph.graph import START, StateGraph, MessagesState


def get_graph(checkpointer):
    workflow = StateGraph(MessagesState)
    workflow.add_node("agent", agent)  # agent
    workflow.add_node("tools", tool_node)

    workflow.add_edge(START, "agent")
    workflow.add_conditional_edges(
        "agent",
        should_continue,
    )
    workflow.add_edge("tools", 'agent')

    graph = workflow.compile(checkpointer=checkpointer)
    return graph

Note: We pass a checkpointer from the main.py to the agent.py

Step 5. Setting up the API

main.py:

from dotenv import load_dotenv
load_dotenv()
from fastapi import FastAPI
from agent import get_graph
from langchain_core.messages import HumanMessage
from psycopg_pool import ConnectionPool
import os
from langgraph.checkpoint.postgres import PostgresSaver

app = FastAPI()

connection_kwargs = {
    "autocommit": True,
    "prepare_threshold": 0,
}

pool = ConnectionPool(
    # Example configuration
    conninfo=os.environ['DB_URI'],
    max_size=20,
    kwargs=connection_kwargs,
    )

@app.get("/")
def query_llm(query: str) -> str:
    checkpointer = PostgresSaver(pool)
    # checkpointer = MemorySaver()
    checkpointer.setup()
    graph = get_graph(checkpointer)
    final_state = graph.invoke(
        {"messages": [HumanMessage(content=query)]},
        config={"configurable": {"thread_id": "1"}}
    )
    last_message = final_state['messages'][-1].content
    return last_message

@app.on_event("shutdown")
async def shutdown_event():
    pool.close()

First we load the environment variables using the python-dotenv library. We also import all the requirements for serving the API.
We call the FastAPI class and store it in the app variable.
We then create a ConnectionPool with our Postgres DB. We use the connection string stored in the “DB_URI” environment variable.
We define GET method on the “/” route and add a function.
In the query_llm(), we take in the query as a parameter in the request. We define the checkpointer and set it up. We get the graph from the get_graph() in agent.py
We invoke the graph with the query and return the last message of the final state which is the response.
At the end of the file, we add an event for app shutdown. This function runs when the API is shutting down. Here, we will close the connection pool to Postgres DB.

You can run the API using:

fastapi dev main.py

The API gets served by default on localhost:8000 and you can send requests to the API by using a tool like Postman or use the FastAPI docs to test it out at localhost:8000/docs. Here is an example for a query:

"Explain NVDA trend from 01/01/2024 to 01/01/2025"

You can also use curl as follows for the above query:

curl -X 'GET' \
  'http://127.0.0.1:8000/?query=Explain%20NVDA%20trend%20from%2001%2F01%2F2024%20to%2001%2F01%2F2025' \
  -H 'accept: application/json'

Here is the response on the localhost:8000/docs page:

When you query the Agent using the API, you will see a few tables being created in the Postgres database that you specified, which look like these:

Here is what the checkpoints table has after sending the above request:

Conclusion

This process covers the complete lifecycle of building and serving an RAG agent using FastAPI and PostgreSQL:

Development: Defining tools, nodes, and graph to manage state.
Serving: Using FastAPI to serve the agent and handle requests.
Persistence: Storing conversation history in PostgreSQL for long-term storage.

By following these steps, you’ll have a fully functional REST API that interacts with an RAG agent while maintaining state across multiple interactions. You can test, monitor, and scale the service as needed.This wraps up the whole RAG agent lifecycle from development to serving!

Few additional points:

For hosting the Flask API, render.com is a good platform which has a generous free tier.
Make sure to add authentication for the API or rate limiting to make sure you are not billed an excessively large amount if you do host the API.
The libraries and components used in the RAG tool can be easily swapped with alternatives based on your preferences or requirements.