Let’s build a Text Analysis Pipeline with LangGraph Agents

Rifx.Online
Natural Language Processing , Programming , Data Science
11 Jan, 2025

In this article, I will present to you LangGraph, an incredible framework for building applications using graph-based workflows that would otherwise be impossible. I will share my experience with LangGraph, and its important features, and eventually create a text analysis pipeline that will illustrate what LangGraph is capable of doing.

Understanding LangGraph

Essentially, LangGraph is built around the concept of graph-based workflows, where each node functions as a specific procedural or computational step, and edges determine the flow of data between these nodes under certain conditions. This gives a high flexibility and modularity to the application design, making it highly appropriate for complicated tasks such as those found in natural language processing (NLP).

Key Features

State Management: Exceed the Boundaries- State Management of LangGraph: It is probably most special because it has one of the best capabilities to maintain states across diverse nodes so that the application keeps context and can thus reply appropriately to the user’s actions or input.
Flexible Routing: The framework supports dynamic data routing between nodes, allowing for complex decision-making processes within workflows. This flexibility is essential for applications that require adaptability based on varying inputs.
Persistence: LangGraph includes built-in persistence capabilities, enabling workflows to save their state after each step. This feature is crucial for applications that need to recover from interruptions or support human-in-the-loop interactions.
Visualization: The graph-based structure allows developers to visualize workflows easily, which aids in understanding how different components interact and the overall flow of data within the application.

Our Model for This Project: Text Analysis Pipeline

In this tutorial, we will build a multi-stage pipeline for text analysis using LangGraph. The pipeline will deal with processing a given text in three main steps:

1. Text Classification

In the initial stage, we classify the input text into defined categories such as News, Blogs, Research, Others, or something similar. With a classification model at this node, we can rather determine the nature of the text and take it on to further processing steps as required.

2. Entity Extraction

The next thing to do is identify and extract key entities of the text. In such recognition, important components like persons, organizations, and locations occur in that text. Entity extraction adds to the understanding of the text and sets the stage for further detailed analysis.

3. Text Summarization

Ultimately, we will create a small summary of the input text. Summarization techniques are involved in this step to provide all significant, necessary information in a more compressed form to users. The summarization node will collect input from both the classification and entity extraction stages for a more coherent overview.

Building the Pipeline

For the construction of this pipeline in LangGraph, we shall create nodes for each stage of processing and then lay down edges that will define the flow of data through these nodes.

Define Nodes: Classification, Extraction, Summary- Every such function will be expressed in the form of a node in our graph.
Establish Edges: We will create edges joining these nodes based on the output of one as input to another.
Implement Logic: It can be necessary to define a conditional logic determining the path to take, based on the classification results or the extracted entities.

This process can lead us toward making a modular and extensible workflow that can further be easily modified or expanded upon concerning text analysis requirements.

Import Required Libraries

This cell imports all necessary modules and classes for our LangGraph tutorial.

import os
from typing import TypedDict, List
from langgraph.graph import StateGraph, END
from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage
from langchain_core.runnables.graph import MermaidDrawMethod
from IPython.display import display, Image

from dotenv import load_dotenv

Set Up API Key

This cell would load the environment variables and should configure the OpenAI API key. You need to have a .env file containing your OPENAI_API_KEY.

## Load environment variables
load_dotenv()

## Set OpenAI API key
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

Building the Text Processing Pipeline

Define the State and Set up the LLMHere, we define the State class to manage our workflow data and then, initialize the ChatOpenAI model.

class State(TypedDict):
    text: str
    classification: str
    entities: List[str]
    summary: str

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

Define Node Functions

Well, the functions that specify the operations performed at each node in our graph are for classification, entity extraction, and summarization.

def classification_node(state: State):
    ''' Classify the text into one of the categories: News, Blog, Research, or Other '''
    prompt = PromptTemplate(
        input_variables=["text"],
        template="Classify the following text into one of the categories: News, Blog, Research, or Other.\n\nText:{text}\n\nCategory:"
    )
    message = HumanMessage(content=prompt.format(text=state["text"]))
    classification = llm.invoke([message]).content.strip()
    return {"classification": classification}


def entity_extraction_node(state: State):
    ''' Extract all the entities (Person, Organization, Location) from the text '''
    prompt = PromptTemplate(
        input_variables=["text"],
        template="Extract all the entities (Person, Organization, Location) from the following text. Provide the result as a comma-separated list.\n\nText:{text}\n\nEntities:"
    )
    message = HumanMessage(content=prompt.format(text=state["text"]))
    entities = llm.invoke([message]).content.strip().split(", ")
    return {"entities": entities}


def summarization_node(state: State):
    ''' Summarize the text in one short sentence '''
    prompt = PromptTemplate(
        input_variables=["text"],
        template="Summarize the following text in one short sentence.\n\nText:{text}\n\nSummary:"
    )
    message = HumanMessage(content=prompt.format(text=state["text"]))
    summary = llm.invoke([message]).content.strip()
    return {"summary": summary}

Create Tools and Build Workflow

This cell constructs the StateGraph workflow.

workflow = StateGraph(State)

## Add nodes to the graph
workflow.add_node("classification_node", classification_node)
workflow.add_node("entity_extraction", entity_extraction_node)
workflow.add_node("summarization", summarization_node)

## Add edges to the graph
workflow.set_entry_point("classification_node") # Set the entry point of the graph
workflow.add_edge("classification_node", "entity_extraction")
workflow.add_edge("entity_extraction", "summarization")
workflow.add_edge("summarization", END)

## Compile the graph
app = workflow.compile()

Visualizing the Workflow

It helps us present a flow of how our work process goes in this cell through Mermaid.

display(
    Image(
        app.get_graph().draw_mermaid_png(
            draw_method=MermaidDrawMethod.API,
        )
    )
)

Testing the Pipeline

This cell runs a sample text through our pipeline and displays the results.

sample_text = """
OpenAI has announced the GPT-4 model, which is a large multimodal model that exhibits human-level performance on various professional benchmarks. It is developed to improve the alignment and safety of AI systems.
additionally, the model is designed to be more efficient and scalable than its predecessor, GPT-3. The GPT-4 model is expected to be released in the coming months and will be available to the public for research and development purposes.
"""

state_input = {"text": sample_text}
result = app.invoke(state_input)

print("Classification:", result["classification"])
print("\nEntities:", result["entities"])
print("\nSummary:", result["summary"])

#response
Classification: News

Entities: ['OpenAI', 'GPT-4', 'GPT-3']

Summary: OpenAI's upcoming GPT-4 model is a multimodal AI that aims for human-level performance, improved safety, and greater efficiency compared to GPT-3.

Conclusion

In this tutorial, we have:

Studied the concepts of LangGraph
Constructed a text-processing pipeline
Showed an application of LangGraph in data processing workflows
Visualized this workflow in Mermaid

An example of such uses outside the venue of conversational agents is what LangGraph as a general framework can be utilized for: constructing very complex graph-based workflows.