构建具有图像字幕和可视化问答功能的聊天应用程序

Rifx.Online
Programming , Chatbots , Computer Vision
24 Jan, 2025

学习如何使用 Chainlit 创建互动聊天用户界面

在我之前的文章中，我向您介绍了 Chainlit，这个开源的 Python 库，使得创建基于聊天的用户界面变得简单：

在那篇文章中，我涵盖了聊天应用的基本结构，并演示了如何处理文本和文件输入，特别是图像。在这篇文章中，我将基于这个基础，深入探讨 Chainlit。这一次，我们将创建一个完全功能的聊天应用程序，使您能够：

与大型语言模型（LLM）进行对话并提出问题。
上传一张图片并为其生成描述性标题。
上传一张图片并提出与其内容相关的问题。

让我们填补空白，让这个聊天应用程序变得生动起来！

构建应用程序

作为提醒，这里是我们应用程序的文件夹结构。ChainlitDemo.py 文件位于 chainlitapp 文件夹内：

chainlitapp
     |___ChainlitDemo.py

让我们开始构建应用程序的主要框架。首先将 ChainlitDemo.py 文件填充以下代码：

import chainlit as cl

@cl.on_message
async def main(message: cl.Message):
    # if the message contains attachments
    if message.elements:
        for element in message.elements:
            if element.type == "image":    # if the attachement in an image                
                await cl.Message(
                    content=f"You sent an image and asked: '{message.content}'",                    
                ).send()
    else:
        await cl.Message(
            content=f"You said: {message.content}",
        ).send()

要运行应用程序，请在终端中输入以下命令：

$ chainlit run ChainlitDemo.py -w

使用这个基础应用程序，您可以输入一条消息：

消息将会被回显给您：

您还可以附加一张图片并输入一些文本：

然后图片将被上传到应用程序。同时，您输入的文本将被回显给您：

与 llama3.2 聊天

现在基础应用程序已经设置好，让我们开始实现第一个功能：与大型语言模型（LLM）聊天。为此，我将使用 Ollama。我假设您已经安装了 Ollama，并且正在运行 llama3.2 模型。

如果您需要关于 Ollama 的复习，请查看我之前的两篇文章：

将以下代码片段添加到 ChainlitDemo.py 文件中：

import chainlit as cl
import requests

## =======================================================================
## 初始化对话历史
conversation_history = []

@cl.on_chat_start
async def handle_new_chat():
    """
    当新聊天开始时触发此函数。
    使用它来重置对话状态或初始化变量。
    """
    global conversation_history
    # 清除新聊天的对话历史
    conversation_history = []
    # 向用户发送欢迎消息
    await cl.Message(content="你好！今天我能为您提供什么帮助？").send()

def chat(message):
    url = "http://localhost:11434/api/generate"
    model = "llama3.2"  
    headers = {
        "Content-Type": "application/json",
    }

    # 将用户消息添加到对话历史中
    conversation_history.append(
        {
            "role": "User", 
            "content": message
        })
    
    # 用对话历史格式化提示
    formatted_prompt = ""
    for turn in conversation_history:
        formatted_prompt += f"{turn['role']}: {turn['content']}\n"
       
    data = {
        "model": model,
        "prompt": formatted_prompt.strip(),
        "stream": False
    }
    response = requests.post(url, json = data, headers = headers)
    
    # 将助手的响应添加到对话历史中
    conversation_history.append(
        {
            "role": "Assistant", 
            "content": response.json()["response"]
        })    
    return response.json()["response"]
## =======================================================================

@cl.on_message
async def main(message: cl.Message):
    # 如果消息包含附件
    if message.elements:
        for element in message.elements:
            if element.type == "image":            # 如果附件是图片                
                await cl.Message(
                    content=f"您发送了一张图片并询问：'{message.content}'",                    
                ).send()
    else:
        await cl.Message( 
            # =================================
            content=f"{chat(message.content)}",
            # =================================
        ).send()

在上述代码片段中，我们添加了以下内容：

conversation_history 变量是一个列表，用于存储用户与模型之间的对话历史。这是至关重要的，以便模型能够在对话之间保持上下文。
handle_new_chat() 函数在用户想要创建新对话时被调用。这是清除 conversation_history 变量的地方，以便删除先前的对话。
chat() 函数接收来自用户的消息并将其发送到 LLM，在本例中是通过 Ollama 的 REST 端点 (http://localhost:11434/api/generate) 暴露的 llama3.2。请注意，用户输入的所有消息和来自 LLM 的响应都被添加到 conversation_history 列表中。
当 main() 函数接收到用户的文本输入时，它调用 chat() 函数并将响应返回给用户：

    else:
        await cl.Message(            
            content=f"{chat(message.content)}",
        ).send()

让我们试试。首先，让我们请 LLM 给我们讲个笑话：

果然，笑话送到了：

既然它问您是否想听一个笑话，那就让我们回应：“当然，再来一个！”。得益于保存的对话历史，LLM 现在可以理解您回复的上下文并无缝地继续：

要创建新对话，请点击屏幕左上角的图标：

当您点击确认时，屏幕会被清除，conversation_history 变量会被重置，所有先前的上下文都会被擦除：

请继续提问！然后，跟进一个相关的问题，以测试 LLM 是否能够保持对话的上下文。这将有助于验证其有效处理多轮对话的能力。

图像标题生成

我们即将实现的下一个功能是 图像标题生成。为此，我们将使用 Salesforce/blip-image-captioning-base 模型，我在之前关于多模态模型的文章中已经介绍过：

将以下代码片段添加到 ChainlitDemo.py 文件中：

import chainlit as cl

## ==================================================================
## Image captioning
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image

## Load the BLIP model and processor
model_name_caption = "Salesforce/blip-image-captioning-base"
processor_caption = BlipProcessor.from_pretrained(model_name_caption)
model_caption = BlipForConditionalGeneration.from_pretrained(model_name_caption)

def caption_image(image):
    # Prepare the image for the model
    inputs = processor_caption(image, return_tensors="pt")

    # Generate the caption
    caption_ids = model_caption.generate(**inputs)
    caption = processor_caption.decode(caption_ids[0], skip_special_tokens=True)
    return caption
## ==================================================================

import requests

## Initialize conversation history
conversation_history = []

@cl.on_chat_start
async def handle_new_chat():
   ...
   ...

def chat(message):
   ...
   ...

@cl.on_message
async def main(message: cl.Message):
    # if the message contains attachments
    if message.elements:
        for element in message.elements:
            if element.type == "image":  # if the attachement in an image                
                # ====================================================
                # Image captioning
                # send back the image caption
                caption = caption_image(Image.open(str(element.path)))
                await cl.Message(
                    content=f"Caption: {caption}",
                ).send()
                # ====================================================
    else:
        await cl.Message(            
            content=f"{chat(message.content)}",
        ).send()

在您添加的上述代码片段中，您：

加载了 Salesforce/blip-image-captioning-base 模型，以帮助您执行图像标题生成。
定义了一个名为 caption_image() 的函数。该函数将用户上传的图像传递给模型，并获取其生成的标题。
将标题发送回用户。

现在让我们试一试。对于这个例子，我将附上以下包含一排校车的图像：

一旦图像附加完成，输入一些文本并按回车：

Salesforce/blip-image-captioning-base 模型现在将获取图像并生成标题，然后将其返回给用户，如下所示：

这不是很有趣吗？尝试各种图像，检查标题是否准确。

视觉问答

接下来我们将实现的功能是 视觉问答 (VQA)。为此，我们将使用 Salesforce/blip-vqa-base 模型。将以下代码片段添加到 ChainlitDemo.py 文件中：

import chainlit as cl
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image

## Load the BLIP model and processor
model_name_caption = "Salesforce/blip-image-captioning-base"
processor_caption = BlipProcessor.from_pretrained(model_name_caption)
model_caption = BlipForConditionalGeneration.from_pretrained(model_name_caption)

def caption_image(image):
  ...
  ...

## ==================================================================
## VQA
import torch
from PIL import Image
from transformers import BlipProcessor, BlipForQuestionAnswering

## Load the pre-trained BLIP model and processor for VQA
model_name_vqa = "Salesforce/blip-vqa-base"
processor_vqa = BlipProcessor.from_pretrained(model_name_vqa)
model_vqa = BlipForQuestionAnswering.from_pretrained(model_name_vqa)

def visual_qa(image, question):
    inputs = processor_vqa(image, question, return_tensors="pt")
    out = model_vqa.generate(**inputs)
    answer = processor_vqa.decode(out[0], skip_special_tokens=True)
    return answer
## ==================================================================

import requests

## Initialize conversation history
conversation_history = []

@cl.on_chat_start
async def handle_new_chat():
  ...
  ...

def chat(message):
  ...
  ...

@cl.on_message
async def main(message: cl.Message):
    # if the message contains attachments
    if message.elements:
        for element in message.elements:
            if element.type == "image":  # if the attachement in an image  
                # ==========================================================
                # VQA
                # send back the image caption and the answer to the question
                caption = caption_image(Image.open(str(element.path)))
                answer = visual_qa(Image.open(str(element.path)), message.content)
                await cl.Message(
                    content=f"Caption: {caption},\n Answer: {answer}",                    
                ).send()
                # ==========================================================
    else:
        await cl.Message(            
            content=f"{chat(message.content)}",
        ).send()