Type something to search...
Qwen2-VL: A Vision Language Model That Runs Locally

Qwen2-VL: A Vision Language Model That Runs Locally

This is an introduction to「Qwen2-VL」, a machine learning model that can be used with ailia SDK. You can easily use this model to create AI applications using ailia SDK as well as many other ready-to-use ailia MODELS.

Overview

Qwen2-VL is a [vision language model](http://Vision Language Models) released by Alibaba in October 2024. It offers three model sizes: 2B, 7B, and 72B, and enables users to ask questions about images using text, similar to the GPT-4 vision API.

Applications include multilingual image-text understanding, code/math reasoning, video analysis, live chat, and agents.

Previously, LLAVAwas commonly used as an open-source solution for such tasks. However, it had certain limitations, such as its smallest model being relatively large at 7B and its lack of support for some languages, such as Japanese. Qwen2-VL addresses these issues by providing a 2B model size and support for Japanese.

Architecture

In Qwen2-VL, input images are tokenized and combined with the prompt text, then transformed into latent representations using a Vision Encoder before being fed into the QwenLM Decoder. It also supports videos, where up to 30 frames can be tokenized together.

Vision Language Models (VLMs) usually face the following challenges:

  • Encoding input images at a fixed resolution
  • Using CLIP as the Vision Encoder

Qwen2-VL addresses these issues by:

  • Handling input resolutions as they are, embedding positional information with RoPE
  • Using Vision Transformers (ViT) as the Vision Encoder and making it trainable

These improvements enhance the model’s accuracy.

Qwen2-VL training process goes as follows:

  1. The first stage involves training the ViT
  2. The second stage trains all parameters, including those of the LLM
  3. In the final stage, ViT parameters are frozen, and instruction tuning is performed using an Instruction Dataset

During pretraining, 600 billion tokens are used. The LLM is initialized with Qwen2 parameters. In the second stage, an additional 800 billion image-related tokens are processed, bringing the total to 1.4 trillion tokens.

Performance

Qwen2-VL-72B outperforms GPT-4o in terms of performance.

The graph below is a performance comparison of the 2B, 7B, and 72B models. While the 72B model delivers the highest accuracy, the 2B model also demonstrates solid performance.

Qwen2-VL-2B is the most efficient model, providing sufficient performance for most scenarios. The 7B model significantly enhances text recognition and video understanding capabilities. The 72B model further improves instruction adherence, decision-making, and agent-related capabilities.

The Vision Encoder has a fixed parameter count of 675M, ensuring high image recognition performance regardless of the model size. As a result, tasks like OCR can achieve high performance even with the 2B model.

Prompt templates

Qwen2-VL utilizes special tokens such as <|vision_start|> and <|vision_end|> for vision-related input. In dialogue, <!im_start|> is used. For encoding bounding boxes, <|box_start|> and <|box_end|> are employed. To link bounding boxes with captions, <|object_ref_start|> and <|object_ref_end|> are used.

Here is the prompt used when running a sample. <|image_pad|> is replaced with the tokenized values of the image and supplied to the Vision Encoder.

<!im_start|>system
You are a helpful assistant.<!im_end|>
<!im_start|>user
<|vision_start|><|image_pad|><|vision_end|>Describe this image.<!im_end|>
<!im_start|>assistant

When the input tokens are of size (1, 913), the output from the Vision Encoder will be (1, 913, 1536). This output is then fed into the QwenLM Decoder to generate text.

Tokenizer

Qwen2-VL uses the Qwen2Tokenizer as its tokenizer. Qwen2Tokenizer is compatible and employs the same BPE-based method as GPT2Tokenizer.

Usage

To run Qwen2-VL with ailia SDK (version 1.5 or later), use the following command. The 2B model has a size of 10GB (FP32), and Japanese can also be used in the prompt.

$ python3 qwen2_vl.py --input demo.jpeg --prompt "Describe this image."

When running inference on an M2 Mac CPU, it takes approximately 23 seconds for image embedding (1024x683 resolution) and about 42 seconds for text generation. The processing time for image embedding depends on the image resolution, while text generation time is less affected by resolution.

A reduced FP16 version of the ONNX model is also available. To use the FP16 version, include the -fp16 option.

$ python3 qwen2_vl.py --input demo.jpeg --prompt "Describe this image." --fp16

Qwen2-VL does not currently work with llama.cpp, but support is being addressed in Issue #9426.

Output examples

Here is an example of querying an image with Qwen2-VL.

Describe this image.

The image depicts a serene beach scene with a woman and a dog. The woman is sitting on the sand, wearing a plaid shirt and black pants, and appears to be smiling. She is giving a high-five to the dog, which is sitting on the sand next to her. The dog is wearing a colorful harness and appears to be wagging its tail. The background shows the ocean with gentle waves, and the sky is clear with a soft glow, suggesting either sunrise or sunset. The overall atmosphere is peaceful and joyful.

Queries can be made in Japanese, and Japanese OCR is also supported.

Applications

An application called ColQwen2 has been proposed, which utilizes the projected features from Qwen2-VL Vision Encoder output to perform PDF-based Retrieval-Augmented Generation (RAG). Users can ask questions about the images on the relevant pages using Qwen2-VL prompts to get answers.

Traditional RAG processes everything in a text-based manner, making it challenging to handle charts and diagrams. However, by using ColPali, which leverages Qwen2-VL to process everything image-based, this issue can be addressed effectively.

ax Inc. has developed ailia SDK, which enables cross-platform, GPU-based rapid inference.

ax Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to contact us for any inquiry.

Related Posts

10 Creative Ways to Use ChatGPT Search The Web Feature

10 Creative Ways to Use ChatGPT Search The Web Feature

For example, prompts and outputs Did you know you can use the “search the web” feature of ChatGPT for many tasks other than your basic web search? For those who don't know, ChatGPT’s new

Read More
📚 10 Must-Learn Skills to Stay Ahead in AI and Tech 🚀

📚 10 Must-Learn Skills to Stay Ahead in AI and Tech 🚀

In an industry as dynamic as AI and tech, staying ahead means constantly upgrading your skills. Whether you’re aiming to dive deep into AI model performance, master data analysis, or transform trad

Read More
10 Powerful Perplexity AI Prompts to Automate Your Marketing Tasks

10 Powerful Perplexity AI Prompts to Automate Your Marketing Tasks

In today’s fast-paced digital world, marketers are always looking for smarter ways to streamline their efforts. Imagine having a personal assistant who can create audience profiles, suggest mar

Read More
10+ Top ChatGPT Prompts for UI/UX Designers

10+ Top ChatGPT Prompts for UI/UX Designers

AI technologies, such as machine learning, natural language processing, and data analytics, are redefining traditional design methodologies. From automating repetitive tasks to enabling personal

Read More
100 AI Tools to Finish Months of Work in Minutes

100 AI Tools to Finish Months of Work in Minutes

The rapid advancements in artificial intelligence (AI) have transformed how businesses operate, allowing people to complete tasks that once took weeks or months in mere minutes. From content creat

Read More
17 Mindblowing GitHub Repositories You Never Knew Existed

17 Mindblowing GitHub Repositories You Never Knew Existed

Github Hidden Gems!! Repositories To Bookmark Right Away Learning to code is relatively easy, but mastering the art of writing better code is much tougher. GitHub serves as a treasur

Read More