Mini MiniCPM-o 2.6: The 8B Parameter Multimodal LLM Beating GPT-4o

Rifx.Online
Natural Language Processing , Machine Learning , Technology/Web
20 Jan, 2025

In a groundbreaking development, Mini CPM-o has taken the world of multimodal large language models (LLMs) by storm. With its 8-billion parameter architecture, it not only outperforms GPT-4o on several benchmarks but also rivals it in vision, audio, and other multimodal functionalities. Let’s dive into this exciting release, its capabilities, installation process, and use cases.

What Is MiniCPM-o 2.6:?

MiniCPM-o 2.6: is an advanced multimodal LLM that seamlessly handles text, images, videos, and audio as input. It delivers high-quality outputs, including text generation, speech synthesis, and multimodal interactions. Its performance is a testament to cutting-edge innovation, beating prominent models in speed and accuracy while introducing new possibilities in real-time processing.

Key Features of MiniCPM-o

End-to-End Multimodal Processing: Handles vision, audio, and text input simultaneously with unparalleled precision.
Bilingual Real-Time Speech Conversion: Supports real-time speech-to-speech conversion with customizable voices.
Emotion, Speed and Style Control: Enables unique capabilities like emotion modulation, voice cloning, and role-play scenarios.
Video and Audio Understanding: Efficiently extracts insights from videos and audio files, making it a versatile tool for creative and analytical tasks.
OCR Capabilities: Performs optical character recognition on images, supporting multiple languages.
Time Division Multiplexing (TDM): A novel mechanism for online multimodal streaming, ensuring seamless real-time performance.

This radar chart compares the performance of various models (GPT-4o-2.2405, Gemini-1.5 Pro, Qwen2-VL, GLM-4-Voice, and MiniCPM-o 2.6) across multiple tasks such as live streaming, speech conversation, and visual understanding benchmarks. It visually highlights strengths and weaknesses for each model in different evaluation categories.

Installing MiniCPM-o Locally

System Requirements:

Python 3.10

Step-by-Step Guide:

Clone the Repository: Clone the Mini CPM 2.6 repository from GitHub, which is open-sourced for the community. For the cpu follow according to these steps.

https://github.com/OpenBMB/MiniCPM-o.git

2. Install Dependencies:Use the provided requirements.txt file to install necessary libraries.

cd MiniCPM-o

## Create Conda Environment
conda create -n minicpm python=3.10
conda activate minicpm

## Install the requirements
pip install -r requirements_o2.6.txt

3. Launch Model Server.

python web_demos/minicpm-o_2.6/model_server.py

4. Launch Web Server.

## Make sure Node and PNPM is installed.
sudo apt-get update
sudo apt-get install nodejs npm
npm install -g pnpm


cd web_demos/minicpm-o_2.6/web_server
## create ssl cert for https, https is required to request camera and microphone permissions.
bash ./make_ssl_cert.sh  # output key.pem and cert.pem

pnpm install  # install requirements
pnpm run dev  # start server

5. If you want to test in Jupyter Noteboook. Launch Jupyter Notebook:Initiate Jupyter Notebook to interact with the model.

conda install -c conda-forge --override-channels notebook -y
conda install -c conda-forge --override-channels ipywidgets -y
jupyter notebook

For image Based Output.

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

torch.manual_seed(100)

model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)

image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB')

## First round chat 
question = "What is the landform in the picture?"
msgs = [{'role': 'user', 'content': [image, question]}]

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

## Second round chat, pass history context of multi-turn conversation
msgs.append({"role": "assistant", "content": [answer]})
msgs.append({"role": "user", "content": ["What should I pay attention to when traveling here?"]})

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

## Output
"The landform in the picture is a mountain range. The mountains appear to be karst formations, characterized by their steep, rugged peaks and smooth, rounded shapes. These types of mountains are often found in regions with limestone bedrock and are shaped by processes such as erosion and weathering. The reflection of the mountains in the water adds to the scenic beauty of the landscape."

"When traveling to this scenic location, it's important to pay attention to the weather conditions, as the area appears to be prone to fog and mist, especially during sunrise or sunset. Additionally, ensure you have proper footwear for navigating the potentially slippery terrain around the water. Lastly, respect the natural environment by not disturbing the local flora and fauna."

For Multiple Images.

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)

image1 = Image.open('image1.jpg').convert('RGB')
image2 = Image.open('image2.jpg').convert('RGB')
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'

msgs = [{'role': 'user', 'content': [image1, image2, question]}]

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

For Video Input:

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
from decord import VideoReader, cpu    # pip install decord

model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)

MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number

def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]

    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames

video_path="video_test.mp4"
frames = encode_video(video_path)
question = "Describe the video"
msgs = [
    {'role': 'user', 'content': frames + [question]}, 
]

## Set decode params for video
params = {}
params["use_image_id"] = False
params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution > 448*448

answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    **params
)
print(answer)

Exploring the Capabilities

Video Understanding:

MiniCPM-o can analyze and describe videos with remarkable accuracy. For instance, it can identify objects, surroundings, and even subtle details like the expressions, landscape of a video.

Text-to-Speech (TTS):

The model excels at TTS, providing accurate transcriptions of audio files. Its bilingual capability ensures accurate translations in supported languages.

OCR Functionality:

While its English OCR is highly accurate, its performance in languages like is good but not great. This makes it a strong contender for English-centric applications.

Limitations and Observations

Language Support: While English OCR is robust, multilingual support needs refinement.
GPU Dependency: Requires high-performance GPUs for optimal performance.

Real-World Applications

Content Creation: Ideal for video summarization, transcription, and voiceover tasks.
Customer Support: Automates interactions with real-time bilingual speech conversion and emotion control.
Data Analysis: Extracts insights from multimodal data, including images and videos.
Education and Accessibility: Enhances learning experiences with TTS and OCR for visually impaired users.

Conclusion

MiniCPM-o 2.6 is a monumental leap in multimodal LLMs, offering unparalleled capabilities in handling diverse data types. Its real-time processing, bilingual TTS, and video analysis open doors to countless applications across industries.

If you found this article insightful, consider following and sharing it within your network. Together, let’s explore the future of AI innovation!