Mini MiniCPM-o 2.6: The 8B Parameter Multimodal LLM Beating GPT-4o
In a groundbreaking development, Mini CPM-o has taken the world of multimodal large language models (LLMs) by storm. With its 8-billion parameter architecture, it not only outperforms GPT-4o on several benchmarks but also rivals it in vision, audio, and other multimodal functionalities. Let’s dive into this exciting release, its capabilities, installation process, and use cases.
What Is MiniCPM-o 2.6:?
MiniCPM-o 2.6: is an advanced multimodal LLM that seamlessly handles text, images, videos, and audio as input. It delivers high-quality outputs, including text generation, speech synthesis, and multimodal interactions. Its performance is a testament to cutting-edge innovation, beating prominent models in speed and accuracy while introducing new possibilities in real-time processing.
Key Features of MiniCPM-o
- End-to-End Multimodal Processing: Handles vision, audio, and text input simultaneously with unparalleled precision.
- Bilingual Real-Time Speech Conversion: Supports real-time speech-to-speech conversion with customizable voices.
- Emotion, Speed and Style Control: Enables unique capabilities like emotion modulation, voice cloning, and role-play scenarios.
- Video and Audio Understanding: Efficiently extracts insights from videos and audio files, making it a versatile tool for creative and analytical tasks.
- OCR Capabilities: Performs optical character recognition on images, supporting multiple languages.
- Time Division Multiplexing (TDM): A novel mechanism for online multimodal streaming, ensuring seamless real-time performance.
This radar chart compares the performance of various models (GPT-4o-2.2405, Gemini-1.5 Pro, Qwen2-VL, GLM-4-Voice, and MiniCPM-o 2.6) across multiple tasks such as live streaming, speech conversation, and visual understanding benchmarks. It visually highlights strengths and weaknesses for each model in different evaluation categories.
Installing MiniCPM-o Locally
System Requirements:
- Python 3.10
Step-by-Step Guide:
- Clone the Repository: Clone the Mini CPM 2.6 repository from GitHub, which is open-sourced for the community. For the cpu follow according to these steps.
https://github.com/OpenBMB/MiniCPM-o.git
2. Install Dependencies:Use the provided requirements.txt
file to install necessary libraries.
cd MiniCPM-o
## Create Conda Environment
conda create -n minicpm python=3.10
conda activate minicpm
## Install the requirements
pip install -r requirements_o2.6.txt
3. Launch Model Server.
python web_demos/minicpm-o_2.6/model_server.py
4. Launch Web Server.
## Make sure Node and PNPM is installed.
sudo apt-get update
sudo apt-get install nodejs npm
npm install -g pnpm
cd web_demos/minicpm-o_2.6/web_server
## create ssl cert for https, https is required to request camera and microphone permissions.
bash ./make_ssl_cert.sh # output key.pem and cert.pem
pnpm install # install requirements
pnpm run dev # start server
5. If you want to test in Jupyter Noteboook. Launch Jupyter Notebook:Initiate Jupyter Notebook to interact with the model.
conda install -c conda-forge --override-channels notebook -y
conda install -c conda-forge --override-channels ipywidgets -y
jupyter notebook
For image Based Output.
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
torch.manual_seed(100)
model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
image = Image.open('./assets/minicpmo2_6/show_demo.jpg').convert('RGB')
## First round chat
question = "What is the landform in the picture?"
msgs = [{'role': 'user', 'content': [image, question]}]
answer = model.chat(
msgs=msgs,
tokenizer=tokenizer
)
print(answer)
## Second round chat, pass history context of multi-turn conversation
msgs.append({"role": "assistant", "content": [answer]})
msgs.append({"role": "user", "content": ["What should I pay attention to when traveling here?"]})
answer = model.chat(
msgs=msgs,
tokenizer=tokenizer
)
print(answer)
## Output
"The landform in the picture is a mountain range. The mountains appear to be karst formations, characterized by their steep, rugged peaks and smooth, rounded shapes. These types of mountains are often found in regions with limestone bedrock and are shaped by processes such as erosion and weathering. The reflection of the mountains in the water adds to the scenic beauty of the landscape."
"When traveling to this scenic location, it's important to pay attention to the weather conditions, as the area appears to be prone to fog and mist, especially during sunrise or sunset. Additionally, ensure you have proper footwear for navigating the potentially slippery terrain around the water. Lastly, respect the natural environment by not disturbing the local flora and fauna."
For Multiple Images.
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
image1 = Image.open('image1.jpg').convert('RGB')
image2 = Image.open('image2.jpg').convert('RGB')
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
msgs = [{'role': 'user', 'content': [image1, image2, question]}]
answer = model.chat(
msgs=msgs,
tokenizer=tokenizer
)
print(answer)
For Video Input:
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
from decord import VideoReader, cpu # pip install decord
model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
def encode_video(video_path):
def uniform_sample(l, n):
gap = len(l) / n
idxs = [int(i * gap + gap / 2) for i in range(n)]
return [l[i] for i in idxs]
vr = VideoReader(video_path, ctx=cpu(0))
sample_fps = round(vr.get_avg_fps() / 1) # FPS
frame_idx = [i for i in range(0, len(vr), sample_fps)]
if len(frame_idx) > MAX_NUM_FRAMES:
frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
frames = vr.get_batch(frame_idx).asnumpy()
frames = [Image.fromarray(v.astype('uint8')) for v in frames]
print('num frames:', len(frames))
return frames
video_path="video_test.mp4"
frames = encode_video(video_path)
question = "Describe the video"
msgs = [
{'role': 'user', 'content': frames + [question]},
]
## Set decode params for video
params = {}
params["use_image_id"] = False
params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution > 448*448
answer = model.chat(
msgs=msgs,
tokenizer=tokenizer,
**params
)
print(answer)
Exploring the Capabilities
Video Understanding:
MiniCPM-o can analyze and describe videos with remarkable accuracy. For instance, it can identify objects, surroundings, and even subtle details like the expressions, landscape of a video.
Text-to-Speech (TTS):
The model excels at TTS, providing accurate transcriptions of audio files. Its bilingual capability ensures accurate translations in supported languages.
OCR Functionality:
While its English OCR is highly accurate, its performance in languages like is good but not great. This makes it a strong contender for English-centric applications.
Limitations and Observations
- Language Support: While English OCR is robust, multilingual support needs refinement.
- GPU Dependency: Requires high-performance GPUs for optimal performance.
Real-World Applications
- Content Creation: Ideal for video summarization, transcription, and voiceover tasks.
- Customer Support: Automates interactions with real-time bilingual speech conversion and emotion control.
- Data Analysis: Extracts insights from multimodal data, including images and videos.
- Education and Accessibility: Enhances learning experiences with TTS and OCR for visually impaired users.
Conclusion
MiniCPM-o 2.6 is a monumental leap in multimodal LLMs, offering unparalleled capabilities in handling diverse data types. Its real-time processing, bilingual TTS, and video analysis open doors to countless applications across industries.
If you found this article insightful, consider following and sharing it within your network. Together, let’s explore the future of AI innovation!