multimodal-understanding

Google: Gemini 2.0 Flash Experimental

Gemini 2.0 Flash offers a significantly faster time to first token (TTFT) compared to Gemini 1.5 Flash, while maintaining quality on par with larger models like [Gemini 1.5 ...

Google 976.56K context $0.2/M input tokens $0.6/M output tokens

Mistral: Pixtral Large 2411

Text image 2 text

Pixtral Large is a 124B open-weights multimodal model built on top of Mistral Large 2. The model is able to understand documents, charts and natural images. The mode ...

MistralAI 125K context $2/M input tokens $6/M output tokens $0.003/M image tokens

Qwen2-VL 7B Instruct

Text image 2 text

Qwen2 VL 7B is a multimodal LLM from the Qwen Team with the following key enhancements:SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance o...

Qwen 32K context $0.1/M input tokens $0.1/M output tokens $0.144/K image tokens

Multimodal understanding

Google: Gemini 2.0 Flash Experimental

Mistral: Pixtral Large 2411

Qwen2-VL 7B Instruct