Type something to search...

Computer vision

MiniMax-01 is a combines MiniMax-Text-01 for text generation and MiniMax-VL-01 for image understanding. It has 456 billion parameters, with 45.9 billion parameters activated per inference, and can ha ...

MiniMax: MiniMax-01
Rifx.Online
976.75K context $0.2/M input tokens $1.1/M output tokens

Grok 2 Vision 1212 advances image-based AI with stronger visual comprehension, refined instruction-following, and multilingual support. From object recognition to style analysis, it empowers develope ...

xAI: Grok 2 Vision 1212
X AI
32K context $2/M input tokens $10/M output tokens $0.004/M image tokens
70% OFF

Amazon Nova Lite 1.0 is a very low-cost multimodal model from Amazon that focused on fast processing of image, video, and text inputs to generate text output. Amazon Nova Lite can handle real-time c ...

nova-lite
Amazon
292.97K context $0.06/M input tokens $0.24/M output tokens
70% OFF

Amazon Nova Pro 1.0 is a capable multimodal model from Amazon focused on providing a combination of accuracy, speed, and cost for a wide range of tasks. As of December 2024, it achieves state-of-the ...

nova-pro
Amazon
292.97K context $0.8/M input tokens $3.2/M output tokens $0.001/M image tokens

Amazon Nova Pro 1.0 is a capable multimodal model from Amazon focused on providing a combination of accuracy, speed, and cost for a wide range of tasks. As of December 2024, it achieves state-of-the- ...

Amazon: Nova Pro 1.0
Amazon
292.97K context $0.8/M input tokens $3.2/M output tokens $0.001/M image tokens

Amazon Nova Lite 1.0 is a very low-cost multimodal model from Amazon that focused on fast processing of image, video, and text inputs to generate text output. Amazon Nova Lite can handle real-time cu ...

Amazon: Nova Lite 1.0
Amazon
292.97K context $0.06/M input tokens $0.24/M output tokens
40% OFF

Gemini 1.5 Flash is a foundation model that performs well at a variety of multimodal tasks such as visual understanding, classification, summarization, and creating content from image, audio and vid ...

Gemini Flash 1.5
Google
976.56K context $0.15/M input tokens $0.6/M output tokens $0.04/K image tokens
40% OFF

GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of GPT-4 Turbo while being twi ...

gpt-4o
OpenAI
125K context $2.5/M input tokens $10/M output tokens $0.004/M image tokens
40% OFF

Claude 3.5 Sonnet delivers better-than-Opus capabilities, faster-than-Sonnet speeds, at the same Sonnet prices. Sonnet is particularly good at:Coding: Autonomously writes, edits, and runs code w...

Claude 3.5 Sonnet-20240620
Anthropic
195.31K context $3/M input tokens $15/M output tokens $0.005/M image tokens

Pixtral Large is a 124B open-weights multimodal model built on top of Mistral Large 2. The model is able to understand documents, charts and natural images. The mode ...

Mistral: Pixtral Large 2411
MistralAI
125K context $2/M input tokens $6/M output tokens $0.003/M image tokens

Google's flagship multimodal model, supporting image and video in text or chat prompts for a text or code response. See the benchmarks and prompting guidelines from [Deepmind](https://deepmind.googl ...

Google: Gemini Pro Vision 1.0
Google
16K context $0.5/M input tokens $1.5/M output tokens $0.003/M image tokens

Gemini 1.5 Flash is a foundation model that performs well at a variety of multimodal tasks such as visual understanding, classification, summarization, and creating content from image, audio and vide ...

Google: Gemini Flash 1.5
Google
976.56K context $0.075/M input tokens $0.3/M output tokens $0.04/K image tokens

Claude 3.5 Sonnet delivers better-than-Opus capabilities, faster-than-Sonnet speeds, at the same Sonnet prices. Sonnet is particularly good at:Coding: Autonomously writes, edits, and runs code wi...

Anthropic: Claude 3.5 Sonnet (2024-06-20)
Anthropic
195.31K context $3/M input tokens $15/M output tokens $0.005/M image tokens

Claude 3.5 Sonnet delivers better-than-Opus capabilities, faster-than-Sonnet speeds, at the same Sonnet prices. Sonnet is particularly good at:Coding: Autonomously writes, edits, and runs code wi...

Anthropic: Claude 3.5 Sonnet
Anthropic
195.31K context $3/M input tokens $15/M output tokens $0.005/M image tokens

Qwen2 VL 7B is a multimodal LLM from the Qwen Team with the following key enhancements:SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance o...

Qwen2-VL 7B Instruct
Qwen
32K context $0.1/M input tokens $0.1/M output tokens $0.144/K image tokens
FREE

Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual question answ ...

Meta: Llama 3.2 11B Vision Instruct (free)
Meta Llama
128K context $0 input tokens $0 output tokens $0.079/K image tokens

Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual question answ ...

Meta: Llama 3.2 11B Vision Instruct
Meta Llama
128K context $0.055/M input tokens $0.055/M output tokens $0.079/K image tokens

GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of GPT-4 Turbo while being twi ...

GPT-4o
OpenAI
125K context $2.5/M input tokens $10/M output tokens $0.004/M image tokens

GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of GPT-4 Turbo while being twi ...

OpenAI: GPT-4o
OpenAI
125K context $2.5/M input tokens $10/M output tokens $0.004/M image tokens

Qwen2 VL 72B is a multimodal LLM from the Qwen Team with the following key enhancements:SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance...

Qwen2-VL 72B Instruct
Qwen
32K context $0.4/M input tokens $0.4/M output tokens $0.578/K image tokens

The Llama 90B Vision model is a top-tier, 90-billion-parameter multimodal model designed for the most challenging visual reasoning and language tasks. It offers unparalleled accuracy in image caption ...

Meta: Llama 3.2 90B Vision Instruct
Meta Llama
128K context $0.35/M input tokens $0.4/M output tokens $0.506/K image tokens

GLM-4V Model Introduction Key Capabilities and Primary Use CasesMultimodal Conversations: Engages in text and image-based conversations. Image Understanding: Analyz...

glm-4v
ChatGLM
31.25K context $7/M input tokens $7/M output tokens

GLM-4V-Plus Model Introduction Key Capabilities and Primary Use CasesMultimodal Understanding: Excels in image and video understanding, including temporal sequence analys...

glm-4v-plus
ChatGLM
31.25K context $1.4/M input tokens $1.4/M output tokens

Liquid's 40.3B Mixture of Experts (MoE) model. Liquid Foundation Models (LFMs) are large neural networks built with computational units rooted in dynamic systems. LFMs are general-purp ...

Liquid: LFM 40B MoE
Liquid
32K context $1/M input tokens $2/M output tokens

Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual ...

Meta: Llama 3.2 11B Vision Instruct
Meta llama
128K context $0.055/M input tokens $0.055/M output tokens $0.079/K image tokens

The Llama 90B Vision model is a top-tier, 90-billion-parameter multimodal model designed for the most challenging visual reasoning and language tasks. It offers unparalleled accuracy in ...

Meta: Llama 3.2 90B Vision Instruct
Meta llama
128K context $0.35/M input tokens $0.4/M output tokens $0.506/K image tokens

The Llama 90B Vision model is a top-tier, 90-billion-parameter multimodal model designed for the most challenging visual reasoning and language tasks. It offers unparalleled accuracy in ...

Meta: Llama 3.2 90B Vision Instruct (free)
Rifx.Online
4K context $0 input tokens $0 output tokens

Qwen2 VL 72B is a multimodal LLM from the Qwen Team with the following key enhancements:SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-ar...

Qwen2-VL 72B Instruct
Qwen
32K context $0.4/M input tokens $0.4/M output tokens $0.578/K image tokens

The first image to text model from Mistral AI. Its weight was launched via torrent per their tradition: https://x.com/mistralai/status/1833758285167722836 ...

Mistral: Pixtral 12B
Mistralai
4K context $0.1/M input tokens $0.1/M output tokens $0.144/K image tokens

Qwen2 VL 7B is a multimodal LLM from the Qwen Team with the following key enhancements:SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art...

Qwen2-VL 7B Instruct
Qwen
32K context $0.1/M input tokens $0.1/M output tokens $0.144/K image tokens

Claude 3.5 Sonnet delivers better-than-Opus capabilities, faster-than-Sonnet speeds, at the same Sonnet prices. Sonnet is particularly good at:Coding: Autonomously writes, edits, an...

Anthropic: Claude 3.5 Sonnet (2024-06-20)
Anthropic
195.31K context $3/M input tokens $15/M output tokens $0.005/M image tokens

Gemini 1.5 Flash is a foundation model that performs well at a variety of multimodal tasks such as visual understanding, classification, summarization, and creating content from image, ...

Google: Gemini Flash 1.5
Google
976.56K context $0.075/M input tokens $0.3/M output tokens $0.04/K image tokens

Google's flagship multimodal model, supporting image and video in text or chat prompts for a text or code response. See the benchmarks and prompting guidelines from [Deepmind](https:// ...

Google: Gemini Pro Vision 1.0
Google
16K context $0.5/M input tokens $1.5/M output tokens $0.003/M image tokens