computer-vision

MiniMax: MiniMax-01

MiniMax-01 is a combines MiniMax-Text-01 for text generation and MiniMax-VL-01 for image understanding. It has 456 billion parameters, with 45.9 billion parameters activated per inference, and can ha ...

Rifx.Online 976.75K context $0.2/M input tokens $1.1/M output tokens

xAI: Grok 2 Vision 1212

Text image 2 text

Grok 2 Vision 1212 advances image-based AI with stronger visual comprehension, refined instruction-following, and multilingual support. From object recognition to style analysis, it empowers develope ...

X AI 32K context $2/M input tokens $10/M output tokens $0.004/M image tokens

70% OFF

nova-lite

Text image 2 text

# Discount

Amazon Nova Lite 1.0 is a very low-cost multimodal model from Amazon that focused on fast processing of image, video, and text inputs to generate text output. Amazon Nova Lite can handle real-time c ...

Amazon 292.97K context $0.06/M input tokens $0.24/M output tokens

70% OFF

nova-pro

Text image 2 text

# Discount

Amazon Nova Pro 1.0 is a capable multimodal model from Amazon focused on providing a combination of accuracy, speed, and cost for a wide range of tasks. As of December 2024, it achieves state-of-the ...

Amazon 292.97K context $0.8/M input tokens $3.2/M output tokens $0.001/M image tokens

Amazon: Nova Pro 1.0

Text image 2 text

# New

Amazon Nova Pro 1.0 is a capable multimodal model from Amazon focused on providing a combination of accuracy, speed, and cost for a wide range of tasks. As of December 2024, it achieves state-of-the- ...

Amazon 292.97K context $0.8/M input tokens $3.2/M output tokens $0.001/M image tokens

Amazon: Nova Lite 1.0

Text image 2 text

# New

Amazon Nova Lite 1.0 is a very low-cost multimodal model from Amazon that focused on fast processing of image, video, and text inputs to generate text output. Amazon Nova Lite can handle real-time cu ...

Amazon 292.97K context $0.06/M input tokens $0.24/M output tokens

40% OFF

Gemini Flash 1.5

Text image 2 text

# Discount

Gemini 1.5 Flash is a foundation model that performs well at a variety of multimodal tasks such as visual understanding, classification, summarization, and creating content from image, audio and vid ...

Google 976.56K context $0.15/M input tokens $0.6/M output tokens $0.04/K image tokens

40% OFF

gpt-4o

Text image 2 text

# Discount

GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of GPT-4 Turbo while being twi ...

OpenAI 125K context $2.5/M input tokens $10/M output tokens $0.004/M image tokens

40% OFF

Claude 3.5 Sonnet-20240620

Text image 2 text

# Discount # 40%Off # Discount

Claude 3.5 Sonnet delivers better-than-Opus capabilities, faster-than-Sonnet speeds, at the same Sonnet prices. Sonnet is particularly good at:Coding: Autonomously writes, edits, and runs code w...

Anthropic 195.31K context $3/M input tokens $15/M output tokens $0.005/M image tokens

Mistral: Pixtral Large 2411

Text image 2 text

Pixtral Large is a 124B open-weights multimodal model built on top of Mistral Large 2. The model is able to understand documents, charts and natural images. The mode ...

MistralAI 125K context $2/M input tokens $6/M output tokens $0.003/M image tokens

Google: Gemini Pro Vision 1.0

Text image 2 text

Google's flagship multimodal model, supporting image and video in text or chat prompts for a text or code response. See the benchmarks and prompting guidelines from [Deepmind](https://deepmind.googl ...

Google 16K context $0.5/M input tokens $1.5/M output tokens $0.003/M image tokens

Google: Gemini Flash 1.5

Text image 2 text

Gemini 1.5 Flash is a foundation model that performs well at a variety of multimodal tasks such as visual understanding, classification, summarization, and creating content from image, audio and vide ...

Google 976.56K context $0.075/M input tokens $0.3/M output tokens $0.04/K image tokens

Anthropic: Claude 3.5 Sonnet (2024-06-20)

Text image 2 text

Claude 3.5 Sonnet delivers better-than-Opus capabilities, faster-than-Sonnet speeds, at the same Sonnet prices. Sonnet is particularly good at:Coding: Autonomously writes, edits, and runs code wi...

Anthropic 195.31K context $3/M input tokens $15/M output tokens $0.005/M image tokens

Anthropic: Claude 3.5 Sonnet

Text image 2 text

Claude 3.5 Sonnet delivers better-than-Opus capabilities, faster-than-Sonnet speeds, at the same Sonnet prices. Sonnet is particularly good at:Coding: Autonomously writes, edits, and runs code wi...

Anthropic 195.31K context $3/M input tokens $15/M output tokens $0.005/M image tokens

Qwen2-VL 7B Instruct

Text image 2 text

Qwen2 VL 7B is a multimodal LLM from the Qwen Team with the following key enhancements:SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance o...

Qwen 32K context $0.1/M input tokens $0.1/M output tokens $0.144/K image tokens

FREE

Meta: Llama 3.2 11B Vision Instruct (free)

Text image 2 text

# Free

Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual question answ ...

Meta Llama 128K context $0 input tokens $0 output tokens $0.079/K image tokens

Meta: Llama 3.2 11B Vision Instruct

Text image 2 text

Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual question answ ...

Meta Llama 128K context $0.055/M input tokens $0.055/M output tokens $0.079/K image tokens

GPT-4o

Text image 2 text

GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of GPT-4 Turbo while being twi ...

OpenAI 125K context $2.5/M input tokens $10/M output tokens $0.004/M image tokens

OpenAI: GPT-4o

Text image 2 text

GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of GPT-4 Turbo while being twi ...

OpenAI 125K context $2.5/M input tokens $10/M output tokens $0.004/M image tokens

Qwen2-VL 72B Instruct

Text image 2 text

Qwen2 VL 72B is a multimodal LLM from the Qwen Team with the following key enhancements:SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance...

Qwen 32K context $0.4/M input tokens $0.4/M output tokens $0.578/K image tokens

Meta: Llama 3.2 90B Vision Instruct

Text image 2 text

The Llama 90B Vision model is a top-tier, 90-billion-parameter multimodal model designed for the most challenging visual reasoning and language tasks. It offers unparalleled accuracy in image caption ...

Meta Llama 128K context $0.35/M input tokens $0.4/M output tokens $0.506/K image tokens

glm-4v

Text 2 text

GLM-4V Model Introduction Key Capabilities and Primary Use CasesMultimodal Conversations: Engages in text and image-based conversations. Image Understanding: Analyz...

ChatGLM 31.25K context $7/M input tokens $7/M output tokens

glm-4v-plus

Text 2 text

GLM-4V-Plus Model Introduction Key Capabilities and Primary Use CasesMultimodal Understanding: Excels in image and video understanding, including temporal sequence analys...

ChatGLM 31.25K context $1.4/M input tokens $1.4/M output tokens

Liquid: LFM 40B MoE

Text 2 text

Liquid's 40.3B Mixture of Experts (MoE) model. Liquid Foundation Models (LFMs) are large neural networks built with computational units rooted in dynamic systems. LFMs are general-purp ...

Liquid 32K context $1/M input tokens $2/M output tokens

Meta: Llama 3.2 11B Vision Instruct

Text image 2 text

Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual ...

Meta llama 128K context $0.055/M input tokens $0.055/M output tokens $0.079/K image tokens

Meta: Llama 3.2 90B Vision Instruct

Text image 2 text

The Llama 90B Vision model is a top-tier, 90-billion-parameter multimodal model designed for the most challenging visual reasoning and language tasks. It offers unparalleled accuracy in ...

Meta llama 128K context $0.35/M input tokens $0.4/M output tokens $0.506/K image tokens

Meta: Llama 3.2 90B Vision Instruct (free)

Text image 2 text

The Llama 90B Vision model is a top-tier, 90-billion-parameter multimodal model designed for the most challenging visual reasoning and language tasks. It offers unparalleled accuracy in ...

Rifx.Online 4K context $0 input tokens $0 output tokens

Qwen2-VL 72B Instruct

Text image 2 text

Qwen2 VL 72B is a multimodal LLM from the Qwen Team with the following key enhancements:SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-ar...

Qwen 32K context $0.4/M input tokens $0.4/M output tokens $0.578/K image tokens

Mistral: Pixtral 12B

Text image 2 text

The first image to text model from Mistral AI. Its weight was launched via torrent per their tradition: https://x.com/mistralai/status/1833758285167722836 ...

Mistralai 4K context $0.1/M input tokens $0.1/M output tokens $0.144/K image tokens

Qwen2-VL 7B Instruct

Text image 2 text

Qwen2 VL 7B is a multimodal LLM from the Qwen Team with the following key enhancements:SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art...

Qwen 32K context $0.1/M input tokens $0.1/M output tokens $0.144/K image tokens

Anthropic: Claude 3.5 Sonnet (2024-06-20)

Text image 2 text

Claude 3.5 Sonnet delivers better-than-Opus capabilities, faster-than-Sonnet speeds, at the same Sonnet prices. Sonnet is particularly good at:Coding: Autonomously writes, edits, an...

Anthropic 195.31K context $3/M input tokens $15/M output tokens $0.005/M image tokens

Google: Gemini Flash 1.5

Text image 2 text

Gemini 1.5 Flash is a foundation model that performs well at a variety of multimodal tasks such as visual understanding, classification, summarization, and creating content from image, ...

Google 976.56K context $0.075/M input tokens $0.3/M output tokens $0.04/K image tokens

Google: Gemini Pro Vision 1.0

Text image 2 text

Google's flagship multimodal model, supporting image and video in text or chat prompts for a text or code response. See the benchmarks and prompting guidelines from [Deepmind](https:// ...

Google 16K context $0.5/M input tokens $1.5/M output tokens $0.003/M image tokens

Computer vision