Logo

AI Models with Vision Support

This page lists Large Language Models that offer Vision. Compare models, see how they implement this feature, and find the best option for projects requiring robust Vision.

Providers

Anthropic
Google
Google
Google
Google
Anthropic
Anthropic
Anthropic
Anthropic
OpenAI
OpenAI
OpenAI
OpenAI
OpenAI
OpenAI
Google
Anthropic
OpenAI

Models with this Capability

Claude 3 Haiku

Anthropic · Claude 3

Current

Input

200K tokens

Output

4.1K tokens

Input Cost

$0.25/1M

Output Cost

$1.25/1M

multimodal input
long context
vision
+1

Gemini 2.5 Flash Preview

Google · Gemini

preview

Input

1M tokens

Output

0 tokens

Input Cost

$0.15/1M

Output Cost

$0.60/1M

Exceptional at:

mathematics
multimodal input
long context
thinking
+5

Gemini 2.0 Flash

Google · Gemini

Current

Input

1M tokens

Output

0 tokens

Input Cost

$0.10/1M

Output Cost

$0.40/1M

Exceptional at:

instruction following
multimodal input
long context
vision
+1

Gemini 2.0 Flash

Google · Gemini

Current

Input

1M tokens

Output

0 tokens

Input Cost

$0.10/1M

Output Cost

$0.40/1M

Exceptional at:

instruction following
multimodal input
long context
vision
+14

Gemini 1.5 Flash-8B

Google · Gemini

Current

Input

1M tokens

Output

0 tokens

Input Cost

$0.04/1M

Output Cost

$0.15/1M

multimodal input
long context
vision
+4

Claude 3.7 Sonnet

Anthropic · Claude 3

Current

Input

200K tokens

Output

4.1K tokens

Input Cost

$3.00/1M

Output Cost

$15.00/1M

Exceptional at:

agentic coding
reasoning
+2
multimodal input
long context
function calling
+15

Claude 3.5 Haiku

Anthropic · Claude 3.5

Current

Input

200K tokens

Output

8.2K tokens

Input Cost

$0.80/1M

Output Cost

$4.00/1M

Exceptional at:

speed
targeted performance
multimodal input
long context
vision
+4

Claude 3.5 Sonnet

Anthropic · Claude 3

Current

Input

200K tokens

Output

4.1K tokens

Input Cost

$3.00/1M

Output Cost

$15.00/1M

Exceptional at:

complex reasoning
agent applications
+2
vision
prompt caching
batch processing
+12

Claude 3 Haiku

Anthropic · Claude 3

Current

Input

200K tokens

Output

4.1K tokens

Input Cost

$0.25/1M

Output Cost

$1.25/1M

Exceptional at:

quick response applications
targeted performance
multimodal input
multilingual
vision
+7

GPT-4o

OpenAI · GPT-4o

Current

Input

128K tokens

Output

16.4K tokens

Input Cost

$2.50/1M

Output Cost

$10.00/1M

Exceptional at:

general reasoning
multimodal understanding
+1
multimodal input
code interpretation via tool
computer use via tool
+8

GPT-4o mini Audio

OpenAI · GPT-4o

preview

Input

128K tokens

Output

4.1K tokens

Input Cost

$0.15/1M

Output Cost

$0.60/1M

Exceptional at:

audio to text transcription
text to audio synthesis
+1
multimodal input
audio to text
text to audio
+4

omni-moderation

OpenAI · omni-moderation

Current

Input

0 tokens

Output

0 tokens

Input Cost

$0.00/1M

Output Cost

$0.00/1M

Exceptional at:

harmful content detection
multimodal moderation
vision
moderation
multimodal input

o3-2025-04-16

OpenAI · o3

Current

Input

200K tokens

Output

100K tokens

Input Cost

$10.00/1M

Output Cost

$40.00/1M

Exceptional at:

math
science
+6
code interpretation
web browsing via tool
streaming
+6

o3

OpenAI · OpenAI

Current

Input

200K tokens

Output

100K tokens

Input Cost

$10.00/1M

Output Cost

$40.00/1M

Exceptional at:

complex reasoning
math
+6
multimodal input
long context
function calling
+7

GPT-4o

OpenAI · GPT-4o

Current

Input

128K tokens

Output

16.4K tokens

Input Cost

$2.50/1M

Output Cost

$10.00/1M

Exceptional at:

multimodal understanding
complex reasoning
+2
streaming
structured outputs
multimodal input
+8

Gemini 2.5 Pro Preview

Google · Gemini

preview

Input

1M tokens

Output

0 tokens

Input Cost

$1.25/1M

Output Cost

$10.00/1M

Exceptional at:

complex reasoning
coding
+7
multimodal input
long context
advanced reasoning
+9

Claude 3 Opus

Anthropic · Claude 3

Current

Input

200K tokens

Output

4.1K tokens

Input Cost

$15.00/1M

Output Cost

$75.00/1M

Exceptional at:

top level intelligence
fluency
+2
multimodal input
long context
vision
+3

GPT-4o

OpenAI · GPT-4o

outdated

Input

128K tokens

Output

16.4K tokens

Input Cost

$2.50/1M

Output Cost

$10.00/1M

Exceptional at:

general reasoning
multimodal understanding
+2
multimodal input
long context
function calling
+7