Logo

AI Models with Multimodal Input Support

This page lists Large Language Models that offer Multimodal Input. Compare models, see how they implement this feature, and find the best option for projects requiring robust Multimodal Input.

Providers

Anthropic
Google
Google
Google
OpenAI
Google
Anthropic
Anthropic
Anthropic
OpenAI
OpenAI
OpenAI
OpenAI
OpenAI
OpenAI
OpenAI
OpenAI
OpenAI
OpenAI
OpenAI
OpenAI
OpenAI
OpenAI
OpenAI
OpenAI
OpenAI
OpenAI
Google
Google
Google
Google
Anthropic
OpenAI
OpenAI
OpenAI
OpenAI
OpenAI
OpenAI
OpenAI
OpenAI

Models with this Capability

Claude 3 Haiku

Anthropic · Claude 3

Current

Input

200K tokens

Output

4.1K tokens

Input Cost

$0.25/1M

Output Cost

$1.25/1M

multimodal input
long context
vision
+1

Gemini 2.5 Flash Preview

Google · Gemini

preview

Input

1M tokens

Output

0 tokens

Input Cost

$0.15/1M

Output Cost

$0.60/1M

Exceptional at:

mathematics
multimodal input
long context
thinking
+5

Gemini 2.0 Flash

Google · Gemini

Current

Input

1M tokens

Output

0 tokens

Input Cost

$0.10/1M

Output Cost

$0.40/1M

Exceptional at:

instruction following
multimodal input
long context
vision
+1

Gemini 2.0 Flash

Google · Gemini

Current

Input

1M tokens

Output

0 tokens

Input Cost

$0.10/1M

Output Cost

$0.40/1M

Exceptional at:

instruction following
multimodal input
long context
vision
+14

GPT-4.1 nano

OpenAI · GPT-4.1

Current

Input

1.0M tokens

Output

32.8K tokens

Input Cost

$0.10/1M

Output Cost

$0.40/1M

Exceptional at:

cost-effectiveness
speed
multimodal input
long context
function calling
+3

Gemini 1.5 Flash-8B

Google · Gemini

Current

Input

1M tokens

Output

0 tokens

Input Cost

$0.04/1M

Output Cost

$0.15/1M

multimodal input
long context
vision
+4

Claude 3.7 Sonnet

Anthropic · Claude 3

Current

Input

200K tokens

Output

4.1K tokens

Input Cost

$3.00/1M

Output Cost

$15.00/1M

Exceptional at:

agentic coding
reasoning
+2
multimodal input
long context
function calling
+15

Claude 3.5 Haiku

Anthropic · Claude 3.5

Current

Input

200K tokens

Output

8.2K tokens

Input Cost

$0.80/1M

Output Cost

$4.00/1M

Exceptional at:

speed
targeted performance
multimodal input
long context
vision
+4

Claude 3 Haiku

Anthropic · Claude 3

Current

Input

200K tokens

Output

4.1K tokens

Input Cost

$0.25/1M

Output Cost

$1.25/1M

Exceptional at:

quick response applications
targeted performance
multimodal input
multilingual
vision
+7

GPT-4o mini Realtime

OpenAI · GPT-4o

preview

Input

128K tokens

Output

4.1K tokens

Input Cost

$0.60/1M

Output Cost

$2.40/1M

Exceptional at:

realtime processing
audio input
+1
audio input
audio output
text generation
+2

GPT-4 Turbo

OpenAI · GPT-4

outdated

Input

128K tokens

Output

4.1K tokens

Input Cost

$10.00/1M

Output Cost

$30.00/1M

Exceptional at:

long context processing
function calling
+2
long context
function calling
streaming
+2

GPT-4o

OpenAI · GPT-4o

outdated

Input

128K tokens

Output

16.4K tokens

Input Cost

$2.50/1M

Output Cost

$10.00/1M

Exceptional at:

high intelligence
versatility
+4
multimodal input
function calling
structured outputs
+5

GPT-4o

OpenAI · GPT-4o

Current

Input

128K tokens

Output

16.4K tokens

Input Cost

$2.50/1M

Output Cost

$10.00/1M

Exceptional at:

general reasoning
multimodal understanding
+1
multimodal input
code interpretation via tool
computer use via tool
+8

GPT-4o mini Audio

OpenAI · GPT-4o

preview

Input

128K tokens

Output

4.1K tokens

Input Cost

$0.15/1M

Output Cost

$0.60/1M

Exceptional at:

audio to text transcription
text to audio synthesis
+1
multimodal input
audio to text
text to audio
+4

o1-pro

OpenAI · o1

Current

Input

200K tokens

Output

100K tokens

Input Cost

$150.00/1M

Output Cost

$600.00/1M

Exceptional at:

complex reasoning
multi turn interactions
advanced reasoning
function calling
structured outputs
+4

omni-moderation

OpenAI · omni-moderation

Current

Input

0 tokens

Output

0 tokens

Input Cost

$0.00/1M

Output Cost

$0.00/1M

Exceptional at:

harmful content detection
multimodal moderation
vision
moderation
multimodal input

GPT-4o Audio Preview

OpenAI · GPT-4o

outdated

Input

0 tokens

Output

0 tokens

Input Cost

$40.00/1M

Output Cost

$80.00/1M

Exceptional at:

audio input processing
audio output generation
+1
advanced reasoning
function calling
audio input
+2

omni-moderation

OpenAI · Moderation

Current

Input

0 tokens

Output

0 tokens

Input Cost

$0.00/1M

Output Cost

$0.00/1M

Exceptional at:

identifying harmful content in text and images
content moderation
multimodal input

ChatGPT-4o

OpenAI · GPT-4o

Current

Input

128K tokens

Output

4.1K tokens

Input Cost

$5.00/1M

Output Cost

$15.00/1M

Exceptional at:

text understanding
vision tasks
+1
file inputs
web browsing via tool
code interpretation
+5

GPT-4o mini

OpenAI · GPT-4o

Current

Input

128K tokens

Output

16.4K tokens

Input Cost

$0.15/1M

Output Cost

$0.60/1M

Exceptional at:

fast
affordable
+2
multimodal input
function calling
structured outputs
+2

GPT-4o mini Realtime

OpenAI · GPT-4o

preview

Input

128K tokens

Output

4.1K tokens

Input Cost

$0.60/1M

Output Cost

$2.40/1M

Exceptional at:

realtime text processing
realtime audio processing
function calling
realtime conversations
realtime transcription
+3

o3-2025-04-16

OpenAI · o3

Current

Input

200K tokens

Output

100K tokens

Input Cost

$10.00/1M

Output Cost

$40.00/1M

Exceptional at:

math
science
+6
code interpretation
web browsing via tool
streaming
+6

o3

OpenAI · OpenAI

Current

Input

200K tokens

Output

100K tokens

Input Cost

$10.00/1M

Output Cost

$40.00/1M

Exceptional at:

complex reasoning
math
+6
multimodal input
long context
function calling
+7

GPT-4o

OpenAI · GPT-4o

Current

Input

128K tokens

Output

16.4K tokens

Input Cost

$2.50/1M

Output Cost

$10.00/1M

Exceptional at:

multimodal understanding
complex reasoning
+2
streaming
structured outputs
multimodal input
+8

GPT-4o Audio

OpenAI · GPT-4o

preview

Input

128K tokens

Output

16.4K tokens

Input Cost

$2.50/1M

Output Cost

$10.00/1M

Exceptional at:

audio input processing
audio output generation
multimodal input
function calling
streaming

GPT-4o Realtime

OpenAI · GPT-4o

preview

Input

128K tokens

Output

4.1K tokens

Input Cost

$5.00/1M

Output Cost

$20.00/1M

Exceptional at:

realtime audio processing
realtime text processing
+1
multimodal input
function calling
realtime text input
+3

GPT-4.1 mini

OpenAI · GPT-4.1

Current

Input

1.0M tokens

Output

32.8K tokens

Input Cost

$0.40/1M

Output Cost

$1.60/1M

file search via tool
multimodal input
computer use via tool
+7

Gemini 2.0 Flash-Lite

Google · Gemini

Current

Input

0 tokens

Output

0 tokens

Input Cost

$0.07/1M

Output Cost

$0.30/1M

Exceptional at:

instruction following
multimodal input

Gemini 1.5 Flash

Google · Gemini

Current

Input

1M tokens

Output

1M tokens

Input Cost

$0.07/1M

Output Cost

$0.30/1M

Exceptional at:

long context processing
multimodal understanding
+3
multimodal input
long context
context caching
+2

Gemini 2.5 Pro Preview

Google · Gemini

preview

Input

1M tokens

Output

0 tokens

Input Cost

$1.25/1M

Output Cost

$10.00/1M

Exceptional at:

complex reasoning
coding
+7
multimodal input
long context
advanced reasoning
+9

Gemini 1.5 Pro

Google · Gemini

Current

Input

2M tokens

Output

0 tokens

Input Cost

$1.25/1M

Output Cost

$5.00/1M

Exceptional at:

long context processing
complex reasoning
+1
multimodal input
long context
structured output
+8

Claude 3 Opus

Anthropic · Claude 3

Current

Input

200K tokens

Output

4.1K tokens

Input Cost

$15.00/1M

Output Cost

$75.00/1M

Exceptional at:

top level intelligence
fluency
+2
multimodal input
long context
vision
+3

GPT-4o

OpenAI · GPT-4o

outdated

Input

128K tokens

Output

16.4K tokens

Input Cost

$2.50/1M

Output Cost

$10.00/1M

Exceptional at:

general reasoning
multimodal understanding
+2
multimodal input
long context
function calling
+7

GPT-4o Audio Preview

OpenAI · GPT-4o

preview

Input

0 tokens

Output

0 tokens

Input Cost

$40.00/1M

Output Cost

$80.00/1M

Exceptional at:

realtime audio processing
realtime text processing
realtime text
multimodal input
audio input
+2

omni-moderation-latest

OpenAI · OpenAI

Current

Input

0 tokens

Output

0 tokens

Input Cost

$0.00/1M

Output Cost

$0.00/1M

Exceptional at:

multimodal content moderation
identifying harmful content (text and images)
multimodal input
content moderation

GPT-4o mini Realtime

OpenAI · GPT-4o

preview

Input

128K tokens

Output

4.1K tokens

Input Cost

$0.60/1M

Output Cost

$2.40/1M

Exceptional at:

realtime processing
audio input
+1
multimodal input
realtime processing
audio input
+2

o4-mini

OpenAI · o-series

Current

Input

200K tokens

Output

100K tokens

Input Cost

$1.10/1M

Output Cost

$4.40/1M

Exceptional at:

reasoning
coding
+1
multimodal input
long context
function calling
+5

GPT-4.1

OpenAI · GPT-4.1

Current

Input

1.0M tokens

Output

32.8K tokens

Input Cost

$2.00/1M

Output Cost

$8.00/1M

Exceptional at:

complex reasoning
problem solving across domains
+4
long context
function calling
structured outputs
+5

GPT-4o mini Audio

OpenAI · GPT-4o

preview

Input

128K tokens

Output

16.4K tokens

Input Cost

$0.15/1M

Output Cost

$0.60/1M

Exceptional at:

audio input processing
audio output generation
streaming
multimodal input
function calling

GPT Image 1

OpenAI · GPT Image

Current

Input

0 tokens

Output

0 tokens

Input Cost

$10.00/1M

Output Cost

$40.00/1M

Exceptional at:

image generation
image editing
multimodal input
image generation
inpainting