Gemini Embedding 2: Google's Multimodal AI Just Changed Search

Key Takeaways

Gemini Embedding 2 (gemini-embedding-2-preview) is Google’s first natively multimodal embedding model, released on March 10, 2026.
It maps text, images, video, audio, and PDF documents into a single unified embedding space — enabling true cross-modal search.
It supports over 100 languages and generates 3072-dimensional vectors by default.
It uses Matryoshka Representation Learning (MRL), allowing output dimensions to be scaled down to 128–3072 without significant quality loss.
It is available now via the Gemini API and Vertex AI in Public Preview.
It is incompatible with the previous gemini-embedding-001 model — existing data must be re-embedded when migrating.
It powers use cases including RAG systems, semantic search, classification, clustering, and anomaly detection — now across all media types, not just text.

Table of Contents

What Is Gemini Embedding 2?

Gemini Embedding 2 is Google’s first natively multimodal AI embedding model that converts text, images, video, audio, and documents into numerical vectors within a single, unified embedding space.

Released on March 10, 2026, it is built on the Gemini architecture and represents a fundamental upgrade over its predecessor gemini-embedding-001, which was text-only. With Gemini Embedding 2, a search query typed in English can now retrieve a matching video clip, an image, or a PDF page — all using the same vector math.

The model ID is gemini-embedding-2-preview and it is available in Public Preview through both the Gemini API and Vertex AI.

Why Does This Matter?

Most embedding models today operate on a single modality. You need a separate model for text, a different one for images, another for audio. Gemini Embedding 2 collapses all of that into one model, one API call, and one shared vector space.

This means:

A text query can find the most relevant image, video clip, or audio segment
An image can be used to retrieve semantically similar documents
Mixed-media content (a slide deck with text + images) can be embedded in a single request

How Does Gemini Embedding 2 Work?

Gemini Embedding 2 converts any supported input — text, image, audio, video, or PDF — into a high-dimensional numerical vector that captures its semantic meaning, then places it into a shared mathematical space where similar concepts cluster together regardless of modality.

The Core Concept: Embedding Space

An embedding is a list of numbers (a vector) that represents the meaning of content. When two pieces of content are semantically similar — even if one is a text description and the other is an image — their vectors will be mathematically close in the embedding space.

Gemini Embedding 2 generates 3072-dimensional vectors by default. Each dimension captures a different aspect of meaning: topic, tone, context, visual content, acoustic properties, and more.

Matryoshka Representation Learning (MRL)

Gemini Embedding 2 is trained using MRL — a technique that “nests” information within the vector. This means:

You can truncate the 3072-dimensional output to a smaller size (e.g., 768 or 1536 dimensions)
Smaller vectors cost less to store and process
Performance degrades gracefully — 768 dimensions still achieves competitive quality

MTEB benchmark scores by dimension (text):

Dimension	MTEB Score
2048	68.16
1536	68.17
768	67.99
512	67.55
256	66.19
128	63.31

Recommended dimensions: 768, 1536, or 3072 for highest quality.

⚠️ Important: The 3072-dimension output is automatically normalized. For 768 and 1536, you must manually normalize the vector before cosine similarity calculations.

Interleaved Multimodal Input

Unlike previous models that process one modality at a time, Gemini Embedding 2 natively understands interleaved input — meaning you can pass text + image + audio in a single API request, and it generates one aggregated embedding that captures the combined meaning.

What Can Gemini Embedding 2 Process?

Gemini Embedding 2 accepts five modality types: text, images, audio, video, and PDF documents — each with specific format and size limits.

Supported Modalities and Limits

Modality	Max Per Request	Max Duration / Size	Supported Formats
Text	—	8,192 tokens	Any text
Images	6 images	No file size limit	PNG, JPEG
Audio	1 file	80 seconds	MP3, WAV
Video	1 file	128 sec (no audio) / 80 sec (with audio)	MP4, MOV (H264, H265, AV1, VP9)
PDF Documents	1 file	6 pages	PDF

Total input limit across all modalities: 8,192 tokens per request.

Special Capabilities

Document OCR — The model reads and embeds text extracted from PDFs, not just the visual appearance.
Audio Track Extraction — When embedding video, the model can automatically extract and process the audio track alongside the visual frames — no manual preprocessing required.

Key Features: What’s New vs. gemini-embedding-001

Gemini Embedding 2 introduces four major features that its predecessor lacked: multimodal input, custom task instructions, adjustable output dimensions, and document OCR.

1. Multimodal Input

The most significant upgrade. gemini-embedding-001 was text-only. Gemini Embedding 2 handles all five modality types in one unified model.

2. Custom Task Instructions

You can now specify what you intend to do with the embedding. This helps the model optimize the vector for the specific task, increasing accuracy.

Supported task types:

Task Type	Use Case
SEMANTIC_SIMILARITY	Comparing two pieces of content for meaning closeness
CLASSIFICATION	Sentiment analysis, spam detection
CLUSTERING	Document organization, anomaly detection
RETRIEVAL_DOCUMENT	Indexing articles, books, web pages for search
RETRIEVAL_QUERY	User search queries
CODE_RETRIEVAL_QUERY	Finding code blocks from natural language queries
QUESTION_ANSWERING	Finding documents that answer a specific question
FACT_VERIFICATION	Retrieving evidence to verify a claim

3. Adjustable Output Dimensions

Use the output_dimensionality parameter to get a smaller, cheaper vector when full precision isn’t needed. Supported range: 128 to 3072.

4. Document OCR

Embed PDFs by processing their actual textual and visual content — not just metadata. The model reads and understands what’s on each page.

How to Use Gemini Embedding 2: Step-by-Step

Gemini Embedding 2 is available via the google-genai Python SDK, JavaScript SDK, REST API, and third-party integrations including LangChain, LlamaIndex, and ChromaDB.

Step 1: Install the SDK

pip install google-genai

Step 2: Set Up Your API Key

Get your API key from Google AI Studio.

import os

os.environ[“GOOGLE_API_KEY”] = “your_api_key_here”

Step 3: Generate a Text Embedding

from google import genai

client = genai.Client()

result = client.models.embed_content(

model=”gemini-embedding-2-preview”,

contents=”What is the meaning of life?”

)

print(result.embeddings)

Step 4: Embed an Image

from google import genai

from google.genai import types

client = genai.Client()

with open(“example.png”, “rb”) as f:

image_bytes = f.read()

result = client.models.embed_content(

model=”gemini-embedding-2-preview”,

contents=[

types.Part.from_bytes(

data=image_bytes,

mime_type=”image/png”,

]

)

print(result.embeddings)

Step 5: Embed Text + Image Together (Single Aggregated Vector)

from google import genai

from google.genai import types

client = genai.Client()

with open(“dog.png”, “rb”) as f:

image_bytes = f.read()

result = client.models.embed_content(

model=”gemini-embedding-2-preview”,

contents=[

types.Content(

parts=[

types.Part(text=”An image of a dog”),

types.Part.from_bytes(

data=image_bytes,

mime_type=”image/png”,

)

]

)

]

)

# Returns ONE aggregated embedding for both inputs

for embedding in result.embeddings:

print(embedding.values)

Step 6: Use a Task Type for Better Accuracy

from google import genai

from google.genai import types

client = genai.Client()

result = client.models.embed_content(

model=”gemini-embedding-2-preview”,

contents=”How do transformers work in NLP?”,

config=types.EmbedContentConfig(task_type=”RETRIEVAL_QUERY”)

)

Step 7: Control Output Dimensions

from google import genai

from google.genai import types

client = genai.Client()

result = client.models.embed_content(

model=”gemini-embedding-2-preview”,

contents=”Semantic search example”,

config=types.EmbedContentConfig(output_dimensionality=768)

)

⚠️ Remember: Normalize the vector manually if using dimensions other than 3072.

Use Cases: What Can You Build With Gemini Embedding 2?

Gemini Embedding 2 enables any application that needs to find, compare, or organize information across mixed media types.

Multimodal RAG (Retrieval-Augmented Generation)

Build a knowledge base that includes text documents, images, and audio recordings. A user’s text question retrieves the most relevant content — regardless of what format it’s stored in.

Search a video library using a text description
Find images that match an audio description
Retrieve PDF pages using a photo as the query

Semantic Search Across 100+ Languages

Index content in any language; search in any other. The unified embedding space handles cross-lingual retrieval without translation.

Document Intelligence

Embed PDFs directly. No need to extract text first. The model reads and understands the content visually and textually, then places it in the vector space.

Classification and Sentiment Analysis

Embed incoming content (text, image, or mixed) and classify it against label embeddings. Works for spam detection, content moderation, and review analysis.

Anomaly Detection

Embed operational logs, sensor data, or media assets. Flag items whose vectors are statistical outliers from the expected cluster.

Supported Vector Databases and Frameworks

Gemini Embedding 2 integrates natively with:

LangChain — docs.langchain.com
LlamaIndex — developers.llamaindex.ai
Haystack — haystack.deepset.ai
Weaviate — docs.weaviate.io
Qdrant — qdrant.tech
ChromaDB — docs.trychroma.com
Pinecone — via REST API
BigQuery, AlloyDB, Cloud SQL — via Google Cloud

Migrating from gemini-embedding-001

If you are currently using gemini-embedding-001, you cannot simply swap model names — the embedding spaces are mathematically incompatible.

What You Must Do

Re-embed all existing data using gemini-embedding-2-preview
Update your model ID in all API calls
Update dimension handling — check if you need to normalize vectors for non-3072 outputs
Update task type parameters if using the task type feature

What Stays the Same

The API call structure (embed_content method) is identical
The output_dimensionality parameter works the same way
Default output dimensions (3072) remain the same

✅ Batch processing tip: Use the Gemini API Batch Mode for re-embedding large datasets. Batch API runs at 50% of the standard embedding price, making large migrations cost-effective.

Pricing and Availability

Gemini Embedding 2 is available now in Public Preview through the Gemini API and Vertex AI, billed under Standard PayGo pricing.

Access Options

Platform	Access	Current Availability
Gemini API	Standard PayGo	✅ Public Preview
Vertex AI	Standard PayGo	✅ Public Preview (us-central1)
Vertex AI Provisioned Throughput	—	❌ Not yet supported
Vertex AI Batch Prediction	—	❌ Not yet supported

Batch API discount: If latency is not critical, use the Gemini API Batch Mode for 50% cost savings on large embedding jobs.

Knowledge Cutoff

The model’s training knowledge cutoff is November 2025.

Troubleshooting Common Issues

“My cosine similarity scores are unexpected”

Solution: Check whether you’re using 3072 dimensions (auto-normalized) or a smaller dimension (requires manual normalization). Non-3072 vectors must be L2-normalized before cosine similarity calculations.

“I’m getting different results than with gemini-embedding-001”

Expected behavior. The two models use incompatible embedding spaces. You must re-embed all your documents with the new model before comparing results. Do not mix embeddings from the two models.

“My video embedding seems incomplete”

Solution: Videos longer than 128 seconds are not fully processed in a single request. Chunk your video into overlapping segments of ≤128 seconds and embed each segment individually.

“The model isn’t available in my region”

Current limitation. During Public Preview, Gemini Embedding 2 on Vertex AI is only available in us-central1. Check the Vertex AI locations page for regional expansion updates.

“Embedding audio from a video is failing”

Solution: The model supports audio track extraction from video natively, but the video must be ≤80 seconds when audio is included (the limit drops from 128 to 80 seconds when the audio track is processed).

BONUS: Turn Your AI-Retrieved Content Into Full Video with Gaga AI

Gemini Embedding 2 helps you find and organize content. Gaga AI helps you present it.

Once Gemini Embedding 2 powers your search or RAG pipeline, the natural next step for many creators and businesses is turning that retrieved content into something people actually watch. Gaga AI is an all-in-one AI video creation platform purpose-built for that exact workflow.

Generate Video Free

Learn Gaga AI

What Gaga AI Offers

Image to Video AI

Convert any retrieved image or static asset into a dynamic video clip. Perfect for:

Turning Gemini Embedding 2-retrieved product images into demo videos
Animating retrieved visual search results for social media
Building preview clips from image archives

Video and Audio Infusion

Don’t just generate video — synchronize it with audio intelligently:

Layer retrieved audio content over video clips with precise timing
Add AI-generated background music that adapts to video mood
Balance voiceover, music, and sound effects in one step
Sync visual transitions to beat detection automatically

This is especially powerful when combined with Gemini Embedding 2’s audio retrieval — find the right audio, then use Gaga AI to infuse it into the final video output.

AI Avatar

Create a photorealistic AI presenter that can deliver your retrieved content on camera — without you ever recording a video:

Presenting search results or RAG-generated summaries as talking-head videos
Narrating multimodal content retrieved by Gemini Embedding 2
Building branded video spokespeople for product or documentation pages
Multilingual video delivery: same avatar, multiple languages

AI Voice Clone

Record a brief voice sample and Gaga AI builds a digital clone of your voice:

Narrate AI-retrieved content in your own voice consistently
Localize content rapidly — clone once, speak in any language
Generate podcast-style audio summaries of search results
Maintain a consistent voice identity across all video content

Text-to-Speech (TTS)

Skip the voice recording entirely with Gaga AI’s high-quality TTS engine:

Natural-sounding voices in multiple languages and accents
Emotional tone control: neutral, professional, warm, energetic
SSML support for fine-grained pacing and emphasis
Adjustable speed, pitch, and style per script segment

A Practical Gemini Embedding 2 + Gaga AI Workflow

Index your content library (text, images, audio, video) using Gemini Embedding 2
Retrieve the most semantically relevant assets via cross-modal search
Animate retrieved images into video clips using Gaga AI
Infuse retrieved audio into the video with Gaga AI’s audio layer
Add an AI Avatar to present the results or narrate the summary
Voice it with TTS or your voice clone for the final narration
Publish the finished video to YouTube, LinkedIn, or TikTok

This pipeline takes raw multimodal data, surfaces the right content with Gemini Embedding 2’s semantic intelligence, and wraps it in a production-ready video with Gaga AI — end to end, without a camera crew or editor.

Frequently Asked Questions (FAQ)

What is Gemini Embedding 2?

Gemini Embedding 2 (gemini-embedding-2-preview) is Google’s first natively multimodal embedding model. Released on March 10, 2026, it converts text, images, video, audio, and PDF documents into numerical vectors within a single unified embedding space, enabling cross-modal semantic search and retrieval.

What makes Gemini Embedding 2 different from other embedding models?

Most embedding models are text-only or single-modality. Gemini Embedding 2 is natively multimodal — it maps all five media types into the same mathematical space using one model. It also supports custom task types, adjustable output dimensions via MRL, document OCR, and audio extraction from video.

Is Gemini Embedding 2 free?

Gemini Embedding 2 is available via Standard PayGo pricing on both the Gemini API and Vertex AI. There is no free tier listed in the Public Preview documentation. A 50% discount is available when using the Gemini API Batch Mode for non-latency-sensitive jobs.

Can I use gemini-embedding-2-preview to replace gemini-embedding-001?

No — not directly. The two models produce vectors in incompatible embedding spaces. If you switch, you must re-embed all existing documents and data using the new model before running any similarity comparisons.

What languages does Gemini Embedding 2 support?

Gemini Embedding 2 captures semantic intent across more than 100 languages, enabling cross-lingual retrieval — a query in one language can retrieve semantically matching content in another.

What are the input limits for Gemini Embedding 2?

The total input limit is 8,192 tokens. Per-modality limits: text (8,192 tokens), images (6 per request, PNG/JPEG), audio (1 file, max 80 seconds, MP3/WAV), video (1 file, max 128 seconds without audio / 80 seconds with audio, MP4/MOV), PDF (1 file, max 6 pages).

What output dimensions does Gemini Embedding 2 support?

The default output is a 3,072-dimensional vector. Using MRL, you can reduce this to any size between 128 and 3,072. Google recommends using 768, 1,536, or 3,072 for best quality. Vectors smaller than 3,072 must be manually L2-normalized before use.

Can I embed video and audio together in one request?

Yes. Gemini Embedding 2 supports interleaved multimodal input. You can pass text, image, audio, and video parts within a single request. When submitted as a single Content entry, it returns one aggregated embedding.

Where is Gemini Embedding 2 available?

It is available via the Gemini API globally and via Vertex AI in the us-central1 region during Public Preview. Broader regional availability is expected as the model moves toward General Availability.

What can I build with Gemini Embedding 2?

Key use cases include: multimodal RAG systems, cross-modal semantic search, document intelligence pipelines, content classification and moderation, multilingual search, clustering and anomaly detection, and recommendation engines that work across text, image, audio, and video content.

Gemini Embedding 2: Google’s Multimodal AI Just Changed Search