
Key Takeaways
- Gemini Embedding 2 (gemini-embedding-2-preview) is Google’s first natively multimodal embedding model, released on March 10, 2026.
- It maps text, images, video, audio, and PDF documents into a single unified embedding space — enabling true cross-modal search.
- It supports over 100 languages and generates 3072-dimensional vectors by default.
- It uses Matryoshka Representation Learning (MRL), allowing output dimensions to be scaled down to 128–3072 without significant quality loss.
- It is available now via the Gemini API and Vertex AI in Public Preview.
- It is incompatible with the previous gemini-embedding-001 model — existing data must be re-embedded when migrating.
- It powers use cases including RAG systems, semantic search, classification, clustering, and anomaly detection — now across all media types, not just text.
Table of Contents
What Is Gemini Embedding 2?
Gemini Embedding 2 is Google’s first natively multimodal AI embedding model that converts text, images, video, audio, and documents into numerical vectors within a single, unified embedding space.
Released on March 10, 2026, it is built on the Gemini architecture and represents a fundamental upgrade over its predecessor gemini-embedding-001, which was text-only. With Gemini Embedding 2, a search query typed in English can now retrieve a matching video clip, an image, or a PDF page — all using the same vector math.
The model ID is gemini-embedding-2-preview and it is available in Public Preview through both the Gemini API and Vertex AI.
Why Does This Matter?
Most embedding models today operate on a single modality. You need a separate model for text, a different one for images, another for audio. Gemini Embedding 2 collapses all of that into one model, one API call, and one shared vector space.
This means:
- A text query can find the most relevant image, video clip, or audio segment
- An image can be used to retrieve semantically similar documents
- Mixed-media content (a slide deck with text + images) can be embedded in a single request
How Does Gemini Embedding 2 Work?
Gemini Embedding 2 converts any supported input — text, image, audio, video, or PDF — into a high-dimensional numerical vector that captures its semantic meaning, then places it into a shared mathematical space where similar concepts cluster together regardless of modality.
The Core Concept: Embedding Space
An embedding is a list of numbers (a vector) that represents the meaning of content. When two pieces of content are semantically similar — even if one is a text description and the other is an image — their vectors will be mathematically close in the embedding space.
Gemini Embedding 2 generates 3072-dimensional vectors by default. Each dimension captures a different aspect of meaning: topic, tone, context, visual content, acoustic properties, and more.
Matryoshka Representation Learning (MRL)
Gemini Embedding 2 is trained using MRL — a technique that “nests” information within the vector. This means:
- You can truncate the 3072-dimensional output to a smaller size (e.g., 768 or 1536 dimensions)
- Smaller vectors cost less to store and process
- Performance degrades gracefully — 768 dimensions still achieves competitive quality
MTEB benchmark scores by dimension (text):
| Dimension | MTEB Score |
| 2048 | 68.16 |
| 1536 | 68.17 |
| 768 | 67.99 |
| 512 | 67.55 |
| 256 | 66.19 |
| 128 | 63.31 |
Recommended dimensions: 768, 1536, or 3072 for highest quality.
⚠️ Important: The 3072-dimension output is automatically normalized. For 768 and 1536, you must manually normalize the vector before cosine similarity calculations.
Interleaved Multimodal Input
Unlike previous models that process one modality at a time, Gemini Embedding 2 natively understands interleaved input — meaning you can pass text + image + audio in a single API request, and it generates one aggregated embedding that captures the combined meaning.
What Can Gemini Embedding 2 Process?
Gemini Embedding 2 accepts five modality types: text, images, audio, video, and PDF documents — each with specific format and size limits.
Supported Modalities and Limits
| Modality | Max Per Request | Max Duration / Size | Supported Formats |
| Text | — | 8,192 tokens | Any text |
| Images | 6 images | No file size limit | PNG, JPEG |
| Audio | 1 file | 80 seconds | MP3, WAV |
| Video | 1 file | 128 sec (no audio) / 80 sec (with audio) | MP4, MOV (H264, H265, AV1, VP9) |
| PDF Documents | 1 file | 6 pages |
Total input limit across all modalities: 8,192 tokens per request.
Special Capabilities
- Document OCR — The model reads and embeds text extracted from PDFs, not just the visual appearance.
- Audio Track Extraction — When embedding video, the model can automatically extract and process the audio track alongside the visual frames — no manual preprocessing required.
Key Features: What’s New vs. gemini-embedding-001
Gemini Embedding 2 introduces four major features that its predecessor lacked: multimodal input, custom task instructions, adjustable output dimensions, and document OCR.
1. Multimodal Input
The most significant upgrade. gemini-embedding-001 was text-only. Gemini Embedding 2 handles all five modality types in one unified model.
2. Custom Task Instructions
You can now specify what you intend to do with the embedding. This helps the model optimize the vector for the specific task, increasing accuracy.
Supported task types:
| Task Type | Use Case |
| SEMANTIC_SIMILARITY | Comparing two pieces of content for meaning closeness |
| CLASSIFICATION | Sentiment analysis, spam detection |
| CLUSTERING | Document organization, anomaly detection |
| RETRIEVAL_DOCUMENT | Indexing articles, books, web pages for search |
| RETRIEVAL_QUERY | User search queries |
| CODE_RETRIEVAL_QUERY | Finding code blocks from natural language queries |
| QUESTION_ANSWERING | Finding documents that answer a specific question |
| FACT_VERIFICATION | Retrieving evidence to verify a claim |
3. Adjustable Output Dimensions
Use the output_dimensionality parameter to get a smaller, cheaper vector when full precision isn’t needed. Supported range: 128 to 3072.
4. Document OCR
Embed PDFs by processing their actual textual and visual content — not just metadata. The model reads and understands what’s on each page.
How to Use Gemini Embedding 2: Step-by-Step
Gemini Embedding 2 is available via the google-genai Python SDK, JavaScript SDK, REST API, and third-party integrations including LangChain, LlamaIndex, and ChromaDB.
Step 1: Install the SDK
pip install google-genai
Step 2: Set Up Your API Key
Get your API key from Google AI Studio.
import os
os.environ[“GOOGLE_API_KEY”] = “your_api_key_here”
Step 3: Generate a Text Embedding
from google import genai
client = genai.Client()
result = client.models.embed_content(
model=”gemini-embedding-2-preview”,
contents=”What is the meaning of life?”
)
print(result.embeddings)
Step 4: Embed an Image
from google import genai
from google.genai import types
client = genai.Client()
with open(“example.png”, “rb”) as f:
image_bytes = f.read()
result = client.models.embed_content(
model=”gemini-embedding-2-preview”,
contents=[
types.Part.from_bytes(
data=image_bytes,
mime_type=”image/png”,
),
]
)
print(result.embeddings)
Step 5: Embed Text + Image Together (Single Aggregated Vector)
from google import genai
from google.genai import types
client = genai.Client()
with open(“dog.png”, “rb”) as f:
image_bytes = f.read()
result = client.models.embed_content(
model=”gemini-embedding-2-preview”,
contents=[
types.Content(
parts=[
types.Part(text=”An image of a dog”),
types.Part.from_bytes(
data=image_bytes,
mime_type=”image/png”,
)
]
)
]
)
# Returns ONE aggregated embedding for both inputs
for embedding in result.embeddings:
print(embedding.values)
Step 6: Use a Task Type for Better Accuracy
from google import genai
from google.genai import types
client = genai.Client()
result = client.models.embed_content(
model=”gemini-embedding-2-preview”,
contents=”How do transformers work in NLP?”,
config=types.EmbedContentConfig(task_type=”RETRIEVAL_QUERY”)
)
Step 7: Control Output Dimensions
from google import genai
from google.genai import types
client = genai.Client()
result = client.models.embed_content(
model=”gemini-embedding-2-preview”,
contents=”Semantic search example”,
config=types.EmbedContentConfig(output_dimensionality=768)
)
⚠️ Remember: Normalize the vector manually if using dimensions other than 3072.
Use Cases: What Can You Build With Gemini Embedding 2?
Gemini Embedding 2 enables any application that needs to find, compare, or organize information across mixed media types.
Multimodal RAG (Retrieval-Augmented Generation)
Build a knowledge base that includes text documents, images, and audio recordings. A user’s text question retrieves the most relevant content — regardless of what format it’s stored in.
Cross-Modal Search
- Search a video library using a text description
- Find images that match an audio description
- Retrieve PDF pages using a photo as the query
Semantic Search Across 100+ Languages
Index content in any language; search in any other. The unified embedding space handles cross-lingual retrieval without translation.
Document Intelligence
Embed PDFs directly. No need to extract text first. The model reads and understands the content visually and textually, then places it in the vector space.
Classification and Sentiment Analysis
Embed incoming content (text, image, or mixed) and classify it against label embeddings. Works for spam detection, content moderation, and review analysis.
Anomaly Detection
Embed operational logs, sensor data, or media assets. Flag items whose vectors are statistical outliers from the expected cluster.
Supported Vector Databases and Frameworks
Gemini Embedding 2 integrates natively with:
- LangChain — docs.langchain.com
- LlamaIndex — developers.llamaindex.ai
- Haystack — haystack.deepset.ai
- Weaviate — docs.weaviate.io
- Qdrant — qdrant.tech
- ChromaDB — docs.trychroma.com
- Pinecone — via REST API
- BigQuery, AlloyDB, Cloud SQL — via Google Cloud
Migrating from gemini-embedding-001
If you are currently using gemini-embedding-001, you cannot simply swap model names — the embedding spaces are mathematically incompatible.
What You Must Do
- Re-embed all existing data using gemini-embedding-2-preview
- Update your model ID in all API calls
- Update dimension handling — check if you need to normalize vectors for non-3072 outputs
- Update task type parameters if using the task type feature
What Stays the Same
- The API call structure (embed_content method) is identical
- The output_dimensionality parameter works the same way
- Default output dimensions (3072) remain the same
✅ Batch processing tip: Use the Gemini API Batch Mode for re-embedding large datasets. Batch API runs at 50% of the standard embedding price, making large migrations cost-effective.
Pricing and Availability
Gemini Embedding 2 is available now in Public Preview through the Gemini API and Vertex AI, billed under Standard PayGo pricing.
Access Options
| Platform | Access | Current Availability |
| Gemini API | Standard PayGo | ✅ Public Preview |
| Vertex AI | Standard PayGo | ✅ Public Preview (us-central1) |
| Vertex AI Provisioned Throughput | — | ❌ Not yet supported |
| Vertex AI Batch Prediction | — | ❌ Not yet supported |
Batch API discount: If latency is not critical, use the Gemini API Batch Mode for 50% cost savings on large embedding jobs.
Knowledge Cutoff
The model’s training knowledge cutoff is November 2025.
Troubleshooting Common Issues
“My cosine similarity scores are unexpected”
Solution: Check whether you’re using 3072 dimensions (auto-normalized) or a smaller dimension (requires manual normalization). Non-3072 vectors must be L2-normalized before cosine similarity calculations.
“I’m getting different results than with gemini-embedding-001”
Expected behavior. The two models use incompatible embedding spaces. You must re-embed all your documents with the new model before comparing results. Do not mix embeddings from the two models.
“My video embedding seems incomplete”
Solution: Videos longer than 128 seconds are not fully processed in a single request. Chunk your video into overlapping segments of ≤128 seconds and embed each segment individually.
“The model isn’t available in my region”
Current limitation. During Public Preview, Gemini Embedding 2 on Vertex AI is only available in us-central1. Check the Vertex AI locations page for regional expansion updates.
“Embedding audio from a video is failing”
Solution: The model supports audio track extraction from video natively, but the video must be ≤80 seconds when audio is included (the limit drops from 128 to 80 seconds when the audio track is processed).
BONUS: Turn Your AI-Retrieved Content Into Full Video with Gaga AI
Gemini Embedding 2 helps you find and organize content. Gaga AI helps you present it.

Once Gemini Embedding 2 powers your search or RAG pipeline, the natural next step for many creators and businesses is turning that retrieved content into something people actually watch. Gaga AI is an all-in-one AI video creation platform purpose-built for that exact workflow.
What Gaga AI Offers
Image to Video AI
Convert any retrieved image or static asset into a dynamic video clip. Perfect for:
- Turning Gemini Embedding 2-retrieved product images into demo videos
- Animating retrieved visual search results for social media
- Building preview clips from image archives
Video and Audio Infusion
Don’t just generate video — synchronize it with audio intelligently:
- Layer retrieved audio content over video clips with precise timing
- Add AI-generated background music that adapts to video mood
- Balance voiceover, music, and sound effects in one step
- Sync visual transitions to beat detection automatically
This is especially powerful when combined with Gemini Embedding 2’s audio retrieval — find the right audio, then use Gaga AI to infuse it into the final video output.
AI Avatar
Create a photorealistic AI presenter that can deliver your retrieved content on camera — without you ever recording a video:
- Presenting search results or RAG-generated summaries as talking-head videos
- Narrating multimodal content retrieved by Gemini Embedding 2
- Building branded video spokespeople for product or documentation pages
- Multilingual video delivery: same avatar, multiple languages
AI Voice Clone
Record a brief voice sample and Gaga AI builds a digital clone of your voice:
- Narrate AI-retrieved content in your own voice consistently
- Localize content rapidly — clone once, speak in any language
- Generate podcast-style audio summaries of search results
- Maintain a consistent voice identity across all video content
Text-to-Speech (TTS)
Skip the voice recording entirely with Gaga AI’s high-quality TTS engine:
- Natural-sounding voices in multiple languages and accents
- Emotional tone control: neutral, professional, warm, energetic
- SSML support for fine-grained pacing and emphasis
- Adjustable speed, pitch, and style per script segment
A Practical Gemini Embedding 2 + Gaga AI Workflow
- Index your content library (text, images, audio, video) using Gemini Embedding 2
- Retrieve the most semantically relevant assets via cross-modal search
- Animate retrieved images into video clips using Gaga AI
- Infuse retrieved audio into the video with Gaga AI’s audio layer
- Add an AI Avatar to present the results or narrate the summary
- Voice it with TTS or your voice clone for the final narration
- Publish the finished video to YouTube, LinkedIn, or TikTok
This pipeline takes raw multimodal data, surfaces the right content with Gemini Embedding 2’s semantic intelligence, and wraps it in a production-ready video with Gaga AI — end to end, without a camera crew or editor.
Frequently Asked Questions (FAQ)
What is Gemini Embedding 2?
Gemini Embedding 2 (gemini-embedding-2-preview) is Google’s first natively multimodal embedding model. Released on March 10, 2026, it converts text, images, video, audio, and PDF documents into numerical vectors within a single unified embedding space, enabling cross-modal semantic search and retrieval.
What makes Gemini Embedding 2 different from other embedding models?
Most embedding models are text-only or single-modality. Gemini Embedding 2 is natively multimodal — it maps all five media types into the same mathematical space using one model. It also supports custom task types, adjustable output dimensions via MRL, document OCR, and audio extraction from video.
Is Gemini Embedding 2 free?
Gemini Embedding 2 is available via Standard PayGo pricing on both the Gemini API and Vertex AI. There is no free tier listed in the Public Preview documentation. A 50% discount is available when using the Gemini API Batch Mode for non-latency-sensitive jobs.
Can I use gemini-embedding-2-preview to replace gemini-embedding-001?
No — not directly. The two models produce vectors in incompatible embedding spaces. If you switch, you must re-embed all existing documents and data using the new model before running any similarity comparisons.
What languages does Gemini Embedding 2 support?
Gemini Embedding 2 captures semantic intent across more than 100 languages, enabling cross-lingual retrieval — a query in one language can retrieve semantically matching content in another.
What are the input limits for Gemini Embedding 2?
The total input limit is 8,192 tokens. Per-modality limits: text (8,192 tokens), images (6 per request, PNG/JPEG), audio (1 file, max 80 seconds, MP3/WAV), video (1 file, max 128 seconds without audio / 80 seconds with audio, MP4/MOV), PDF (1 file, max 6 pages).
What output dimensions does Gemini Embedding 2 support?
The default output is a 3,072-dimensional vector. Using MRL, you can reduce this to any size between 128 and 3,072. Google recommends using 768, 1,536, or 3,072 for best quality. Vectors smaller than 3,072 must be manually L2-normalized before use.
Can I embed video and audio together in one request?
Yes. Gemini Embedding 2 supports interleaved multimodal input. You can pass text, image, audio, and video parts within a single request. When submitted as a single Content entry, it returns one aggregated embedding.
Where is Gemini Embedding 2 available?
It is available via the Gemini API globally and via Vertex AI in the us-central1 region during Public Preview. Broader regional availability is expected as the model moves toward General Availability.
What can I build with Gemini Embedding 2?
Key use cases include: multimodal RAG systems, cross-modal semantic search, document intelligence pipelines, content classification and moderation, multilingual search, clustering and anomaly detection, and recommendation engines that work across text, image, audio, and video content.








