Gemini Embedding 2: Google’s Multimodal AI Just Changed Search

Gemini Embedding 2: Google’s Multimodal AI Just Changed Search


gemini 2 embedding

Key Takeaways

  • Gemini Embedding 2 (gemini-embedding-2-preview) is Google’s first natively multimodal embedding model, released on March 10, 2026.
  • It maps text, images, video, audio, and PDF documents into a single unified embedding space — enabling true cross-modal search.
  • It supports over 100 languages and generates 3072-dimensional vectors by default.
  • It uses Matryoshka Representation Learning (MRL), allowing output dimensions to be scaled down to 128–3072 without significant quality loss.
  • It is available now via the Gemini API and Vertex AI in Public Preview.
  • It is incompatible with the previous gemini-embedding-001 model — existing data must be re-embedded when migrating.
  • It powers use cases including RAG systems, semantic search, classification, clustering, and anomaly detection — now across all media types, not just text.

What Is Gemini Embedding 2?

Gemini Embedding 2 is Google’s first natively multimodal AI embedding model that converts text, images, video, audio, and documents into numerical vectors within a single, unified embedding space.

Released on March 10, 2026, it is built on the Gemini architecture and represents a fundamental upgrade over its predecessor gemini-embedding-001, which was text-only. With Gemini Embedding 2, a search query typed in English can now retrieve a matching video clip, an image, or a PDF page — all using the same vector math.

The model ID is gemini-embedding-2-preview and it is available in Public Preview through both the Gemini API and Vertex AI.

Why Does This Matter?

Most embedding models today operate on a single modality. You need a separate model for text, a different one for images, another for audio. Gemini Embedding 2 collapses all of that into one model, one API call, and one shared vector space.

This means:

  • A text query can find the most relevant image, video clip, or audio segment
  • An image can be used to retrieve semantically similar documents
  • Mixed-media content (a slide deck with text + images) can be embedded in a single request

How Does Gemini Embedding 2 Work?

Gemini Embedding 2 converts any supported input — text, image, audio, video, or PDF — into a high-dimensional numerical vector that captures its semantic meaning, then places it into a shared mathematical space where similar concepts cluster together regardless of modality.

The Core Concept: Embedding Space

An embedding is a list of numbers (a vector) that represents the meaning of content. When two pieces of content are semantically similar — even if one is a text description and the other is an image — their vectors will be mathematically close in the embedding space.

Gemini Embedding 2 generates 3072-dimensional vectors by default. Each dimension captures a different aspect of meaning: topic, tone, context, visual content, acoustic properties, and more.

Matryoshka Representation Learning (MRL)

Gemini Embedding 2 is trained using MRL — a technique that “nests” information within the vector. This means:

  • You can truncate the 3072-dimensional output to a smaller size (e.g., 768 or 1536 dimensions)
  • Smaller vectors cost less to store and process
  • Performance degrades gracefully — 768 dimensions still achieves competitive quality

MTEB benchmark scores by dimension (text):

DimensionMTEB Score
204868.16
153668.17
76867.99
51267.55
25666.19
12863.31

Recommended dimensions: 768, 1536, or 3072 for highest quality.

⚠️ Important: The 3072-dimension output is automatically normalized. For 768 and 1536, you must manually normalize the vector before cosine similarity calculations.

Interleaved Multimodal Input

Unlike previous models that process one modality at a time, Gemini Embedding 2 natively understands interleaved input — meaning you can pass text + image + audio in a single API request, and it generates one aggregated embedding that captures the combined meaning.

What Can Gemini Embedding 2 Process?

Gemini Embedding 2 accepts five modality types: text, images, audio, video, and PDF documents — each with specific format and size limits.

Supported Modalities and Limits

ModalityMax Per RequestMax Duration / SizeSupported Formats
Text8,192 tokensAny text
Images6 imagesNo file size limitPNG, JPEG
Audio1 file80 secondsMP3, WAV
Video1 file128 sec (no audio) / 80 sec (with audio)MP4, MOV (H264, H265, AV1, VP9)
PDF Documents1 file6 pagesPDF

Total input limit across all modalities: 8,192 tokens per request.

Special Capabilities

  • Document OCR — The model reads and embeds text extracted from PDFs, not just the visual appearance.
  • Audio Track Extraction — When embedding video, the model can automatically extract and process the audio track alongside the visual frames — no manual preprocessing required.

Key Features: What’s New vs. gemini-embedding-001

Gemini Embedding 2 introduces four major features that its predecessor lacked: multimodal input, custom task instructions, adjustable output dimensions, and document OCR.

1. Multimodal Input

The most significant upgrade. gemini-embedding-001 was text-only. Gemini Embedding 2 handles all five modality types in one unified model.

2. Custom Task Instructions

You can now specify what you intend to do with the embedding. This helps the model optimize the vector for the specific task, increasing accuracy.

Supported task types:

Task TypeUse Case
SEMANTIC_SIMILARITYComparing two pieces of content for meaning closeness
CLASSIFICATIONSentiment analysis, spam detection
CLUSTERINGDocument organization, anomaly detection
RETRIEVAL_DOCUMENTIndexing articles, books, web pages for search
RETRIEVAL_QUERYUser search queries
CODE_RETRIEVAL_QUERYFinding code blocks from natural language queries
QUESTION_ANSWERINGFinding documents that answer a specific question
FACT_VERIFICATIONRetrieving evidence to verify a claim

3. Adjustable Output Dimensions

Use the output_dimensionality parameter to get a smaller, cheaper vector when full precision isn’t needed. Supported range: 128 to 3072.

4. Document OCR

Embed PDFs by processing their actual textual and visual content — not just metadata. The model reads and understands what’s on each page.

How to Use Gemini Embedding 2: Step-by-Step

Gemini Embedding 2 is available via the google-genai Python SDK, JavaScript SDK, REST API, and third-party integrations including LangChain, LlamaIndex, and ChromaDB.

Step 1: Install the SDK

pip install google-genai

Step 2: Set Up Your API Key

Get your API key from Google AI Studio.

import os

os.environ[“GOOGLE_API_KEY”] = “your_api_key_here”

Step 3: Generate a Text Embedding

from google import genai

client = genai.Client()

result = client.models.embed_content(

    model=”gemini-embedding-2-preview”,

    contents=”What is the meaning of life?”

)

print(result.embeddings)

Step 4: Embed an Image

from google import genai

from google.genai import types

client = genai.Client()

with open(“example.png”, “rb”) as f:

    image_bytes = f.read()

result = client.models.embed_content(

    model=”gemini-embedding-2-preview”,

    contents=[

        types.Part.from_bytes(

            data=image_bytes,

            mime_type=”image/png”,

        ),

    ]

)

print(result.embeddings)

Step 5: Embed Text + Image Together (Single Aggregated Vector)

from google import genai

from google.genai import types

client = genai.Client()

with open(“dog.png”, “rb”) as f:

    image_bytes = f.read()

result = client.models.embed_content(

    model=”gemini-embedding-2-preview”,

    contents=[

        types.Content(

            parts=[

                types.Part(text=”An image of a dog”),

                types.Part.from_bytes(

                    data=image_bytes,

                    mime_type=”image/png”,

                )

            ]

        )

    ]

)

# Returns ONE aggregated embedding for both inputs

for embedding in result.embeddings:

    print(embedding.values)

Step 6: Use a Task Type for Better Accuracy

from google import genai

from google.genai import types

client = genai.Client()

result = client.models.embed_content(

    model=”gemini-embedding-2-preview”,

    contents=”How do transformers work in NLP?”,

    config=types.EmbedContentConfig(task_type=”RETRIEVAL_QUERY”)

)

Step 7: Control Output Dimensions

from google import genai

from google.genai import types

client = genai.Client()

result = client.models.embed_content(

    model=”gemini-embedding-2-preview”,

    contents=”Semantic search example”,

    config=types.EmbedContentConfig(output_dimensionality=768)

)

⚠️ Remember: Normalize the vector manually if using dimensions other than 3072.

Use Cases: What Can You Build With Gemini Embedding 2?

Gemini Embedding 2 enables any application that needs to find, compare, or organize information across mixed media types.

Multimodal RAG (Retrieval-Augmented Generation)

Build a knowledge base that includes text documents, images, and audio recordings. A user’s text question retrieves the most relevant content — regardless of what format it’s stored in.

  • Search a video library using a text description
  • Find images that match an audio description
  • Retrieve PDF pages using a photo as the query

Semantic Search Across 100+ Languages

Index content in any language; search in any other. The unified embedding space handles cross-lingual retrieval without translation.

Document Intelligence

Embed PDFs directly. No need to extract text first. The model reads and understands the content visually and textually, then places it in the vector space.

Classification and Sentiment Analysis

Embed incoming content (text, image, or mixed) and classify it against label embeddings. Works for spam detection, content moderation, and review analysis.

Anomaly Detection

Embed operational logs, sensor data, or media assets. Flag items whose vectors are statistical outliers from the expected cluster.

Supported Vector Databases and Frameworks

Gemini Embedding 2 integrates natively with:

  • LangChain — docs.langchain.com
  • LlamaIndex — developers.llamaindex.ai
  • Haystack — haystack.deepset.ai
  • Weaviate — docs.weaviate.io
  • Qdrant — qdrant.tech
  • ChromaDB — docs.trychroma.com
  • Pinecone — via REST API
  • BigQuery, AlloyDB, Cloud SQL — via Google Cloud

Migrating from gemini-embedding-001

If you are currently using gemini-embedding-001, you cannot simply swap model names — the embedding spaces are mathematically incompatible.

What You Must Do

  1. Re-embed all existing data using gemini-embedding-2-preview
  2. Update your model ID in all API calls
  3. Update dimension handling — check if you need to normalize vectors for non-3072 outputs
  4. Update task type parameters if using the task type feature

What Stays the Same

  • The API call structure (embed_content method) is identical
  • The output_dimensionality parameter works the same way
  • Default output dimensions (3072) remain the same

✅ Batch processing tip: Use the Gemini API Batch Mode for re-embedding large datasets. Batch API runs at 50% of the standard embedding price, making large migrations cost-effective.

Pricing and Availability

Gemini Embedding 2 is available now in Public Preview through the Gemini API and Vertex AI, billed under Standard PayGo pricing.

Access Options

PlatformAccessCurrent Availability
Gemini APIStandard PayGo✅ Public Preview
Vertex AIStandard PayGo✅ Public Preview (us-central1)
Vertex AI Provisioned Throughput❌ Not yet supported
Vertex AI Batch Prediction❌ Not yet supported

Batch API discount: If latency is not critical, use the Gemini API Batch Mode for 50% cost savings on large embedding jobs.

Knowledge Cutoff

The model’s training knowledge cutoff is November 2025.

Troubleshooting Common Issues

“My cosine similarity scores are unexpected”

Solution: Check whether you’re using 3072 dimensions (auto-normalized) or a smaller dimension (requires manual normalization). Non-3072 vectors must be L2-normalized before cosine similarity calculations.

“I’m getting different results than with gemini-embedding-001”

Expected behavior. The two models use incompatible embedding spaces. You must re-embed all your documents with the new model before comparing results. Do not mix embeddings from the two models.

“My video embedding seems incomplete”

Solution: Videos longer than 128 seconds are not fully processed in a single request. Chunk your video into overlapping segments of ≤128 seconds and embed each segment individually.

“The model isn’t available in my region”

Current limitation. During Public Preview, Gemini Embedding 2 on Vertex AI is only available in us-central1. Check the Vertex AI locations page for regional expansion updates.

“Embedding audio from a video is failing”

Solution: The model supports audio track extraction from video natively, but the video must be ≤80 seconds when audio is included (the limit drops from 128 to 80 seconds when the audio track is processed).

BONUS: Turn Your AI-Retrieved Content Into Full Video with Gaga AI

Gemini Embedding 2 helps you find and organize content. Gaga AI helps you present it.

gaga ai video generation

Once Gemini Embedding 2 powers your search or RAG pipeline, the natural next step for many creators and businesses is turning that retrieved content into something people actually watch. Gaga AI is an all-in-one AI video creation platform purpose-built for that exact workflow.

What Gaga AI Offers

Image to Video AI

Convert any retrieved image or static asset into a dynamic video clip. Perfect for:

  • Turning Gemini Embedding 2-retrieved product images into demo videos
  • Animating retrieved visual search results for social media
  • Building preview clips from image archives

Video and Audio Infusion

Don’t just generate video — synchronize it with audio intelligently:

  • Layer retrieved audio content over video clips with precise timing
  • Add AI-generated background music that adapts to video mood
  • Balance voiceover, music, and sound effects in one step
  • Sync visual transitions to beat detection automatically

This is especially powerful when combined with Gemini Embedding 2’s audio retrieval — find the right audio, then use Gaga AI to infuse it into the final video output.

AI Avatar

Create a photorealistic AI presenter that can deliver your retrieved content on camera — without you ever recording a video:

  • Presenting search results or RAG-generated summaries as talking-head videos
  • Narrating multimodal content retrieved by Gemini Embedding 2
  • Building branded video spokespeople for product or documentation pages
  • Multilingual video delivery: same avatar, multiple languages

AI Voice Clone

Record a brief voice sample and Gaga AI builds a digital clone of your voice:

  • Narrate AI-retrieved content in your own voice consistently
  • Localize content rapidly — clone once, speak in any language
  • Generate podcast-style audio summaries of search results
  • Maintain a consistent voice identity across all video content

Text-to-Speech (TTS)

Skip the voice recording entirely with Gaga AI’s high-quality TTS engine:

  • Natural-sounding voices in multiple languages and accents
  • Emotional tone control: neutral, professional, warm, energetic
  • SSML support for fine-grained pacing and emphasis
  • Adjustable speed, pitch, and style per script segment

A Practical Gemini Embedding 2 + Gaga AI Workflow

  1. Index your content library (text, images, audio, video) using Gemini Embedding 2
  2. Retrieve the most semantically relevant assets via cross-modal search
  3. Animate retrieved images into video clips using Gaga AI
  4. Infuse retrieved audio into the video with Gaga AI’s audio layer
  5. Add an AI Avatar to present the results or narrate the summary
  6. Voice it with TTS or your voice clone for the final narration
  7. Publish the finished video to YouTube, LinkedIn, or TikTok

This pipeline takes raw multimodal data, surfaces the right content with Gemini Embedding 2’s semantic intelligence, and wraps it in a production-ready video with Gaga AI — end to end, without a camera crew or editor.

Frequently Asked Questions (FAQ)

What is Gemini Embedding 2?

Gemini Embedding 2 (gemini-embedding-2-preview) is Google’s first natively multimodal embedding model. Released on March 10, 2026, it converts text, images, video, audio, and PDF documents into numerical vectors within a single unified embedding space, enabling cross-modal semantic search and retrieval.

What makes Gemini Embedding 2 different from other embedding models?

Most embedding models are text-only or single-modality. Gemini Embedding 2 is natively multimodal — it maps all five media types into the same mathematical space using one model. It also supports custom task types, adjustable output dimensions via MRL, document OCR, and audio extraction from video.

Is Gemini Embedding 2 free?

Gemini Embedding 2 is available via Standard PayGo pricing on both the Gemini API and Vertex AI. There is no free tier listed in the Public Preview documentation. A 50% discount is available when using the Gemini API Batch Mode for non-latency-sensitive jobs.

Can I use gemini-embedding-2-preview to replace gemini-embedding-001?

No — not directly. The two models produce vectors in incompatible embedding spaces. If you switch, you must re-embed all existing documents and data using the new model before running any similarity comparisons.

What languages does Gemini Embedding 2 support?

Gemini Embedding 2 captures semantic intent across more than 100 languages, enabling cross-lingual retrieval — a query in one language can retrieve semantically matching content in another.

What are the input limits for Gemini Embedding 2?

The total input limit is 8,192 tokens. Per-modality limits: text (8,192 tokens), images (6 per request, PNG/JPEG), audio (1 file, max 80 seconds, MP3/WAV), video (1 file, max 128 seconds without audio / 80 seconds with audio, MP4/MOV), PDF (1 file, max 6 pages).

What output dimensions does Gemini Embedding 2 support?

The default output is a 3,072-dimensional vector. Using MRL, you can reduce this to any size between 128 and 3,072. Google recommends using 768, 1,536, or 3,072 for best quality. Vectors smaller than 3,072 must be manually L2-normalized before use.

Can I embed video and audio together in one request?

Yes. Gemini Embedding 2 supports interleaved multimodal input. You can pass text, image, audio, and video parts within a single request. When submitted as a single Content entry, it returns one aggregated embedding.

Where is Gemini Embedding 2 available?

It is available via the Gemini API globally and via Vertex AI in the us-central1 region during Public Preview. Broader regional availability is expected as the model moves toward General Availability.

What can I build with Gemini Embedding 2?

Key use cases include: multimodal RAG systems, cross-modal semantic search, document intelligence pipelines, content classification and moderation, multilingual search, clustering and anomaly detection, and recommendation engines that work across text, image, audio, and video content.

Turn Your Ideas Into a Masterpiece

Discover how Gaga AI delivers perfect lip-sync and nuanced emotional performances.