Gemini Embedding 2: The Paper, Explained

A beginner-friendly guide to how one model turns text, images, video, and audio into a single shared "meaning space" — and why that beats a pile of specialized models. Every AI term is defined. Every concept is grounded in analogy.

Paper by the Gemini Embedding Team (Google DeepMind, 2026) • Explainer published May 2026

made withHyperFrames Every modality flows through one model into a single vector space. Things that mean the same thing land in the same region — no matter whether they arrived as text, an image, a clip, or a sound.

The Big Picture

A huge amount of modern AI plumbing — search engines, recommendation feeds, the "retrieval" half of RAGRetrieval-Augmented Generation. Before a language model answers, a retrieval system fetches relevant documents and pastes them into the prompt, so the model can ground its answer in real source material instead of relying on memory. systems — runs on one humble primitive: the embeddingA list of numbers (a vector) that captures the meaning of a piece of content, arranged so that similar meanings get similar numbers. "Embedding" the content means producing that vector.. You turn a piece of content into a list of numbers, and content that means similar things gets similar numbers. Then "find things like this" becomes "find the nearest numbers."

For years, the catch was that you needed a different embedding model for each kind of content. One model for text. A separate one for images. Audio usually wasn't embedded at all — it was transcribed to text first and then handed to the text model. Video was an afterthought. Stitching these together into one search system was a mess, and none of them could handle a query that mixed, say, an image and a question about it.

Gemini Embedding 2 collapses all of that into a single model. It takes text, images, video, audio, or any interleaved combination, and produces one embedding in one shared space. The paper tackles several problems at once:

  1. Separate per-modality models can't truly mix modalities. The old CLIPA 2021 model from OpenAI that learns a shared image-text space using two separate encoders (one for images, one for text) trained on image-caption pairs. The dominant design for years — but the two towers only ever meet at the final comparison step.-style approach uses one encoder per modality and only compares them at the very end (“late fusionProcess each modality with its own separate network, then combine (fuse) the results only at the final step. Cheap and modular, but the modalities never interact during processing, so the model can't reason about how an image and its caption relate.”). That's fine for matching a caption to a photo, but it can't reason over an image and text together.
  2. Audio retrieval leans on brittle transcription. The standard recipe is "run ASRAutomatic Speech Recognition — software that converts spoken audio into written text. The first step in most voice pipelines, and a common point of failure: once it picks the wrong words, everything downstream inherits the mistake. to get text, then embed the text." If the transcriber mishears one word, the error poisons everything downstream.
  3. Specialized domains usually need specialized models. A model great at fine art might be terrible at microscope images. The dream is one embedder that works out of the box on astronomy, biology, recipes, and code alike.
Build the embedder on top of Gemini, a model that already understands all these modalities deeply, and fine-tune it to produce embeddings. Because every modality is processed by the same network with full cross-modal interaction (“deep fusionFeed all modalities into a single network as one interleaved sequence so they interact throughout processing — the opposite of late fusion. The model can attend from a word to a pixel and back, producing a richer joint understanding.”), the resulting space is genuinely unified. The single model hits state-of-the-art on text, image, video, and audio retrieval simultaneously — 77.2 overall vs. 68–70 for the next-best multimodal embedders — and even beats specialized models in their own domains.

Background Concepts

The paper assumes you're comfortable with embeddings, contrastive learning, and the CLIP lineage of multimodal models. Let's build those up from scratch.

What is an embedding (and a vector space)?

An embeddingA vector (list of numbers) representing a piece of content's meaning. Gemini Embedding 2 outputs up to 3,072 numbers per item. Similar meanings produce vectors that point in similar directions. is a list of numbers — a vectorAn ordered list of numbers. A 3-dimensional vector is a point in 3D space; a 3,072-dimensional vector is a point in 3,072-dimensional space. You can't picture it, but the math (distances, angles) works exactly the same. — that represents the meaning of some content. You can think of each number as a coordinate, so the vector is a single point in a high-dimensional space. Content with similar meaning lands at nearby points; unrelated content lands far apart. That shared coordinate system is the vector spaceThe high-dimensional space all embeddings live in. "Unified" means text, images, video, and audio all map into the same space, so their vectors can be compared directly with one another..

Think of a library where every book is placed on a vast floor so that books about similar topics sit near each other — cookbooks in one area, astronomy in another. To find books like the one in your hand, you just look at its neighbors. An embedding space is that floor, except it has thousands of dimensions instead of two, and the "books" can be sentences, photos, video clips, or sounds.

To measure how close two vectors are, the paper uses cosine similarityA measure of how aligned two vectors are, based on the angle between them (ignoring their length). +1 means same direction (very similar), 0 means unrelated, −1 means opposite. The standard similarity measure for embeddings. — the angle between them. Two vectors pointing the same way are very similar; perpendicular ones are unrelated.

What is retrieval, and why do embeddings power it?

RetrievalGiven a query, finding the most relevant items from a large collection. Embedding-based retrieval embeds the query and every item, then returns the items whose vectors are closest to the query's. means: given a query, find the most relevant items in a big pile. With embeddings, you embed everything once, then answer "find things like X" by embedding X and grabbing its nearest neighbors. This is the engine behind semantic search, recommendation systems, and RAG.

✎ text 📷 image 🎥 video 🔊 audio

A modalityA type of data: text, images, video, audio. A "multimodal" model handles more than one. "Cross-modal" retrieval means the query and the result are different modalities — e.g. searching images with a text query. is just a type of content. A model that handles several is multimodal. When the query is one modality and the answer is another — like typing "a golden retriever in snow" to find a photo — that's cross-modal retrieval, and it only works if both modalities live in the same vector space.

Late fusion vs. deep fusion: the central design choice

This is the heart of what makes Gemini Embedding 2 different from the models that came before it, so it's worth seeing rather than just reading.

made withHyperFrames Late fusion (CLIP-style) runs each modality through its own encoder and compares only at the end. Deep fusion feeds everything through one transformer as a single interleaved sequence, letting modalities interact throughout.

Late fusion (the old way)

Models like CLIP, ALIGN, and SigLIP use two separate encoders — one for images, one for text — trained so a matched image and caption land near each other.

  • Great at simple "does this caption match this image?" tasks.
  • Modular and efficient.
  • But: the two halves never interact during processing. They can't handle a single input that mixes an image and a question, and they miss the rich interplay between modalities.

Deep fusion (this paper)

All modalities are turned into one interleaved sequence of tokensThe small chunks a transformer actually reads. Text becomes word-pieces; an image or video frame becomes a grid of visual tokens; audio becomes audio tokens. Gemini converts every modality into tokens it can process uniformly. and fed through one transformer.

  • Words can attend to pixels and vice versa, throughout the network.
  • Handles arbitrary mixed inputs (image + text + video together).
  • Produces one genuinely shared space — the richer understanding modern MLLMsMultimodal Large Language Models — LLMs (like Gemini) that natively read images, audio, and video in addition to text. Gemini Embedding 2 is built on top of one. are known for.

Transformers and bidirectional attention

A transformerThe neural network architecture behind almost all modern AI (GPT, Gemini, etc.). It processes a sequence of tokens using "attention," which lets every token look at and pull information from other tokens. processes a sequence of tokens using attentionThe mechanism that lets each token gather information from other tokens, weighting how much to "pay attention" to each. It's how a transformer builds context-aware representations. — a mechanism where each token decides how much to "look at" every other token.

Generative models like Gemini normally use causal attentionEach token can only attend to earlier tokens, never future ones. Necessary for generating text left-to-right, but limiting for embeddings, where the model should use the whole input at once.: each token can only see the ones before it (necessary for generating text word by word). But for embeddings you want every token to see the whole input. So the paper switches the model to bidirectional attentionEvery token can attend to every other token in both directions. Better for encoding a fixed input into a representation, since nothing is hidden from any position. The paper converts Gemini's causal attention to bidirectional for embedding. — every token sees every other token, both directions.

Causal attention is reading a sentence one word at a time with the rest of the page covered — good if you're trying to predict the next word. Bidirectional attention is reading the whole sentence at once before forming an opinion — better if your job is to summarize what it means.

Contrastive learning

How do you teach a model to put similar things close together? Contrastive learningA training method that pulls "matching" pairs (a query and its correct target) together in embedding space while pushing non-matching pairs apart. The model learns meaning by comparison, without needing explicit labels for every concept. is the answer, and it's covered in detail in the training section below. The one-line version: show the model matched pairs (a query and its correct answer) and a bunch of mismatched pairs, then nudge the matched ones together and shove the rest apart.

How It Works

Architecturally, Gemini Embedding 2 is surprisingly clean. It starts from a pretrained Gemini model and turns it into an embedder with a few well-chosen steps.

text image video audio … or any interleaved mix
Gemini-native tokenization
each modality is converted into Gemini's native token format — raw images, video, and audio go in directly, no separate encoders to bolt on
Transformer M (initialized from Gemini, bidirectional)
the full sequence of L tokens is processed with bidirectional attention, producing one embedding per token. This is the "pre-trained" core — it inherits everything Gemini already knows.
Mean pooling
average all the per-token embeddings into a single vector that represents the whole input
Linear projection f
a learned layer scales that vector to the final output size (up to 3,072 dimensions)
one embedding E • up to 3,072 numbers

Two design choices are worth unpacking.

Why "initialize from Gemini" counts as pre-training

The model doesn't learn about the world from scratch. It starts as a copy of Gemini's parameters — which already encode an enormous amount of knowledge across all modalities — and then gets fine-tuned to encode rather than generate. The authors describe initializing from Gemini as effectively the embedding model's pre-training stage. The pooling and projection are deliberately simple precisely because the heavy lifting already happened inside Gemini.

What's "mean pooling" and why so simple?

The transformer outputs one vector per input token, but a search index needs one vector per item. Mean poolingCombining a sequence of per-token vectors into one by averaging them element-wise. The simplest pooling strategy — the paper notes prior work showed simple pooling is effective when the backbone is strong. just averages them. The authors cite prior research showing that when your backbone model is this capable, fancy pooling schemes add little, so the simplest option wins.

Matryoshka: one vector, many sizes

Gemini Embedding 2 outputs 3,072-dimensional vectors, but storing and comparing millions of 3,072-number vectors is expensive. The trick is Matryoshka Representation LearningA training technique (MRL) that nests smaller usable embeddings inside a larger one, like Russian nesting dolls. The first 768 or 1,536 numbers form a complete, usable embedding on their own, so you can truncate for speed without retraining. (MRL): the model is trained so the first 768 numbers are a complete, usable embedding on their own, and so are the first 1,536. You can chop the vector short to save space and speed, and it still works.

Russian nesting dolls. The full 3,072-dimensional vector is the biggest doll, but inside it sits a perfectly good 1,536 doll, and inside that a 768 doll. You pick whichever size fits your storage budget — no need to keep different models for different sizes.

How It's Trained

The model's secret sauce is its training recipe. Everything revolves around one objective — contrastive learning — applied across many tasks and several stages.

The contrastive objective

made withHyperFrames Each query is pulled toward its correct target and pushed away from every other target in the batch (free "in-batch negatives") plus any deliberately tricky "hard negatives."

Each training example is a query, its correct positive target, and optionally a hard negative. The model embeds all of them and is trained with a contrastive lossSpecifically a noise-contrastive estimation (NCE) loss with in-batch negatives. It maximizes the similarity of the query to its positive while minimizing similarity to all negatives, using cosine similarity scaled by a temperature. that maximizes the cosine similarity between the query and its positive while minimizing it against the negatives.

The clever, cheap part is in-batch negativesDuring training, every other example's target in the same batch is treated as a negative (wrong answer) for your query. This gives you many negatives for free without having to collect them, which is why large batch sizes help contrastive training.: within a batch of, say, 1,000 examples, the other 999 targets serve as "wrong answers" for your query — for free. That's why contrastive training loves big batches: more examples per batch means more negatives to contrast against.

A hard negativeA wrong answer that is deliberately similar to the right one — e.g. for the query "first purchase date," a passage about the *most recent* purchase. Training against hard negatives forces the model to learn fine distinctions, not just gross ones. is a wrong answer that's deliberately close to the right one, forcing the model to learn fine distinctions instead of just obvious ones.

The details: temperature, masking, and a task string
  • Temperature (τ): a knob that sharpens or softens how aggressively the loss separates positives from negatives.
  • Masking: if two examples in a batch happen to share the same query or positive, they're masked out so the model isn't punished for a "negative" that's actually correct — important for classification tasks with few labels.
  • Task string: text examples can carry a short instruction like "question answering" or "fact checking." During training these are randomly dropped so the model stays robust even when no task hint is given — one reason it works zero-shot without prompt engineering.
  • MRL multi-loss: the loss is computed several times over nested sub-dimensions (768, 1,536, full 3,072) so all the Matryoshka sizes stay usable.

The multi-stage recipe

Training proceeds in three stages, borrowed and extended from Google's earlier Gecko and Gemini Embedding models.

1 Pre-Fine-Tuning (PFT) adapt the model from generating to encoding, using a huge set of noisy query–target pairs. Big batches smooth out the noise. Only text, image, and code tasks here.
2 Fine-Tuning (FT) train on cleaner, harder data across all modalities — text, code, documents, image, audio, video — with hard negatives and per-task tuned batch sizes.
3 Model Soup average the weights of several fine-tuned checkpoints into one, gaining extra robustness across modalities for free.
Wait — you can just average model weights together?

Yes. A model soupAveraging the parameters (weights) of several separately fine-tuned models into one. Counterintuitively, the averaged model often generalizes better than any single ingredient, with zero extra inference cost. takes several models that were fine-tuned differently and literally averages their weights, element by element. The surprising and well-replicated result is that the averaged model is often more robust than any single ingredient — and it costs nothing extra at inference time, since you still end up with one model. The paper uses it to balance the per-modality trade-offs that fine-tuning introduces (see the video-data result below).

Why synthetic data matters

The team uses Gemini itself to generate high-quality training data. On code-retrieval tasks this is dramatic: adding Gemini-synthesized data lifts the average score by +15.8 points over the previous text-only Gemini Embedding model — the difference between a 70.5 and an 86.3 average on the code benchmarks studied.

Native Audio vs. ASR

One of the paper's most striking results is about audio. The conventional pipeline transcribes speech to text first (ASR), then embeds the text. Gemini Embedding 2 can skip that and embed the raw audio directly — and that turns out to matter a lot.

made withHyperFrames The cascade commits to one text guess — and if it's wrong, retrieval inherits the error. Native audio keeps the continuous signal, preserving ambiguity and acoustic cues until the very end.

The problem with the cascade is error propagation. The classic example: an ASR system has to make a hard choice between "recognize speech" and "wreck a nice beach" — they sound almost identical. Once it commits to the wrong text, the retrieval system is searching for the wrong thing, and there's no recovering.

Embedding the audio directly avoids the forced decision. The model keeps the continuous acoustic signal — including prosodyThe rhythm, stress, and intonation of speech — the musical qualities beyond the literal words. Lost the moment audio is flattened to text, but preserved when audio is embedded directly. (intonation, emphasis) — so the embedding preserves the inherent ambiguity instead of throwing it away too early.

Setup (MSEB retrieval, MRR@10Mean Reciprocal Rank at 10: for each query, score 1/(rank of the first correct result) within the top 10, then average. Higher is better; rewards putting the right answer near the top.)Same-languageCross-languageAverage
Cascade (ASR → embed text)73.5867.5570.40
Native audio (embed directly)75.5872.5673.99

Native audio wins everywhere, and the gap widens for cross-language retrieval (+5.0 points), where the model matches meaning across languages without being trapped by the phonetic guesses of an intermediate transcriber.

Results

Multimodal retrieval: one model, best at everything

Across a broad suite of image, text, and video retrieval benchmarks, Gemini Embedding 2 posts the highest overall score — while being one of only two compared models to even support all four modalities.

Benchmark (metric)Gemini Embedding 2Amazon Nova MMEVoyage-3.5-MMLegacy Google
Image→Image, GUIEC (R@1)79.468.669.469.5
Text→Image, MSCOCO (R@1)62.957.258.153.1
Text→Video, Vatex (NDCG@10)68.860.355.254.9
Image+Text→Text, EncyclopedicVQA (R@20)71.558.6
Document retrieval, ViDoRe V2 (NDCG@10)64.960.665.528.9
Overall77.268.270.064.1

"R@1" is Recall@1The fraction of queries where the single top-ranked result is correct. Recall@k allows the right answer anywhere in the top k. Higher is better.; "NDCG@10Normalized Discounted Cumulative Gain at 10 — a ranking quality score for the top 10 results that rewards putting more-relevant items higher. Ranges 0 to 1 (shown here ×100); higher is better." rewards good ranking near the top. Bottom line: it leads almost everywhere and stays competitive even where it doesn't win outright.

Text and code: no compromise

A worry with any "do-everything" model is that the new abilities dilute the old ones. They don't. On MTEBMassive Text Embedding Benchmark (and its multilingual edition, MMTEB) — the standard leaderboard for text embeddings, spanning retrieval, classification, clustering, and more across 250+ languages. multilingual, Gemini Embedding 2 scores 69.9 — better than the previous text-only Gemini Embedding (68.4). On code retrieval it reaches 84.0, a large jump and a new state of the art, beating even code-specialist models like voyage-code-3.

Specialized domains: robust out of the box

Tested zero-shotEvaluated on a task or domain it was never specifically trained for, with no fine-tuning or examples. A strong zero-shot model works "out of the box." on niche fields, the model doesn't just win — it wins consistently, where rival models swing wildly between domains (image→text Recall@5):

Microscopy & bioscience

MicroVQA: 79.3 vs. 53.3 for the next best — over 48% better.

Astronomy

AstroLLaVA: 64.4 — roughly double the best baseline.

Culinary (Recipe1M)

Ingredients 90.2, instructions 92.1 — breaking 90 where the next best sits in the low 80s.

The authors emphasize consistency: models like SigLIP 2 might score 81 on recipes but collapse to 8.4 on fine art. Gemini Embedding 2 stays reliable across all of them — the practical value of a truly general embedder.

The in-domain video data trade-off (and how souping fixes it)

Adding a few thousand examples of a specific video dataset's training split sharply boosts that dataset (e.g. MSR-VTT jumps +7.9 points) — but can slightly hurt others (YouCook2 dipped 0.6). This is classic overfitting to a narrow target. The fix: model soup. Averaging the fine-tuned weights with the original model brings back the gains while restoring the broad robustness — in several cases beating the baseline across the board.

Final Quiz

What's the core architectural difference between Gemini Embedding 2 and CLIP-style models?
Why does embedding audio directly beat the "ASR then embed text" cascade?
What are "in-batch negatives" in contrastive training?
What does Matryoshka Representation Learning (MRL) let you do?
Why do the authors call "initializing from Gemini" the embedding model's pre-training stage?
What's notable about Gemini Embedding 2's performance on specialized domains like astronomy and microscopy?

Why This Paper Matters

For builders and practitioners

If you build search, recommendations, or RAG, the operational win is consolidation. Instead of running and maintaining a text embedder, an image embedder, an ASR pipeline, and glue code to reconcile their incompatible vector spaces, you call one model. Mixed-modality queries — "find the moment in this video where someone explains X," with the query combining an image and text — become possible rather than hacky. And because it works zero-shotOut of the box, with no task-specific tuning or prompt engineering. The paper highlights that you don't need brittle task instructions to get strong results. without task-specific prompt engineering, you skip a whole category of fragile tuning. The Matryoshka sizing means you can dial down to 768 dimensions to cut storage and latency when you need to.

For the research community

Two results are load-bearing. First, native multimodal beats cascades: embedding raw audio outperforms transcribe-then-embed, a concrete demonstration that flattening one modality into another throws away signal you can't recover. Expect the same argument to be made for video and documents. Second, generality need not cost specialization: the multimodal model matches or beats the previous text-only model on pure text, and beats domain specialists in their own domains. That challenges the default assumption that you trade breadth for depth, and it leans heavily on three ingredients the paper validates — a strong MLLM backbone, Gemini-synthesized training data, and model souping.

The bigger picture

Embeddings are the connective tissue between raw data and everything an AI system does with it. As applications become agentic and multimodal — an assistant that reasons over your documents, screenshots, voice notes, and videos at once — a single unified representation space stops being a convenience and becomes infrastructure. This paper is a bet that the embedding layer should be as natively multimodal as the generative models it serves, built from the same foundation rather than assembled from a patchwork of specialists. If that bet holds, "which embedding model do I use for this data type?" becomes a question that no longer needs asking.

Glossary

Embedding

A vector (list of numbers) representing the meaning of content, arranged so similar meanings have similar vectors.

Vector space

The shared high-dimensional space all embeddings live in. "Unified" means all modalities map into the same one.

Cosine similarity

How aligned two vectors are, by the angle between them. +1 = same direction, 0 = unrelated.

Retrieval

Finding the most relevant items for a query by comparing embeddings — the engine behind search, recommendation, and RAG.

RAG

Retrieval-Augmented Generation. Fetch relevant documents, then let a language model answer using them as grounding.

Modality

A type of data: text, image, video, audio. Multimodal = handles several; cross-modal = query and result differ.

Late fusion

Process each modality separately, combine only at the end (CLIP-style). Modalities never interact during processing.

Deep fusion

Feed all modalities into one network as an interleaved sequence so they interact throughout. This paper's approach.

CLIP / ALIGN / SigLIP

The dominant lineage of dual-encoder multimodal models that pioneered shared image-text spaces via late fusion.

MLLM

Multimodal Large Language Model — an LLM (like Gemini) that natively reads images, audio, and video too.

Transformer

The neural network architecture behind modern AI, processing token sequences using attention.

Attention

The mechanism letting each token gather information from other tokens, weighting how much to "look at" each.

Bidirectional attention

Every token attends to every other token in both directions — better for encoding a fixed input than causal attention.

Causal attention

Each token sees only earlier tokens. Needed for generating text, limiting for embeddings.

Token

The small chunk a transformer reads. Text, image patches, and audio all get converted into tokens.

Contrastive learning

Training that pulls matching pairs together and pushes mismatched pairs apart in embedding space.

In-batch negatives

Other examples' targets in the same batch, reused for free as wrong answers. Big batches give more of them.

Hard negative

A wrong answer deliberately similar to the right one, forcing the model to learn fine distinctions.

Mean pooling

Averaging per-token vectors into one vector for the whole input. The simplest pooling, used here.

Matryoshka (MRL)

Training so smaller embeddings nest inside larger ones — truncate 3,072 to 1,536 or 768 and it still works.

Model soup

Averaging the weights of several fine-tuned models into one, often more robust than any single ingredient.

ASR

Automatic Speech Recognition — speech-to-text. The cascade approach embeds its output; errors propagate.

Prosody

The rhythm, stress, and intonation of speech — lost in transcription, preserved in native audio embedding.

Zero-shot

Working on a task or domain with no specific training or tuning — "out of the box."

MTEB / MMTEB

Massive (Multilingual) Text Embedding Benchmark — the standard text-embedding leaderboard.

Recall@k

Fraction of queries where a correct result appears in the top k. Recall@1 = the very top result is right.

NDCG@10

A ranking-quality score for the top 10 results, rewarding more-relevant items ranked higher.

MRR@10

Mean Reciprocal Rank at 10 — rewards putting the first correct result as high as possible in the top 10.