A deep dive into Matryoshka embeddings, dimension reduction economics, and why pizza statistically belongs closer to video games than a controller does.
OpenAI's third-generation models introduce dynamic dimensionality via Matryoshka learning โ a complete departure from the rigid fixed-size vectors of the past.
| Model | Max Dimensions | MTEB English | MIRACL Multilingual | Price / 1M tokens |
|---|---|---|---|---|
| text-embedding-3-small 3rd gen | 1,536 |
|
|
$0.02 |
| text-embedding-3-large 3rd gen | 3,072 |
|
|
$0.13 |
Matryoshka Representation Learning (MRL) makes dimension-flexible embeddings possible. Understanding it reshapes how you think about vector storage.
A 512ร512 and a 1024ร1024 photo both show you a leather boot โ one just has more detail. But do you need 10,000px to recognize it's a boot? At some point, extra resolution stops adding meaningful information. The first few hundred dimensions capture broad ontological category โ enough for most retrieval tasks.
Two cameras both set to 512ร512. One is a basic smartphone; the other a professional DSLR. Same resolution โ but the DSLR captures better color fidelity, texture, and depth. The Large model is that DSLR: more internal parameters, deeper attention heads, packing 768 dimensions with higher-quality representations than Small does.
Query: "Video game" โ results ordered by cosine similarity (most similar first), extracted directly from the experiment. Two stories worth telling.
ยผ of the max storage. Better results. The model's internal architecture matters more than dimension count.
Half the storage. Identical ranking. This is Matryoshka learning doing exactly what it promises.
Query: "Video game" tested against 11 candidate terms, across both models and four dimension sizes, using Euclidean distance and Cosine similarity.
Half the storage. Half the vector length. Identical semantic rankings in this context. For product-catalog matching tasks, you can cut your vector database footprint in half with zero measurable loss in retrieval quality.
โ ๏ธ Tested on product names in an e-commerce context. More nuanced tasks โ like matching paragraphs of dense legal or academic text โ may reveal differences at higher dimensions.
Storage WinOne-eighth of the Large model's max storage โ and it still produces better rankings. This confirms the camera-sensor analogy: the model's internal architecture dominates over raw dimension count. The Small model, fully unrolled, can't match Large's compressed precision in this domain.
โ ๏ธ This finding applies to short, categorical terms (product names, labels). For longer, denser inputs the gap may narrow or shift.
Architecture WinsA controller is literally a video game accessory. Pizza is food. Yet the model places pizza closer in the vector space โ and once you understand how embeddings actually work, this makes complete sense.
Mind-bending
Query: "Video game" ยท Model:
text-embedding-3-large ยท 768 dimensions ยท Cosine
distance (lower = more similar).
Ontologically, pizza and video games share zero overlap. Yet in the statistical geography of human language โ across billions of forum posts, stories, and conversations โ they're neighbors.
Direct synonyms or near-identical meaning. Words that can substitute for each other without changing the core meaning.
Linked by function, culture, or co-occurrence โ not necessarily alike in meaning, but appearing in the same situations.
The word "controller" appears across vastly different contexts in billions of training examples. Its vector is pulled in many directions simultaneously โ diluting its gravitational pull toward gaming:
API inference costs are often negligible at enterprise scale. The compounding cost lives in storage, RAM, and query latency โ and that's where dimension reduction pays off.