What Are Embeddings in AI?
Embeddings turn words into numbers that capture meaning. Learn how they work — from one-hot vectors to king - man + woman = queen and beyond.
What Are Embeddings in AI?
How does a computer understand that "king" and "queen" are related, that "Paris" is to "France" as "Tokyo" is to "Japan," or that "happy" and "joyful" mean almost the same thing? The answer is embeddings — dense numerical representations that encode meaning as geometry.
Embeddings are arguably the most important idea in modern NLP. They're the foundation every language model is built on.
The problem: computers need numbers
Computers can't process words directly. They need numerical representations. The simplest approach is one-hot encoding: assign each word in your vocabulary a unique index and represent it as a vector of all zeros with a single 1.
If your vocabulary is ["cat", "dog", "fish", "king", "queen"], then:
- "cat" = [1, 0, 0, 0, 0]
- "king" = [0, 0, 0, 1, 0]
- "queen" = [0, 0, 0, 0, 1]
This works, but it has two fatal problems:
- No similarity information. The distance between "king" and "queen" is the same as between "king" and "fish." The representation encodes identity but not meaning.
- Impossibly large vectors. A vocabulary of 100,000 words means every word is a 100,000-dimensional vector with 99,999 zeros. Massively wasteful.
Dense embeddings: meaning as geometry
The breakthrough idea: instead of sparse, meaningless vectors, represent each word as a short, dense vector (typically 256–4096 dimensions) where similar meanings map to nearby points.
In embedding space:
- "happy" and "joyful" are close together
- "cat" and "kitten" are close together
- "cat" and "democracy" are far apart
- "run" (the verb) and "run" (a scoring run) might be at different locations depending on context
These vectors aren't hand-crafted. They're learned from data — the model discovers the geometry of meaning by processing billions of sentences.
Word2Vec: where it started
In 2013, Tomas Mikolov at Google published Word2Vec, and the field changed overnight. The idea was elegantly simple: train a shallow neural network to predict a word from its context (or vice versa), and the hidden layer weights become your embeddings.
Two architectures:
- CBOW (Continuous Bag of Words): Given surrounding words, predict the center word
- Skip-gram: Given a center word, predict surrounding words
The trained model was never the point — the learned weight matrix was. Each row was a 300-dimensional embedding for one word, and these embeddings exhibited remarkable properties.
The king - man + woman = queen moment
The most famous result in embedding history:
vector("king") - vector("man") + vector("woman") ≈ vector("queen")
This isn't a parlor trick. It works because the model learned that the relationship between "king" and "man" (royalty vs commoner, male) is parallel to the relationship between "queen" and "woman." The gender direction in the vector space is consistent across word pairs.
Other examples that work:
- Paris - France + Japan ≈ Tokyo (capital city relationship)
- bigger - big + small ≈ smaller (comparative form)
- walked - walk + swim ≈ swam (past tense)
The model wasn't taught grammar rules or geography. It learned these relationships purely from patterns of word co-occurrence in text. The structure of language encodes the structure of the world.
Static vs contextual embeddings
Word2Vec (and its successors GloVe and FastText) produce static embeddings: each word gets one fixed vector regardless of context. "Bank" has the same embedding whether you're talking about a river bank or a financial bank.
This is a real limitation. Enter contextual embeddings.
In a transformer-based model like GPT or BERT, each word's representation changes based on the surrounding sentence. The embedding for "bank" in "river bank" is different from "bank" in "bank account." Each layer of the transformer refines these representations, adding more context.
In GPT-4, a token passes through ~120 transformer layers. The embedding at layer 1 captures basic syntactic information. By layer 60, it encodes semantic meaning. By layer 120, it encodes nuanced, context-specific understanding. The "embedding" at any given layer is a snapshot of the model's evolving understanding of that token.
This is why large models are so powerful: they don't just look up a word's meaning — they compute its meaning fresh for every occurrence based on full context.
How LLMs use embeddings
In a modern LLM, embeddings serve as the input and output layers:
Input: When you type "The cat sat on the mat," each token is first converted to its embedding vector (via a lookup table with ~100,000 rows). These vectors become the initial input to the transformer layers.
Processing: Each transformer layer takes the embeddings, applies attention and feed-forward operations, and outputs updated embeddings. Information flows between tokens through attention — each token's embedding gradually incorporates context from the entire sequence.
Output: The final layer's embeddings are projected to vocabulary-sized logits via a linear layer (often sharing weights with the input embedding table). These logits become probabilities for the next token.
The entire model can be viewed as a function that takes in embedding vectors and transforms them through dozens of layers until they encode enough information to predict what comes next.
Practical applications beyond LLMs
Embeddings aren't just an internal mechanism — they're a powerful tool in their own right:
Semantic search: Embed documents and queries into the same vector space. The most relevant documents are those whose embeddings are closest to the query embedding. This powers modern search engines and is far more nuanced than keyword matching.
Retrieval-Augmented Generation (RAG): Embed your knowledge base into a vector database. When a user asks a question, find the most relevant chunks via embedding similarity and inject them into the LLM's prompt. This grounds the model in factual, up-to-date information.
Recommendations: Embed users and items into the same space. Recommend items whose embeddings are close to the user's embedding. Netflix, Spotify, and Amazon all use variations of this.
Clustering and classification: Embed text samples and run standard ML algorithms (k-means, SVM) on the vectors. This works surprisingly well even for complex tasks like sentiment analysis, topic modeling, and spam detection.
Anomaly detection: If a new piece of text has an embedding far from any cluster, it might be unusual — useful for fraud detection, content moderation, or novelty detection.
The geometry of meaning
What makes embeddings profound isn't just that they work — it's what they reveal. The fact that arithmetic on word vectors captures semantic relationships suggests that meaning itself has geometric structure. Directions in embedding space correspond to concepts: gender, tense, formality, sentiment.
This geometry isn't designed — it's discovered from data. And it's consistent enough to be useful across languages, domains, and tasks. That's remarkable.
Explore embeddings interactively
See how words cluster by meaning in our interactive embedding explorer — type words and watch them plotted in a 2D projection of embedding space, with similar words gravitating toward each other.