Transformers

Transformers made modern language models practical by changing the unit of sequence modeling. A recurrent network carries information through a hidden state one step at a time. A transformer lets every token build a content-addressed view of the other tokens in the context, then applies the same block again and again. The original paper removed recurrence and convolution from an encoder-decoder translation model and used stacked attention plus feed-forward layers instead (Attention Is All You Need).

Transformer full architecture diagram with encoder, decoder, self-attention, cross-attention, feed-forward layers, normalization, and residual paths

The Core Bet

The transformer bet is simple: token interactions should be explicit matrix operations. At each layer, the model projects the current token representations into queries, keys, and values. Queries ask what a token needs. Keys advertise what each token contains. Values carry the information that gets mixed into the next representation. Scaled dot-product attention computes softmax(QK^T / sqrt(d_k))V, so each token receives a weighted mixture of value vectors from the sequence (Attention Is All You Need).

That mechanism gives the model a dense interaction graph. A token near the end of a prompt can attend to a token near the beginning without waiting for a recurrent state to carry information across hundreds of steps. The price is concrete: the attention score matrix grows with the square of sequence length, and long contexts also enlarge the key-value cache used during autoregressive inference (FlashAttention, GQA).

Token Stream

A transformer does not see words directly. A tokenizer maps text to token IDs, an embedding table maps IDs to vectors, and positional information tells the model where each token sits. The 2017 transformer used sinusoidal positional encodings added to token embeddings because the architecture itself has no recurrence or convolution to encode order (Attention Is All You Need).

Modern LLMs often use learned or rotary positional schemes rather than the original sinusoidal encoding. Rotary Position Embedding, or RoPE, rotates query and key vectors as a function of position, which makes relative position enter the attention score itself (RoFormer). That detail matters for long context. Position is not a cosmetic feature; it determines whether the attention operation can distinguish "the file mentioned earlier" from "the file mentioned just now."

Attention Heads

Multi-head attention repeats the query-key-value operation several times with different learned projections. The original transformer concatenates the head outputs and projects them back into the model dimension (Attention Is All You Need).

token representations
  -> Q, K, V projections for head 1 -> attention output
  -> Q, K, V projections for head 2 -> attention output
  -> ...
  -> concatenate heads
  -> output projection

Heads do not guarantee neat human categories. One head may learn a syntax pattern, another may track delimiters, and another may collapse into redundant behavior. The architectural point is broader: the model gets multiple learned subspaces for routing information before the feed-forward network transforms each position independently.

The Block

A transformer layer is a residual block with two main jobs. The attention sublayer mixes information across positions. The feed-forward sublayer transforms each position's representation through a small MLP. The original architecture wraps those sublayers with residual connections, layer normalization, and dropout (Attention Is All You Need).

residual stream
  -> attention mixes tokens
  -> add back to residual stream
  -> feed-forward network transforms each position
  -> add back to residual stream

The residual stream gives the stack a working memory channel. Each layer edits the representation rather than replacing it from scratch. In current decoder-only LLMs, implementation details vary: normalization can sit before or after sublayers, activation functions differ, and feed-forward layers may use gated variants. Those changes matter, but they preserve the same division of labor: attention routes context; the MLP computes local feature updates.

Encoder, Decoder, Decoder-Only

The original transformer used an encoder-decoder layout for machine translation. The encoder reads the source sequence with unmasked self-attention. The decoder uses masked self-attention over the generated prefix, then cross-attends to encoder outputs before predicting the next target token (Attention Is All You Need).

Shape	Attention pattern	Usual job	Example source
Encoder-only	Bidirectional self-attention	Represent or classify text	BERT pretrains bidirectional representations with masked language modeling (BERT)
Encoder-decoder	Encoder self-attention plus decoder causal and cross-attention	Conditional generation such as translation or text-to-text tasks	The original transformer and T5 use this family of layout (Attention Is All You Need, T5)
Decoder-only	Causal self-attention	Autoregressive language modeling	GPT-style models generate by predicting the next token from the prefix (Improving Language Understanding by Generative Pre-Training, GPT-2 paper)

Most chat LLMs expose the decoder-only shape because next-token prediction pairs cleanly with open-ended generation. A causal mask prevents each position from attending to future positions during training, so the model can learn from many positions in parallel while preserving the left-to-right generation rule at inference time (Attention Is All You Need, GPT-2 paper).

Training and Inference

Transformer training is parallel across positions inside a sequence. The model can process a whole training context, apply the causal mask, and compute next-token losses for many positions in one pass. That property made the architecture fit accelerators better than recurrent sequence models, which have a stronger step-by-step dependency during training (Attention Is All You Need).

Inference still runs one generated token at a time for a decoder-only model. The serving system caches each layer's keys and values for the previous tokens, then computes attention for the new token against that cache. Multi-query attention and grouped-query attention reduce inference memory bandwidth by using fewer key-value heads than query heads, with GQA sitting between full multi-head attention and single-key-value-head MQA (GQA).

Long Context

Long context stresses two parts of the system. Attention scores grow as sequence length squared during full-sequence training or prefill. The autoregressive cache grows with sequence length, layer count, key-value head count, and head dimension during serving. FlashAttention keeps exact attention but tiles computation to reduce reads and writes between GPU memory levels, which improves practical throughput without changing the model's mathematical attention result (FlashAttention).

That distinction matters. Kernel work can make long context cheaper, and key-value sharing can reduce cache pressure, but neither gives a transformer unlimited working memory. A longer prompt still asks the model to route through more tokens, preserve more constraints, and spend more serving memory. Architecture and product design have to meet: retrieval, summarization, and state compression still belong outside the model for many applications.

Why Transformers Scale

Transformers scale well because they turn sequence learning into large dense tensor operations, and dense tensor operations map cleanly to GPU and TPU hardware. Empirical scaling work found smooth power-law relationships between language-model loss and model size, data size, and training compute across large ranges, while many architectural details had smaller effects within the studied regimes (Scaling Laws for Neural Language Models).

That does not mean architecture stopped mattering. Attention layout, positional encoding, normalization, activation choice, optimizer, data mixture, and inference kernel all decide whether a training run converts compute into useful behavior. The scaling result says larger transformer language models can improve predictably when data and compute grow together; it does not say any transformer configuration deserves more compute.

Failure Modes

Transformers fail in ways that follow from their structure. Long-context evaluations show that language models do not use every position equally well; Liu and colleagues found that performance can drop when relevant information sits in the middle of a long input rather than near the beginning or end (Lost in the Middle). During generation, every sampled token becomes part of the next context, so a bad early token changes the distribution for later tokens (GPT-2 paper).

My take: transformer reliability depends as much on the surrounding harness as on the block itself. A decoder can continue text, but product systems need retrieval, validation, tool boundaries, and memory management around it. The architecture supplies a powerful context mixer. It does not supply truth, intent, or operational safety by itself.

Takeaways

The transformer is useful because it separates sequence modeling into routing and local computation. Attention lets tokens read from other tokens. The feed-forward network edits each position. Residual streams carry state through depth. Causal masks turn the same machinery into a next-token generator. The main cost is also clear: attention and key-value caches become expensive as context grows. Most LLM engineering lives inside that trade-off.

References

author: Ope tag: #ai, #architecture links: [[Multi-Token Prediction]], [[Harness Engineering]], [[Full-Duplex Speech Models]]