Transformers Handbook
A practical guide to transformer architecture: self-attention, encoder and decoder stacks, BERT-style encoders, modern descendants, and the trade-offs between dense and Mixture-of-Experts models.
Attention Is All You Need
The original transformer paper introduced the architecture that made modern LLMs possible. The key move was self-attention: instead of reading tokens one by one, the model lets each token directly attend to every other token in the context window. That gives the model global context, efficient parallelism during training, and a flexible foundation for scaling.
Encoder, Decoder, and Encoder-Decoder
Transformer models are built from repeated blocks. The standard building blocks are multi-head attention, feed-forward layers, residual connections, and layer normalization.
Encoder
The encoder reads the entire input sequence and produces contextual embeddings. Each token representation is enriched by attention over all other input tokens. This is ideal for understanding and representation tasks.
Decoder
The decoder generates tokens autoregressively. It uses masked self-attention so future tokens are hidden during training and inference. This is the basis for text generation models like GPT-style systems.
Encoder + Decoder
The encoder-decoder pattern is best when the input and output sequences are both important, such as translation, summarization, and structured generation. The decoder attends to encoder outputs through cross-attention.
# Conceptual structure, not a full framework example
input_tokens -> [Encoder stack] -> contextual representations
context states -> [Decoder stack with masked self-attention + cross-attention] -> output tokens
| Component | What it does | Typical use |
|---|---|---|
| Encoder | Builds contextual input representations | Classification, retrieval, embeddings |
| Decoder | Generates tokens one step at a time | Chat, code generation, completion |
| Encoder-decoder | Maps one sequence to another sequence | Translation, summarization, structured tasks |
BERT and Encoder-Only Models
BERT is one of the most important transformer descendants. It uses the encoder stack only, with bidirectional attention, so every token can see the full input. That makes it strong for understanding tasks rather than generation.
How Modern Models Use This Architecture
Most modern foundation models are transformer variants. The core design is still attention + feed-forward blocks, but the surrounding training recipe, parameter scaling, attention tricks, and routing mechanisms have evolved.
| Model family | Transformer type | Typical use |
|---|---|---|
| GPT-style models | Decoder-only | Chat, generation, agents, code |
| BERT-style models | Encoder-only | Embedding, retrieval, classification |
| T5 / FLAN-T5 | Encoder-decoder | Translation, summarization, instruction tasks |
| Vision Transformers | Encoder-based tokenization of image patches | Vision classification and multimodal backbones |
Derived Architectures
Transformer research produced many specialized descendants. Some examples include:
- Decoder-only LLMs: GPT, Llama, Mistral, Qwen, DeepSeek-style families.
- Encoder-only stacks: BERT, RoBERTa, DeBERTa, modern embedding models.
- Encoder-decoder systems: T5, BART, mT5, FLAN-T5, many translation and summarization models.
- Vision and multimodal transformers: ViT, CLIP-style dual encoders, and vision-language backbones.
Dense vs MoE
Dense transformers activate all parameters for every token. Mixture-of-Experts (MoE) models route tokens to only a subset of experts. That means MoE can scale parameter count much faster without paying the full compute cost on every token.
| Type | Strength | Trade-off |
|---|---|---|
| Dense | Simple, predictable, easier to deploy and optimize | Compute cost rises directly with model size |
| MoE | Very large parameter capacity with lower active compute | Routing complexity, operational complexity, and harder serving |
Which Transformer Should You Use?
Choose the architecture based on the task instead of starting from model size alone.
| Task | Best transformer family | Why |
|---|---|---|
| Chat / generation | Decoder-only | Autoregressive generation is the natural fit |
| Search / embeddings / classification | Encoder-only | Bidirectional context is great for understanding |
| Summarization / translation | Encoder-decoder | Best when input and output both matter |
| Very large capacity at lower active compute | MoE transformer | Useful when scale matters more than simplicity |
| Small team, stable deployment, low complexity | Dense transformer | Operationally simpler and easier to serve |