V1
Back to handbooks index

Transformers Handbook

A practical guide to transformer architecture: self-attention, encoder and decoder stacks, BERT-style encoders, modern descendants, and the trade-offs between dense and Mixture-of-Experts models.

Attention-based models Encoder / decoder stacks Dense vs MoE May 2026
i
Core idea: transformers remove recurrence and convolution from the sequence path and replace them with attention plus feed-forward layers. That one change made parallel training practical for large language models and modern multimodal systems.

Attention Is All You Need

The original transformer paper introduced the architecture that made modern LLMs possible. The key move was self-attention: instead of reading tokens one by one, the model lets each token directly attend to every other token in the context window. That gives the model global context, efficient parallelism during training, and a flexible foundation for scaling.

Self-Attention
Every token computes a weighted view of the full sequence, which makes long-range dependency modeling much easier than with pure RNNs.
Parallelism
Tokens can be processed in parallel, so training scales far better than sequential architectures.
Scalability
The same block design can be stacked deeply and adapted into encoder-only, decoder-only, or encoder-decoder systems.
✓
Reference: Vaswani et al., Attention Is All You Need (2017). The paper is still the canonical starting point for understanding modern transformer design.

Encoder, Decoder, and Encoder-Decoder

Transformer models are built from repeated blocks. The standard building blocks are multi-head attention, feed-forward layers, residual connections, and layer normalization.

Encoder

The encoder reads the entire input sequence and produces contextual embeddings. Each token representation is enriched by attention over all other input tokens. This is ideal for understanding and representation tasks.

Decoder

The decoder generates tokens autoregressively. It uses masked self-attention so future tokens are hidden during training and inference. This is the basis for text generation models like GPT-style systems.

Encoder + Decoder

The encoder-decoder pattern is best when the input and output sequences are both important, such as translation, summarization, and structured generation. The decoder attends to encoder outputs through cross-attention.

# Conceptual structure, not a full framework example
input_tokens   -> [Encoder stack] -> contextual representations
context states -> [Decoder stack with masked self-attention + cross-attention] -> output tokens
ComponentWhat it doesTypical use
EncoderBuilds contextual input representationsClassification, retrieval, embeddings
DecoderGenerates tokens one step at a timeChat, code generation, completion
Encoder-decoderMaps one sequence to another sequenceTranslation, summarization, structured tasks

BERT and Encoder-Only Models

BERT is one of the most important transformer descendants. It uses the encoder stack only, with bidirectional attention, so every token can see the full input. That makes it strong for understanding tasks rather than generation.

ℹ️
BERT-style models: excellent for embeddings, search, classification, extraction, reranking, and retrieval pipelines. They are not naturally suited to open-ended text generation because they do not use the decoder-only generation pattern.
Good Fit
Sentence embeddings, semantic search, document classification, named entity extraction, and reranking.
Not a Great Fit
Open-ended generation where the model needs to write long answers token by token from scratch.

How Modern Models Use This Architecture

Most modern foundation models are transformer variants. The core design is still attention + feed-forward blocks, but the surrounding training recipe, parameter scaling, attention tricks, and routing mechanisms have evolved.

Model familyTransformer typeTypical use
GPT-style modelsDecoder-onlyChat, generation, agents, code
BERT-style modelsEncoder-onlyEmbedding, retrieval, classification
T5 / FLAN-T5Encoder-decoderTranslation, summarization, instruction tasks
Vision TransformersEncoder-based tokenization of image patchesVision classification and multimodal backbones

Derived Architectures

Transformer research produced many specialized descendants. Some examples include:

Dense vs MoE

Dense transformers activate all parameters for every token. Mixture-of-Experts (MoE) models route tokens to only a subset of experts. That means MoE can scale parameter count much faster without paying the full compute cost on every token.

TypeStrengthTrade-off
DenseSimple, predictable, easier to deploy and optimizeCompute cost rises directly with model size
MoEVery large parameter capacity with lower active computeRouting complexity, operational complexity, and harder serving
âš 
Practical trade-off: MoE is attractive when you want very large capacity and can tolerate more complex infra. Dense models are usually easier to deploy, easier to reason about, and often the better default for smaller teams or strict latency targets.

Which Transformer Should You Use?

Choose the architecture based on the task instead of starting from model size alone.

TaskBest transformer familyWhy
Chat / generationDecoder-onlyAutoregressive generation is the natural fit
Search / embeddings / classificationEncoder-onlyBidirectional context is great for understanding
Summarization / translationEncoder-decoderBest when input and output both matter
Very large capacity at lower active computeMoE transformerUseful when scale matters more than simplicity
Small team, stable deployment, low complexityDense transformerOperationally simpler and easier to serve
✓
Rule of thumb: if you are building an assistant or generator, start with a decoder-only model. If you are building retrieval, classification, or embeddings, start with an encoder-only model. If you are translating or compressing one sequence into another, use encoder-decoder.

Reference Links