Transformers Handbook

A practical guide to transformer architecture: self-attention, encoder and decoder stacks, BERT-style encoders, modern descendants, and the trade-offs between dense and Mixture-of-Experts models.

Attention-based models Encoder / decoder stacks Dense vs MoE May 2026

Core idea: transformers remove recurrence and convolution from the sequence path and replace them with attention plus feed-forward layers. That one change made parallel training practical for large language models and modern multimodal systems.

Attention Is All You Need

The original transformer paper introduced the architecture that made modern LLMs possible. The key move was self-attention: instead of reading tokens one by one, the model lets each token directly attend to every other token in the context window. That gives the model global context, efficient parallelism during training, and a flexible foundation for scaling.

Self-Attention

Every token computes a weighted view of the full sequence, which makes long-range dependency modeling much easier than with pure RNNs.

Parallelism

Tokens can be processed in parallel, so training scales far better than sequential architectures.

Scalability

The same block design can be stacked deeply and adapted into encoder-only, decoder-only, or encoder-decoder systems.

âœ“

Reference: Vaswani et al., Attention Is All You Need (2017). The paper is still the canonical starting point for understanding modern transformer design.

Encoder, Decoder, and Encoder-Decoder

Transformer models are built from repeated blocks. The standard building blocks are multi-head attention, feed-forward layers, residual connections, and layer normalization.

Encoder

The encoder reads the entire input sequence and produces contextual embeddings. Each token representation is enriched by attention over all other input tokens. This is ideal for understanding and representation tasks.

Decoder

The decoder generates tokens autoregressively. It uses masked self-attention so future tokens are hidden during training and inference. This is the basis for text generation models like GPT-style systems.

Encoder + Decoder

The encoder-decoder pattern is best when the input and output sequences are both important, such as translation, summarization, and structured generation. The decoder attends to encoder outputs through cross-attention.

# Conceptual structure, not a full framework example
input_tokens   -> [Encoder stack] -> contextual representations
context states -> [Decoder stack with masked self-attention + cross-attention] -> output tokens

Component	What it does	Typical use
Encoder	Builds contextual input representations	Classification, retrieval, embeddings
Decoder	Generates tokens one step at a time	Chat, code generation, completion
Encoder-decoder	Maps one sequence to another sequence	Translation, summarization, structured tasks

BERT and Encoder-Only Models

BERT is one of the most important transformer descendants. It uses the encoder stack only, with bidirectional attention, so every token can see the full input. That makes it strong for understanding tasks rather than generation.

â„¹ï¸

BERT-style models: excellent for embeddings, search, classification, extraction, reranking, and retrieval pipelines. They are not naturally suited to open-ended text generation because they do not use the decoder-only generation pattern.

Good Fit

Sentence embeddings, semantic search, document classification, named entity extraction, and reranking.

Not a Great Fit

Open-ended generation where the model needs to write long answers token by token from scratch.

How Modern Models Use This Architecture

Most modern foundation models are transformer variants. The core design is still attention + feed-forward blocks, but the surrounding training recipe, parameter scaling, attention tricks, and routing mechanisms have evolved.

Model family	Transformer type	Typical use
GPT-style models	Decoder-only	Chat, generation, agents, code
BERT-style models	Encoder-only	Embedding, retrieval, classification
T5 / FLAN-T5	Encoder-decoder	Translation, summarization, instruction tasks
Vision Transformers	Encoder-based tokenization of image patches	Vision classification and multimodal backbones

Derived Architectures

Transformer research produced many specialized descendants. Some examples include:

Decoder-only LLMs: GPT, Llama, Mistral, Qwen, DeepSeek-style families.
Encoder-only stacks: BERT, RoBERTa, DeBERTa, modern embedding models.
Encoder-decoder systems: T5, BART, mT5, FLAN-T5, many translation and summarization models.
Vision and multimodal transformers: ViT, CLIP-style dual encoders, and vision-language backbones.

Dense vs MoE

Dense transformers activate all parameters for every token. Mixture-of-Experts (MoE) models route tokens to only a subset of experts. That means MoE can scale parameter count much faster without paying the full compute cost on every token.

Type	Strength	Trade-off
Dense	Simple, predictable, easier to deploy and optimize	Compute cost rises directly with model size
MoE	Very large parameter capacity with lower active compute	Routing complexity, operational complexity, and harder serving

âš

Practical trade-off: MoE is attractive when you want very large capacity and can tolerate more complex infra. Dense models are usually easier to deploy, easier to reason about, and often the better default for smaller teams or strict latency targets.

Which Transformer Should You Use?

Choose the architecture based on the task instead of starting from model size alone.

Task	Best transformer family	Why
Chat / generation	Decoder-only	Autoregressive generation is the natural fit
Search / embeddings / classification	Encoder-only	Bidirectional context is great for understanding
Summarization / translation	Encoder-decoder	Best when input and output both matter
Very large capacity at lower active compute	MoE transformer	Useful when scale matters more than simplicity
Small team, stable deployment, low complexity	Dense transformer	Operationally simpler and easier to serve

âœ“

Rule of thumb: if you are building an assistant or generator, start with a decoder-only model. If you are building retrieval, classification, or embeddings, start with an encoder-only model. If you are translating or compressing one sequence into another, use encoder-decoder.

Transformers Handbook

Attention Is All You Need

Encoder, Decoder, and Encoder-Decoder

Encoder

Decoder

Encoder + Decoder

BERT and Encoder-Only Models

How Modern Models Use This Architecture

Derived Architectures

Dense vs MoE

Which Transformer Should You Use?

Reference Links