Back to handbooks index

Advanced Prompt Engineering Handbook

A production-focused guide to prompt design, reasoning strategies, retrieval prompts, and structured outputs for teams shipping LLM features in real systems.

Chat APIs RAG + Structured Output OpenAI + Anthropic April 2026
i
Prompt engineering is interface design, not spell casting. Strong prompts do not rely on magic words. They work because they set role, scope, evidence, format, and decision criteria so the model can navigate its latent space toward the answer shape you actually need.

Table of Contents

This handbook is organized around the core workflow prompt engineers use in production: define the message contract, pick the right prompting technique, improve reasoning reliability, inject external context safely, and force outputs into machine-readable structures.

Module 1: Anatomy of a Prompt
Message roles, decoding controls, and a baseline prompt contract that turns vague requests into stable model behavior.
Module 2: Core Prompting Techniques
Zero-shot, few-shot, and delimiter design for getting the model to parse your intent exactly once and correctly.
Module 3: Advanced Reasoning
Chain of Thought, self-consistency, and decomposition strategies for problems that require multi-step reasoning.
Module 4: Context Injection & RAG
Prompt layouts that keep retrieval grounded, reduce hallucination, and defend against weak or missing evidence.
Module 5: Enforcing JSON
Contracts, schemas, and API-level structure enforcement so downstream software can trust the response.

Operating Principles

The cleanest mental model is to treat prompt engineering like briefing a senior contractor. If you hand over a vague sentence, you get improvisation. If you provide scope, examples, evidence, constraints, and a required deliverable, you get a much better first draft.

Role -> Context -> Task -> Constraints -> Output Contract

Module 1: System vs. User vs. Assistant

In the Chat Completions model, message roles are not cosmetic labels. They are part of the instruction hierarchy. A simple analogy is a film set: the system message is the director's brief, the user message is today's scene request, and prior assistant messages are continuity notes from earlier takes.

RolePurposeWhen to Use It
SystemSets durable behavior, persona, guardrails, and priorities.Use for policies, tone, output rules, and non-negotiable constraints.
UserProvides the live task, inputs, and clarifying requirements.Use for the actual work request and runtime context.
AssistantRepresents prior model responses or inserted exemplars.Use in chat history or few-shot demonstrations when continuity matters.
System message template

You are {{assistant_persona}}.
Your job is to {{primary_responsibility}}.
Always prioritize:
1. {{priority_one}}
2. {{priority_two}}
3. {{priority_three}}

Never do the following:
- {{forbidden_action_one}}
- {{forbidden_action_two}}

When information is missing, ask for {{missing_information_policy}}.

User message template

Context:
{{context}}

Task:
{{task}}

Constraints:
{{constraints}}

Desired output:
{{output_format}}
from typing import Any

from openai import OpenAI


client = OpenAI()


def run_message_stack(system_prompt: str, user_prompt: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4.1-mini",
        temperature=0.2,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
    )
    return response.choices[0].message.content or ""

Hyperparameters

Hyperparameters control sampling behavior, not truthfulness. Think of them like steering sensitivity in a vehicle: lower values keep the model tightly on rails, higher values allow more variation. For code generation and extraction, stability usually wins. For ideation and copywriting, controlled variation helps.

Temperature
Lower for deterministic tasks like coding, classification, extraction, or policy answers. Higher for brainstorming and creative writing.
Top-P
Nucleus sampling cap. Usually leave it alone unless you are explicitly tuning generation diversity.
Frequency / Presence
Useful when repetition is the problem, especially in creative tasks or long generations. Usually keep near default for code or structured outputs.
Task tuning prompt header

Task type: {{task_type}}
Risk tolerance: {{risk_tolerance}}
Need for novelty: {{novelty_level}}
Need for exact wording: {{exactness_level}}
from openai import OpenAI


client = OpenAI()


def complete_task(task_prompt: str, creative: bool) -> str:
    response = client.chat.completions.create(
        model="gpt-4.1-mini",
        temperature=0.8 if creative else 0.1,
        top_p=1.0,
        frequency_penalty=0.3 if creative else 0.0,
        presence_penalty=0.2 if creative else 0.0,
        messages=[{"role": "user", "content": task_prompt}],
    )
    return response.choices[0].message.content or ""

The "Perfect" Baseline Prompt

A strong baseline prompt is a reusable contract. It does not chase hidden incantations. It simply answers the five questions every model needs: who it is, what it knows, what it must do, what it must avoid, and what shape the output must take.

Baseline Template Persona + Context + Task + Constraints + Output

Use this when a prompt will be reused in code, tested in evals, or handed between engineers.

# Persona
You are {{persona}}.

# Context
You are helping with {{business_context}}.
Relevant background:
{{background_context}}

# Task
Your task is to {{task_to_complete}}.

# Constraints
- Optimize for {{primary_goal}}.
- Do not assume facts outside: {{allowed_sources}}.
- If information is missing, {{fallback_behavior}}.
- Keep the response within {{length_limit}}.

# Output Format
Return your answer as:
1. Summary
2. Key reasoning
3. Recommended next action

# Input
{{runtime_input}}
from typing import Mapping

from openai import OpenAI


client = OpenAI()


BASELINE_TEMPLATE = """# Persona
You are {{persona}}.

# Context
You are helping with {{business_context}}.
Relevant background:
{{background_context}}

# Task
Your task is to {{task_to_complete}}.

# Constraints
- Optimize for {{primary_goal}}.
- Do not assume facts outside: {{allowed_sources}}.
- If information is missing, {{fallback_behavior}}.
- Keep the response within {{length_limit}}.

# Output Format
Return your answer as:
1. Summary
2. Key reasoning
3. Recommended next action

# Input
{{runtime_input}}"""


def render_template(template: str, values: Mapping[str, str]) -> str:
    rendered = template
    for key, value in values.items():
        rendered = rendered.replace(f"{{{{{key}}}}}", value)
    return rendered


def run_baseline_prompt(values: Mapping[str, str]) -> str:
    prompt = render_template(BASELINE_TEMPLATE, values)
    response = client.chat.completions.create(
        model="gpt-4.1-mini",
        temperature=0.2,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content or ""

Module 2: Zero-Shot vs. Few-Shot Prompting

Zero-shot prompting asks the model to infer the pattern from instructions alone. Few-shot prompting shows the pattern explicitly. The analogy is onboarding a new analyst: zero-shot is handing over the assignment brief; few-shot is handing over the brief plus three finished examples.

Few-shot works best when the format is the point. If you care about tone, label style, reasoning style, or classification boundaries, examples are usually more effective than more adjectives.
Few-shot template

Task: {{task_description}}

Follow the pattern shown in the examples.

Example 1
Input: {{example_input_1}}
Output: {{example_output_1}}

Example 2
Input: {{example_input_2}}
Output: {{example_output_2}}

Example 3
Input: {{example_input_3}}
Output: {{example_output_3}}

Now complete the real task.
Input: {{actual_input}}
Output:
from anthropic import Anthropic


client = Anthropic()


def run_few_shot(prompt: str) -> str:
    response = client.messages.create(
        model="claude-3-7-sonnet-latest",
        max_tokens=700,
        temperature=0.1,
        messages=[
            {"role": "user", "content": prompt},
        ],
    )
    return "".join(block.text for block in response.content if block.type == "text")

Formatting & Delimiters

Delimiters reduce hallucination because they reduce ambiguity. They act like labeled folders in a filing cabinet. Instead of forcing the model to guess where the instructions end and the raw input begins, you separate them with explicit structure.

XML-Tagged Prompt Best for complex prompts

XML-style sections are especially useful when a prompt contains instructions, data, examples, and output rules in the same request.

<instructions>
You are {{persona}}.
Complete the task exactly as specified.
Do not use knowledge outside <input> unless explicitly allowed.
</instructions>

<task>
{{task}}
</task>

<requirements>
- Audience: {{audience}}
- Tone: {{tone}}
- Maximum length: {{max_length}}
- Must include: {{must_include}}
</requirements>

<input>
{{input_text}}
</input>

<output_format>
Return Markdown with these headings:
## Summary
## Evidence
## Risks
</output_format>
from openai import OpenAI


client = OpenAI()


def run_xml_prompt(prompt: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4.1-mini",
        temperature=0.2,
        messages=[
            {"role": "system", "content": "Parse tagged sections carefully and follow the output format exactly."},
            {"role": "user", "content": prompt},
        ],
    )
    return response.choices[0].message.content or ""

Module 3: Chain of Thought (CoT)

The famous “let’s think step by step” effect works because it nudges the model to allocate tokens to intermediate reasoning instead of jumping directly to the answer. The analogy is asking someone to show their work on a whiteboard before committing to the final number.

!
Production caution: do not depend on hidden proprietary reasoning. If you need inspectable steps, request a visible reasoning or planning block explicitly and evaluate that format.
You are {{persona}}.
Solve the problem carefully.

Problem:
{{problem}}

Rules:
- First write a <thinking> block that breaks the problem into steps.
- Then write a <final_answer> block with only the answer the user should see.
- If the evidence is insufficient, say so in <final_answer>.

Required output:
<thinking>
...
</thinking>
<final_answer>
...
</final_answer>
from openai import OpenAI


client = OpenAI()


def run_cot(problem: str) -> str:
    prompt = f"""You are a careful analyst.
Solve the problem carefully.

Problem:
{problem}

Rules:
- First write a <thinking> block.
- Then write a <final_answer> block.
"""
    response = client.chat.completions.create(
        model="gpt-4.1-mini",
        temperature=0.2,
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content or ""

Self-Consistency

Self-consistency improves reliability by sampling multiple reasoning paths and selecting the most common final answer. Think of it like asking several analysts to solve the same logic puzzle independently, then trusting the convergent answer more than any single draft.

Self-consistency base prompt

Solve the following problem.
Reason step by step.
End your response with a single line in this exact format:
FINAL_ANSWER: {{answer_format}}

Problem:
{{problem}}
from collections import Counter
from typing import Iterable

from openai import OpenAI


client = OpenAI()


def sample_answers(prompt: str, runs: int = 5) -> str:
    answers: list[str] = []

    for _ in range(runs):
        response = client.chat.completions.create(
            model="gpt-4.1-mini",
            temperature=0.7,
            messages=[{"role": "user", "content": prompt}],
        )
        text = response.choices[0].message.content or ""
        final_line = next((line for line in text.splitlines() if line.startswith("FINAL_ANSWER:")), "FINAL_ANSWER: UNKNOWN")
        answers.append(final_line.replace("FINAL_ANSWER:", "").strip())

    return Counter(answers).most_common(1)[0][0]

Tree of Thoughts (ToT) / Step-back Prompting

Tree of Thoughts asks the model to explore multiple candidate paths before committing. Step-back prompting is simpler: first ask the model for the governing principles, then solve the concrete case. This is useful when the task is too entangled for one-pass generation.

Step-back template

You are {{persona}}.

Step 1: Abstract the problem.
What general principles, frameworks, or decision rules apply to this type of problem?

Step 2: Apply those principles.
Using the principles above, solve the specific case below.

Specific case:
{{specific_problem}}

Return:
## General Principles
## Candidate Approaches
## Recommended Approach
## Final Answer
from anthropic import Anthropic


client = Anthropic()


def run_step_back(prompt: str) -> str:
    response = client.messages.create(
        model="claude-3-7-sonnet-latest",
        max_tokens=900,
        temperature=0.3,
        messages=[{"role": "user", "content": prompt}],
    )
    return "".join(block.text for block in response.content if block.type == "text")
  • Use Tree of Thoughts when you need deliberate branching, candidate evaluation, and backtracking across possible solutions.
  • Use step-back prompting when the model is solving the instance too literally and needs a principles-first frame.
  • Use ordinary CoT when one linear reasoning path is enough and you do not need branch comparison.
  • Module 4: The "Lost in the Middle" Phenomenon

    LLMs often attend less effectively to information buried in the middle of a long prompt. The analogy is a long contract review: readers remember the opening instructions and the closing clause, but the critical sentence buried on page 47 is easier to miss.

    Context Ordering Strategy Front-load and restate key evidence

    Put the question up front, place the highest-value retrieved evidence early, and restate the decisive excerpts near the end if the prompt is long.

    Long-context layout template
    
    Question:
    {{user_question}}
    
    Critical evidence summary:
    {{high_signal_summary}}
    
    Retrieved context blocks:
    {{documents}}
    
    Reminder of decisive facts:
    {{decisive_facts}}
    
    Instructions:
    - Answer only from the provided context.
    - If the answer is not supported, say "Insufficient context".
    - Cite the relevant document ids.
    from typing import Sequence
    
      from openai import OpenAI
    
    
      client = OpenAI()
    
    
      def build_long_context_prompt(question: str, high_signal_summary: str, documents: Sequence[str], decisive_facts: str) -> str:
        ordered_documents = "\n\n".join(documents)
        return f"""Question:
      {question}
    
      Critical evidence summary:
      {high_signal_summary}
    
      Retrieved context blocks:
      {ordered_documents}
    
      Reminder of decisive facts:
      {decisive_facts}
    
      Instructions:
      - Answer only from the provided context.
      - If the answer is not supported, say \"Insufficient context\".
      - Cite the relevant document ids."""
    
    
      def run_long_context_prompt(question: str, high_signal_summary: str, documents: Sequence[str], decisive_facts: str) -> str:
        prompt = build_long_context_prompt(question, high_signal_summary, documents, decisive_facts)
        response = client.chat.completions.create(
          model="gpt-4.1-mini",
          temperature=0.1,
          messages=[{"role": "user", "content": prompt}],
        )
        return response.choices[0].message.content or ""

    Structuring a RAG Prompt

    A RAG prompt should act like a disciplined research assistant: use only the retrieved evidence, cite it, and refuse to over-claim when the evidence is missing. The retrieval system fetches the books; the prompt tells the model how to read them.

    Do not say “use the context” and stop there. You need a fallback contract for missing evidence, a citation rule, and a clear instruction to separate supported facts from assumptions.
    <system>
    You are {{assistant_role}}.
    Answer questions using only the provided documents.
    If the documents do not contain the answer, explicitly say: "I do not have enough context to answer that reliably."
    Do not invent citations.
    </system>
    
    <user_question>
    {{question}}
    </user_question>
    
    <documents>
    {{documents}}
    </documents>
    
    <instructions>
    1. Read the question first.
    2. Identify the smallest set of relevant documents.
    3. Answer only with supported claims.
    4. Cite sources inline as [doc_id].
    5. If evidence is partial, say what is known and what is missing.
    </instructions>
    
    <output_format>
    ## Answer
    ## Evidence
    ## Gaps
    </output_format>
    from typing import Iterable
    
    from openai import OpenAI
    
    
    client = OpenAI()
    
    
    def format_documents(documents: Iterable[tuple[str, str]]) -> str:
        return "\n\n".join(
            f"[doc_id={doc_id}]\n{content}" for doc_id, content in documents
        )
    
    
    def run_rag(question: str, documents: Iterable[tuple[str, str]]) -> str:
        prompt = f"""<system>
    You are a grounded QA assistant.
    Answer questions using only the provided documents.
    If the documents do not contain the answer, explicitly say: \"I do not have enough context to answer that reliably.\"
    Do not invent citations.
    </system>
    
    <user_question>
    {question}
    </user_question>
    
    <documents>
    {format_documents(documents)}
    </documents>
    
    <instructions>
    1. Use only supported claims.
    2. Cite sources inline as [doc_id].
    3. Call out missing evidence clearly.
    </instructions>"""
    
        response = client.chat.completions.create(
            model="gpt-4.1-mini",
            temperature=0.1,
            messages=[{"role": "user", "content": prompt}],
        )
        return response.choices[0].message.content or ""

    Module 5: Enforcing Output Structures (JSON)

    Structured output is where prompt engineering becomes software engineering. If a response feeds a UI, workflow engine, or validator, prose is a liability. JSON is the equivalent of giving the model a tax form instead of a blank page.

    Prompt-Level Contract
    Tell the model exactly which fields are required, which are optional, and what types they should contain.
    API-Level Enforcement
    Use response-format or schema features when the provider supports them. Prompting alone is weaker than transport-level guarantees.
    Validation
    Always validate parsed JSON with application code before trusting it downstream.
    Strict JSON prompt template
    
    You are {{persona}}.
    Return a valid JSON object only.
    Do not wrap the JSON in markdown fences.
    Do not include commentary before or after the JSON.
    
    Schema:
    {
      "answer": "string",
      "confidence": "number from 0 to 1",
      "citations": ["string"],
      "needs_human_review": "boolean"
    }
    
    Task:
    {{task}}
    
    Input:
    {{input}}
    import json
      from typing import Any
    
      from openai import OpenAI
    
    
      client = OpenAI()
    
    
      def run_json_prompt(task: str, input_text: str) -> dict[str, Any]:
        response = client.chat.completions.create(
            model="gpt-4.1-mini",
            response_format={"type": "json_object"},
            messages=[
                {
                    "role": "user",
                    "content": f"""You are a structured extraction assistant.
    Return a valid JSON object only.
    
    Schema:
    {{
      \"answer\": \"string\",
      \"confidence\": \"number from 0 to 1\",
      \"citations\": [\"string\"],
      \"needs_human_review\": \"boolean\"
    }}
    
    Task:
    {task}
    
    Input:
    {input_text}""",
                }
            ],
        )
      content = response.choices[0].message.content or "{}"
      return json.loads(content)
    Production pattern: combine prompt instructions, API-level JSON enforcement, and server-side schema validation. Any single layer can fail; all three together are much more robust.

    Reference Links