Advanced Prompt Engineering Handbook
A production-focused guide to prompt design, reasoning strategies, retrieval prompts, and structured outputs for teams shipping LLM features in real systems.
Table of Contents
This handbook is organized around the core workflow prompt engineers use in production: define the message contract, pick the right prompting technique, improve reasoning reliability, inject external context safely, and force outputs into machine-readable structures.
Operating Principles
The cleanest mental model is to treat prompt engineering like briefing a senior contractor. If you hand over a vague sentence, you get improvisation. If you provide scope, examples, evidence, constraints, and a required deliverable, you get a much better first draft.
- Be explicit about authority. The model should know which instructions outrank others.
- Separate data from instructions. Ambiguity grows when raw user input and policy text are blended together.
- Design for parsability. Your prompt should be easy for both humans and models to scan under token pressure.
- Prefer contracts over vibes. Ask for specific fields, steps, or decisions rather than “something good.”
Module 1: System vs. User vs. Assistant
In the Chat Completions model, message roles are not cosmetic labels. They are part of the instruction hierarchy. A simple analogy is a film set: the system message is the director's brief, the user message is today's scene request, and prior assistant messages are continuity notes from earlier takes.
| Role | Purpose | When to Use It |
|---|---|---|
| System | Sets durable behavior, persona, guardrails, and priorities. | Use for policies, tone, output rules, and non-negotiable constraints. |
| User | Provides the live task, inputs, and clarifying requirements. | Use for the actual work request and runtime context. |
| Assistant | Represents prior model responses or inserted exemplars. | Use in chat history or few-shot demonstrations when continuity matters. |
System message template
You are {{assistant_persona}}.
Your job is to {{primary_responsibility}}.
Always prioritize:
1. {{priority_one}}
2. {{priority_two}}
3. {{priority_three}}
Never do the following:
- {{forbidden_action_one}}
- {{forbidden_action_two}}
When information is missing, ask for {{missing_information_policy}}.
User message template
Context:
{{context}}
Task:
{{task}}
Constraints:
{{constraints}}
Desired output:
{{output_format}}
from typing import Any
from openai import OpenAI
client = OpenAI()
def run_message_stack(system_prompt: str, user_prompt: str) -> str:
response = client.chat.completions.create(
model="gpt-4.1-mini",
temperature=0.2,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
)
return response.choices[0].message.content or ""
Hyperparameters
Hyperparameters control sampling behavior, not truthfulness. Think of them like steering sensitivity in a vehicle: lower values keep the model tightly on rails, higher values allow more variation. For code generation and extraction, stability usually wins. For ideation and copywriting, controlled variation helps.
Task tuning prompt header
Task type: {{task_type}}
Risk tolerance: {{risk_tolerance}}
Need for novelty: {{novelty_level}}
Need for exact wording: {{exactness_level}}
from openai import OpenAI
client = OpenAI()
def complete_task(task_prompt: str, creative: bool) -> str:
response = client.chat.completions.create(
model="gpt-4.1-mini",
temperature=0.8 if creative else 0.1,
top_p=1.0,
frequency_penalty=0.3 if creative else 0.0,
presence_penalty=0.2 if creative else 0.0,
messages=[{"role": "user", "content": task_prompt}],
)
return response.choices[0].message.content or ""
The "Perfect" Baseline Prompt
A strong baseline prompt is a reusable contract. It does not chase hidden incantations. It simply answers the five questions every model needs: who it is, what it knows, what it must do, what it must avoid, and what shape the output must take.
Use this when a prompt will be reused in code, tested in evals, or handed between engineers.
# Persona
You are {{persona}}.
# Context
You are helping with {{business_context}}.
Relevant background:
{{background_context}}
# Task
Your task is to {{task_to_complete}}.
# Constraints
- Optimize for {{primary_goal}}.
- Do not assume facts outside: {{allowed_sources}}.
- If information is missing, {{fallback_behavior}}.
- Keep the response within {{length_limit}}.
# Output Format
Return your answer as:
1. Summary
2. Key reasoning
3. Recommended next action
# Input
{{runtime_input}}
from typing import Mapping
from openai import OpenAI
client = OpenAI()
BASELINE_TEMPLATE = """# Persona
You are {{persona}}.
# Context
You are helping with {{business_context}}.
Relevant background:
{{background_context}}
# Task
Your task is to {{task_to_complete}}.
# Constraints
- Optimize for {{primary_goal}}.
- Do not assume facts outside: {{allowed_sources}}.
- If information is missing, {{fallback_behavior}}.
- Keep the response within {{length_limit}}.
# Output Format
Return your answer as:
1. Summary
2. Key reasoning
3. Recommended next action
# Input
{{runtime_input}}"""
def render_template(template: str, values: Mapping[str, str]) -> str:
rendered = template
for key, value in values.items():
rendered = rendered.replace(f"{{{{{key}}}}}", value)
return rendered
def run_baseline_prompt(values: Mapping[str, str]) -> str:
prompt = render_template(BASELINE_TEMPLATE, values)
response = client.chat.completions.create(
model="gpt-4.1-mini",
temperature=0.2,
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content or ""
Module 2: Zero-Shot vs. Few-Shot Prompting
Zero-shot prompting asks the model to infer the pattern from instructions alone. Few-shot prompting shows the pattern explicitly. The analogy is onboarding a new analyst: zero-shot is handing over the assignment brief; few-shot is handing over the brief plus three finished examples.
Few-shot template
Task: {{task_description}}
Follow the pattern shown in the examples.
Example 1
Input: {{example_input_1}}
Output: {{example_output_1}}
Example 2
Input: {{example_input_2}}
Output: {{example_output_2}}
Example 3
Input: {{example_input_3}}
Output: {{example_output_3}}
Now complete the real task.
Input: {{actual_input}}
Output:
from anthropic import Anthropic
client = Anthropic()
def run_few_shot(prompt: str) -> str:
response = client.messages.create(
model="claude-3-7-sonnet-latest",
max_tokens=700,
temperature=0.1,
messages=[
{"role": "user", "content": prompt},
],
)
return "".join(block.text for block in response.content if block.type == "text")
Formatting & Delimiters
Delimiters reduce hallucination because they reduce ambiguity. They act like labeled folders in a filing cabinet. Instead of forcing the model to guess where the instructions end and the raw input begins, you separate them with explicit structure.
XML-style sections are especially useful when a prompt contains instructions, data, examples, and output rules in the same request.
<instructions>
You are {{persona}}.
Complete the task exactly as specified.
Do not use knowledge outside <input> unless explicitly allowed.
</instructions>
<task>
{{task}}
</task>
<requirements>
- Audience: {{audience}}
- Tone: {{tone}}
- Maximum length: {{max_length}}
- Must include: {{must_include}}
</requirements>
<input>
{{input_text}}
</input>
<output_format>
Return Markdown with these headings:
## Summary
## Evidence
## Risks
</output_format>
from openai import OpenAI
client = OpenAI()
def run_xml_prompt(prompt: str) -> str:
response = client.chat.completions.create(
model="gpt-4.1-mini",
temperature=0.2,
messages=[
{"role": "system", "content": "Parse tagged sections carefully and follow the output format exactly."},
{"role": "user", "content": prompt},
],
)
return response.choices[0].message.content or ""
Module 3: Chain of Thought (CoT)
The famous “let’s think step by step” effect works because it nudges the model to allocate tokens to intermediate reasoning instead of jumping directly to the answer. The analogy is asking someone to show their work on a whiteboard before committing to the final number.
You are {{persona}}.
Solve the problem carefully.
Problem:
{{problem}}
Rules:
- First write a <thinking> block that breaks the problem into steps.
- Then write a <final_answer> block with only the answer the user should see.
- If the evidence is insufficient, say so in <final_answer>.
Required output:
<thinking>
...
</thinking>
<final_answer>
...
</final_answer>
from openai import OpenAI
client = OpenAI()
def run_cot(problem: str) -> str:
prompt = f"""You are a careful analyst.
Solve the problem carefully.
Problem:
{problem}
Rules:
- First write a <thinking> block.
- Then write a <final_answer> block.
"""
response = client.chat.completions.create(
model="gpt-4.1-mini",
temperature=0.2,
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content or ""
Self-Consistency
Self-consistency improves reliability by sampling multiple reasoning paths and selecting the most common final answer. Think of it like asking several analysts to solve the same logic puzzle independently, then trusting the convergent answer more than any single draft.
Self-consistency base prompt
Solve the following problem.
Reason step by step.
End your response with a single line in this exact format:
FINAL_ANSWER: {{answer_format}}
Problem:
{{problem}}
from collections import Counter
from typing import Iterable
from openai import OpenAI
client = OpenAI()
def sample_answers(prompt: str, runs: int = 5) -> str:
answers: list[str] = []
for _ in range(runs):
response = client.chat.completions.create(
model="gpt-4.1-mini",
temperature=0.7,
messages=[{"role": "user", "content": prompt}],
)
text = response.choices[0].message.content or ""
final_line = next((line for line in text.splitlines() if line.startswith("FINAL_ANSWER:")), "FINAL_ANSWER: UNKNOWN")
answers.append(final_line.replace("FINAL_ANSWER:", "").strip())
return Counter(answers).most_common(1)[0][0]
- Best use case: arithmetic, logic, symbolic reasoning, and rubric-based analysis.
- Tradeoff: higher cost and latency because you are deliberately sampling more than once.
- Bad fit: deterministic extraction or strict JSON output, where multiple samples usually add noise.
Tree of Thoughts (ToT) / Step-back Prompting
Tree of Thoughts asks the model to explore multiple candidate paths before committing. Step-back prompting is simpler: first ask the model for the governing principles, then solve the concrete case. This is useful when the task is too entangled for one-pass generation.
Step-back template
You are {{persona}}.
Step 1: Abstract the problem.
What general principles, frameworks, or decision rules apply to this type of problem?
Step 2: Apply those principles.
Using the principles above, solve the specific case below.
Specific case:
{{specific_problem}}
Return:
## General Principles
## Candidate Approaches
## Recommended Approach
## Final Answer
from anthropic import Anthropic
client = Anthropic()
def run_step_back(prompt: str) -> str:
response = client.messages.create(
model="claude-3-7-sonnet-latest",
max_tokens=900,
temperature=0.3,
messages=[{"role": "user", "content": prompt}],
)
return "".join(block.text for block in response.content if block.type == "text")
Module 4: The "Lost in the Middle" Phenomenon
LLMs often attend less effectively to information buried in the middle of a long prompt. The analogy is a long contract review: readers remember the opening instructions and the closing clause, but the critical sentence buried on page 47 is easier to miss.
Put the question up front, place the highest-value retrieved evidence early, and restate the decisive excerpts near the end if the prompt is long.
Long-context layout template
Question:
{{user_question}}
Critical evidence summary:
{{high_signal_summary}}
Retrieved context blocks:
{{documents}}
Reminder of decisive facts:
{{decisive_facts}}
Instructions:
- Answer only from the provided context.
- If the answer is not supported, say "Insufficient context".
- Cite the relevant document ids.
from typing import Sequence
from openai import OpenAI
client = OpenAI()
def build_long_context_prompt(question: str, high_signal_summary: str, documents: Sequence[str], decisive_facts: str) -> str:
ordered_documents = "\n\n".join(documents)
return f"""Question:
{question}
Critical evidence summary:
{high_signal_summary}
Retrieved context blocks:
{ordered_documents}
Reminder of decisive facts:
{decisive_facts}
Instructions:
- Answer only from the provided context.
- If the answer is not supported, say \"Insufficient context\".
- Cite the relevant document ids."""
def run_long_context_prompt(question: str, high_signal_summary: str, documents: Sequence[str], decisive_facts: str) -> str:
prompt = build_long_context_prompt(question, high_signal_summary, documents, decisive_facts)
response = client.chat.completions.create(
model="gpt-4.1-mini",
temperature=0.1,
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content or ""
Structuring a RAG Prompt
A RAG prompt should act like a disciplined research assistant: use only the retrieved evidence, cite it, and refuse to over-claim when the evidence is missing. The retrieval system fetches the books; the prompt tells the model how to read them.
<system>
You are {{assistant_role}}.
Answer questions using only the provided documents.
If the documents do not contain the answer, explicitly say: "I do not have enough context to answer that reliably."
Do not invent citations.
</system>
<user_question>
{{question}}
</user_question>
<documents>
{{documents}}
</documents>
<instructions>
1. Read the question first.
2. Identify the smallest set of relevant documents.
3. Answer only with supported claims.
4. Cite sources inline as [doc_id].
5. If evidence is partial, say what is known and what is missing.
</instructions>
<output_format>
## Answer
## Evidence
## Gaps
</output_format>
from typing import Iterable
from openai import OpenAI
client = OpenAI()
def format_documents(documents: Iterable[tuple[str, str]]) -> str:
return "\n\n".join(
f"[doc_id={doc_id}]\n{content}" for doc_id, content in documents
)
def run_rag(question: str, documents: Iterable[tuple[str, str]]) -> str:
prompt = f"""<system>
You are a grounded QA assistant.
Answer questions using only the provided documents.
If the documents do not contain the answer, explicitly say: \"I do not have enough context to answer that reliably.\"
Do not invent citations.
</system>
<user_question>
{question}
</user_question>
<documents>
{format_documents(documents)}
</documents>
<instructions>
1. Use only supported claims.
2. Cite sources inline as [doc_id].
3. Call out missing evidence clearly.
</instructions>"""
response = client.chat.completions.create(
model="gpt-4.1-mini",
temperature=0.1,
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content or ""
Module 5: Enforcing Output Structures (JSON)
Structured output is where prompt engineering becomes software engineering. If a response feeds a UI, workflow engine, or validator, prose is a liability. JSON is the equivalent of giving the model a tax form instead of a blank page.
Strict JSON prompt template
You are {{persona}}.
Return a valid JSON object only.
Do not wrap the JSON in markdown fences.
Do not include commentary before or after the JSON.
Schema:
{
"answer": "string",
"confidence": "number from 0 to 1",
"citations": ["string"],
"needs_human_review": "boolean"
}
Task:
{{task}}
Input:
{{input}}
import json
from typing import Any
from openai import OpenAI
client = OpenAI()
def run_json_prompt(task: str, input_text: str) -> dict[str, Any]:
response = client.chat.completions.create(
model="gpt-4.1-mini",
response_format={"type": "json_object"},
messages=[
{
"role": "user",
"content": f"""You are a structured extraction assistant.
Return a valid JSON object only.
Schema:
{{
\"answer\": \"string\",
\"confidence\": \"number from 0 to 1\",
\"citations\": [\"string\"],
\"needs_human_review\": \"boolean\"
}}
Task:
{task}
Input:
{input_text}""",
}
],
)
content = response.choices[0].message.content or "{}"
return json.loads(content)
Reference Links
- OpenAI OpenAI API documentation
- Anthropic Anthropic API documentation
- Prompting Prompt Engineering Guide
- RAG LangChain RAG concepts
- Structured Output Structured outputs guide