Production Prompt Engineering Patterns: CoT, Self-Consistency, Constitutional AI, and Prompt Chaining

Prompt Engineering Is an Engineering Discipline

The term "prompt engineering" undersells what it actually is. In production, prompts are configuration that affects system behavior at runtime. They have the same properties as code: they need version control, regression testing, and systematic improvement based on measured outcomes. An LLM call with a carelessly written prompt will fail inconsistently and in ways that are hard to debug.

This article covers four patterns that have consistent, measurable impact on output quality. Each section explains what the pattern does mechanically, when it pays off, what it costs, and provides runnable Python code. All examples use the OpenAI Python SDK v1.x but the patterns are model-agnostic.

Pattern 1: Chain-of-Thought (CoT) Prompting

Chain-of-Thought prompting elicits step-by-step reasoning before the final answer. The original Google Brain paper (Wei et al., 2022) showed that asking the model to reason through a problem — rather than jump to an answer — significantly improves accuracy on multi-step tasks including math, logic, and code debugging.

Why CoT Works

Autoregressive language models generate tokens sequentially. Each token is conditioned on all previous tokens. When you force the model to generate intermediate reasoning steps, those steps become part of the context for the final answer token. The model is, in effect, doing computation in the token stream before committing to an answer. Without CoT, the model has to "compute" the answer in a single forward pass through its attention layers — which is essentially asking it to hold all the reasoning in its weights without scratchpad space.

Zero-Shot CoT vs. Few-Shot CoT

from openai import OpenAI
from typing import Any
import json

client = OpenAI()

def call_llm(messages: list[dict], model: str = "gpt-4o-mini", **kwargs) -> str:
    response = client.chat.completions.create(
        model=model, messages=messages, **kwargs
    )
    return response.choices[0].message.content.strip()


# --- Zero-Shot CoT ---
# Simply adding "Let's think step by step." triggers reasoning in most capable models.

def zero_shot_cot(question: str) -> str:
    return call_llm([
        {
            "role": "system",
            "content": "You are a precise analytical assistant. Always reason step by step before giving your final answer.",
        },
        {
            "role": "user",
            "content": f"{question}\n\nLet's think step by step.",
        },
    ])


# --- Few-Shot CoT ---
# Demonstrated examples consistently outperform zero-shot on domain-specific tasks.

FEW_SHOT_COT_EXAMPLES = [
    {
        "question": "A server handles 1,200 requests per second at 40% CPU. If CPU scales linearly, how many requests per second can it handle before hitting 100% CPU?",
        "reasoning": (
            "Step 1: At 40% CPU, the server handles 1,200 RPS.\n"
            "Step 2: To find RPS at 100% CPU, I need to scale up proportionally.\n"
            "Step 3: Scale factor = 100% / 40% = 2.5\n"
            "Step 4: Maximum RPS = 1,200 × 2.5 = 3,000 RPS\n"
            "But CPU rarely scales perfectly linearly. Real-world headroom usually stops at 80-90%."
        ),
        "answer": "3,000 RPS at theoretical 100% CPU. In practice, plan for ~2,400 RPS at 80% CPU ceiling.",
    },
]

def few_shot_cot(question: str) -> str:
    messages = [
        {
            "role": "system",
            "content": "Solve problems by reasoning step by step, then give a clear final answer.",
        }
    ]

    for ex in FEW_SHOT_COT_EXAMPLES:
        messages.append({"role": "user", "content": ex["question"]})
        messages.append({
            "role": "assistant",
            "content": f"Reasoning:\n{ex['reasoning']}\n\nAnswer: {ex['answer']}",
        })

    messages.append({"role": "user", "content": question})
    return call_llm(messages)


# Test both approaches
q = "If a database query takes 50ms and 30% of that is network latency, what is the maximum QPS if you have 8 parallel workers?"
print("Zero-shot CoT:")
print(zero_shot_cot(q))
print("\nFew-shot CoT:")
print(few_shot_cot(q))

When CoT Is Worth the Cost

CoT increases output token count, which increases latency and cost. On a 200ms LLM call, CoT might push you to 400-800ms depending on reasoning depth. This is worth it when:

The task requires multi-step logic (math, code debugging, eligibility checks)
Error rate without CoT is above 5% on your test set
Incorrect answers have high downstream cost

Do not use CoT for tasks that are genuinely single-step (classification, extraction, simple reformatting). The added tokens produce no quality improvement and inflate costs.

Pattern 2: Self-Consistency

Self-Consistency (Wang et al., 2022) addresses a core weakness of CoT: even with step-by-step reasoning, a single sample can reach the wrong answer via a plausible but incorrect reasoning path. Self-Consistency generates multiple independent reasoning paths for the same question and takes the majority vote answer.

Think of it as running the same problem through multiple independent solvers and trusting the answer that appears most often. It does not reduce the variance of any single path — it aggregates across paths to find the most likely correct answer.

from openai import OpenAI
from collections import Counter
from typing import Optional
import asyncio
import re

client = OpenAI()

async def generate_reasoning_path(question: str, path_id: int) -> dict:
    """Generate a single CoT reasoning path with a final answer."""
    response = await asyncio.to_thread(
        client.chat.completions.create,
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "Solve the problem step by step. At the end, state your final answer "
                    "clearly on a new line starting with 'FINAL ANSWER:'"
                ),
            },
            {"role": "user", "content": question},
        ],
        temperature=0.7,  # Non-zero temperature is essential for diversity
    )
    content = response.choices[0].message.content.strip()

    # Extract the final answer
    answer = None
    for line in reversed(content.splitlines()):
        if line.startswith("FINAL ANSWER:"):
            answer = line.replace("FINAL ANSWER:", "").strip()
            break

    return {"path_id": path_id, "reasoning": content, "answer": answer}


async def self_consistency(
    question: str,
    num_paths: int = 7,
) -> dict:
    """
    Generate num_paths reasoning paths and return the majority-vote answer.
    Odd numbers (5, 7, 9) avoid ties.
    """
    tasks = [generate_reasoning_path(question, i) for i in range(num_paths)]
    paths = await asyncio.gather(*tasks)

    answers = [p["answer"] for p in paths if p["answer"]]
    if not answers:
        return {"answer": None, "confidence": 0.0, "paths": paths}

    vote_counts = Counter(answers)
    best_answer, best_count = vote_counts.most_common(1)[0]
    confidence = best_count / len(answers)

    return {
        "answer": best_answer,
        "confidence": confidence,
        "vote_distribution": dict(vote_counts),
        "paths": paths,
    }


# Run self-consistency
question = (
    "A microservice handles 500 requests/sec. Each request consumes 10ms of CPU time. "
    "If the server has 4 vCPUs, is the system CPU-bound? "
    "What is the maximum sustainable RPS before CPU saturation?"
)

result = asyncio.run(self_consistency(question, num_paths=7))
print(f"Majority answer: {result['answer']}")
print(f"Confidence: {result['confidence']:.0%}")
print(f"Vote distribution: {result['vote_distribution']}")

Self-Consistency Cost Model

Self-consistency with N paths costs approximately N times a single CoT call. For latency-critical paths, run paths in parallel (as shown above) and the wall-clock time is roughly the time of the slowest single path, not N times the average. The sweet spot in practice is 5-9 paths: fewer paths have high variance; more than 9 paths yield diminishing returns on most tasks.

Use self-consistency when answer accuracy is worth the token cost. It is well-suited for: clinical decision support, financial calculations, code generation where correctness can be verified, and any task where a single wrong answer has high cost.

Pattern 3: Constitutional AI (Critiquing and Revision)

Constitutional AI (Bai et al., 2022, from Anthropic) is a method for making LLM outputs safer and more aligned by having the model critique its own output against a set of principles, then revise. In production engineering, the pattern generalizes to any domain where you want an LLM to self-correct against a specification.

The three-step process is: (1) generate an initial response, (2) critique the response against a set of rules, (3) revise based on the critique. You can run multiple critique-revise rounds.

from openai import OpenAI
from dataclasses import dataclass, field
from typing import Optional

client = OpenAI()

@dataclass
class ConstitutionalPrinciple:
    name: str
    critique_prompt: str
    revision_prompt: str

# Define your constitution — these are domain-specific rules, not just safety
API_DOCUMENTATION_CONSTITUTION = [
    ConstitutionalPrinciple(
        name="completeness",
        critique_prompt=(
            "Review the API documentation above. Does it cover all required fields: "
            "endpoint URL, HTTP method, request parameters (with types and required/optional status), "
            "response schema, error codes, and at least one code example? "
            "List any missing elements."
        ),
        revision_prompt=(
            "Revise the documentation to include all the missing elements identified in the critique. "
            "Add concrete examples for any missing parts."
        ),
    ),
    ConstitutionalPrinciple(
        name="accuracy",
        critique_prompt=(
            "Review the documentation for technical inaccuracies. "
            "Are HTTP status codes used correctly? Are data types accurately described? "
            "Are parameter names consistent with REST conventions? "
            "Identify any inaccuracies."
        ),
        revision_prompt=(
            "Correct all technical inaccuracies identified in the critique. "
            "Do not change correct information."
        ),
    ),
    ConstitutionalPrinciple(
        name="clarity",
        critique_prompt=(
            "Review the documentation for clarity issues. "
            "Are there ambiguous parameter descriptions? Unexplained jargon? "
            "Unclear error handling instructions? List specific clarity issues."
        ),
        revision_prompt=(
            "Rewrite the flagged sections for clarity. Use concrete, specific language."
        ),
    ),
]

def call_llm(messages: list[dict], temperature: float = 0) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        temperature=temperature,
    )
    return response.choices[0].message.content.strip()


def constitutional_revision(
    task: str,
    initial_response: str,
    constitution: list[ConstitutionalPrinciple],
    max_rounds: int = 1,
) -> dict:
    """
    Apply constitutional critique-revision to an initial response.
    Returns the final revised response and the full critique history.
    """
    current_response = initial_response
    history = []

    for round_num in range(max_rounds):
        round_critiques = []

        for principle in constitution:
            # Critique step
            critique = call_llm([
                {"role": "system", "content": "You are a precise technical reviewer."},
                {"role": "user", "content": (
                    f"Original task:\n{task}\n\n"
                    f"Response to review:\n{current_response}\n\n"
                    f"{principle.critique_prompt}"
                )},
            ])
            round_critiques.append({"principle": principle.name, "critique": critique})

        # Aggregate critique and revise
        combined_critique = "\n\n".join(
            f"[{c['principle'].upper()}]: {c['critique']}"
            for c in round_critiques
        )

        revised = call_llm([
            {"role": "system", "content": "You are a precise technical writer."},
            {"role": "user", "content": (
                f"Original task:\n{task}\n\n"
                f"Current draft:\n{current_response}\n\n"
                f"Critique:\n{combined_critique}\n\n"
                f"Revise the draft to address all critique points. "
                f"Return only the revised documentation, no meta-commentary."
            )},
        ])

        history.append({
            "round": round_num + 1,
            "critiques": round_critiques,
            "revised": revised,
        })
        current_response = revised

    return {"final": current_response, "history": history}


# Example usage
task = "Write API documentation for a POST /api/v1/users endpoint that creates a new user account."

initial = call_llm([
    {"role": "system", "content": "Write API documentation."},
    {"role": "user", "content": task},
])

result = constitutional_revision(
    task=task,
    initial_response=initial,
    constitution=API_DOCUMENTATION_CONSTITUTION,
    max_rounds=2,
)

print("=== INITIAL RESPONSE ===")
print(initial)
print("\n=== FINAL RESPONSE (after constitutional revision) ===")
print(result["final"])

Constitutional AI Beyond Safety

The pattern is not limited to content safety. The "constitution" is just a set of evaluation criteria with correction instructions. Useful constitutions for engineering workflows include:

Code review constitutions: Principles covering security, performance, error handling, testability
Technical writing constitutions: Principles covering accuracy, completeness, clarity, example quality
Data extraction constitutions: Principles covering format compliance, completeness, type accuracy

The main cost is additional LLM calls per principle per round. For a 3-principle constitution with 2 rounds, you make 1 (generate) + 6 (critique) + 2 (revise) = 9 LLM calls instead of 1. Use this pattern when output quality significantly affects downstream decisions, not for high-volume low-stakes generation tasks.

Pattern 4: Prompt Chaining

Prompt chaining breaks a complex task into a sequence of simpler LLM calls where each call's output feeds the next. This is the most general-purpose pattern and underlies most production agent systems.

The key insight is that a single LLM call handling a complex task is harder to control, debug, and improve than a pipeline of focused calls. Each step in the chain can use a different model, different temperature, different context, and can be independently tested and optimized.

Linear Chain

from openai import OpenAI
from dataclasses import dataclass
from typing import Callable, Any
import json

client = OpenAI()

@dataclass
class ChainStep:
    name: str
    system_prompt: str
    user_prompt_template: str  # Use {variable} placeholders
    model: str = "gpt-4o-mini"
    temperature: float = 0
    output_key: str = "output"  # Key to store this step's output in context

def run_chain_step(step: ChainStep, context: dict) -> str:
    """Execute one chain step, injecting context variables into prompts."""
    user_prompt = step.user_prompt_template.format(**context)
    response = client.chat.completions.create(
        model=step.model,
        messages=[
            {"role": "system", "content": step.system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        temperature=step.temperature,
    )
    return response.choices[0].message.content.strip()


def run_chain(steps: list[ChainStep], initial_context: dict) -> dict:
    """Run a linear chain of prompts, accumulating context."""
    context = dict(initial_context)
    results = []

    for step in steps:
        print(f"Running step: {step.name}")
        output = run_chain_step(step, context)
        context[step.output_key] = output
        results.append({"step": step.name, "output": output})

    return {"context": context, "steps": results}


# Example: Code review pipeline
# Each step specializes in one aspect; simpler steps can use cheaper models.

CODE_REVIEW_CHAIN = [
    ChainStep(
        name="extract_intent",
        system_prompt="You are a senior engineer who understands code intent. Be concise.",
        user_prompt_template=(
            "Analyze this code and describe in 2-3 sentences what it is trying to do:\n\n```python\n{code}\n```"
        ),
        model="gpt-4o-mini",
        output_key="intent",
    ),
    ChainStep(
        name="security_review",
        system_prompt=(
            "You are a security engineer. Focus only on security vulnerabilities. "
            "Return a JSON array of issues: [{severity: high/medium/low, issue: str, line_hint: str}]. "
            "Return [] if no issues found."
        ),
        user_prompt_template=(
            "Review this Python code for security vulnerabilities:\n\n```python\n{code}\n```\n"
            "Context: {intent}"
        ),
        model="gpt-4o",
        output_key="security_issues",
    ),
    ChainStep(
        name="performance_review",
        system_prompt=(
            "You are a performance engineer. Focus only on performance issues. "
            "Return a JSON array: [{severity: high/medium/low, issue: str, suggestion: str}]. "
            "Return [] if no issues found."
        ),
        user_prompt_template=(
            "Review this Python code for performance issues:\n\n```python\n{code}\n```\n"
            "Context: {intent}"
        ),
        model="gpt-4o-mini",
        output_key="performance_issues",
    ),
    ChainStep(
        name="synthesize_review",
        system_prompt=(
            "You are a tech lead writing a final code review. "
            "Be direct and actionable. Prioritize issues by severity."
        ),
        user_prompt_template=(
            "Write a code review for the following code based on the analysis results.\n\n"
            "Code:\n```python\n{code}\n```\n\n"
            "Intent: {intent}\n\n"
            "Security issues: {security_issues}\n\n"
            "Performance issues: {performance_issues}\n\n"
            "Write a concise review with clear action items."
        ),
        model="gpt-4o",
        output_key="final_review",
    ),
]

sample_code = """
import sqlite3
import hashlib

def authenticate_user(username, password):
    conn = sqlite3.connect('users.db')
    cursor = conn.cursor()
    query = f"SELECT * FROM users WHERE username = '{username}'"
    cursor.execute(query)
    user = cursor.fetchone()
    if user:
        stored_hash = user[2]
        input_hash = hashlib.md5(password.encode()).hexdigest()
        return stored_hash == input_hash
    return False
"""

result = run_chain(CODE_REVIEW_CHAIN, {"code": sample_code})
print(result["context"]["final_review"])

Branching Chains

Not all pipelines are linear. Many production workflows require routing to different chains based on intermediate outputs:

def classify_intent(user_message: str) -> str:
    """Route to the appropriate chain based on message classification."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "Classify the user message into exactly one category. "
                    "Return only the category name, nothing else.\n"
                    "Categories: bug_report, feature_request, how_to_question, billing_issue, other"
                ),
            },
            {"role": "user", "content": user_message},
        ],
        temperature=0,
    )
    return response.choices[0].message.content.strip().lower()


CHAIN_REGISTRY = {
    "bug_report": bug_report_chain,
    "feature_request": feature_request_chain,
    "how_to_question": how_to_chain,
    "billing_issue": billing_chain,
    "other": general_chain,
}

def route_and_run(user_message: str) -> dict:
    intent = classify_intent(user_message)
    chain = CHAIN_REGISTRY.get(intent, CHAIN_REGISTRY["other"])
    return run_chain(chain, {"user_message": user_message, "intent": intent})

Pattern Composition: When to Combine Patterns

These patterns compose. The most robust production pipelines combine them:

CoT + Self-Consistency: Generate N CoT paths and take majority vote. The standard approach for high-accuracy reasoning tasks.
Prompt Chaining + Constitutional AI: Generate output in a chain, then apply critique-revision on the final step. Keeps generation fast and applies quality checking only at the output boundary.
Prompt Chaining + CoT: Use CoT for the reasoning-heavy steps in a chain; use standard prompts for extraction and formatting steps. Minimizes CoT cost while getting its benefits where it matters.

Prompt Version Control and Testing

Treating prompts as code means versioning them and testing changes. A minimal prompt management setup:

import yaml
from pathlib import Path
from dataclasses import dataclass
from typing import Optional

@dataclass
class PromptVersion:
    name: str
    version: str
    system: str
    user_template: str
    model: str
    temperature: float
    notes: str = ""

def load_prompt(name: str, version: Optional[str] = None) -> PromptVersion:
    """Load a prompt from the prompts/ directory."""
    prompt_dir = Path("prompts") / name
    if version:
        path = prompt_dir / f"{version}.yaml"
    else:
        # Load latest version
        versions = sorted(prompt_dir.glob("*.yaml"))
        path = versions[-1]

    data = yaml.safe_load(path.read_text())
    return PromptVersion(**data)

# prompts/code_reviewer/v1.2.yaml:
# name: code_reviewer
# version: "1.2"
# system: "You are a senior engineer reviewing Python code..."
# user_template: "Review this code:\n\n{code}"
# model: gpt-4o
# temperature: 0
# notes: "Added security focus in system prompt"

# In tests:
# def test_code_reviewer_v1_2():
#     prompt = load_prompt("code_reviewer", "1.2")
#     result = run_prompt(prompt, {"code": SQL_INJECTION_SAMPLE})
#     assert "SQL injection" in result.lower()

Performance Benchmarks: When Each Pattern Wins

Based on published research and practical experience, here is guidance on where each pattern delivers meaningful improvement over a baseline single-shot prompt:

Zero-Shot CoT: +10-30% on multi-step reasoning tasks. Negligible improvement on single-step classification or extraction. Cost: 2-4x output tokens.
Few-Shot CoT: +20-40% on domain-specific reasoning where examples guide the reasoning format. Cost: +20-50% input tokens for examples.
Self-Consistency (N=7): +5-15% accuracy over single CoT path. Biggest gains on problems with clear correct/incorrect answers. Cost: 7x single CoT call, but parallelizable.
Constitutional AI (3 principles, 2 rounds): Significant qualitative improvement in output structure and completeness. Hard to quantify as a single metric. Cost: ~9x single call.
Prompt Chaining: Transforms intractable single-shot tasks into solvable pipelines. Also enables using smaller (cheaper) models for simpler sub-tasks, often reducing total cost vs. a single GPT-4o call.

Conclusion

The patterns covered here — Chain-of-Thought, Self-Consistency, Constitutional AI, and Prompt Chaining — are not tricks. They are systematic techniques with documented mechanisms and measurable effects. Applying them without understanding when they pay off will inflate costs without improving quality. Applying them to the right tasks with appropriate cost/quality calibration is what separates ad-hoc prompt tinkering from professional prompt engineering.

Start by identifying the steps in your pipeline with the highest error rates. Apply CoT first — it is the cheapest improvement. Add Self-Consistency if single-path variance is hurting you. Use Prompt Chaining to break apart complex tasks that resist single-shot reliability. Apply Constitutional AI where output structure and specification compliance matter. Measure everything with a fixed evaluation set before and after any change.