API Gateway Patterns — Field Handbook

Gateway
Patterns

// Kong · Apigee · Rate Limiting · Throttling · Safe AI Exposure

The definitive operational guide to API gateway architecture. Covers Kong and Apigee deep-dives, rate limiting algorithms, quota management, threat protection, and the emerging discipline of safely exposing AI tools and LLM endpoints to the outside world.

Kong Gateway Apigee Rate Limiting Throttling mTLS / Zero Trust AI Tool Exposure Prompt Guard

What Is an API Gateway

// THE FRONT DOOR TO YOUR BACKEND UNIVERSE

An API gateway is the single entry point through which all external (and increasingly, internal) API traffic passes before reaching your backend services. It is simultaneously a reverse proxy, a policy enforcement point, a traffic shaper, and an observability hub. Unlike a basic load balancer, a gateway understands the HTTP/gRPC/WebSocket application layer and can make intelligent decisions per-request.

Traffic Control

Rate limiting, throttling, load balancing, circuit breaking, retries, and timeouts — all enforced before requests reach your services.

Security Enforcement

Authentication, authorization, mTLS termination, JWT validation, API key management, WAF rules, and threat protection in a centralized layer.

Observability

Unified access logs, distributed traces, latency histograms, and error-rate metrics across all services — without instrumenting each service individually.

Gateway vs. Other Patterns

Pattern	Layer	Use When	Limitations
API Gateway	L7 — app aware	Public APIs, multi-consumer, policy enforcement	Potential single point of failure; latency overhead
Load Balancer	L4/L7 basic	Distributing traffic to identical backends	No auth, rate limiting, or transform logic
Service Mesh	L7 — east-west	Internal service-to-service (mTLS, observability)	Not designed for north-south external traffic
BFF (Backend-for-Frontend)	Application	Client-specific API aggregation	Custom code, not declarative, no shared policy
Reverse Proxy (nginx)	L7 basic	Static routing, SSL termination, caching	No dynamic policy, auth plugins, quota management

▸

Complementary, not competing: A real production stack typically uses all three layers — an API Gateway for north-south traffic (external consumers), a Service Mesh for east-west (internal services), and a Load Balancer underneath each. They serve different segments of the traffic matrix.

Gateway Anatomy

// REQUEST PIPELINE INTERNALS

Every enterprise gateway processes requests through a plugin/policy pipeline. Knowing the execution order matters — a misconfigured plugin order can bypass authentication, double-charge rate limits, or corrupt telemetry.

Client

Request Arrives

TLS termination
SNI routing

→

Phase 1

Auth & Identity

JWT · API Key
mTLS · OAuth

→

Phase 2

Rate Limit & Quota

Window checks
Token bucket

→

Phase 3

Transform

Header rewrite
Body modify

→

Phase 4

Route & Proxy

Service LB
Circuit break

→

Response

Log & Return

Trace export
Metrics emit

⚠

Plugin order is security-critical: Authentication MUST run before rate limiting. If rate limiting runs first, unauthenticated requests consume quota, enabling quota-exhaustion DoS against legitimate users. In Kong, plugins with lower priority numbers run first — validate your kong.conf plugin order explicitly.

Control Plane vs. Data Plane

Control Plane

Stores gateway configuration — routes, plugins, consumers, credentials, upstream services. In Kong this is the Admin API + database (PostgreSQL or declarative YAML). In Apigee, the management console and API Management APIs.

Separate it. The control plane should never be publicly accessible. Admin API on Kong must be firewalled; Apigee management APIs require service account auth.

Data Plane

The runtime nodes that actually proxy traffic. They pull configuration from the control plane and enforce policies in-process — no database calls per request. Data plane nodes scale horizontally and can run in multiple regions.

Resilience: Configure data planes to operate with cached config if the control plane is unreachable (Kong's declarative mode, Apigee hybrid runtime).

Kong Gateway

// OSS & ENTERPRISE — DECLARATIVE CONFIGURATION

Kong is built on nginx + OpenResty (LuaJIT) with a plugin framework that intercepts requests at each processing phase. Kong Gateway OSS is fully open-source; Kong Enterprise adds RBAC, secrets management, Dev Portal, and audit logs. Configuration can be managed via Admin API, declarative deck sync, or Kubernetes CRDs (Kong Ingress Controller).

YAML — decK declarative config (service + route + plugins)

# deck.yaml — full service definition with security plugins
# Apply with: deck sync -s deck.yaml

_format_version: "3.0"

services:
  - name: payments-api
    url: http://payments-svc.internal:8080
    connect_timeout: 5000
    read_timeout: 30000
    write_timeout: 30000
    retries: 2

    routes:
      - name: payments-v1
        paths: ["/v1/payments"]
        methods: ["POST", "GET"]
        strip_path: false
        preserve_host: true

    plugins:
            - name: jwt                               # Phase 1: authenticate
        config:
          key_claim_name: kid
          claims_to_verify: [exp, nbf]
          secret_is_base64: false

            - name: rate-limiting                     # Phase 2: enforce quota
        config:
          minute: 60              # 60 req/min per consumer
          hour: 1000
          policy: redis           # shared counter across nodes
          redis_host: redis.internal
          redis_port: 6379
          hide_client_headers: false
          error_code: 429
          error_message: "Rate limit exceeded. Retry-After is in the response header."

      - name: request-size-limiting           # Protect against large payloads
        config:
          allowed_payload_size: 2           # 2 MB max
          size_unit: megabytes

      - name: cors
        config:
          origins: ["https://app.example.com"]
          methods: ["GET", "POST"]
          headers: ["Authorization", "Content-Type"]
          max_age: 3600

      - name: prometheus                       # Metrics emission
        config:
          per_consumer: true

      - name: opentelemetry                   # Distributed tracing
        config:
          endpoint: "http://otel-collector:4318/v1/traces"

Consumer & Credential Model Core Concept

Kong's authorization unit is the Consumer — a representation of a client app or user. Credentials (API keys, JWTs, OAuth tokens) are attached to consumers. Rate limits, ACLs, and quotas can be scoped per-consumer, allowing different limits for free vs. paid tiers.

Plugin Priority Order Security-Critical

1000+: Bot detection, IP restriction
~999: Authentication (JWT, Key Auth, OAuth)
~900: ACL, authorization checks
~901: Rate Limiting (post-auth, per consumer)
~800: Request transforms, validation
~0: Logging, tracing, metrics

Shell — Kong Admin API operations

# Create a consumer and attach an API key
curl -X POST http://localhost:8001/consumers \
  --data username=acme-corp \
  --data custom_id=tenant-001

curl -X POST http://localhost:8001/consumers/acme-corp/key-auth \
  --data key=sk-prod-abc123xyz

# Apply per-consumer rate limit override (Enterprise)
curl -X POST http://localhost:8001/consumers/acme-corp/plugins \
  --data name=rate-limiting \
  --data config.minute=300 \           # 5x the default — paid tier
  --data config.hour=5000 \
  --data config.policy=redis

# Rotate API key (old key stays active during grace period)
curl -X POST http://localhost:8001/consumers/acme-corp/key-auth \
  --data key=sk-prod-newkey456

# Revoke old key after client confirms migration
curl -X DELETE http://localhost:8001/consumers/acme-corp/key-auth/<key-id>

# Health check — data plane node status
curl http://localhost:8001/status | jq '.database, .server'

Apigee (Google Cloud)

// ENTERPRISE API MANAGEMENT — POLICY-DRIVEN

Apigee is Google Cloud's full-lifecycle API management platform. Unlike Kong's plugin model, Apigee uses XML policy bundles attached to proxy flows (PreFlow → Conditional Flows → PostFlow). Apigee X runs entirely on Google Cloud; Apigee hybrid deploys runtime pods in your own Kubernetes cluster while keeping management in GCP.

Proxy Structure

Inbound

ProxyEndpoint
PreFlow

Auth, rate limit,
spike arrest

→

Routing

Conditional
Flows

Path/verb
matching

→

Backend

TargetEndpoint

LB, health,
mTLS to upstream

→

Response

PostFlow
Response

Header mask
error transform

→

Client

Final Response

Logged, traced,
metered

XML — Apigee proxy bundle (ProxyEndpoint)

<ProxyEndpoint name="default">
  <PreFlow name="PreFlow">
    <Request>
            <Step><Name>VA-OAuthV2-Token</Name></Step>          <!-- 1. Verify OAuth -->
      <Step><Name>SC-SpikeArrest-Global</Name></Step>   <!-- 2. Global burst limit -->
      <Step><Name>QL-Quota-PerAppPerMin</Name></Step>   <!-- 3. Per-app quota -->
      <Step><Name>AM-SetTarget-Headers</Name></Step>    <!-- 4. Inject context -->
    </Request>
  </PreFlow>

  <Flows>
    <Flow name="GetPayments">
      <Condition>(proxy.pathsuffix MatchesPath "/payments") and (request.verb = "GET")</Condition>
      <Request>
        <Step><Name>RL-RaiseFault-ReadOnly</Name></Step>
      </Request>
    </Flow>
  </Flows>

  <PostFlow name="PostFlow">
    <Response>
      <Step><Name>AM-RemoveInternalHeaders</Name></Step>
      <Step><Name>ST-StatisticsCollector</Name></Step>
    </Response>
  </PostFlow>
</ProxyEndpoint>

<!-- SpikeArrest Policy: caps instantaneous burst -->
<SpikeArrest name="SC-SpikeArrest-Global">
  <Rate>500ps</Rate>           <!-- 500 req/second global cap -->
  <UseEffectiveCount>true</UseEffectiveCount>
</SpikeArrest>

<!-- Quota Policy: per-app per-minute limit -->
<Quota name="QL-Quota-PerAppPerMin">
  <Allow countRef="verifyapikey.VA-APIKey.apiproduct.developer.quota.limit"/>
  <Interval ref="verifyapikey.VA-APIKey.apiproduct.developer.quota.interval">1</Interval>
  <TimeUnit ref="verifyapikey.VA-APIKey.apiproduct.developer.quota.timeunit">minute</TimeUnit>
  <Identifier ref="client_id"/>   <!-- unique counter per app -->
  <Distributed>true</Distributed>
  <Synchronous>true</Synchronous>
</Quota>

Kong Strengths

✓ Open-source core — no vendor lock-in
✓ Lua plugin framework — extend anything
✓ Kubernetes-native (KIC)
✓ Lower operational cost at scale
✓ Fast data plane (nginx/OpenResty)
✗ Less mature developer portal
✗ Analytics require external stack

Apigee Strengths

✓ Full developer portal + monetization
✓ Built-in analytics + custom dashboards
✓ Advanced API product/plan management
✓ GCP-native IAM, Secret Manager
✓ Integrated threat protection rules
✗ Higher cost, GCP dependency
✗ XML policies: verbose and complex

Rate Limiting

// ALGORITHMS, WINDOWS, AND DISTRIBUTED COUNTERS

Rate limiting controls how many requests a client can send in a time window. The choice of algorithm determines fairness, burst tolerance, and implementation complexity. All production gateway deployments need distributed counters — local-only counters undercount in multi-node deployments.

The Four Algorithms

Fixed Window Counter Simple

Divide time into fixed windows (e.g., each minute). Count requests per window. Reset at window boundary.

Problem: "Boundary burst" — a client can send 100 req at 00:59 and 100 more at 01:00, effectively 200 req in 2 seconds while staying within limits.

Sliding Window Log Accurate

Store timestamp of every request. Count only requests within the last N seconds. Evict old entries.

Trade-off: Most accurate, but O(n) memory per consumer at high volume. Acceptable for low-QPS privileged endpoints; not for public APIs.

Sliding Window Counter Balanced

Hybrid: two fixed windows + weighted interpolation. Count = (prev_window × weight) + current_window. Weight = 1 − (elapsed / window_size).

Best trade-off for most use cases — O(1) storage, near-accurate smoothing, no boundary burst. Kong's sliding window mode uses this approach.

Token Bucket Burst-Friendly

Tokens fill at a constant rate (e.g., 10/sec). Each request consumes 1 token. Bucket has a max capacity — allows short bursts up to bucket size.

Ideal for AI endpoints where one request may cost vastly different amounts of computation. Pair with token-weight logic for LLM APIs.

Python — Token bucket implementation (Redis-backed)

import time
import redis

r = redis.Redis(host='redis.internal', decode_responses=True)

def token_bucket_check(consumer_id: str, cost: int = 1) -> dict:
    """
    Redis Lua script ensures atomic read-modify-write.
    Returns: {allowed: bool, tokens_remaining: int, retry_after: float}
    """
    key        = f"rl:token_bucket:{consumer_id}"
    rate       = 10          # tokens added per second
    capacity   = 50          # max burst size
    now        = time.time()

    lua_script = """
    local key      = KEYS[1]
    local rate     = tonumber(ARGV[1])
    local capacity = tonumber(ARGV[2])
    local now      = tonumber(ARGV[3])
    local cost     = tonumber(ARGV[4])

    local data     = redis.call('HMGET', key, 'tokens', 'last_refill')
    local tokens   = tonumber(data[1]) or capacity
    local last     = tonumber(data[2]) or now

    -- Refill tokens based on elapsed time
    local elapsed  = math.max(0, now - last)
    tokens = math.min(capacity, tokens + elapsed * rate)

    if tokens >= cost then
        tokens = tokens - cost
        redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
        redis.call('EXPIRE', key, 3600)
        return {1, math.floor(tokens), 0}
    else
        local wait = (cost - tokens) / rate
        return {0, math.floor(tokens), wait}
    end
    """

    result = r.eval(lua_script, 1, key, rate, capacity, now, cost)
    return {
        "allowed":           bool(result[0]),
        "tokens_remaining":  int(result[1]),
        "retry_after_secs":  float(result[2])
    }

Required Response Headers (RFC 6585 + Best Practice)

Header	Value	Required?
`RateLimit-Limit`	Max requests in the window (e.g., `100`)	YES
`RateLimit-Remaining`	Requests left in current window	YES
`RateLimit-Reset`	Unix timestamp when window resets	YES
`Retry-After`	Seconds until client may retry (429 responses only)	YES on 429
`X-RateLimit-Limit-Minute`	Kong-specific: per-window breakdown	OPTIONAL
`X-RateLimit-Limit-Hour`	Kong-specific: hourly limit	OPTIONAL

Throttling Strategies

// SHAPING TRAFFIC WITHOUT DROPPING IT

Throttling is the act of slowing down requests rather than rejecting them. Where rate limiting drops excess requests with 429, throttling queues or delays them. The right strategy depends on your SLA, consumer expectations, and backend capacity model.

Hard Limit

Reject immediately — 429 Too Many Requests. For burst abuse, DDoS mitigation, unauthenticated traffic. No queue, no delay. Use SpikeArrest in Apigee or the rate-limiting plugin in Kong with error_code: 429.

Soft Throttle

Add artificial delay — slow the client without dropping. Useful for preventing hot consumers from starving others. Inject a response delay (e.g., 200ms–2s) when a consumer hits 80% of their limit. The client experiences degraded performance before hitting the hard wall.

Queue & Retry

Accept the request, queue it, drain at allowed rate. Gateway returns 202 Accepted immediately with a job ID; client polls for completion. Requires async architecture. Ideal for expensive operations (AI inference, report generation).

Priority Queue

Tiered queue with guaranteed SLAs per plan. Enterprise tier gets dedicated queue capacity. Free tier requests wait behind paid tier. Implement with Redis Sorted Sets scored by consumer tier + arrival time.

YAML — Kong request queuing with response transformer

# Pattern: return 202 immediately, process async
# Requires: custom Kong plugin or upstream async handler

plugins:
  - name: request-termination         # used when queue is full
    config:
      status_code: 503
      message: "Queue at capacity. Retry in 30 seconds."
      trigger: "queue_depth > 500"  # evaluated by pre-function plugin

  - name: pre-function                # Lua to check queue depth
    config:
      access:
        - |
          local queue_depth = kong.redis:get("queue:ai-inference:depth")
          if tonumber(queue_depth) > 500 then
            kong.response.set_header("Retry-After", "30")
            kong.response.exit(503, "{ \"error\": \"queue_full\" }")
          end

          -- Tag request with consumer tier for priority routing
          local tier = kong.ctx.shared.consumer_tier or "free"
          kong.service.request.set_header("X-Consumer-Tier", tier)

  - name: response-transformer
    config:
      add:
        headers:
          - "X-Queue-Position: $(queue_position)"
          - "X-Estimated-Wait: $(estimated_wait_secs)s"

Quotas & API Plans

// MONETIZATION & ENTITLEMENT ARCHITECTURE

Rate limits protect infrastructure; quotas enforce business entitlements. Quotas are typically longer-lived (daily, monthly) and tied to API products or subscription plans. A free-tier consumer may be rate-limited to 60 req/min AND quota-capped at 10,000 req/month — both apply simultaneously.

Free Tier

60 req/min · 10,000 req/month · No SLA guarantee
No auth for some endpoints; API key required for others. Quota hard-stops at monthly cap. Upgrade prompt embedded in 429 response body.

Developer

300 req/min · 100,000 req/month · 99.5% SLA
OAuth 2.0 required. Access to beta endpoints. Quota rollover: unused 10% carries to next month. Dashboard with usage analytics.

Business

1,000 req/min · 1M req/month · 99.9% SLA
Priority queue. Dedicated gateway pool in Enterprise plans. Burst allowance: up to 3× for 30-second windows. Webhook-based quota alerts.

Enterprise

Custom limits · Unlimited or metered · 99.99% SLA
Per-contract negotiation. BYOK encryption. Dedicated egress IPs. Private gateway deployment. Custom SLAs with financial penalties.

YAML — Apigee API Product quota configuration

# Apigee API Product definition (via management API)
{
  "name": "InferenceAPI-Business",
  "displayName": "Inference API — Business Plan",
  "proxies": ["inference-api-v1"],
  "environments": ["production"],
  "quota": "1000",            // 1000 requests
  "quotaInterval": "1",       // per
  "quotaTimeUnit": "minute", // minute

  // Monthly absolute cap — enforced by second Quota policy
  "attributes": [
    { "name": "monthly_quota",   "value": "1000000" },
    { "name": "burst_multiplier", "value": "3" },       // 3x burst for 30s
    { "name": "consumer_tier",    "value": "business" },
    { "name": "sla_target",      "value": "99.9" }
  ],

  // Scoped API operations included in this product
  "operationGroup": {
    "operationConfigs": [{
      "apiSource": "inference-api-v1",
      "operations": [
        { "resource": "/v1/chat/completions",  "methods": ["POST"] },
        { "resource": "/v1/embeddings",         "methods": ["POST"] }
      ]
    }]
  }
}

Circuit Breakers

// FAIL FAST, RECOVER GRACEFULLY

A circuit breaker prevents cascading failures by stopping requests to a failing upstream before those failures propagate. Without a circuit breaker, a slow or failed backend causes gateway threads to block, consuming resources until the gateway itself degrades.

State: CLOSED

Normal Traffic

Count failures
vs threshold

→

State: OPEN

Fail Fast 503

No upstream
calls made

→

State: HALF-OPEN

Probe Upstream

1 req allowed
per window

→

Recovery

Back to CLOSED

If probe
succeeds

YAML — Kong upstream circuit breaker + passive health checks

upstreams:
  - name: payments-upstream
    algorithm: round-robin
    slots: 10000

        healthchecks:
      passive:                        # Tracks real traffic — no extra probes
        healthy:
          http_statuses: [200, 201, 204]
          successes: 1               # 1 success reopens
        unhealthy:
          http_statuses: [500, 502, 503, 504]
          http_failures: 5           # 5 failures → mark node down
          timeouts: 3               # 3 timeouts also counts

      active:                         # Periodic health probe
        type: http
        http_path: /health
        healthy:
          interval: 10              # probe every 10s
          successes: 2              # 2 successes to mark healthy
        unhealthy:
          interval: 5               # probe sick nodes more often
          http_failures: 3

    targets:
      - target: payments-1.internal:8080
        weight: 100
      - target: payments-2.internal:8080
        weight: 100

Authentication Patterns

// API KEYS, OAUTH 2.0, JWT, AND SAML

Method	Best For	Revocability	Gateway Support
API Key	Machine-to-machine, simple integrations, public APIs	MANUAL — must be rotated explicitly	UNIVERSAL — all gateways
OAuth 2.0 + PKCE	User-delegated access, mobile/web apps	REAL-TIME — token revocation via introspection	UNIVERSAL
JWT (stateless)	High-throughput internal auth, claims-bearing tokens	HARD — requires blocklist or short TTL	UNIVERSAL
mTLS	Service-to-service, highest assurance, zero-trust	REAL-TIME — CRL/OCSP revocation	UNIVERSAL
HMAC Signature	Webhook validation, financial APIs, tamper-proof requests	MEDIUM — key rotation required	PLUGIN — may require custom

YAML — Kong OAuth 2.0 plugin with PKCE

plugins:
  - name: oauth2
    config:
      scopes: ["read:payments", "write:payments", "read:profile"]
      mandatory_scope: true
      enable_authorization_code: true
      enable_client_credentials: true
      enable_implicit_grant: false   # deprecated, insecure
      enable_password_grant: false    # deprecated, avoid
      token_expiration: 3600           # 1 hour access token
      refresh_token_ttl: 1209600       # 14 days refresh token
      pkce: S256                        # PKCE required for public clients
      accept_http_if_already_terminated: false  # reject non-TLS

# Token introspection for short-lived JWT validation
  - name: jwt
    config:
      uri_param_names: []              # never accept token in URL — header only
      cookie_names: []
      claims_to_verify: ["exp", "nbf", "iss"]
      maximum_expiration: 3600         # reject tokens > 1h lifetime
      key_claim_name: "kid"
      run_on_preflight: false

mTLS & Zero Trust at the Gateway

// MUTUAL AUTHENTICATION — BOTH SIDES PROVE IDENTITY

Mutual TLS (mTLS) extends standard TLS by requiring the client to present a certificate, not just the server. At the gateway, this means only clients holding a certificate signed by your corporate CA can establish connections — credential theft alone is insufficient to gain access.

YAML — Kong mTLS plugin + client cert validation

# 1. Upload CA certificate to Kong
curl -X POST http://localhost:8001/ca_certificates \
  --data-urlencode "cert=$(cat /path/to/corporate-ca.pem)"

# 2. Configure service to require client cert
services:
  - name: high-assurance-api
    url: https://backend.internal:8443
    # Gateway → Backend: also mTLS (gateway presents its own cert)
    tls_verify: true
    ca_certificates: ["<ca-cert-uuid>"]
    client_certificate: "<gateway-cert-uuid>"

    plugins:
            - name: mtls-auth                      # Client → Gateway: mTLS
        config:
          ca_certificates: ["<ca-cert-uuid>"]
          certificate_bound_access_tokens: true  # bind OAuth token to cert
          revocation_check_mode: IGNORE_CA_ERROR
          skip_consumer_lookup: false            # map cert CN to Kong consumer
          authenticated_group_by: DN             # group by Distinguished Name

      - name: acl
        config:
          allow: ["tier:enterprise", "service:internal"]  # cert-derived groups

✓

Certificate-Bound Tokens (RFC 8705): Combine OAuth 2.0 with mTLS by binding access tokens to the client certificate thumbprint. The token is useless without the certificate — credential theft alone doesn't grant access. Enable in Kong with certificate_bound_access_tokens: true. Required for FAPI 2.0 (Financial-grade API) compliance.

Threat Protection

// WAF, INJECTION, SCHEMA VALIDATION

Injection Defense Critical

Validate all inputs against strict JSON Schema before routing
Block SQLi, XSS, SSRF, path traversal via regex deny-lists at gateway
Reject unexpected content types: application/x-www-form-urlencoded should never reach a JSON API
Limit request body depth (JSON bomb protection — max 5 levels)

Resource Exhaustion DoS

Max request body size: enforce at gateway, not backend
Max URL length: 2048 chars (reject longer)
Max header count: 100; max header size: 8KB
Max query string parameters: 50
Timeout cascade: gateway timeout < backend timeout (prevent thread exhaustion)

YAML — Kong request-validator + bot detection

# JSON Schema validation — reject malformed payloads at gateway layer
plugins:
  - name: request-validator
    config:
      body_schema: |
        {
          "$schema": "http://json-schema.org/draft-07/schema",
          "type": "object",
          "required": ["model", "messages"],
          "properties": {
            "model": {
              "type": "string",
              "enum": ["gpt-4o", "claude-3-7-sonnet-20250219"]  # allowlist only
            },
            "messages": {
              "type": "array",
              "maxItems": 50,                   # prevent context stuffing
              "items": {
                "type": "object",
                "required": ["role", "content"],
                "properties": {
                  "role":    { "type": "string", "enum": ["user", "assistant", "system"] },
                  "content": { "type": "string", "maxLength": 32000 }  # ~8k tokens
                }
              }
            },
            "max_tokens": { "type": "integer", "minimum": 1, "maximum": 4096 },
            "temperature": { "type": "number", "minimum": 0, "maximum": 2 }
          },
          "additionalProperties": false          # reject unknown fields
        }
      allowed_content_types: ["application/json"]
      verbose_response: false                  # don't leak schema in errors

  # Bot detection (Kong Enterprise)
  - name: bot-detection
    config:
      allow: []
      deny: ["generic-bot", "crawler", "scanner"]

AI Gateway Patterns

// SAFELY EXPOSING AI TOOLS TO THE OUTSIDE WORLD

Exposing AI model endpoints externally introduces risks that standard API gateway patterns don't fully cover. LLMs are non-deterministic, computationally expensive, and susceptible to prompt injection, data leakage, and jailbreaking. An AI gateway layer adds controls specific to this threat model, sitting between your consumers and the model API.

Consumer

API Request

External app
or user

→

Gateway Layer 1

Auth + Rate
Limit + Quota

Standard
policies

→

Gateway Layer 2

AI-Specific
Controls

Token limit
Prompt guard

→

Model Router

Semantic Cache
+ Model Select

Cost control
fallback

→

LLM API

OpenAI · Claude
Gemini · Self-host

Actual
inference

AI Gateway Capabilities

Semantic Caching Cost Control

Cache LLM responses by semantic similarity of prompts, not exact string match. If a new prompt is >95% similar to a cached query, return the cached response. Reduces inference costs by 30–70% for repetitive query patterns (FAQs, code completion, classification).

Tools: Kong AI Gateway, LiteLLM, Portkey, GPTCache

Model Routing & Fallback Resilience

Route to cheaper models for simple tasks; escalate to frontier models for complex ones. If primary model is unavailable (429, 503), auto-fallback to alternative — with model mapping for API compatibility. Maintain per-model rate limits separately.

Pattern: GPT-4o-mini → GPT-4o → Claude as fallback chain

PII Scrubbing Compliance

Detect and redact PII (names, emails, SSNs, credit cards, phone numbers) from prompts before sending to external model APIs. Use regex + NER model in the gateway pipeline. Log the redaction event but never the original PII.

Output Filtering Safety

Inspect LLM responses before returning to consumer. Strip: PII leakage, hallucinated credentials, toxic content, competitor mentions (per policy). Apply regex + classifier in the response pipeline. Log filtered content for fine-tuning feedback.

YAML — Kong AI Gateway plugin configuration

# Kong AI Gateway (KonnectAI / Kong 3.6+)
plugins:
  - name: ai-proxy
    config:
      route_type: llm/v1/chat
      auth:
        header_name: Authorization
        header_value: "Bearer $(vault://secrets/openai-key)"
      model:
        provider: openai
        name: gpt-4o
        options:
          max_tokens: 2048
          temperature: 0.7
          input_cost: 2.5          # USD per 1M input tokens (for metering)
          output_cost: 10.0         # USD per 1M output tokens

    - name: ai-semantic-cache
    config:
      embeddings_provider: openai
      embeddings_model: text-embedding-3-small
      similarity_threshold: 0.95          # 95%+ match = cache hit
      cache_ttl: 3600
      vectordb:
        strategy: redis
        redis:
          host: redis.internal
          port: 6379

  - name: ai-prompt-guard
    config:
      allow_patterns: []
      deny_patterns:
        - pattern: "ignore previous instructions"
        - pattern: "DAN mode"
        - pattern: "jailbreak"
        - pattern: "\\bSSN\\b.*\\d{3}-\\d{2}-\\d{4}"   # SSN in prompt
        - pattern: "\\b4[0-9]{12}(?:[0-9]{3})?\\b"     # Visa card number
      match_all_roles: true

  - name: ai-rate-limiting-advanced
    config:
      limit_by: consumer
      limits:
        - tokens_per_minute: 100000     # input + output tokens
        - tokens_per_hour:   1000000
        - requests_per_minute: 60

LLM-Specific Security Risks

// OWASP LLM TOP 10 AT THE GATEWAY

The OWASP Top 10 for LLM Applications defines risks specific to AI systems. The gateway is the first — and most scalable — line of defense against several of these, though some require application-layer mitigations.

Threat Coverage Matrix

Risk

Gateway Can Block

App-Layer Also Needed

Severity

LLM01: Prompt Injection

Partial — regex deny-lists

Input sanitization, system prompt hardening

CRITICAL

LLM02: Insecure Output Handling

Response content filtering

Output encoding before rendering

HIGH

LLM03: Training Data Poisoning

Not applicable at gateway

Model provenance, red-teaming

MEDIUM

LLM04: Model DoS

Token limits, rate limiting, request size

Backend timeouts

HIGH

LLM06: Sensitive Data Disclosure

PII scrubbing in request + response

RAG access controls, data classification

CRITICAL

LLM07: Insecure Plugin Design

Tool/function allowlisting in request schema

Capability sandboxing, principle of least privilege

CRITICAL

LLM09: Overreliance

Not applicable at gateway

Human-in-the-loop for high-stakes decisions

MEDIUM

⚠

Tool/Function call allowlisting: If you expose function-calling or tool-use APIs, the gateway MUST validate that tools arrays in requests only reference approved function names and schemas. An attacker crafting a malicious tool definition can invoke arbitrary code if your application naively executes whatever the model returns. Treat every LLM output as untrusted.

Token-Aware Rate Limiting

// REQUEST COUNT IS THE WRONG METRIC FOR LLMs

Standard rate limiting counts requests. For LLM APIs, two requests can differ by 1000× in computational cost — a 100-token prompt vs a 128,000-token context window. Token-based limiting controls actual resource consumption, not request count.

Python — Token-aware gateway middleware

import tiktoken
import redis
from fastapi import Request, HTTPException

enc = tiktoken.encoding_for_model("gpt-4o")
r   = redis.Redis(host="redis.internal", decode_responses=True)

TOKEN_LIMITS = {
    "free":       {"minute": 10_000,  "day": 100_000},
    "developer":  {"minute": 50_000,  "day": 1_000_000},
    "business":   {"minute": 200_000, "day": 10_000_000},
}

async def token_limit_middleware(request: Request, consumer_id: str, tier: str):
    body = await request.json()

    # Pre-flight: count input tokens BEFORE sending to model
    messages_text = " ".join(
        m["content"] for m in body.get("messages", [])
        if isinstance(m.get("content"), str)
    )
    input_tokens = len(enc.encode(messages_text))

    # Add max_tokens from request as expected output cost
    max_output   = body.get("max_tokens", 1024)
    expected_cost = input_tokens + max_output

    limits = TOKEN_LIMITS.get(tier, TOKEN_LIMITS["free"])
    pipe   = r.pipeline()

    for window, limit in [("minute", limits["minute"]), ("day", limits["day"])]:
        key   = f"tokens:{consumer_id}:{window}"
        used  = int(r.get(key) or 0)
        if used + expected_cost > limit:
            raise HTTPException(
                status_code=429,
                detail={
                    "error": "token_limit_exceeded",
                    "window": window,
                    "limit": limit,
                    "used": used,
                    "request_cost": expected_cost,
                    "remaining": max(0, limit - used),
                }
            )
        pipe.incrby(key, expected_cost)
        pipe.expire(key, 60 if window == "minute" else 86400)

    pipe.execute()

    # After response: reconcile actual usage (tokens_used from API response)
    # Subtract over-estimate if model returned fewer tokens than max_tokens
    return input_tokens, max_output

Prompt Inspection & Guardrails

// DETECT INJECTION, JAILBREAKS, AND PII BEFORE INFERENCE

Prompt inspection at the gateway layer is the first — and cheapest — defense against malicious inputs. Combine rule-based pattern matching (fast, low cost) with classifier-based detection (higher accuracy, higher latency) tiered by risk level.

Python — Prompt inspection pipeline

import re
from dataclasses import dataclass
from enum import Enum

class Verdict(Enum):
    ALLOW    = "allow"
    WARN     = "warn"       # log + continue
    SANITIZE = "sanitize"  # redact PII, then continue
    BLOCK    = "block"     # reject with 400

INJECTION_PATTERNS = [
    # Direct jailbreak attempts
    re.compile(r"ignore\s+(all\s+)?(previous|prior|above)\s+instructions?", re.I),
    re.compile(r"(you are|act as|pretend to be)\s+(dan|evil|unrestricted)", re.I),
    re.compile(r"(DAN|jailbreak|GPT-4-DAN|STAN)\s+(mode|prompt)", re.I),
    # System prompt extraction
    re.compile(r"(reveal|print|repeat|output)\s+(your\s+)?(system|initial)\s+(prompt|instruction)", re.I),
    # Indirect injection via data
    re.compile(r"<(s|INST|SYS|SYSTEM|human|assistant)>"),  # model-specific control tokens
    re.compile(r"\[\[HUMAN\]\]|\[\[ASSISTANT\]\]"),
]

PII_PATTERNS = {
    "ssn":         re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),
    "credit_card": re.compile(r"\b(?:4[0-9]{12}|5[1-5][0-9]{14}|3[47][0-9]{13})\b"),
    "email":       re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"),
    "api_key":     re.compile(r"\b(sk-[a-zA-Z0-9]{32,}|ghp_[a-zA-Z0-9]{36})\b"),
}

def inspect_prompt(text: str) -> tuple[Verdict, str, dict]:
    # 1. Injection check — block immediately
    for pattern in INJECTION_PATTERNS:
        if pattern.search(text):
            return Verdict.BLOCK, text, {"reason": "prompt_injection_detected"}

    # 2. PII check — sanitize, don't block
    sanitized = text
    found_pii = []
    for pii_type, pattern in PII_PATTERNS.items():
        if pattern.search(sanitized):
            sanitized = pattern.sub(f"[REDACTED:{pii_type.upper()}]", sanitized)
            found_pii.append(pii_type)

    if found_pii:
        return Verdict.SANITIZE, sanitized, {"redacted": found_pii}

    return Verdict.ALLOW, text, {}

ℹ

Classifier-based detection: Regex is fast but easily evaded with Unicode normalization, zero-width characters, or paraphrasing. For high-security deployments, add a secondary ML classifier (LlamaGuard, ShieldLM, or a fine-tuned BERT) that evaluates semantics, not syntax. Run classifiers asynchronously and cache verdicts by prompt hash to minimize latency overhead (<50ms target).

Observability

// METRICS, TRACES, LOGS — THE GATEWAY TELEMETRY STACK

Golden Signals

Latency — p50/p95/p99 per route
Traffic — req/sec, tokens/sec
Errors — 4xx/5xx rates by consumer
Saturation — connection pool, CPU

AI-Specific Metrics

Input / output tokens per minute
Token cost (USD) per consumer
Cache hit rate (semantic cache)
Prompt injection block rate
Model fallback frequency
Time-to-first-token (streaming)

Alert Thresholds

p99 latency > 3s: PagerDuty P2
Error rate > 5% (5m): P1
429 rate > 20%: quota tuning
Token cost anomaly > 3σ: billing alert
Cache hit rate < 20%: cache review

YAML — Prometheus + Grafana alert rules for AI gateway

# prometheus/rules/ai-gateway.yml
groups:
  - name: ai_gateway
    rules:
      - alert: HighTokenCostAnomaly
        expr: |
          rate(ai_gateway_tokens_total{direction="output"}[5m])
          > 3 * avg_over_time(
              rate(ai_gateway_tokens_total{direction="output"}[5m])[1h:5m]
            )
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Token usage spike detected — possible prompt stuffing"

      - alert: PromptInjectionSpike
        expr: |
          rate(ai_gateway_prompt_blocked_total[5m]) > 10
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: ">10 prompt injections/min — possible coordinated attack"

      - alert: ModelFallbackHigh
        expr: |
          rate(ai_gateway_model_fallback_total[10m]) /
          rate(ai_gateway_requests_total[10m]) > 0.1
        for: 5m
        annotations:
          summary: ">10% of requests falling back — primary model degraded"

      - alert: SemanticCacheHitRateLow
        expr: |
          rate(ai_gateway_cache_hit_total[1h]) /
          rate(ai_gateway_requests_total[1h]) < 0.2
        for: 30m
        annotations:
          summary: "Cache hit rate <20% — review embedding similarity threshold"

Policy-as-Code

// OPA, CEDAR, AND DECLARATIVE GATEWAY CONFIG

Gateway policies should be version-controlled, tested, and deployed through CI/CD — not configured manually via UIs. For complex authorization logic, offload to a policy engine (OPA, Cedar) that the gateway calls synchronously at the enforcement point.

Rego — OPA policy for AI API authorization

# policy/ai_api.rego
package ai.api.authz

import future.keywords

# Default: deny all
default allow = false

# Consumers may only use models in their tier's allowlist
allow if {
    requested_model := input.request.body.model
    consumer_tier   := input.consumer.tier
    allowed_models[consumer_tier][requested_model]
}

allowed_models := {
    "free":       {"gpt-4o-mini": true, "text-embedding-3-small": true},
    "developer":  {"gpt-4o-mini": true, "gpt-4o": true, "text-embedding-3-large": true},
    "business":   {"gpt-4o": true, "claude-3-7-sonnet-20250219": true, "o3": true},
    "enterprise": {"_all": true},
}

# Enterprise override: any model allowed
allow if {
    input.consumer.tier == "enterprise"
}

# Block tool/function calls for free tier (prevents code execution)
deny[reason] if {
    input.consumer.tier == "free"
    count(input.request.body.tools) > 0
    reason := "Free tier does not support function calling"
}

# Enforce max_tokens ceiling per tier
deny[reason] if {
    input.request.body.max_tokens > max_tokens_ceiling[input.consumer.tier]
    reason := sprintf("max_tokens exceeds tier ceiling of %v",
                       [max_tokens_ceiling[input.consumer.tier]])
}

max_tokens_ceiling := {
    "free":       512,
    "developer":  4096,
    "business":   16384,
    "enterprise": 128000,
}

▸

CI Pipeline for Gateway Config: Store all Kong/Apigee config in Git → lint with deck validate or apigeelint → run policy unit tests (opa test) → diff against prod with deck diff → deploy via deck sync in CD pipeline. Never mutate production via Admin API manually — treat it as infra drift.

Quick Reference

// STATUS CODES, LIMITS, COMMANDS, CHEATSHEET

HTTP Status Code Guide for Gateway

Code	Meaning	When Gateway Returns It
`400`	Bad Request	Schema validation failure, malformed JSON, disallowed parameters
`401`	Unauthorized	Missing or malformed credentials (no Authorization header)
`403`	Forbidden	Valid credentials, but insufficient scope or ACL denial
`404`	Not Found	Route not matched — or dark SDP hiding the endpoint
`408`	Request Timeout	Client took too long to send request body
`413`	Payload Too Large	Request body exceeds `allowed_payload_size`
`422`	Unprocessable Entity	Syntactically valid JSON but semantically invalid (OPA denial)
`429`	Too Many Requests	Rate limit or quota exceeded — include `Retry-After`
`502`	Bad Gateway	Upstream returned an invalid response
`503`	Service Unavailable	All upstream nodes unhealthy (circuit open); queue full
`504`	Gateway Timeout	Upstream did not respond within `read_timeout`

Kong CLI Cheatsheet

# Validate declarative config
deck validate -s deck.yaml

# Diff current vs what would be synced
deck diff -s deck.yaml

# Apply configuration
deck sync -s deck.yaml

# Dump current config to file
deck dump -o current.yaml

# Live reload (no downtime)
kong reload

# Check gateway status
curl localhost:8001/status | jq .

# List all plugins with priority
curl localhost:8001/plugins | jq \
  '[.data[] | {name,priority:.instance_config}]'

Apigee CLI Cheatsheet

# Authenticate
gcloud auth application-default login

# Deploy a proxy bundle
apigeecli apis create bundle \
  --name my-api --proxy ./apiproxy \
  --org $ORG --env production

# List deployments
apigeecli deployments list \
  --org $ORG --env production

# Create API product
apigeecli products create \
  --name "AI-Business" \
  --proxies "inference-api-v1" \
  --quota 1000 --interval 1 \
  --unit minute --org $ORG

# Check proxy traffic (last 1h)
apigeecli stats get \
  --org $ORG --env production \
  --dims apiproxy --metric message_count

AI Gateway Configuration Checklist

Checklist — AI endpoint exposure

## Authentication & Authorization
□ OAuth 2.0 with PKCE or mTLS for all consumers
□ Scopes defined per capability (e.g., ai:chat, ai:embed, ai:tools)
□ OPA/Cedar policy: model allowlist per consumer tier
□ Tool/function calling disabled for free tier

## Traffic Control
□ Request-count rate limit: per minute AND per hour (per consumer)
□ Token-count rate limit: input+output tokens per minute AND per day
□ Max request body size: 1–10 MB depending on use case
□ max_tokens ceiling enforced per tier (schema validation)
□ Spike arrest: global burst cap before per-consumer limits
□ Circuit breaker: passive + active health checks on model API

## Prompt Security
□ JSON Schema validation: allowlist known fields, reject additionalProperties
□ Prompt injection pattern matching (regex deny-list)
□ PII scrubbing in request: SSN, credit card, email, API keys
□ PII scrubbing in response: model may hallucinate or echo PII
□ Tool/function call schema validation: only approved function names
□ Context depth limit: max messages array length

## Cost Control
□ Semantic cache enabled with similarity threshold
□ Token cost metered per consumer (input + output separately)
□ Billing alert on anomalous token spend (>3σ)
□ Model fallback chain configured (expensive → cheap on 429/503)

## Observability
□ Distributed tracing: inject trace-id into all upstream requests
□ Token usage exported to Prometheus/Datadog per consumer
□ Prompt injection block rate alerting
□ Per-model error rates and latency histograms
□ Audit log: every request logged (consumer, model, tokens, verdict)

▸

Reference Resources: Kong documentation at docs.konghq.com · Apigee docs at cloud.google.com/apigee/docs · OWASP LLM Top 10 at owasp.org/www-project-top-10-for-large-language-model-applications · RFC 6585 (429 Status Code) · RFC 8705 (Certificate-Bound Tokens) · NIST AI RMF at airc.nist.gov · LiteLLM AI gateway at litellm.ai

GatewayPatterns

What Is an API Gateway

Gateway vs. Other Patterns

Gateway Anatomy

Control Plane vs. Data Plane

Kong Gateway

Apigee (Google Cloud)

Proxy Structure

Rate Limiting

The Four Algorithms

Required Response Headers (RFC 6585 + Best Practice)

Throttling Strategies

Quotas & API Plans

Circuit Breakers

Authentication Patterns

mTLS & Zero Trust at the Gateway

Threat Protection

AI Gateway Patterns

AI Gateway Capabilities

LLM-Specific Security Risks

Threat Coverage Matrix

Token-Aware Rate Limiting

Prompt Inspection & Guardrails

Observability

Policy-as-Code

Quick Reference

HTTP Status Code Guide for Gateway

AI Gateway Configuration Checklist

Gateway
Patterns