??????? API Gateway Patterns — Field Handbook
Field Manual
API Gateway Patterns
/
Kong · Apigee · AI Exposure
APIGW-2025
LIVE
API Gateway Patterns — Field Handbook

Gateway
Patterns

// Kong · Apigee · Rate Limiting · Throttling · Safe AI Exposure

The definitive operational guide to API gateway architecture. Covers Kong and Apigee deep-dives, rate limiting algorithms, quota management, threat protection, and the emerging discipline of safely exposing AI tools and LLM endpoints to the outside world.

Kong Gateway Apigee Rate Limiting Throttling mTLS / Zero Trust AI Tool Exposure Prompt Guard
01

What Is an API Gateway

// THE FRONT DOOR TO YOUR BACKEND UNIVERSE

An API gateway is the single entry point through which all external (and increasingly, internal) API traffic passes before reaching your backend services. It is simultaneously a reverse proxy, a policy enforcement point, a traffic shaper, and an observability hub. Unlike a basic load balancer, a gateway understands the HTTP/gRPC/WebSocket application layer and can make intelligent decisions per-request.

Traffic Control

Rate limiting, throttling, load balancing, circuit breaking, retries, and timeouts — all enforced before requests reach your services.

Security Enforcement

Authentication, authorization, mTLS termination, JWT validation, API key management, WAF rules, and threat protection in a centralized layer.

Observability

Unified access logs, distributed traces, latency histograms, and error-rate metrics across all services — without instrumenting each service individually.

Gateway vs. Other Patterns

PatternLayerUse WhenLimitations
API GatewayL7 — app awarePublic APIs, multi-consumer, policy enforcementPotential single point of failure; latency overhead
Load BalancerL4/L7 basicDistributing traffic to identical backendsNo auth, rate limiting, or transform logic
Service MeshL7 — east-westInternal service-to-service (mTLS, observability)Not designed for north-south external traffic
BFF (Backend-for-Frontend)ApplicationClient-specific API aggregationCustom code, not declarative, no shared policy
Reverse Proxy (nginx)L7 basicStatic routing, SSL termination, cachingNo dynamic policy, auth plugins, quota management
Complementary, not competing: A real production stack typically uses all three layers — an API Gateway for north-south traffic (external consumers), a Service Mesh for east-west (internal services), and a Load Balancer underneath each. They serve different segments of the traffic matrix.
02

Gateway Anatomy

// REQUEST PIPELINE INTERNALS

Every enterprise gateway processes requests through a plugin/policy pipeline. Knowing the execution order matters — a misconfigured plugin order can bypass authentication, double-charge rate limits, or corrupt telemetry.

Client
Request Arrives
TLS termination
SNI routing
Phase 1
Auth & Identity
JWT · API Key
mTLS · OAuth
Phase 2
Rate Limit & Quota
Window checks
Token bucket
Phase 3
Transform
Header rewrite
Body modify
Phase 4
Route & Proxy
Service LB
Circuit break
Response
Log & Return
Trace export
Metrics emit
Plugin order is security-critical: Authentication MUST run before rate limiting. If rate limiting runs first, unauthenticated requests consume quota, enabling quota-exhaustion DoS against legitimate users. In Kong, plugins with lower priority numbers run first — validate your kong.conf plugin order explicitly.

Control Plane vs. Data Plane

Control Plane

Stores gateway configuration — routes, plugins, consumers, credentials, upstream services. In Kong this is the Admin API + database (PostgreSQL or declarative YAML). In Apigee, the management console and API Management APIs.

Separate it. The control plane should never be publicly accessible. Admin API on Kong must be firewalled; Apigee management APIs require service account auth.

Data Plane

The runtime nodes that actually proxy traffic. They pull configuration from the control plane and enforce policies in-process — no database calls per request. Data plane nodes scale horizontally and can run in multiple regions.

Resilience: Configure data planes to operate with cached config if the control plane is unreachable (Kong's declarative mode, Apigee hybrid runtime).

03

Kong Gateway

// OSS & ENTERPRISE — DECLARATIVE CONFIGURATION

Kong is built on nginx + OpenResty (LuaJIT) with a plugin framework that intercepts requests at each processing phase. Kong Gateway OSS is fully open-source; Kong Enterprise adds RBAC, secrets management, Dev Portal, and audit logs. Configuration can be managed via Admin API, declarative deck sync, or Kubernetes CRDs (Kong Ingress Controller).

YAML — decK declarative config (service + route + plugins)
# deck.yaml — full service definition with security plugins # Apply with: deck sync -s deck.yaml _format_version: "3.0" services: - name: payments-api url: http://payments-svc.internal:8080 connect_timeout: 5000 read_timeout: 30000 write_timeout: 30000 retries: 2 routes: - name: payments-v1 paths: ["/v1/payments"] methods: ["POST", "GET"] strip_path: false preserve_host: true plugins: - name: jwt # Phase 1: authenticate config: key_claim_name: kid claims_to_verify: [exp, nbf] secret_is_base64: false - name: rate-limiting # Phase 2: enforce quota config: minute: 60 # 60 req/min per consumer hour: 1000 policy: redis # shared counter across nodes redis_host: redis.internal redis_port: 6379 hide_client_headers: false error_code: 429 error_message: "Rate limit exceeded. Retry-After is in the response header." - name: request-size-limiting # Protect against large payloads config: allowed_payload_size: 2 # 2 MB max size_unit: megabytes - name: cors config: origins: ["https://app.example.com"] methods: ["GET", "POST"] headers: ["Authorization", "Content-Type"] max_age: 3600 - name: prometheus # Metrics emission config: per_consumer: true - name: opentelemetry # Distributed tracing config: endpoint: "http://otel-collector:4318/v1/traces"
Consumer & Credential Model Core Concept

Kong's authorization unit is the Consumer — a representation of a client app or user. Credentials (API keys, JWTs, OAuth tokens) are attached to consumers. Rate limits, ACLs, and quotas can be scoped per-consumer, allowing different limits for free vs. paid tiers.

Plugin Priority Order Security-Critical
  • 1000+: Bot detection, IP restriction
  • ~999: Authentication (JWT, Key Auth, OAuth)
  • ~900: ACL, authorization checks
  • ~901: Rate Limiting (post-auth, per consumer)
  • ~800: Request transforms, validation
  • ~0: Logging, tracing, metrics
Shell — Kong Admin API operations
# Create a consumer and attach an API key curl -X POST http://localhost:8001/consumers \ --data username=acme-corp \ --data custom_id=tenant-001 curl -X POST http://localhost:8001/consumers/acme-corp/key-auth \ --data key=sk-prod-abc123xyz # Apply per-consumer rate limit override (Enterprise) curl -X POST http://localhost:8001/consumers/acme-corp/plugins \ --data name=rate-limiting \ --data config.minute=300 \ # 5x the default — paid tier --data config.hour=5000 \ --data config.policy=redis # Rotate API key (old key stays active during grace period) curl -X POST http://localhost:8001/consumers/acme-corp/key-auth \ --data key=sk-prod-newkey456 # Revoke old key after client confirms migration curl -X DELETE http://localhost:8001/consumers/acme-corp/key-auth/<key-id> # Health check — data plane node status curl http://localhost:8001/status | jq '.database, .server'
04

Apigee (Google Cloud)

// ENTERPRISE API MANAGEMENT — POLICY-DRIVEN

Apigee is Google Cloud's full-lifecycle API management platform. Unlike Kong's plugin model, Apigee uses XML policy bundles attached to proxy flows (PreFlow → Conditional Flows → PostFlow). Apigee X runs entirely on Google Cloud; Apigee hybrid deploys runtime pods in your own Kubernetes cluster while keeping management in GCP.

Proxy Structure

Inbound
ProxyEndpoint
PreFlow
Auth, rate limit,
spike arrest
Routing
Conditional
Flows
Path/verb
matching
Backend
TargetEndpoint
LB, health,
mTLS to upstream
Response
PostFlow
Response
Header mask
error transform
Client
Final Response
Logged, traced,
metered
XML — Apigee proxy bundle (ProxyEndpoint)
<ProxyEndpoint name="default"> <PreFlow name="PreFlow"> <Request> <Step><Name>VA-OAuthV2-Token</Name></Step> <!-- 1. Verify OAuth --> <Step><Name>SC-SpikeArrest-Global</Name></Step> <!-- 2. Global burst limit --> <Step><Name>QL-Quota-PerAppPerMin</Name></Step> <!-- 3. Per-app quota --> <Step><Name>AM-SetTarget-Headers</Name></Step> <!-- 4. Inject context --> </Request> </PreFlow> <Flows> <Flow name="GetPayments"> <Condition>(proxy.pathsuffix MatchesPath "/payments") and (request.verb = "GET")</Condition> <Request> <Step><Name>RL-RaiseFault-ReadOnly</Name></Step> </Request> </Flow> </Flows> <PostFlow name="PostFlow"> <Response> <Step><Name>AM-RemoveInternalHeaders</Name></Step> <Step><Name>ST-StatisticsCollector</Name></Step> </Response> </PostFlow> </ProxyEndpoint> <!-- SpikeArrest Policy: caps instantaneous burst --> <SpikeArrest name="SC-SpikeArrest-Global"> <Rate>500ps</Rate> <!-- 500 req/second global cap --> <UseEffectiveCount>true</UseEffectiveCount> </SpikeArrest> <!-- Quota Policy: per-app per-minute limit --> <Quota name="QL-Quota-PerAppPerMin"> <Allow countRef="verifyapikey.VA-APIKey.apiproduct.developer.quota.limit"/> <Interval ref="verifyapikey.VA-APIKey.apiproduct.developer.quota.interval">1</Interval> <TimeUnit ref="verifyapikey.VA-APIKey.apiproduct.developer.quota.timeunit">minute</TimeUnit> <Identifier ref="client_id"/> <!-- unique counter per app --> <Distributed>true</Distributed> <Synchronous>true</Synchronous> </Quota>
Kong Strengths
  • ✓ Open-source core — no vendor lock-in
  • ✓ Lua plugin framework — extend anything
  • ✓ Kubernetes-native (KIC)
  • ✓ Lower operational cost at scale
  • ✓ Fast data plane (nginx/OpenResty)
  • ✗ Less mature developer portal
  • ✗ Analytics require external stack
Apigee Strengths
  • ✓ Full developer portal + monetization
  • ✓ Built-in analytics + custom dashboards
  • ✓ Advanced API product/plan management
  • ✓ GCP-native IAM, Secret Manager
  • ✓ Integrated threat protection rules
  • ✗ Higher cost, GCP dependency
  • ✗ XML policies: verbose and complex
05

Rate Limiting

// ALGORITHMS, WINDOWS, AND DISTRIBUTED COUNTERS

Rate limiting controls how many requests a client can send in a time window. The choice of algorithm determines fairness, burst tolerance, and implementation complexity. All production gateway deployments need distributed counters — local-only counters undercount in multi-node deployments.

The Four Algorithms

Fixed Window Counter Simple

Divide time into fixed windows (e.g., each minute). Count requests per window. Reset at window boundary.

Problem: "Boundary burst" — a client can send 100 req at 00:59 and 100 more at 01:00, effectively 200 req in 2 seconds while staying within limits.

Sliding Window Log Accurate

Store timestamp of every request. Count only requests within the last N seconds. Evict old entries.

Trade-off: Most accurate, but O(n) memory per consumer at high volume. Acceptable for low-QPS privileged endpoints; not for public APIs.

Sliding Window Counter Balanced

Hybrid: two fixed windows + weighted interpolation. Count = (prev_window × weight) + current_window. Weight = 1 − (elapsed / window_size).

Best trade-off for most use cases — O(1) storage, near-accurate smoothing, no boundary burst. Kong's sliding window mode uses this approach.

Token Bucket Burst-Friendly

Tokens fill at a constant rate (e.g., 10/sec). Each request consumes 1 token. Bucket has a max capacity — allows short bursts up to bucket size.

Ideal for AI endpoints where one request may cost vastly different amounts of computation. Pair with token-weight logic for LLM APIs.

Python — Token bucket implementation (Redis-backed)
import time import redis r = redis.Redis(host='redis.internal', decode_responses=True) def token_bucket_check(consumer_id: str, cost: int = 1) -> dict: """ Redis Lua script ensures atomic read-modify-write. Returns: {allowed: bool, tokens_remaining: int, retry_after: float} """ key = f"rl:token_bucket:{consumer_id}" rate = 10 # tokens added per second capacity = 50 # max burst size now = time.time() lua_script = """ local key = KEYS[1] local rate = tonumber(ARGV[1]) local capacity = tonumber(ARGV[2]) local now = tonumber(ARGV[3]) local cost = tonumber(ARGV[4]) local data = redis.call('HMGET', key, 'tokens', 'last_refill') local tokens = tonumber(data[1]) or capacity local last = tonumber(data[2]) or now -- Refill tokens based on elapsed time local elapsed = math.max(0, now - last) tokens = math.min(capacity, tokens + elapsed * rate) if tokens >= cost then tokens = tokens - cost redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now) redis.call('EXPIRE', key, 3600) return {1, math.floor(tokens), 0} else local wait = (cost - tokens) / rate return {0, math.floor(tokens), wait} end """ result = r.eval(lua_script, 1, key, rate, capacity, now, cost) return { "allowed": bool(result[0]), "tokens_remaining": int(result[1]), "retry_after_secs": float(result[2]) }

Required Response Headers (RFC 6585 + Best Practice)

HeaderValueRequired?
RateLimit-LimitMax requests in the window (e.g., 100)YES
RateLimit-RemainingRequests left in current windowYES
RateLimit-ResetUnix timestamp when window resetsYES
Retry-AfterSeconds until client may retry (429 responses only)YES on 429
X-RateLimit-Limit-MinuteKong-specific: per-window breakdownOPTIONAL
X-RateLimit-Limit-HourKong-specific: hourly limitOPTIONAL
06

Throttling Strategies

// SHAPING TRAFFIC WITHOUT DROPPING IT

Throttling is the act of slowing down requests rather than rejecting them. Where rate limiting drops excess requests with 429, throttling queues or delays them. The right strategy depends on your SLA, consumer expectations, and backend capacity model.

Hard Limit
Reject immediately — 429 Too Many Requests. For burst abuse, DDoS mitigation, unauthenticated traffic. No queue, no delay. Use SpikeArrest in Apigee or the rate-limiting plugin in Kong with error_code: 429.
Soft Throttle
Add artificial delay — slow the client without dropping. Useful for preventing hot consumers from starving others. Inject a response delay (e.g., 200ms–2s) when a consumer hits 80% of their limit. The client experiences degraded performance before hitting the hard wall.
Queue & Retry
Accept the request, queue it, drain at allowed rate. Gateway returns 202 Accepted immediately with a job ID; client polls for completion. Requires async architecture. Ideal for expensive operations (AI inference, report generation).
Priority Queue
Tiered queue with guaranteed SLAs per plan. Enterprise tier gets dedicated queue capacity. Free tier requests wait behind paid tier. Implement with Redis Sorted Sets scored by consumer tier + arrival time.
YAML — Kong request queuing with response transformer
# Pattern: return 202 immediately, process async # Requires: custom Kong plugin or upstream async handler plugins: - name: request-termination # used when queue is full config: status_code: 503 message: "Queue at capacity. Retry in 30 seconds." trigger: "queue_depth > 500" # evaluated by pre-function plugin - name: pre-function # Lua to check queue depth config: access: - | local queue_depth = kong.redis:get("queue:ai-inference:depth") if tonumber(queue_depth) > 500 then kong.response.set_header("Retry-After", "30") kong.response.exit(503, "{ \"error\": \"queue_full\" }") end -- Tag request with consumer tier for priority routing local tier = kong.ctx.shared.consumer_tier or "free" kong.service.request.set_header("X-Consumer-Tier", tier) - name: response-transformer config: add: headers: - "X-Queue-Position: $(queue_position)" - "X-Estimated-Wait: $(estimated_wait_secs)s"
07

Quotas & API Plans

// MONETIZATION & ENTITLEMENT ARCHITECTURE

Rate limits protect infrastructure; quotas enforce business entitlements. Quotas are typically longer-lived (daily, monthly) and tied to API products or subscription plans. A free-tier consumer may be rate-limited to 60 req/min AND quota-capped at 10,000 req/month — both apply simultaneously.

Free Tier
60 req/min · 10,000 req/month · No SLA guarantee
No auth for some endpoints; API key required for others. Quota hard-stops at monthly cap. Upgrade prompt embedded in 429 response body.
Developer
300 req/min · 100,000 req/month · 99.5% SLA
OAuth 2.0 required. Access to beta endpoints. Quota rollover: unused 10% carries to next month. Dashboard with usage analytics.
Business
1,000 req/min · 1M req/month · 99.9% SLA
Priority queue. Dedicated gateway pool in Enterprise plans. Burst allowance: up to 3× for 30-second windows. Webhook-based quota alerts.
Enterprise
Custom limits · Unlimited or metered · 99.99% SLA
Per-contract negotiation. BYOK encryption. Dedicated egress IPs. Private gateway deployment. Custom SLAs with financial penalties.
YAML — Apigee API Product quota configuration
# Apigee API Product definition (via management API) { "name": "InferenceAPI-Business", "displayName": "Inference API — Business Plan", "proxies": ["inference-api-v1"], "environments": ["production"], "quota": "1000", // 1000 requests "quotaInterval": "1", // per "quotaTimeUnit": "minute", // minute // Monthly absolute cap — enforced by second Quota policy "attributes": [ { "name": "monthly_quota", "value": "1000000" }, { "name": "burst_multiplier", "value": "3" }, // 3x burst for 30s { "name": "consumer_tier", "value": "business" }, { "name": "sla_target", "value": "99.9" } ], // Scoped API operations included in this product "operationGroup": { "operationConfigs": [{ "apiSource": "inference-api-v1", "operations": [ { "resource": "/v1/chat/completions", "methods": ["POST"] }, { "resource": "/v1/embeddings", "methods": ["POST"] } ] }] } }
08

Circuit Breakers

// FAIL FAST, RECOVER GRACEFULLY

A circuit breaker prevents cascading failures by stopping requests to a failing upstream before those failures propagate. Without a circuit breaker, a slow or failed backend causes gateway threads to block, consuming resources until the gateway itself degrades.

State: CLOSED
Normal Traffic
Count failures
vs threshold
State: OPEN
Fail Fast 503
No upstream
calls made
State: HALF-OPEN
Probe Upstream
1 req allowed
per window
Recovery
Back to CLOSED
If probe
succeeds
YAML — Kong upstream circuit breaker + passive health checks
upstreams: - name: payments-upstream algorithm: round-robin slots: 10000 healthchecks: passive: # Tracks real traffic — no extra probes healthy: http_statuses: [200, 201, 204] successes: 1 # 1 success reopens unhealthy: http_statuses: [500, 502, 503, 504] http_failures: 5 # 5 failures → mark node down timeouts: 3 # 3 timeouts also counts active: # Periodic health probe type: http http_path: /health healthy: interval: 10 # probe every 10s successes: 2 # 2 successes to mark healthy unhealthy: interval: 5 # probe sick nodes more often http_failures: 3 targets: - target: payments-1.internal:8080 weight: 100 - target: payments-2.internal:8080 weight: 100
09

Authentication Patterns

// API KEYS, OAUTH 2.0, JWT, AND SAML
MethodBest ForRevocabilityGateway Support
API Key Machine-to-machine, simple integrations, public APIs MANUAL — must be rotated explicitly UNIVERSAL — all gateways
OAuth 2.0 + PKCE User-delegated access, mobile/web apps REAL-TIME — token revocation via introspection UNIVERSAL
JWT (stateless) High-throughput internal auth, claims-bearing tokens HARD — requires blocklist or short TTL UNIVERSAL
mTLS Service-to-service, highest assurance, zero-trust REAL-TIME — CRL/OCSP revocation UNIVERSAL
HMAC Signature Webhook validation, financial APIs, tamper-proof requests MEDIUM — key rotation required PLUGIN — may require custom
YAML — Kong OAuth 2.0 plugin with PKCE
plugins: - name: oauth2 config: scopes: ["read:payments", "write:payments", "read:profile"] mandatory_scope: true enable_authorization_code: true enable_client_credentials: true enable_implicit_grant: false # deprecated, insecure enable_password_grant: false # deprecated, avoid token_expiration: 3600 # 1 hour access token refresh_token_ttl: 1209600 # 14 days refresh token pkce: S256 # PKCE required for public clients accept_http_if_already_terminated: false # reject non-TLS # Token introspection for short-lived JWT validation - name: jwt config: uri_param_names: [] # never accept token in URL — header only cookie_names: [] claims_to_verify: ["exp", "nbf", "iss"] maximum_expiration: 3600 # reject tokens > 1h lifetime key_claim_name: "kid" run_on_preflight: false
10

mTLS & Zero Trust at the Gateway

// MUTUAL AUTHENTICATION — BOTH SIDES PROVE IDENTITY

Mutual TLS (mTLS) extends standard TLS by requiring the client to present a certificate, not just the server. At the gateway, this means only clients holding a certificate signed by your corporate CA can establish connections — credential theft alone is insufficient to gain access.

YAML — Kong mTLS plugin + client cert validation
# 1. Upload CA certificate to Kong curl -X POST http://localhost:8001/ca_certificates \ --data-urlencode "cert=$(cat /path/to/corporate-ca.pem)" # 2. Configure service to require client cert services: - name: high-assurance-api url: https://backend.internal:8443 # Gateway → Backend: also mTLS (gateway presents its own cert) tls_verify: true ca_certificates: ["<ca-cert-uuid>"] client_certificate: "<gateway-cert-uuid>" plugins: - name: mtls-auth # Client → Gateway: mTLS config: ca_certificates: ["<ca-cert-uuid>"] certificate_bound_access_tokens: true # bind OAuth token to cert revocation_check_mode: IGNORE_CA_ERROR skip_consumer_lookup: false # map cert CN to Kong consumer authenticated_group_by: DN # group by Distinguished Name - name: acl config: allow: ["tier:enterprise", "service:internal"] # cert-derived groups
Certificate-Bound Tokens (RFC 8705): Combine OAuth 2.0 with mTLS by binding access tokens to the client certificate thumbprint. The token is useless without the certificate — credential theft alone doesn't grant access. Enable in Kong with certificate_bound_access_tokens: true. Required for FAPI 2.0 (Financial-grade API) compliance.
11

Threat Protection

// WAF, INJECTION, SCHEMA VALIDATION
Injection Defense Critical
  • Validate all inputs against strict JSON Schema before routing
  • Block SQLi, XSS, SSRF, path traversal via regex deny-lists at gateway
  • Reject unexpected content types: application/x-www-form-urlencoded should never reach a JSON API
  • Limit request body depth (JSON bomb protection — max 5 levels)
Resource Exhaustion DoS
  • Max request body size: enforce at gateway, not backend
  • Max URL length: 2048 chars (reject longer)
  • Max header count: 100; max header size: 8KB
  • Max query string parameters: 50
  • Timeout cascade: gateway timeout < backend timeout (prevent thread exhaustion)
YAML — Kong request-validator + bot detection
# JSON Schema validation — reject malformed payloads at gateway layer plugins: - name: request-validator config: body_schema: | { "$schema": "http://json-schema.org/draft-07/schema", "type": "object", "required": ["model", "messages"], "properties": { "model": { "type": "string", "enum": ["gpt-4o", "claude-3-7-sonnet-20250219"] # allowlist only }, "messages": { "type": "array", "maxItems": 50, # prevent context stuffing "items": { "type": "object", "required": ["role", "content"], "properties": { "role": { "type": "string", "enum": ["user", "assistant", "system"] }, "content": { "type": "string", "maxLength": 32000 } # ~8k tokens } } }, "max_tokens": { "type": "integer", "minimum": 1, "maximum": 4096 }, "temperature": { "type": "number", "minimum": 0, "maximum": 2 } }, "additionalProperties": false # reject unknown fields } allowed_content_types: ["application/json"] verbose_response: false # don't leak schema in errors # Bot detection (Kong Enterprise) - name: bot-detection config: allow: [] deny: ["generic-bot", "crawler", "scanner"]
12

AI Gateway Patterns

// SAFELY EXPOSING AI TOOLS TO THE OUTSIDE WORLD

Exposing AI model endpoints externally introduces risks that standard API gateway patterns don't fully cover. LLMs are non-deterministic, computationally expensive, and susceptible to prompt injection, data leakage, and jailbreaking. An AI gateway layer adds controls specific to this threat model, sitting between your consumers and the model API.

Consumer
API Request
External app
or user
Gateway Layer 1
Auth + Rate
Limit + Quota
Standard
policies
Gateway Layer 2
AI-Specific
Controls
Token limit
Prompt guard
Model Router
Semantic Cache
+ Model Select
Cost control
fallback
LLM API
OpenAI · Claude
Gemini · Self-host
Actual
inference

AI Gateway Capabilities

Semantic Caching Cost Control

Cache LLM responses by semantic similarity of prompts, not exact string match. If a new prompt is >95% similar to a cached query, return the cached response. Reduces inference costs by 30–70% for repetitive query patterns (FAQs, code completion, classification).

Tools: Kong AI Gateway, LiteLLM, Portkey, GPTCache

Model Routing & Fallback Resilience

Route to cheaper models for simple tasks; escalate to frontier models for complex ones. If primary model is unavailable (429, 503), auto-fallback to alternative — with model mapping for API compatibility. Maintain per-model rate limits separately.

Pattern: GPT-4o-mini → GPT-4o → Claude as fallback chain

PII Scrubbing Compliance

Detect and redact PII (names, emails, SSNs, credit cards, phone numbers) from prompts before sending to external model APIs. Use regex + NER model in the gateway pipeline. Log the redaction event but never the original PII.

Output Filtering Safety

Inspect LLM responses before returning to consumer. Strip: PII leakage, hallucinated credentials, toxic content, competitor mentions (per policy). Apply regex + classifier in the response pipeline. Log filtered content for fine-tuning feedback.

YAML — Kong AI Gateway plugin configuration
# Kong AI Gateway (KonnectAI / Kong 3.6+) plugins: - name: ai-proxy config: route_type: llm/v1/chat auth: header_name: Authorization header_value: "Bearer $(vault://secrets/openai-key)" model: provider: openai name: gpt-4o options: max_tokens: 2048 temperature: 0.7 input_cost: 2.5 # USD per 1M input tokens (for metering) output_cost: 10.0 # USD per 1M output tokens - name: ai-semantic-cache config: embeddings_provider: openai embeddings_model: text-embedding-3-small similarity_threshold: 0.95 # 95%+ match = cache hit cache_ttl: 3600 vectordb: strategy: redis redis: host: redis.internal port: 6379 - name: ai-prompt-guard config: allow_patterns: [] deny_patterns: - pattern: "ignore previous instructions" - pattern: "DAN mode" - pattern: "jailbreak" - pattern: "\\bSSN\\b.*\\d{3}-\\d{2}-\\d{4}" # SSN in prompt - pattern: "\\b4[0-9]{12}(?:[0-9]{3})?\\b" # Visa card number match_all_roles: true - name: ai-rate-limiting-advanced config: limit_by: consumer limits: - tokens_per_minute: 100000 # input + output tokens - tokens_per_hour: 1000000 - requests_per_minute: 60
13

LLM-Specific Security Risks

// OWASP LLM TOP 10 AT THE GATEWAY

The OWASP Top 10 for LLM Applications defines risks specific to AI systems. The gateway is the first — and most scalable — line of defense against several of these, though some require application-layer mitigations.

Threat Coverage Matrix

Risk
Gateway Can Block
App-Layer Also Needed
Severity
LLM01: Prompt Injection
Partial — regex deny-lists
Input sanitization, system prompt hardening
CRITICAL
LLM02: Insecure Output Handling
Response content filtering
Output encoding before rendering
HIGH
LLM03: Training Data Poisoning
Not applicable at gateway
Model provenance, red-teaming
MEDIUM
LLM04: Model DoS
Token limits, rate limiting, request size
Backend timeouts
HIGH
LLM06: Sensitive Data Disclosure
PII scrubbing in request + response
RAG access controls, data classification
CRITICAL
LLM07: Insecure Plugin Design
Tool/function allowlisting in request schema
Capability sandboxing, principle of least privilege
CRITICAL
LLM09: Overreliance
Not applicable at gateway
Human-in-the-loop for high-stakes decisions
MEDIUM
Tool/Function call allowlisting: If you expose function-calling or tool-use APIs, the gateway MUST validate that tools arrays in requests only reference approved function names and schemas. An attacker crafting a malicious tool definition can invoke arbitrary code if your application naively executes whatever the model returns. Treat every LLM output as untrusted.
14

Token-Aware Rate Limiting

// REQUEST COUNT IS THE WRONG METRIC FOR LLMs

Standard rate limiting counts requests. For LLM APIs, two requests can differ by 1000× in computational cost — a 100-token prompt vs a 128,000-token context window. Token-based limiting controls actual resource consumption, not request count.

Python — Token-aware gateway middleware
import tiktoken import redis from fastapi import Request, HTTPException enc = tiktoken.encoding_for_model("gpt-4o") r = redis.Redis(host="redis.internal", decode_responses=True) TOKEN_LIMITS = { "free": {"minute": 10_000, "day": 100_000}, "developer": {"minute": 50_000, "day": 1_000_000}, "business": {"minute": 200_000, "day": 10_000_000}, } async def token_limit_middleware(request: Request, consumer_id: str, tier: str): body = await request.json() # Pre-flight: count input tokens BEFORE sending to model messages_text = " ".join( m["content"] for m in body.get("messages", []) if isinstance(m.get("content"), str) ) input_tokens = len(enc.encode(messages_text)) # Add max_tokens from request as expected output cost max_output = body.get("max_tokens", 1024) expected_cost = input_tokens + max_output limits = TOKEN_LIMITS.get(tier, TOKEN_LIMITS["free"]) pipe = r.pipeline() for window, limit in [("minute", limits["minute"]), ("day", limits["day"])]: key = f"tokens:{consumer_id}:{window}" used = int(r.get(key) or 0) if used + expected_cost > limit: raise HTTPException( status_code=429, detail={ "error": "token_limit_exceeded", "window": window, "limit": limit, "used": used, "request_cost": expected_cost, "remaining": max(0, limit - used), } ) pipe.incrby(key, expected_cost) pipe.expire(key, 60 if window == "minute" else 86400) pipe.execute() # After response: reconcile actual usage (tokens_used from API response) # Subtract over-estimate if model returned fewer tokens than max_tokens return input_tokens, max_output
15

Prompt Inspection & Guardrails

// DETECT INJECTION, JAILBREAKS, AND PII BEFORE INFERENCE

Prompt inspection at the gateway layer is the first — and cheapest — defense against malicious inputs. Combine rule-based pattern matching (fast, low cost) with classifier-based detection (higher accuracy, higher latency) tiered by risk level.

Python — Prompt inspection pipeline
import re from dataclasses import dataclass from enum import Enum class Verdict(Enum): ALLOW = "allow" WARN = "warn" # log + continue SANITIZE = "sanitize" # redact PII, then continue BLOCK = "block" # reject with 400 INJECTION_PATTERNS = [ # Direct jailbreak attempts re.compile(r"ignore\s+(all\s+)?(previous|prior|above)\s+instructions?", re.I), re.compile(r"(you are|act as|pretend to be)\s+(dan|evil|unrestricted)", re.I), re.compile(r"(DAN|jailbreak|GPT-4-DAN|STAN)\s+(mode|prompt)", re.I), # System prompt extraction re.compile(r"(reveal|print|repeat|output)\s+(your\s+)?(system|initial)\s+(prompt|instruction)", re.I), # Indirect injection via data re.compile(r"<(s|INST|SYS|SYSTEM|human|assistant)>"), # model-specific control tokens re.compile(r"\[\[HUMAN\]\]|\[\[ASSISTANT\]\]"), ] PII_PATTERNS = { "ssn": re.compile(r"\b\d{3}-\d{2}-\d{4}\b"), "credit_card": re.compile(r"\b(?:4[0-9]{12}|5[1-5][0-9]{14}|3[47][0-9]{13})\b"), "email": re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"), "api_key": re.compile(r"\b(sk-[a-zA-Z0-9]{32,}|ghp_[a-zA-Z0-9]{36})\b"), } def inspect_prompt(text: str) -> tuple[Verdict, str, dict]: # 1. Injection check — block immediately for pattern in INJECTION_PATTERNS: if pattern.search(text): return Verdict.BLOCK, text, {"reason": "prompt_injection_detected"} # 2. PII check — sanitize, don't block sanitized = text found_pii = [] for pii_type, pattern in PII_PATTERNS.items(): if pattern.search(sanitized): sanitized = pattern.sub(f"[REDACTED:{pii_type.upper()}]", sanitized) found_pii.append(pii_type) if found_pii: return Verdict.SANITIZE, sanitized, {"redacted": found_pii} return Verdict.ALLOW, text, {}
Classifier-based detection: Regex is fast but easily evaded with Unicode normalization, zero-width characters, or paraphrasing. For high-security deployments, add a secondary ML classifier (LlamaGuard, ShieldLM, or a fine-tuned BERT) that evaluates semantics, not syntax. Run classifiers asynchronously and cache verdicts by prompt hash to minimize latency overhead (<50ms target).
16

Observability

// METRICS, TRACES, LOGS — THE GATEWAY TELEMETRY STACK
Golden Signals
  • Latency — p50/p95/p99 per route
  • Traffic — req/sec, tokens/sec
  • Errors — 4xx/5xx rates by consumer
  • Saturation — connection pool, CPU
AI-Specific Metrics
  • Input / output tokens per minute
  • Token cost (USD) per consumer
  • Cache hit rate (semantic cache)
  • Prompt injection block rate
  • Model fallback frequency
  • Time-to-first-token (streaming)
Alert Thresholds
  • p99 latency > 3s: PagerDuty P2
  • Error rate > 5% (5m): P1
  • 429 rate > 20%: quota tuning
  • Token cost anomaly > 3σ: billing alert
  • Cache hit rate < 20%: cache review
YAML — Prometheus + Grafana alert rules for AI gateway
# prometheus/rules/ai-gateway.yml groups: - name: ai_gateway rules: - alert: HighTokenCostAnomaly expr: | rate(ai_gateway_tokens_total{direction="output"}[5m]) > 3 * avg_over_time( rate(ai_gateway_tokens_total{direction="output"}[5m])[1h:5m] ) for: 2m labels: severity: warning annotations: summary: "Token usage spike detected — possible prompt stuffing" - alert: PromptInjectionSpike expr: | rate(ai_gateway_prompt_blocked_total[5m]) > 10 for: 1m labels: severity: critical annotations: summary: ">10 prompt injections/min — possible coordinated attack" - alert: ModelFallbackHigh expr: | rate(ai_gateway_model_fallback_total[10m]) / rate(ai_gateway_requests_total[10m]) > 0.1 for: 5m annotations: summary: ">10% of requests falling back — primary model degraded" - alert: SemanticCacheHitRateLow expr: | rate(ai_gateway_cache_hit_total[1h]) / rate(ai_gateway_requests_total[1h]) < 0.2 for: 30m annotations: summary: "Cache hit rate <20% — review embedding similarity threshold"
17

Policy-as-Code

// OPA, CEDAR, AND DECLARATIVE GATEWAY CONFIG

Gateway policies should be version-controlled, tested, and deployed through CI/CD — not configured manually via UIs. For complex authorization logic, offload to a policy engine (OPA, Cedar) that the gateway calls synchronously at the enforcement point.

Rego — OPA policy for AI API authorization
# policy/ai_api.rego package ai.api.authz import future.keywords # Default: deny all default allow = false # Consumers may only use models in their tier's allowlist allow if { requested_model := input.request.body.model consumer_tier := input.consumer.tier allowed_models[consumer_tier][requested_model] } allowed_models := { "free": {"gpt-4o-mini": true, "text-embedding-3-small": true}, "developer": {"gpt-4o-mini": true, "gpt-4o": true, "text-embedding-3-large": true}, "business": {"gpt-4o": true, "claude-3-7-sonnet-20250219": true, "o3": true}, "enterprise": {"_all": true}, } # Enterprise override: any model allowed allow if { input.consumer.tier == "enterprise" } # Block tool/function calls for free tier (prevents code execution) deny[reason] if { input.consumer.tier == "free" count(input.request.body.tools) > 0 reason := "Free tier does not support function calling" } # Enforce max_tokens ceiling per tier deny[reason] if { input.request.body.max_tokens > max_tokens_ceiling[input.consumer.tier] reason := sprintf("max_tokens exceeds tier ceiling of %v", [max_tokens_ceiling[input.consumer.tier]]) } max_tokens_ceiling := { "free": 512, "developer": 4096, "business": 16384, "enterprise": 128000, }
CI Pipeline for Gateway Config: Store all Kong/Apigee config in Git → lint with deck validate or apigeelint → run policy unit tests (opa test) → diff against prod with deck diff → deploy via deck sync in CD pipeline. Never mutate production via Admin API manually — treat it as infra drift.
18

Quick Reference

// STATUS CODES, LIMITS, COMMANDS, CHEATSHEET

HTTP Status Code Guide for Gateway

CodeMeaningWhen Gateway Returns It
400Bad RequestSchema validation failure, malformed JSON, disallowed parameters
401UnauthorizedMissing or malformed credentials (no Authorization header)
403ForbiddenValid credentials, but insufficient scope or ACL denial
404Not FoundRoute not matched — or dark SDP hiding the endpoint
408Request TimeoutClient took too long to send request body
413Payload Too LargeRequest body exceeds allowed_payload_size
422Unprocessable EntitySyntactically valid JSON but semantically invalid (OPA denial)
429Too Many RequestsRate limit or quota exceeded — include Retry-After
502Bad GatewayUpstream returned an invalid response
503Service UnavailableAll upstream nodes unhealthy (circuit open); queue full
504Gateway TimeoutUpstream did not respond within read_timeout
Kong CLI Cheatsheet
# Validate declarative config deck validate -s deck.yaml # Diff current vs what would be synced deck diff -s deck.yaml # Apply configuration deck sync -s deck.yaml # Dump current config to file deck dump -o current.yaml # Live reload (no downtime) kong reload # Check gateway status curl localhost:8001/status | jq . # List all plugins with priority curl localhost:8001/plugins | jq \ '[.data[] | {name,priority:.instance_config}]'
Apigee CLI Cheatsheet
# Authenticate gcloud auth application-default login # Deploy a proxy bundle apigeecli apis create bundle \ --name my-api --proxy ./apiproxy \ --org $ORG --env production # List deployments apigeecli deployments list \ --org $ORG --env production # Create API product apigeecli products create \ --name "AI-Business" \ --proxies "inference-api-v1" \ --quota 1000 --interval 1 \ --unit minute --org $ORG # Check proxy traffic (last 1h) apigeecli stats get \ --org $ORG --env production \ --dims apiproxy --metric message_count

AI Gateway Configuration Checklist

Checklist — AI endpoint exposure
## Authentication & Authorization OAuth 2.0 with PKCE or mTLS for all consumers Scopes defined per capability (e.g., ai:chat, ai:embed, ai:tools) OPA/Cedar policy: model allowlist per consumer tier Tool/function calling disabled for free tier ## Traffic Control Request-count rate limit: per minute AND per hour (per consumer) Token-count rate limit: input+output tokens per minute AND per day Max request body size: 1–10 MB depending on use case max_tokens ceiling enforced per tier (schema validation) Spike arrest: global burst cap before per-consumer limits Circuit breaker: passive + active health checks on model API ## Prompt Security JSON Schema validation: allowlist known fields, reject additionalProperties Prompt injection pattern matching (regex deny-list) PII scrubbing in request: SSN, credit card, email, API keys PII scrubbing in response: model may hallucinate or echo PII Tool/function call schema validation: only approved function names Context depth limit: max messages array length ## Cost Control Semantic cache enabled with similarity threshold Token cost metered per consumer (input + output separately) Billing alert on anomalous token spend (>3σ) Model fallback chain configured (expensive → cheap on 429/503) ## Observability Distributed tracing: inject trace-id into all upstream requests Token usage exported to Prometheus/Datadog per consumer Prompt injection block rate alerting Per-model error rates and latency histograms Audit log: every request logged (consumer, model, tokens, verdict)
Reference Resources: Kong documentation at docs.konghq.com · Apigee docs at cloud.google.com/apigee/docs · OWASP LLM Top 10 at owasp.org/www-project-top-10-for-large-language-model-applications · RFC 6585 (429 Status Code) · RFC 8705 (Certificate-Bound Tokens) · NIST AI RMF at airc.nist.gov · LiteLLM AI gateway at litellm.ai