API Gateway Patterns — Field Handbook
Gateway
Patterns
// Kong · Apigee · Rate Limiting · Throttling · Safe AI Exposure
The definitive operational guide to API gateway architecture. Covers Kong and Apigee deep-dives, rate limiting algorithms, quota management, threat protection, and the emerging discipline of safely exposing AI tools and LLM endpoints to the outside world.
Kong Gateway
Apigee
Rate Limiting
Throttling
mTLS / Zero Trust
AI Tool Exposure
Prompt Guard
An API gateway is the single entry point through which all external (and increasingly, internal) API traffic passes before reaching your backend services. It is simultaneously a reverse proxy, a policy enforcement point, a traffic shaper, and an observability hub. Unlike a basic load balancer, a gateway understands the HTTP/gRPC/WebSocket application layer and can make intelligent decisions per-request.
Traffic Control
Rate limiting, throttling, load balancing, circuit breaking, retries, and timeouts — all enforced before requests reach your services.
Security Enforcement
Authentication, authorization, mTLS termination, JWT validation, API key management, WAF rules, and threat protection in a centralized layer.
Observability
Unified access logs, distributed traces, latency histograms, and error-rate metrics across all services — without instrumenting each service individually.
Gateway vs. Other Patterns
| Pattern | Layer | Use When | Limitations |
| API Gateway | L7 — app aware | Public APIs, multi-consumer, policy enforcement | Potential single point of failure; latency overhead |
| Load Balancer | L4/L7 basic | Distributing traffic to identical backends | No auth, rate limiting, or transform logic |
| Service Mesh | L7 — east-west | Internal service-to-service (mTLS, observability) | Not designed for north-south external traffic |
| BFF (Backend-for-Frontend) | Application | Client-specific API aggregation | Custom code, not declarative, no shared policy |
| Reverse Proxy (nginx) | L7 basic | Static routing, SSL termination, caching | No dynamic policy, auth plugins, quota management |
▸
Complementary, not competing: A real production stack typically uses all three layers — an API Gateway for north-south traffic (external consumers), a Service Mesh for east-west (internal services), and a Load Balancer underneath each. They serve different segments of the traffic matrix.
Every enterprise gateway processes requests through a plugin/policy pipeline. Knowing the execution order matters — a misconfigured plugin order can bypass authentication, double-charge rate limits, or corrupt telemetry.
Client
Request Arrives
TLS termination
SNI routing
→
Phase 1
Auth & Identity
JWT · API Key
mTLS · OAuth
→
Phase 2
Rate Limit & Quota
Window checks
Token bucket
→
Phase 3
Transform
Header rewrite
Body modify
→
Phase 4
Route & Proxy
Service LB
Circuit break
→
Response
Log & Return
Trace export
Metrics emit
⚠
Plugin order is security-critical: Authentication MUST run before rate limiting. If rate limiting runs first, unauthenticated requests consume quota, enabling quota-exhaustion DoS against legitimate users. In Kong, plugins with lower priority numbers run first — validate your kong.conf plugin order explicitly.
Control Plane vs. Data Plane
Control Plane
Stores gateway configuration — routes, plugins, consumers, credentials, upstream services. In Kong this is the Admin API + database (PostgreSQL or declarative YAML). In Apigee, the management console and API Management APIs.
Separate it. The control plane should never be publicly accessible. Admin API on Kong must be firewalled; Apigee management APIs require service account auth.
Data Plane
The runtime nodes that actually proxy traffic. They pull configuration from the control plane and enforce policies in-process — no database calls per request. Data plane nodes scale horizontally and can run in multiple regions.
Resilience: Configure data planes to operate with cached config if the control plane is unreachable (Kong's declarative mode, Apigee hybrid runtime).
Kong is built on nginx + OpenResty (LuaJIT) with a plugin framework that intercepts requests at each processing phase. Kong Gateway OSS is fully open-source; Kong Enterprise adds RBAC, secrets management, Dev Portal, and audit logs. Configuration can be managed via Admin API, declarative deck sync, or Kubernetes CRDs (Kong Ingress Controller).
# deck.yaml — full service definition with security plugins
# Apply with: deck sync -s deck.yaml
_format_version: "3.0"
services:
- name: payments-api
url: http://payments-svc.internal:8080
connect_timeout: 5000
read_timeout: 30000
write_timeout: 30000
retries: 2
routes:
- name: payments-v1
paths: ["/v1/payments"]
methods: ["POST", "GET"]
strip_path: false
preserve_host: true
plugins:
- name: jwt # Phase 1: authenticate
config:
key_claim_name: kid
claims_to_verify: [exp, nbf]
secret_is_base64: false
- name: rate-limiting # Phase 2: enforce quota
config:
minute: 60 # 60 req/min per consumer
hour: 1000
policy: redis # shared counter across nodes
redis_host: redis.internal
redis_port: 6379
hide_client_headers: false
error_code: 429
error_message: "Rate limit exceeded. Retry-After is in the response header."
- name: request-size-limiting # Protect against large payloads
config:
allowed_payload_size: 2 # 2 MB max
size_unit: megabytes
- name: cors
config:
origins: ["https://app.example.com"]
methods: ["GET", "POST"]
headers: ["Authorization", "Content-Type"]
max_age: 3600
- name: prometheus # Metrics emission
config:
per_consumer: true
- name: opentelemetry # Distributed tracing
config:
endpoint: "http://otel-collector:4318/v1/traces"
Consumer & Credential Model Core Concept
Kong's authorization unit is the Consumer — a representation of a client app or user. Credentials (API keys, JWTs, OAuth tokens) are attached to consumers. Rate limits, ACLs, and quotas can be scoped per-consumer, allowing different limits for free vs. paid tiers.
Plugin Priority Order Security-Critical
- 1000+: Bot detection, IP restriction
- ~999: Authentication (JWT, Key Auth, OAuth)
- ~900: ACL, authorization checks
- ~901: Rate Limiting (post-auth, per consumer)
- ~800: Request transforms, validation
- ~0: Logging, tracing, metrics
# Create a consumer and attach an API key
curl -X POST http://localhost:8001/consumers \
--data username=acme-corp \
--data custom_id=tenant-001
curl -X POST http://localhost:8001/consumers/acme-corp/key-auth \
--data key=sk-prod-abc123xyz
# Apply per-consumer rate limit override (Enterprise)
curl -X POST http://localhost:8001/consumers/acme-corp/plugins \
--data name=rate-limiting \
--data config.minute=300 \ # 5x the default — paid tier
--data config.hour=5000 \
--data config.policy=redis
# Rotate API key (old key stays active during grace period)
curl -X POST http://localhost:8001/consumers/acme-corp/key-auth \
--data key=sk-prod-newkey456
# Revoke old key after client confirms migration
curl -X DELETE http://localhost:8001/consumers/acme-corp/key-auth/<key-id>
# Health check — data plane node status
curl http://localhost:8001/status | jq '.database, .server'
Apigee is Google Cloud's full-lifecycle API management platform. Unlike Kong's plugin model, Apigee uses XML policy bundles attached to proxy flows (PreFlow → Conditional Flows → PostFlow). Apigee X runs entirely on Google Cloud; Apigee hybrid deploys runtime pods in your own Kubernetes cluster while keeping management in GCP.
Proxy Structure
Inbound
ProxyEndpoint
PreFlow
Auth, rate limit,
spike arrest
→
Routing
Conditional
Flows
Path/verb
matching
→
Backend
TargetEndpoint
LB, health,
mTLS to upstream
→
Response
PostFlow
Response
Header mask
error transform
→
Client
Final Response
Logged, traced,
metered
<ProxyEndpoint name="default">
<PreFlow name="PreFlow">
<Request>
<Step><Name>VA-OAuthV2-Token</Name></Step> <!-- 1. Verify OAuth -->
<Step><Name>SC-SpikeArrest-Global</Name></Step> <!-- 2. Global burst limit -->
<Step><Name>QL-Quota-PerAppPerMin</Name></Step> <!-- 3. Per-app quota -->
<Step><Name>AM-SetTarget-Headers</Name></Step> <!-- 4. Inject context -->
</Request>
</PreFlow>
<Flows>
<Flow name="GetPayments">
<Condition>(proxy.pathsuffix MatchesPath "/payments") and (request.verb = "GET")</Condition>
<Request>
<Step><Name>RL-RaiseFault-ReadOnly</Name></Step>
</Request>
</Flow>
</Flows>
<PostFlow name="PostFlow">
<Response>
<Step><Name>AM-RemoveInternalHeaders</Name></Step>
<Step><Name>ST-StatisticsCollector</Name></Step>
</Response>
</PostFlow>
</ProxyEndpoint>
<!-- SpikeArrest Policy: caps instantaneous burst -->
<SpikeArrest name="SC-SpikeArrest-Global">
<Rate>500ps</Rate> <!-- 500 req/second global cap -->
<UseEffectiveCount>true</UseEffectiveCount>
</SpikeArrest>
<!-- Quota Policy: per-app per-minute limit -->
<Quota name="QL-Quota-PerAppPerMin">
<Allow countRef="verifyapikey.VA-APIKey.apiproduct.developer.quota.limit"/>
<Interval ref="verifyapikey.VA-APIKey.apiproduct.developer.quota.interval">1</Interval>
<TimeUnit ref="verifyapikey.VA-APIKey.apiproduct.developer.quota.timeunit">minute</TimeUnit>
<Identifier ref="client_id"/> <!-- unique counter per app -->
<Distributed>true</Distributed>
<Synchronous>true</Synchronous>
</Quota>
Kong Strengths
- ✓ Open-source core — no vendor lock-in
- ✓ Lua plugin framework — extend anything
- ✓ Kubernetes-native (KIC)
- ✓ Lower operational cost at scale
- ✓ Fast data plane (nginx/OpenResty)
- ✗ Less mature developer portal
- ✗ Analytics require external stack
Apigee Strengths
- ✓ Full developer portal + monetization
- ✓ Built-in analytics + custom dashboards
- ✓ Advanced API product/plan management
- ✓ GCP-native IAM, Secret Manager
- ✓ Integrated threat protection rules
- ✗ Higher cost, GCP dependency
- ✗ XML policies: verbose and complex
Rate limiting controls how many requests a client can send in a time window. The choice of algorithm determines fairness, burst tolerance, and implementation complexity. All production gateway deployments need distributed counters — local-only counters undercount in multi-node deployments.
The Four Algorithms
Fixed Window Counter Simple
Divide time into fixed windows (e.g., each minute). Count requests per window. Reset at window boundary.
Problem: "Boundary burst" — a client can send 100 req at 00:59 and 100 more at 01:00, effectively 200 req in 2 seconds while staying within limits.
Sliding Window Log Accurate
Store timestamp of every request. Count only requests within the last N seconds. Evict old entries.
Trade-off: Most accurate, but O(n) memory per consumer at high volume. Acceptable for low-QPS privileged endpoints; not for public APIs.
Sliding Window Counter Balanced
Hybrid: two fixed windows + weighted interpolation. Count = (prev_window × weight) + current_window. Weight = 1 − (elapsed / window_size).
Best trade-off for most use cases — O(1) storage, near-accurate smoothing, no boundary burst. Kong's sliding window mode uses this approach.
Token Bucket Burst-Friendly
Tokens fill at a constant rate (e.g., 10/sec). Each request consumes 1 token. Bucket has a max capacity — allows short bursts up to bucket size.
Ideal for AI endpoints where one request may cost vastly different amounts of computation. Pair with token-weight logic for LLM APIs.
import time
import redis
r = redis.Redis(host='redis.internal', decode_responses=True)
def token_bucket_check(consumer_id: str, cost: int = 1) -> dict:
"""
Redis Lua script ensures atomic read-modify-write.
Returns: {allowed: bool, tokens_remaining: int, retry_after: float}
"""
key = f"rl:token_bucket:{consumer_id}"
rate = 10 # tokens added per second
capacity = 50 # max burst size
now = time.time()
lua_script = """
local key = KEYS[1]
local rate = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])
local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or capacity
local last = tonumber(data[2]) or now
-- Refill tokens based on elapsed time
local elapsed = math.max(0, now - last)
tokens = math.min(capacity, tokens + elapsed * rate)
if tokens >= cost then
tokens = tokens - cost
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)
return {1, math.floor(tokens), 0}
else
local wait = (cost - tokens) / rate
return {0, math.floor(tokens), wait}
end
"""
result = r.eval(lua_script, 1, key, rate, capacity, now, cost)
return {
"allowed": bool(result[0]),
"tokens_remaining": int(result[1]),
"retry_after_secs": float(result[2])
}
Required Response Headers (RFC 6585 + Best Practice)
| Header | Value | Required? |
RateLimit-Limit | Max requests in the window (e.g., 100) | YES |
RateLimit-Remaining | Requests left in current window | YES |
RateLimit-Reset | Unix timestamp when window resets | YES |
Retry-After | Seconds until client may retry (429 responses only) | YES on 429 |
X-RateLimit-Limit-Minute | Kong-specific: per-window breakdown | OPTIONAL |
X-RateLimit-Limit-Hour | Kong-specific: hourly limit | OPTIONAL |
Throttling is the act of slowing down requests rather than rejecting them. Where rate limiting drops excess requests with 429, throttling queues or delays them. The right strategy depends on your SLA, consumer expectations, and backend capacity model.
Hard Limit
Reject immediately — 429 Too Many Requests. For burst abuse, DDoS mitigation, unauthenticated traffic. No queue, no delay. Use SpikeArrest in Apigee or the rate-limiting plugin in Kong with error_code: 429.
Soft Throttle
Add artificial delay — slow the client without dropping. Useful for preventing hot consumers from starving others. Inject a response delay (e.g., 200ms–2s) when a consumer hits 80% of their limit. The client experiences degraded performance before hitting the hard wall.
Queue & Retry
Accept the request, queue it, drain at allowed rate. Gateway returns 202 Accepted immediately with a job ID; client polls for completion. Requires async architecture. Ideal for expensive operations (AI inference, report generation).
Priority Queue
Tiered queue with guaranteed SLAs per plan. Enterprise tier gets dedicated queue capacity. Free tier requests wait behind paid tier. Implement with Redis Sorted Sets scored by consumer tier + arrival time.
# Pattern: return 202 immediately, process async
# Requires: custom Kong plugin or upstream async handler
plugins:
- name: request-termination # used when queue is full
config:
status_code: 503
message: "Queue at capacity. Retry in 30 seconds."
trigger: "queue_depth > 500" # evaluated by pre-function plugin
- name: pre-function # Lua to check queue depth
config:
access:
- |
local queue_depth = kong.redis:get("queue:ai-inference:depth")
if tonumber(queue_depth) > 500 then
kong.response.set_header("Retry-After", "30")
kong.response.exit(503, "{ \"error\": \"queue_full\" }")
end
-- Tag request with consumer tier for priority routing
local tier = kong.ctx.shared.consumer_tier or "free"
kong.service.request.set_header("X-Consumer-Tier", tier)
- name: response-transformer
config:
add:
headers:
- "X-Queue-Position: $(queue_position)"
- "X-Estimated-Wait: $(estimated_wait_secs)s"
Rate limits protect infrastructure; quotas enforce business entitlements. Quotas are typically longer-lived (daily, monthly) and tied to API products or subscription plans. A free-tier consumer may be rate-limited to 60 req/min AND quota-capped at 10,000 req/month — both apply simultaneously.
Free Tier
60 req/min · 10,000 req/month · No SLA guarantee
No auth for some endpoints; API key required for others. Quota hard-stops at monthly cap. Upgrade prompt embedded in 429 response body.
Developer
300 req/min · 100,000 req/month · 99.5% SLA
OAuth 2.0 required. Access to beta endpoints. Quota rollover: unused 10% carries to next month. Dashboard with usage analytics.
Business
1,000 req/min · 1M req/month · 99.9% SLA
Priority queue. Dedicated gateway pool in Enterprise plans. Burst allowance: up to 3× for 30-second windows. Webhook-based quota alerts.
Enterprise
Custom limits · Unlimited or metered · 99.99% SLA
Per-contract negotiation. BYOK encryption. Dedicated egress IPs. Private gateway deployment. Custom SLAs with financial penalties.
# Apigee API Product definition (via management API)
{
"name": "InferenceAPI-Business",
"displayName": "Inference API — Business Plan",
"proxies": ["inference-api-v1"],
"environments": ["production"],
"quota": "1000", // 1000 requests
"quotaInterval": "1", // per
"quotaTimeUnit": "minute", // minute
// Monthly absolute cap — enforced by second Quota policy
"attributes": [
{ "name": "monthly_quota", "value": "1000000" },
{ "name": "burst_multiplier", "value": "3" }, // 3x burst for 30s
{ "name": "consumer_tier", "value": "business" },
{ "name": "sla_target", "value": "99.9" }
],
// Scoped API operations included in this product
"operationGroup": {
"operationConfigs": [{
"apiSource": "inference-api-v1",
"operations": [
{ "resource": "/v1/chat/completions", "methods": ["POST"] },
{ "resource": "/v1/embeddings", "methods": ["POST"] }
]
}]
}
}
A circuit breaker prevents cascading failures by stopping requests to a failing upstream before those failures propagate. Without a circuit breaker, a slow or failed backend causes gateway threads to block, consuming resources until the gateway itself degrades.
State: CLOSED
Normal Traffic
Count failures
vs threshold
→
State: OPEN
Fail Fast 503
No upstream
calls made
→
State: HALF-OPEN
Probe Upstream
1 req allowed
per window
→
Recovery
Back to CLOSED
If probe
succeeds
upstreams:
- name: payments-upstream
algorithm: round-robin
slots: 10000
healthchecks:
passive: # Tracks real traffic — no extra probes
healthy:
http_statuses: [200, 201, 204]
successes: 1 # 1 success reopens
unhealthy:
http_statuses: [500, 502, 503, 504]
http_failures: 5 # 5 failures → mark node down
timeouts: 3 # 3 timeouts also counts
active: # Periodic health probe
type: http
http_path: /health
healthy:
interval: 10 # probe every 10s
successes: 2 # 2 successes to mark healthy
unhealthy:
interval: 5 # probe sick nodes more often
http_failures: 3
targets:
- target: payments-1.internal:8080
weight: 100
- target: payments-2.internal:8080
weight: 100
| Method | Best For | Revocability | Gateway Support |
| API Key |
Machine-to-machine, simple integrations, public APIs |
MANUAL — must be rotated explicitly |
UNIVERSAL — all gateways |
| OAuth 2.0 + PKCE |
User-delegated access, mobile/web apps |
REAL-TIME — token revocation via introspection |
UNIVERSAL |
| JWT (stateless) |
High-throughput internal auth, claims-bearing tokens |
HARD — requires blocklist or short TTL |
UNIVERSAL |
| mTLS |
Service-to-service, highest assurance, zero-trust |
REAL-TIME — CRL/OCSP revocation |
UNIVERSAL |
| HMAC Signature |
Webhook validation, financial APIs, tamper-proof requests |
MEDIUM — key rotation required |
PLUGIN — may require custom |
plugins:
- name: oauth2
config:
scopes: ["read:payments", "write:payments", "read:profile"]
mandatory_scope: true
enable_authorization_code: true
enable_client_credentials: true
enable_implicit_grant: false # deprecated, insecure
enable_password_grant: false # deprecated, avoid
token_expiration: 3600 # 1 hour access token
refresh_token_ttl: 1209600 # 14 days refresh token
pkce: S256 # PKCE required for public clients
accept_http_if_already_terminated: false # reject non-TLS
# Token introspection for short-lived JWT validation
- name: jwt
config:
uri_param_names: [] # never accept token in URL — header only
cookie_names: []
claims_to_verify: ["exp", "nbf", "iss"]
maximum_expiration: 3600 # reject tokens > 1h lifetime
key_claim_name: "kid"
run_on_preflight: false
Mutual TLS (mTLS) extends standard TLS by requiring the client to present a certificate, not just the server. At the gateway, this means only clients holding a certificate signed by your corporate CA can establish connections — credential theft alone is insufficient to gain access.
# 1. Upload CA certificate to Kong
curl -X POST http://localhost:8001/ca_certificates \
--data-urlencode "cert=$(cat /path/to/corporate-ca.pem)"
# 2. Configure service to require client cert
services:
- name: high-assurance-api
url: https://backend.internal:8443
# Gateway → Backend: also mTLS (gateway presents its own cert)
tls_verify: true
ca_certificates: ["<ca-cert-uuid>"]
client_certificate: "<gateway-cert-uuid>"
plugins:
- name: mtls-auth # Client → Gateway: mTLS
config:
ca_certificates: ["<ca-cert-uuid>"]
certificate_bound_access_tokens: true # bind OAuth token to cert
revocation_check_mode: IGNORE_CA_ERROR
skip_consumer_lookup: false # map cert CN to Kong consumer
authenticated_group_by: DN # group by Distinguished Name
- name: acl
config:
allow: ["tier:enterprise", "service:internal"] # cert-derived groups
✓
Certificate-Bound Tokens (RFC 8705): Combine OAuth 2.0 with mTLS by binding access tokens to the client certificate thumbprint. The token is useless without the certificate — credential theft alone doesn't grant access. Enable in Kong with certificate_bound_access_tokens: true. Required for FAPI 2.0 (Financial-grade API) compliance.
Injection Defense Critical
- Validate all inputs against strict JSON Schema before routing
- Block SQLi, XSS, SSRF, path traversal via regex deny-lists at gateway
- Reject unexpected content types:
application/x-www-form-urlencoded should never reach a JSON API
- Limit request body depth (JSON bomb protection — max 5 levels)
Resource Exhaustion DoS
- Max request body size: enforce at gateway, not backend
- Max URL length: 2048 chars (reject longer)
- Max header count: 100; max header size: 8KB
- Max query string parameters: 50
- Timeout cascade: gateway timeout < backend timeout (prevent thread exhaustion)
# JSON Schema validation — reject malformed payloads at gateway layer
plugins:
- name: request-validator
config:
body_schema: |
{
"$schema": "http://json-schema.org/draft-07/schema",
"type": "object",
"required": ["model", "messages"],
"properties": {
"model": {
"type": "string",
"enum": ["gpt-4o", "claude-3-7-sonnet-20250219"] # allowlist only
},
"messages": {
"type": "array",
"maxItems": 50, # prevent context stuffing
"items": {
"type": "object",
"required": ["role", "content"],
"properties": {
"role": { "type": "string", "enum": ["user", "assistant", "system"] },
"content": { "type": "string", "maxLength": 32000 } # ~8k tokens
}
}
},
"max_tokens": { "type": "integer", "minimum": 1, "maximum": 4096 },
"temperature": { "type": "number", "minimum": 0, "maximum": 2 }
},
"additionalProperties": false # reject unknown fields
}
allowed_content_types: ["application/json"]
verbose_response: false # don't leak schema in errors
# Bot detection (Kong Enterprise)
- name: bot-detection
config:
allow: []
deny: ["generic-bot", "crawler", "scanner"]
Exposing AI model endpoints externally introduces risks that standard API gateway patterns don't fully cover. LLMs are non-deterministic, computationally expensive, and susceptible to prompt injection, data leakage, and jailbreaking. An AI gateway layer adds controls specific to this threat model, sitting between your consumers and the model API.
Consumer
API Request
External app
or user
→
Gateway Layer 1
Auth + Rate
Limit + Quota
Standard
policies
→
Gateway Layer 2
AI-Specific
Controls
Token limit
Prompt guard
→
Model Router
Semantic Cache
+ Model Select
Cost control
fallback
→
LLM API
OpenAI · Claude
Gemini · Self-host
Actual
inference
AI Gateway Capabilities
Semantic Caching Cost Control
Cache LLM responses by semantic similarity of prompts, not exact string match. If a new prompt is >95% similar to a cached query, return the cached response. Reduces inference costs by 30–70% for repetitive query patterns (FAQs, code completion, classification).
Tools: Kong AI Gateway, LiteLLM, Portkey, GPTCache
Model Routing & Fallback Resilience
Route to cheaper models for simple tasks; escalate to frontier models for complex ones. If primary model is unavailable (429, 503), auto-fallback to alternative — with model mapping for API compatibility. Maintain per-model rate limits separately.
Pattern: GPT-4o-mini → GPT-4o → Claude as fallback chain
PII Scrubbing Compliance
Detect and redact PII (names, emails, SSNs, credit cards, phone numbers) from prompts before sending to external model APIs. Use regex + NER model in the gateway pipeline. Log the redaction event but never the original PII.
Output Filtering Safety
Inspect LLM responses before returning to consumer. Strip: PII leakage, hallucinated credentials, toxic content, competitor mentions (per policy). Apply regex + classifier in the response pipeline. Log filtered content for fine-tuning feedback.
# Kong AI Gateway (KonnectAI / Kong 3.6+)
plugins:
- name: ai-proxy
config:
route_type: llm/v1/chat
auth:
header_name: Authorization
header_value: "Bearer $(vault://secrets/openai-key)"
model:
provider: openai
name: gpt-4o
options:
max_tokens: 2048
temperature: 0.7
input_cost: 2.5 # USD per 1M input tokens (for metering)
output_cost: 10.0 # USD per 1M output tokens
- name: ai-semantic-cache
config:
embeddings_provider: openai
embeddings_model: text-embedding-3-small
similarity_threshold: 0.95 # 95%+ match = cache hit
cache_ttl: 3600
vectordb:
strategy: redis
redis:
host: redis.internal
port: 6379
- name: ai-prompt-guard
config:
allow_patterns: []
deny_patterns:
- pattern: "ignore previous instructions"
- pattern: "DAN mode"
- pattern: "jailbreak"
- pattern: "\\bSSN\\b.*\\d{3}-\\d{2}-\\d{4}" # SSN in prompt
- pattern: "\\b4[0-9]{12}(?:[0-9]{3})?\\b" # Visa card number
match_all_roles: true
- name: ai-rate-limiting-advanced
config:
limit_by: consumer
limits:
- tokens_per_minute: 100000 # input + output tokens
- tokens_per_hour: 1000000
- requests_per_minute: 60
The OWASP Top 10 for LLM Applications defines risks specific to AI systems. The gateway is the first — and most scalable — line of defense against several of these, though some require application-layer mitigations.
Threat Coverage Matrix
Risk
Gateway Can Block
App-Layer Also Needed
Severity
LLM01: Prompt Injection
Partial — regex deny-lists
Input sanitization, system prompt hardening
CRITICAL
LLM02: Insecure Output Handling
Response content filtering
Output encoding before rendering
HIGH
LLM03: Training Data Poisoning
Not applicable at gateway
Model provenance, red-teaming
MEDIUM
LLM04: Model DoS
Token limits, rate limiting, request size
Backend timeouts
HIGH
LLM06: Sensitive Data Disclosure
PII scrubbing in request + response
RAG access controls, data classification
CRITICAL
LLM07: Insecure Plugin Design
Tool/function allowlisting in request schema
Capability sandboxing, principle of least privilege
CRITICAL
LLM09: Overreliance
Not applicable at gateway
Human-in-the-loop for high-stakes decisions
MEDIUM
⚠
Tool/Function call allowlisting: If you expose function-calling or tool-use APIs, the gateway MUST validate that tools arrays in requests only reference approved function names and schemas. An attacker crafting a malicious tool definition can invoke arbitrary code if your application naively executes whatever the model returns. Treat every LLM output as untrusted.
Standard rate limiting counts requests. For LLM APIs, two requests can differ by 1000× in computational cost — a 100-token prompt vs a 128,000-token context window. Token-based limiting controls actual resource consumption, not request count.
import tiktoken
import redis
from fastapi import Request, HTTPException
enc = tiktoken.encoding_for_model("gpt-4o")
r = redis.Redis(host="redis.internal", decode_responses=True)
TOKEN_LIMITS = {
"free": {"minute": 10_000, "day": 100_000},
"developer": {"minute": 50_000, "day": 1_000_000},
"business": {"minute": 200_000, "day": 10_000_000},
}
async def token_limit_middleware(request: Request, consumer_id: str, tier: str):
body = await request.json()
# Pre-flight: count input tokens BEFORE sending to model
messages_text = " ".join(
m["content"] for m in body.get("messages", [])
if isinstance(m.get("content"), str)
)
input_tokens = len(enc.encode(messages_text))
# Add max_tokens from request as expected output cost
max_output = body.get("max_tokens", 1024)
expected_cost = input_tokens + max_output
limits = TOKEN_LIMITS.get(tier, TOKEN_LIMITS["free"])
pipe = r.pipeline()
for window, limit in [("minute", limits["minute"]), ("day", limits["day"])]:
key = f"tokens:{consumer_id}:{window}"
used = int(r.get(key) or 0)
if used + expected_cost > limit:
raise HTTPException(
status_code=429,
detail={
"error": "token_limit_exceeded",
"window": window,
"limit": limit,
"used": used,
"request_cost": expected_cost,
"remaining": max(0, limit - used),
}
)
pipe.incrby(key, expected_cost)
pipe.expire(key, 60 if window == "minute" else 86400)
pipe.execute()
# After response: reconcile actual usage (tokens_used from API response)
# Subtract over-estimate if model returned fewer tokens than max_tokens
return input_tokens, max_output
Prompt inspection at the gateway layer is the first — and cheapest — defense against malicious inputs. Combine rule-based pattern matching (fast, low cost) with classifier-based detection (higher accuracy, higher latency) tiered by risk level.
import re
from dataclasses import dataclass
from enum import Enum
class Verdict(Enum):
ALLOW = "allow"
WARN = "warn" # log + continue
SANITIZE = "sanitize" # redact PII, then continue
BLOCK = "block" # reject with 400
INJECTION_PATTERNS = [
# Direct jailbreak attempts
re.compile(r"ignore\s+(all\s+)?(previous|prior|above)\s+instructions?", re.I),
re.compile(r"(you are|act as|pretend to be)\s+(dan|evil|unrestricted)", re.I),
re.compile(r"(DAN|jailbreak|GPT-4-DAN|STAN)\s+(mode|prompt)", re.I),
# System prompt extraction
re.compile(r"(reveal|print|repeat|output)\s+(your\s+)?(system|initial)\s+(prompt|instruction)", re.I),
# Indirect injection via data
re.compile(r"<(s|INST|SYS|SYSTEM|human|assistant)>"), # model-specific control tokens
re.compile(r"\[\[HUMAN\]\]|\[\[ASSISTANT\]\]"),
]
PII_PATTERNS = {
"ssn": re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),
"credit_card": re.compile(r"\b(?:4[0-9]{12}|5[1-5][0-9]{14}|3[47][0-9]{13})\b"),
"email": re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"),
"api_key": re.compile(r"\b(sk-[a-zA-Z0-9]{32,}|ghp_[a-zA-Z0-9]{36})\b"),
}
def inspect_prompt(text: str) -> tuple[Verdict, str, dict]:
# 1. Injection check — block immediately
for pattern in INJECTION_PATTERNS:
if pattern.search(text):
return Verdict.BLOCK, text, {"reason": "prompt_injection_detected"}
# 2. PII check — sanitize, don't block
sanitized = text
found_pii = []
for pii_type, pattern in PII_PATTERNS.items():
if pattern.search(sanitized):
sanitized = pattern.sub(f"[REDACTED:{pii_type.upper()}]", sanitized)
found_pii.append(pii_type)
if found_pii:
return Verdict.SANITIZE, sanitized, {"redacted": found_pii}
return Verdict.ALLOW, text, {}
ℹ
Classifier-based detection: Regex is fast but easily evaded with Unicode normalization, zero-width characters, or paraphrasing. For high-security deployments, add a secondary ML classifier (LlamaGuard, ShieldLM, or a fine-tuned BERT) that evaluates semantics, not syntax. Run classifiers asynchronously and cache verdicts by prompt hash to minimize latency overhead (<50ms target).
Golden Signals
- Latency — p50/p95/p99 per route
- Traffic — req/sec, tokens/sec
- Errors — 4xx/5xx rates by consumer
- Saturation — connection pool, CPU
AI-Specific Metrics
- Input / output tokens per minute
- Token cost (USD) per consumer
- Cache hit rate (semantic cache)
- Prompt injection block rate
- Model fallback frequency
- Time-to-first-token (streaming)
Alert Thresholds
- p99 latency > 3s: PagerDuty P2
- Error rate > 5% (5m): P1
- 429 rate > 20%: quota tuning
- Token cost anomaly > 3σ: billing alert
- Cache hit rate < 20%: cache review
# prometheus/rules/ai-gateway.yml
groups:
- name: ai_gateway
rules:
- alert: HighTokenCostAnomaly
expr: |
rate(ai_gateway_tokens_total{direction="output"}[5m])
> 3 * avg_over_time(
rate(ai_gateway_tokens_total{direction="output"}[5m])[1h:5m]
)
for: 2m
labels:
severity: warning
annotations:
summary: "Token usage spike detected — possible prompt stuffing"
- alert: PromptInjectionSpike
expr: |
rate(ai_gateway_prompt_blocked_total[5m]) > 10
for: 1m
labels:
severity: critical
annotations:
summary: ">10 prompt injections/min — possible coordinated attack"
- alert: ModelFallbackHigh
expr: |
rate(ai_gateway_model_fallback_total[10m]) /
rate(ai_gateway_requests_total[10m]) > 0.1
for: 5m
annotations:
summary: ">10% of requests falling back — primary model degraded"
- alert: SemanticCacheHitRateLow
expr: |
rate(ai_gateway_cache_hit_total[1h]) /
rate(ai_gateway_requests_total[1h]) < 0.2
for: 30m
annotations:
summary: "Cache hit rate <20% — review embedding similarity threshold"
Gateway policies should be version-controlled, tested, and deployed through CI/CD — not configured manually via UIs. For complex authorization logic, offload to a policy engine (OPA, Cedar) that the gateway calls synchronously at the enforcement point.
# policy/ai_api.rego
package ai.api.authz
import future.keywords
# Default: deny all
default allow = false
# Consumers may only use models in their tier's allowlist
allow if {
requested_model := input.request.body.model
consumer_tier := input.consumer.tier
allowed_models[consumer_tier][requested_model]
}
allowed_models := {
"free": {"gpt-4o-mini": true, "text-embedding-3-small": true},
"developer": {"gpt-4o-mini": true, "gpt-4o": true, "text-embedding-3-large": true},
"business": {"gpt-4o": true, "claude-3-7-sonnet-20250219": true, "o3": true},
"enterprise": {"_all": true},
}
# Enterprise override: any model allowed
allow if {
input.consumer.tier == "enterprise"
}
# Block tool/function calls for free tier (prevents code execution)
deny[reason] if {
input.consumer.tier == "free"
count(input.request.body.tools) > 0
reason := "Free tier does not support function calling"
}
# Enforce max_tokens ceiling per tier
deny[reason] if {
input.request.body.max_tokens > max_tokens_ceiling[input.consumer.tier]
reason := sprintf("max_tokens exceeds tier ceiling of %v",
[max_tokens_ceiling[input.consumer.tier]])
}
max_tokens_ceiling := {
"free": 512,
"developer": 4096,
"business": 16384,
"enterprise": 128000,
}
▸
CI Pipeline for Gateway Config: Store all Kong/Apigee config in Git → lint with deck validate or apigeelint → run policy unit tests (opa test) → diff against prod with deck diff → deploy via deck sync in CD pipeline. Never mutate production via Admin API manually — treat it as infra drift.
HTTP Status Code Guide for Gateway
| Code | Meaning | When Gateway Returns It |
400 | Bad Request | Schema validation failure, malformed JSON, disallowed parameters |
401 | Unauthorized | Missing or malformed credentials (no Authorization header) |
403 | Forbidden | Valid credentials, but insufficient scope or ACL denial |
404 | Not Found | Route not matched — or dark SDP hiding the endpoint |
408 | Request Timeout | Client took too long to send request body |
413 | Payload Too Large | Request body exceeds allowed_payload_size |
422 | Unprocessable Entity | Syntactically valid JSON but semantically invalid (OPA denial) |
429 | Too Many Requests | Rate limit or quota exceeded — include Retry-After |
502 | Bad Gateway | Upstream returned an invalid response |
503 | Service Unavailable | All upstream nodes unhealthy (circuit open); queue full |
504 | Gateway Timeout | Upstream did not respond within read_timeout |
Kong CLI Cheatsheet
# Validate declarative config
deck validate -s deck.yaml
# Diff current vs what would be synced
deck diff -s deck.yaml
# Apply configuration
deck sync -s deck.yaml
# Dump current config to file
deck dump -o current.yaml
# Live reload (no downtime)
kong reload
# Check gateway status
curl localhost:8001/status | jq .
# List all plugins with priority
curl localhost:8001/plugins | jq \
'[.data[] | {name,priority:.instance_config}]'
Apigee CLI Cheatsheet
# Authenticate
gcloud auth application-default login
# Deploy a proxy bundle
apigeecli apis create bundle \
--name my-api --proxy ./apiproxy \
--org $ORG --env production
# List deployments
apigeecli deployments list \
--org $ORG --env production
# Create API product
apigeecli products create \
--name "AI-Business" \
--proxies "inference-api-v1" \
--quota 1000 --interval 1 \
--unit minute --org $ORG
# Check proxy traffic (last 1h)
apigeecli stats get \
--org $ORG --env production \
--dims apiproxy --metric message_count
AI Gateway Configuration Checklist
## Authentication & Authorization
□ OAuth 2.0 with PKCE or mTLS for all consumers
□ Scopes defined per capability (e.g., ai:chat, ai:embed, ai:tools)
□ OPA/Cedar policy: model allowlist per consumer tier
□ Tool/function calling disabled for free tier
## Traffic Control
□ Request-count rate limit: per minute AND per hour (per consumer)
□ Token-count rate limit: input+output tokens per minute AND per day
□ Max request body size: 1–10 MB depending on use case
□ max_tokens ceiling enforced per tier (schema validation)
□ Spike arrest: global burst cap before per-consumer limits
□ Circuit breaker: passive + active health checks on model API
## Prompt Security
□ JSON Schema validation: allowlist known fields, reject additionalProperties
□ Prompt injection pattern matching (regex deny-list)
□ PII scrubbing in request: SSN, credit card, email, API keys
□ PII scrubbing in response: model may hallucinate or echo PII
□ Tool/function call schema validation: only approved function names
□ Context depth limit: max messages array length
## Cost Control
□ Semantic cache enabled with similarity threshold
□ Token cost metered per consumer (input + output separately)
□ Billing alert on anomalous token spend (>3σ)
□ Model fallback chain configured (expensive → cheap on 429/503)
## Observability
□ Distributed tracing: inject trace-id into all upstream requests
□ Token usage exported to Prometheus/Datadog per consumer
□ Prompt injection block rate alerting
□ Per-model error rates and latency histograms
□ Audit log: every request logged (consumer, model, tokens, verdict)
▸
Reference Resources: Kong documentation at docs.konghq.com · Apigee docs at cloud.google.com/apigee/docs · OWASP LLM Top 10 at owasp.org/www-project-top-10-for-large-language-model-applications · RFC 6585 (429 Status Code) · RFC 8705 (Certificate-Bound Tokens) · NIST AI RMF at airc.nist.gov · LiteLLM AI gateway at litellm.ai