Docker & Kubernetes Command Reference
A tactical SRE cheat sheet for multi-cluster environments — every command annotated with the exact production scenario that warrants it. No hello-world fluff.
<values> in angle brackets with your own identifiers.
Container Operations
The core verbs for managing container runtime state. Knowing the difference between stop, kill, and rm prevents accidents in production.
| Command | Description | Scenario |
|---|---|---|
| docker ps | List all running containers with ID, image, status, and ports. | First step in any incident. Confirm the container is actually running before investigating further. |
| docker ps -a | List all containers including stopped/exited. | Find recently crashed containers by name or exit code before their logs vanish. |
| docker ps --filter "status=exited" --format "table {{.ID}}\t{{.Names}}\t{{.Status}}" | Filtered, formatted list of exited containers with clean tabular output. | Quickly audit how many containers crashed on the host. Pipe to wc -l for a count. |
| docker start <container> | Start a stopped container preserving its original configuration. | Re-start a container that was cleanly stopped without re-creating it from scratch. |
| docker stop <container> | Send SIGTERM, wait 10 s (default), then send SIGKILL. Graceful shutdown. |
Controlled shutdown for deployments or maintenance. Gives the process time to flush buffers. |
| docker stop -t 30 <container> | Extend the grace period to 30 seconds before force-kill. | When your app needs more time to drain connections (e.g., Kafka consumers, database writers). |
| docker restart <container> | Stop then start in one command. Equivalent to stop + start. |
Quick fix for a container exhibiting memory leaks or stale connections — restores state without a full re-deploy. |
| docker kill <container> | Send SIGKILL immediately. No grace period. |
Container is stuck — ignoring SIGTERM, consuming 100 % CPU, or the process is wedged in an uninterruptible wait. |
| docker kill --signal SIGUSR1 <container> | Send a custom POSIX signal to the container's PID 1. | Trigger a hot-reload in nginx (SIGHUP) or a heap dump in a JVM process (SIGUSR1) without restarting. |
| docker rm <container> | Remove a stopped container and its anonymous volumes. | Post-incident cleanup. Always remove before re-creating to avoid name conflicts. |
| docker rm -f <container> | Force-remove a running container (implicit kill + rm). | Emergency teardown when the container must be gone immediately (e.g., security breach, runaway process). |
| docker run -d --name <name> --restart=always <image> | Start a container in detached mode with automatic restart on failure or host reboot. | Production services on bare-metal or single-node hosts where no orchestrator handles restarts. |
| docker run --rm -it <image> /bin/sh | Run a throwaway interactive container that self-destructs on exit. | One-off debugging, testing environment variables, or validating an image's contents without leaving garbage. |
| docker rename <old> <new> | Rename a running or stopped container. | Resolve name collisions without killing the container when a deployment creates a conflict. |
| docker update --memory 512m --cpus 1.5 <container> | Live-update resource limits without restarting. | Emergency throttling when a container is starving neighboring services on the same host. |
Image Operations
Image management from build through registry push. Understand layer caching and image provenance before you're under pressure during an incident.
| Command | Description | Scenario |
|---|---|---|
| docker build -t <repo>:<tag> . | Build an image from the Dockerfile in the current directory and tag it. |
Standard CI/CD build step. Always tag with both a semantic version and latest in production pipelines. |
| docker build --no-cache -t <repo>:<tag> . | Rebuild every layer from scratch, ignoring all cached layers. | When a base image has a security patch and you need to ensure the new layer is pulled, not served from cache. |
| docker build --build-arg APP_VERSION=1.2.3 -t <repo>:1.2.3 . | Inject build-time variables into the Dockerfile. | Bake version strings, git SHAs, or feature flags into the image at build time for reproducibility. |
| docker build --target <stage> -t <repo>:dev . | Build only up to a named stage in a multi-stage Dockerfile. | Build a fat "developer" image with test tools without producing the slim production artifact. |
| docker tag <source>:<tag> <registry>/<repo>:<tag> | Create an additional tag pointing to the same image ID. | Promote an image from a staging registry to a production registry without rebuilding. |
| docker pull <image>:<tag> | Fetch the image from its registry to the local daemon cache. | Pre-pull images to a node before a time-sensitive deployment to eliminate pull latency from the critical path. |
| docker push <registry>/<repo>:<tag> | Upload the image to the registry. | End of every CI pipeline after image tests pass. |
| docker images --filter "dangling=true" | List untagged (dangling) image layers consuming disk space. | Pre-cleanup audit to see how much space dangling images are occupying before running prune. |
| docker rmi <image> | Remove one or more images from the local daemon. | Retire a specific image version after confirming rollback is no longer needed. |
| docker history <image>:<tag> | Show every layer of an image: the creating command, size, and timestamp. | Diagnose bloated images. Find which RUN step is responsible for excessive size. Essential for security audits to verify no secrets were accidentally committed to a layer. |
| docker history --no-trunc <image>:<tag> | Same as above but shows the full command without truncation. | When the truncated history hides the specific apt-get or curl command you need to audit. |
| docker save -o <file.tar> <image>:<tag> | Export an image to a tarball for air-gapped transfer. | Deploy to an environment with no internet access or a restricted registry. Transfer via SCP, then docker load. |
| docker load -i <file.tar> | Import an image from a tarball into the local daemon. | Air-gapped environment image ingestion. Pair with image signing to validate integrity. |
| docker image inspect <image>:<tag> | Dump the full JSON manifest: layers, env vars, entrypoint, exposed ports, labels. | Verify the build pipeline embedded correct labels (git SHA, build date), or confirm the entrypoint before deploying to production. |
Exec & Interactive Shell
Getting inside a running container. The most frequently used debugging primitive — know the options cold.
| Command | Description | Scenario |
|---|---|---|
| docker exec -it <container> /bin/bash | Open an interactive bash shell in a running container. | Live inspection — check running processes, environment variables, file contents, or network connectivity without restarting. |
| docker exec -it <container> /bin/sh | Open a POSIX sh shell (use when bash is not available in the image). | Minimal images (Alpine, distroless lite) where bash is not installed. |
| docker exec <container> env | Print all environment variables in the container's process space. | Verify that secrets injected at runtime (e.g., via --env-file or a secrets manager) are present with the correct values. |
| docker exec <container> cat /etc/hosts | Read a file without entering an interactive shell. | Quickly check DNS overrides or network configuration without the overhead of an interactive session. |
| docker exec -u root <container> /bin/bash | Enter the container as root regardless of the default USER setting. | When the app runs as a non-root user but you need root privileges to install a debug tool or inspect a protected file. |
| docker exec -it <container> sh -c "ps aux | grep <process>" | Run a compound shell command in one line without an interactive shell. | Scripted health checks or automated incident response playbooks that need to confirm a specific process is running inside the container. |
| docker cp <container>:/path/to/file ./local/ | Copy a file from a running container to the host. | Extract a crash dump, log file, or generated config from the container for analysis without terminating it. |
| docker cp ./local/file <container>:/path/to/ | Copy a file from the host into a running container. | Hot-patch a configuration file or inject a debug script without rebuilding the image (use with care — ephemeral change only). |
Log Analysis
Effective log triage during an incident. The --since and --until flags are critical for isolating the blast radius of a specific event.
| Command | Description | Scenario |
|---|---|---|
| docker logs <container> | Dump the full stdout/stderr log for a container. | First look at what a container printed before or during a crash. Caution: can produce enormous output for long-running services. |
| docker logs --tail 200 <container> | Show only the last N lines of log output. | Quick sanity check on a running service — see the most recent activity without scrolling through the entire history. |
| docker logs -f <container> | Follow (stream) logs in real time. Behaves like tail -f. |
Watch a deployment roll out, monitor a batch job, or observe a container's behaviour during a load test live. |
| docker logs --since 30m <container> | Show logs from the last 30 minutes only. | Isolate logs to the window of an incident. Accepts durations (30m, 2h) or RFC3339 timestamps. |
| docker logs --since "2026-04-19T14:00:00Z" --until "2026-04-19T14:15:00Z" <container> | Retrieve logs for an exact time window using RFC3339 timestamps. | Post-incident log extraction for a SIEM or a support ticket. Pinpoint output to the 15-minute blast radius of an outage. |
| docker logs -f --tail 50 <container> 2>&1 | grep -i "error\|exception\|fatal" | Follow logs and filter for error-level messages in real time. | High-noise service emitting thousands of INFO lines — stream only the signals that matter during an incident bridge. |
| docker logs --timestamps <container> | Prepend an RFC3339Nano timestamp to each log line. | Correlate Docker container log lines with host-level events (kernel OOM killer, disk I/O spikes) that have precise timestamps. |
Resource Monitoring & Deep Inspection
| Command | Description | Scenario |
|---|---|---|
| docker stats | Live, refreshing table of CPU %, memory usage/limit, network I/O, and block I/O for all running containers. | Real-time host health check during a traffic spike. Identify which container is consuming runaway resources before the host OOM-kills something. |
| docker stats --no-stream | Print a single snapshot of stats and exit — useful for scripting and alerting. | Feed into a cron-based alerting script or Prometheus textfile collector when a native exporter is not available. |
| docker stats --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}" | Custom output format showing only Name, CPU %, and Memory. | Screen-share-friendly view during an incident call — less noise, more signal. |
| docker top <container> | Show the running processes inside a container (like top on the host but scoped to the container's PID namespace). |
Determine which subprocess inside a container is consuming CPU — especially useful when the container is a shell script that spawns children. |
| docker inspect <container> | Dump the full JSON object for a container: network settings, mounts, env vars, health check config, restart policy, resource limits. | The authoritative source of truth for a container's configuration. When you suspect a misconfiguration (wrong mount path, missing env var, wrong network), start here. |
| docker inspect --format '{{.State.Health.Status}}' <container> | Extract a specific field from the inspect JSON using Go template syntax. | Quickly check health check status (healthy/unhealthy/starting) in a script without parsing the full JSON. |
| docker inspect --format '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' <container> | Extract the container's IP address(es) across all attached networks. | Determine the container's IP for direct connectivity testing (bypassing port mappings) when debugging inter-container network failures. |
| docker diff <container> | Show files added (A), changed (C), or deleted (D) in the container's writable layer vs the image. | Security audit — confirm that no unexpected files have been written to the container filesystem. Useful in forensics after a suspected compromise. |
| docker port <container> | List all port mappings for a container (host port → container port). | Confirm that the expected ports are exposed and mapped correctly after a deployment or when investigating connectivity failures. |
Housekeeping & Disk Reclamation
docker system prune -a removes all images not associated with a running container, including images needed for rapid rollback. Always review what will be removed using --dry-run (Docker 23+) or the --filter until option before executing destructive prune commands.
| Command | Description | Scenario |
|---|---|---|
| docker system df | Show disk usage breakdown: images, containers, volumes, and build cache — with reclaimable sizes. | Pre-prune assessment. Run this first to quantify the problem before deciding which prune command to use. |
| docker container prune | Remove all stopped containers. Prompts for confirmation. | Routine host cleanup on CI runners after builds complete to prevent container ID exhaustion. |
| docker container prune -f | Remove all stopped containers without prompting. Force mode. | Non-interactive scripts and cron jobs for automated host hygiene. |
| docker image prune | Remove only dangling images (layers with no tag and not referenced by any container). | Safe, targeted cleanup after repeated builds. Reclaims space from intermediate layers without touching tagged images needed for rollback. |
| docker image prune -a --filter "until=72h" | Remove all images (not just dangling) that were created more than 72 hours ago and are not in use. | Scheduled maintenance window cleanup on a build host where images older than 3 days are no longer needed for rollback. |
| docker volume prune | Remove all anonymous volumes not associated with a running or stopped container. | Recover disk space consumed by orphaned database volumes left over from removed containers. Named volumes are preserved. |
| docker volume prune --filter "label!=keep" | Prune volumes that do not carry a keep label. |
Fine-grained cleanup strategy: label critical volumes at creation time with --label keep=true to protect them from bulk prune operations. |
| docker network prune | Remove all user-defined networks not in use by any container. | Tidy up after docker-compose teardowns that leave orphaned overlay or bridge networks. |
| docker builder prune | Remove BuildKit build cache. | Reclaim build cache space on CI nodes where cache has grown large over many runs. |
| docker builder prune --keep-storage 5gb | Prune build cache but keep the most recent 5 GB (useful for preserving recent cache for faster rebuilds). | Balance between disk reclamation and build speed. Keeps enough cache to accelerate common rebuild patterns. |
| docker system prune | Remove all stopped containers, dangling images, unused networks, and build cache in one operation. | Comprehensive cleanup on dev/staging hosts. Does not remove volumes or non-dangling images by default. |
| docker system prune -a --volumes | Nuclear option: removes everything not associated with a running container, including all images and all volumes. | Decommissioning a host, or recovering a host that has run completely out of disk space. Ensure you have image manifests in a registry before running. |
# Practical disk-reclamation workflow # Step 1: Assess docker system df # Step 2: Safe targeted cleanup (removes stopped containers + dangling images + unused networks) docker system prune -f # Step 3: If still tight — remove images older than 72h not in active use docker image prune -a --filter "until=72h" -f # Step 4: Remove orphaned anonymous volumes (careful — confirm named volumes are excluded) docker volume ls -f dangling=true docker volume prune -f # Step 5: Verify reclamation docker system df
Contexts & Namespaces
Context and namespace mismatches cause the most expensive incidents — running a delete command against the wrong cluster. Adopt strict habits here.
kubectl config current-context as the first line of every runbook. Consider tools like kubectx/kubens or the PS1 kube-prompt to make context visible in your shell prompt at all times.
| Command | Description | Scenario |
|---|---|---|
| kubectl config get-contexts | List all contexts in your kubeconfig with their cluster, user, and namespace. | At the start of a shift or before any multi-cluster operation to confirm which clusters are available and which is active. |
| kubectl config current-context | Print the name of the currently active context. | First command of every runbook. Confirm you are on the correct cluster before proceeding with any state-modifying operation. |
| kubectl config use-context <context-name> | Switch the active context to a different cluster/user/namespace combination. | Switch between prod, staging, and dev clusters during a multi-cluster incident or cross-environment investigation. |
| kubectl config set-context --current --namespace=<ns> | Set the default namespace for all subsequent kubectl commands in the current context. | Prevent accident of repeatedly omitting -n <ns> when working in a non-default namespace for an extended period. Reset to default when done. |
| kubectl get namespaces | List all namespaces in the cluster with their status and age. | Audit namespace sprawl. Identify stale namespaces from old feature branches or forgotten load tests that are consuming quota. |
| kubectl create namespace <ns> | Create a new namespace. | Bootstrap a new team or environment namespace before applying their workloads and RBAC policies. |
| kubectl get all -n <ns> | List all core resource types (pods, services, deployments, ReplicaSets, StatefulSets, DaemonSets, Jobs, CronJobs) in a namespace. | Rapid triage of a namespace's state during an incident — one command to see everything that could be broken. |
| kubectl api-resources --verbs=list --namespaced=true | List all API resource types that are namespace-scoped. | When get all does not return certain resources (e.g., HPA, PDB, NetworkPolicy) because they are not part of the default resource group. |
# Shell alias — safe context switcher that shows a confirmation prompt kctx() { echo "Switching context to: $1" kubectl config use-context "$1" kubectl config current-context } # kubens — set default namespace kubectl config set-context --current --namespace=production # Always reset after a focused session kubectl config set-context --current --namespace=default
Node Management
Node-level operations for routine maintenance, hardware replacement, and kernel upgrades. The cordon → drain → maintain → uncordon sequence is the safe pattern.
| Command | Description | Scenario |
|---|---|---|
| kubectl get nodes | List all nodes with their status, roles, Kubernetes version, and age. | First step when pods cannot be scheduled. Identify NotReady nodes that may have lost kubelet connectivity or failed health checks. |
| kubectl get nodes -o wide | Extended view including internal/external IPs, OS image, kernel version, and container runtime. | Identify nodes running old kernel versions before a patching campaign, or confirm which IPs belong to which nodes when debugging network policies. |
| kubectl describe node <node-name> | Full node detail: capacity, allocatable resources, current conditions, events, taints, and list of all pods scheduled on it. | Diagnose a NotReady node — check the Conditions section for disk pressure, memory pressure, or PID pressure. Review Events for kubelet errors. |
| kubectl top nodes | Live CPU and memory consumption per node (requires Metrics Server to be running). | Identify hot nodes during a traffic spike. A node at 95 % memory usage is the likely target of the next OOM kill — act before it happens. |
| kubectl top nodes --sort-by=memory | Sort node resource usage by memory consumption (descending). | Quickly identify the most memory-pressured node in a cluster during a memory incident. |
| kubectl cordon <node-name> | Mark a node as unschedulable. Existing pods continue running; no new pods will be placed on this node. | Step 1 of node maintenance. Use before draining to ensure the scheduler stops landing new pods on the node while you prepare to drain it. |
| kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data | Evict all non-DaemonSet pods from the node. DaemonSet pods are ignored. Pods using emptyDir will lose that data. | Step 2 of node maintenance. Gracefully reschedule workloads to other nodes before applying OS patches, replacing hardware, or upgrading the node's kubelet. Always check for PodDisruptionBudgets — drain respects them by default. |
| kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --grace-period=120 | Drain with a 120-second grace period for each pod to terminate cleanly. | When stateful services (JVM apps, databases) need more time for graceful shutdown than the default 30 seconds allows. |
| kubectl drain <node-name> --force --ignore-daemonsets | Force drain even when pods cannot be gracefully evicted (e.g., pods not managed by a controller, PDBs blocking eviction). | Emergency node removal where you accept data loss risk — node is failing hardware and the cluster cannot wait for clean eviction. |
| kubectl uncordon <node-name> | Mark the node as schedulable again — the scheduler can now place new pods on it. | Step 3 of node maintenance. Run after patching/rebooting the node and confirming kubelet is Ready. The scheduler will rebalance workloads organically over time. |
| kubectl taint nodes <node-name> key=value:NoSchedule | Add a taint that prevents pods without a matching toleration from being scheduled on the node. | Reserve dedicated nodes for specific workloads (GPU jobs, high-memory services) or to quarantine a misbehaving node without fully cordoning it. |
| kubectl taint nodes <node-name> key=value:NoSchedule- | Remove a taint from a node (the trailing - is the removal syntax). |
Undo a quarantine taint after confirming the node issue is resolved and it is safe to receive general workloads again. |
# Safe node maintenance runbook (3-step procedure) # 1. Prevent new pods from landing on the node kubectl cordon node-1.prod.internal # 2. Drain existing workloads (respects PodDisruptionBudgets) kubectl drain node-1.prod.internal \ --ignore-daemonsets \ --delete-emptydir-data \ --grace-period=120 # ... perform maintenance (kernel upgrade, hardware swap, etc.) ... # 3. Return node to service kubectl uncordon node-1.prod.internal # Verify the node is Ready and accepting workloads kubectl get node node-1.prod.internal
Daily Operations
| Command | Description | Scenario |
|---|---|---|
| kubectl get pods -n <ns> | List pods in a namespace with their status, restarts, and age. | The most-used command in day-to-day operations. Check pod health after a deployment, during an incident, or after a node event. |
| kubectl get pods -n <ns> -o wide | Extended view with the node each pod is running on and its internal IP. | Identify whether failing pods are all on the same node (node-level issue) or spread across nodes (application-level issue). |
| kubectl get pods -n <ns> --sort-by='.status.containerStatuses[0].restartCount' | Sort pods by restart count — highest restarters at the bottom. | Quickly spot the most crash-looping pod in a namespace during a noisy incident with many pods restarting. |
| kubectl get pods -A --field-selector=status.phase!=Running | List pods across all namespaces that are not in the Running phase. | Cluster-wide health sweep — identify Pending, Failed, or Evicted pods across all teams at a glance. |
| kubectl describe pod <pod-name> -n <ns> | Full pod detail: events, init container status, resource requests/limits, image, node, volumes, readiness/liveness probe config. | Diagnose CrashLoopBackOff, ImagePullBackOff, Pending, or OOMKilled status. The Events section is the most valuable part during debugging. |
| kubectl get deployments -n <ns> | List deployments with desired/current/up-to-date/available replica counts. | Post-deployment verification. The ready count must match desired before the deployment is considered successful. |
| kubectl describe deployment <name> -n <ns> | Deployment detail: strategy, replica sets, conditions, and events. | Diagnose a deployment stuck in progress — check the Conditions for Progressing=False and the Events for the root cause. |
| kubectl get services -n <ns> | List services with their type, cluster IP, external IP, and target port. | Verify that a LoadBalancer service has received an external IP from the cloud provider, or that a ClusterIP service is assigned. |
| kubectl apply -f <manifest.yaml> | Apply a manifest declaratively — creates or updates resources as needed. | Standard GitOps deployment verb. Idempotent — safe to run repeatedly. Prefer over create in all production workflows. |
| kubectl apply -f <manifest.yaml> --dry-run=server | Validate the manifest against the live API server without making changes. Catches schema errors and admission webhook rejections. | Pre-flight check in CI pipelines before applying to production. Catches invalid YAML, missing fields, and policy violations from OPA/Kyverno. |
| kubectl delete -f <manifest.yaml> | Delete all resources defined in a manifest file. | Clean teardown of a specific workload (feature branch, load test environment) without having to know the individual resource names. |
| kubectl delete pod <pod-name> -n <ns> | Delete a pod. The controlling ReplicaSet or Deployment will immediately create a replacement. | Force a pod replacement without restarting the Deployment. Useful when one replica in a set is misbehaving (stuck goroutines, bad local cache). |
| kubectl get endpoints <service-name> -n <ns> | Show the IP:port pairs that a Service resolves to (the healthy pods behind the service). | Diagnose "service returns 502/503" — if endpoints list is empty, the selector is wrong or no pods match the label, not a networking issue. |
| kubectl get hpa -n <ns> | List HorizontalPodAutoscalers with current vs target metrics and min/max replicas. | Understand why a deployment has an unexpected replica count. HPA may have scaled down or up independently of manual changes. |
Rollouts & Scaling
| Command | Description | Scenario |
|---|---|---|
| kubectl scale deployment <name> --replicas=10 -n <ns> | Immediately set the desired replica count for a Deployment. | Manual scale-out during an unexpected traffic surge before HPA kicks in, or manual scale-in to reduce costs in a quiet period. |
| kubectl rollout status deployment/<name> -n <ns> | Watch and block until a rollout completes (all replicas updated and healthy) or reports an error. | Gate the next step in a CI/CD pipeline. Do not mark a deploy as "done" until this command exits 0. |
| kubectl rollout status deployment/<name> -n <ns> --timeout=5m | Rollout status with a 5-minute timeout. Exits non-zero if the rollout does not complete in time. | Protect CI pipelines from hanging indefinitely when a broken image causes pods to CrashLoop and the rollout never completes. |
| kubectl rollout history deployment/<name> -n <ns> | Show the revision history for a Deployment — revision number and change cause. | Before a rollback, confirm which revision to target. Use --record flag on apply (deprecated) or annotate with kubernetes.io/change-cause for human-readable history. |
| kubectl rollout history deployment/<name> --revision=3 -n <ns> | Show the detailed spec of a specific revision — the exact image, environment variables, and resource settings at that revision. | Confirm what version was running before the incident so you can validate the rollback target before executing it. |
| kubectl rollout undo deployment/<name> -n <ns> | Instantly roll back to the previous revision. | Primary rollback command during an incident. When a bad deploy is in progress and you need to revert immediately — this is faster than re-running the pipeline. |
| kubectl rollout undo deployment/<name> --to-revision=3 -n <ns> | Roll back to a specific named revision. | When "the last version" is also broken (back-to-back bad deployments) and you need to target a specific known-good revision from history. |
| kubectl rollout restart deployment/<name> -n <ns> | Trigger a rolling restart of all pods in a deployment without changing the image or config. | Flush a bad in-memory cache, force config re-reads (when using ConfigMap mounts), or recover from goroutine leaks without a full re-deploy. |
| kubectl set image deployment/<name> <container>=<image>:<tag> -n <ns> | Update the image for a specific container in a Deployment imperatively (triggers a rolling update). | Emergency hotfix deployment when the GitOps pipeline is too slow and you need to push a specific image tag immediately. |
| kubectl patch deployment <name> -p '{"spec":{"paused":true}}' -n <ns> | Pause a Deployment rollout. Changes accumulate without being applied. | Batch multiple spec changes (image, env vars, resource limits) into a single rollout event to reduce disruption. |
| kubectl patch deployment <name> -p '{"spec":{"paused":false}}' -n <ns> | Resume a paused Deployment. All accumulated changes are applied as a single rollout. | After batching changes on a paused deployment, resume to trigger one cohesive rolling update. |
Log Aggregation
| Command | Description | Scenario |
|---|---|---|
| kubectl logs <pod-name> -n <ns> | Dump logs from a pod's primary container (first container in spec). | The first log command in any pod investigation. If the pod has multiple containers, use -c to specify which one. |
| kubectl logs <pod-name> -c <container> -n <ns> | Logs from a specific named container within a multi-container pod. | Sidecar architectures (Envoy proxy, Vault agent, log shipper) where the main app and the sidecar each have separate logs. |
| kubectl logs <pod-name> -n <ns> -f | Follow (stream) log output in real time. | Watch application behaviour live during a load test, a specific user action, or a manually triggered event. |
| kubectl logs <pod-name> -n <ns> --tail=100 | Show only the last 100 lines. | When a pod has been running for days and you only need recent activity — avoids megabytes of historical output. |
| kubectl logs <pod-name> -n <ns> --since=1h | Show logs from the last 1 hour. Accepts s, m, h. |
Post-incident log scoping — limit output to the incident window to avoid analysis paralysis from unrelated log noise. |
| kubectl logs <pod-name> -n <ns> -p | Logs from the previously terminated container instance in the same pod. | The most important flag for CrashLoopBackOff diagnosis. The current container may be alive but the crash happened in the previous instance. -p retrieves what was printed before the crash. |
| kubectl logs -l app=<label-value> -n <ns> --all-containers=true | Stream logs from all pods matching a label selector — including all containers within each pod. | Aggregate logs from all replicas of a Deployment simultaneously. Essential when the error is intermittent and you cannot predict which replica will produce it next. |
| kubectl logs -l app=<label-value> -n <ns> -f --max-log-requests=20 | Follow logs from up to 20 pods matching a label in parallel. | Real-time streaming from multiple replicas during a canary deployment to compare behaviour between old and new pod versions. |
| kubectl logs <pod-name> -n <ns> --timestamps=true | Prepend RFC3339 timestamps to each log line. | Correlate pod logs with external system events (database slow queries, CDN logs, ALB access logs) that use wall-clock timestamps. |
Network Debugging & Port Forwarding
Bypass ingress controllers and service meshes entirely to test direct connectivity to a specific pod or service. The fastest way to isolate whether an issue is in the application or the network layer.
| Command | Description | Scenario |
|---|---|---|
| kubectl port-forward pod/<pod-name> 8080:8080 -n <ns> | Tunnel local port 8080 directly to port 8080 on the named pod, bypassing all services, ingresses, and load balancers. | Confirm whether an application bug is real or an infrastructure routing issue. If the app responds correctly via port-forward but not via the service, the problem is in the service/ingress layer. |
| kubectl port-forward svc/<service-name> 5432:5432 -n <ns> | Tunnel to a Service (Kubernetes routes to a backing pod via the service's selector). | Access an in-cluster database (Postgres, Redis, MySQL) from your local machine for debugging without exposing it externally. |
| kubectl port-forward deployment/<name> 9090:9090 -n <ns> | Tunnel to a pod selected by a Deployment (Kubernetes picks one replica). | Access a Prometheus or metrics endpoint on one replica for manual scraping or dashboard setup without creating an Ingress. |
| kubectl port-forward pod/<pod-name> 8080:8080 -n <ns> --address 0.0.0.0 | Bind the forwarded port to all interfaces, making it accessible from other hosts on the network. | Share the tunnel with a colleague during pair debugging or make it accessible from a load testing tool on another machine in the same VPN. |
| kubectl run netdebug --rm -it --image=nicolaka/netshoot -n <ns> -- /bin/bash | Spin up a throwaway pod with a comprehensive network debugging toolkit (curl, dig, netstat, tcpdump, nmap, etc.) inside the cluster. | Test DNS resolution, HTTP connectivity, TCP reachability, and network policies from within the cluster network — perspectives impossible to get from the host. |
| kubectl exec <pod> -n <ns> -- curl -sv http://<service-name>.<ns>.svc.cluster.local/health | Make an HTTP request from one pod to another service using Kubernetes DNS, capturing headers and TLS info. | Validate in-cluster DNS resolution and service-to-service HTTP connectivity without installing tools on the host machine. |
| kubectl exec <pod> -n <ns> -- nslookup <service-name> | DNS lookup for a Kubernetes service from inside a pod. | Diagnose DNS-related service discovery failures. Confirm that the CoreDNS entry exists and resolves to the correct ClusterIP. |
Interactive Debugging & Ephemeral Containers
kubectl exec is useless here. kubectl debug with ephemeral containers is the solution — it attaches a debug sidecar to the running pod's process namespace without restarting it.
| Command | Description | Scenario |
|---|---|---|
| kubectl exec -it <pod-name> -n <ns> -- /bin/bash | Open an interactive shell in the pod's primary container. | Inspect the filesystem, test environment variables, or run diagnostic commands inside a container that has a shell available. |
| kubectl exec -it <pod-name> -c <container> -n <ns> -- /bin/sh | Exec into a specific named container in a multi-container pod. | Debug a specific sidecar (e.g., the Envoy proxy container) without affecting the main application container. |
| kubectl exec <pod-name> -n <ns> -- env | sort | Print and sort all environment variables without entering interactive mode. | Verify secret injection — confirm the expected keys are present and the values are not empty (avoid printing sensitive values in logs). |
| kubectl debug -it <pod-name> --image=busybox --target=<container> -n <ns> | Attach an ephemeral debug container sharing the process namespace of <container>. Can inspect its /proc, filesystem, and network stack. |
Debugging distroless or scratch-based containers that have no shell. The ephemeral container shares the PID namespace so you can inspect the running process's file descriptors, open sockets, and memory maps with tools from the debug image. |
| kubectl debug -it <pod-name> --image=nicolaka/netshoot --target=<container> -n <ns> | Attach a netshoot ephemeral container (full network toolkit) to a distroless pod. | Run tcpdump, ss, or strace against a distroless production pod without restarting it and without embedding debug tools in the production image. |
| kubectl debug node/<node-name> -it --image=ubuntu -n <ns> | Spawn a privileged pod on a specific node that can see the host's processes and filesystem. | Node-level debugging: inspect the kubelet's logs, examine the container runtime (containerd/cri-o), or investigate a kernel issue from within the cluster without SSH access to the node. |
| kubectl debug <pod-name> --copy-to=<debug-pod-name> --image=<debug-image> -n <ns> | Create a copy of a pod with the image replaced by a debug-capable image. Does not affect the original pod. | When you need to test behaviour with a debug image but cannot tolerate any risk to the live pod. Useful for reproducing intermittent issues with full debugging tools. |
Event Tracking
Kubernetes events are the cluster's audit trail for scheduling, image pulls, volume mounts, probe failures, and more. They expire after ~1 hour by default — capture them early in an incident.
| Command | Description | Scenario |
|---|---|---|
| kubectl get events -n <ns> | List all events in a namespace with type (Normal/Warning), reason, object, and message. | First step when pods are in Pending, CrashLoopBackOff, or OOMKilled states. The event usually tells you exactly why. |
| kubectl get events -n <ns> --sort-by='.lastTimestamp' | Sort events chronologically — most recent at the bottom. | Establish a timeline of failures. See the sequence of events that led to the current broken state (e.g., ImagePullBackOff → OOMKilled → CrashLoopBackOff). |
| kubectl get events -n <ns> --field-selector reason=BackOff | Filter events by a specific reason code. | Quickly count and isolate all BackOff events in a namespace when you suspect a widespread image pull or container start failure. |
| kubectl get events -n <ns> --field-selector type=Warning --sort-by='.lastTimestamp' | Show only Warning-level events, sorted chronologically. | Cut through Normal events during an incident and focus only on conditions that represent problems. |
| kubectl get events -A --sort-by='.lastTimestamp' | tail -30 | Show the 30 most recent events across the entire cluster. | Cluster-wide incident triage — get a panoramic view of what is failing across all namespaces simultaneously. |
| kubectl describe pod <pod-name> -n <ns> | grep -A 20 Events: | Extract and display the Events section from a pod description. | Fastest way to see events scoped to a single pod — especially useful when there are many events in the namespace and you only care about one pod. |
| kubectl get events -n <ns> -w | Watch the events stream live — new events appear as they are created. | Monitor a deployment, a rolling restart, or a node drain operation in real time and see exactly what the control plane is doing step by step. |
Clearing Stuck Terminating Pods
| Command | Description | Scenario |
|---|---|---|
| kubectl get pods -n <ns> | grep Terminating | List pods currently stuck in the Terminating state. | Audit step before cleanup — confirm which pods are actually stuck vs which are mid-graceful-shutdown and should be left alone. |
| kubectl delete pod <pod-name> -n <ns> --grace-period=0 --force | Bypass the grace period and immediately remove the pod record from the API server, even if the kubelet has not confirmed termination. | Pod is stuck in Terminating and the node it ran on is dead or the kubelet is unreachable. The pod will never self-terminate — this removes the API object. The container process may still be running on the node. |
| kubectl patch pod <pod-name> -p '{"metadata":{"finalizers":[]}}' --type=merge -n <ns> | Remove all finalizers from a pod, allowing the API server to proceed with deletion. | A finalizer (e.g., from a service mesh, a storage driver, or a custom operator) is preventing deletion. Only safe when the finalizer's purpose (resource cleanup) has been completed out-of-band. |
| kubectl get pods -n <ns> --field-selector=status.phase=Failed -o name | xargs kubectl delete -n <ns> | Find all pods in Failed phase in a namespace and delete them. | Post-incident cleanup of failed pods that are not managed by a controller (standalone pods) or were left orphaned after a failed Job. |
# Force-delete a stuck Terminating pod (use only when node is known unreachable) kubectl delete pod <pod-name> -n <ns> --grace-period=0 --force # If it's still stuck — the finalizer is the culprit. Remove it: kubectl patch pod <pod-name> -n <ns> \ -p '{"metadata":{"finalizers":[]}}' \ --type=merge
Bulk Cleanup of Failed & Evicted Pods
Over time, nodes under memory pressure evict pods, and failed Jobs leave pod carcasses. These consume etcd storage and create namespace noise that masks real problems.
| Command | Description | Scenario |
|---|---|---|
| kubectl get pods -A --field-selector=status.phase=Failed | List all Failed pods across all namespaces. | Cluster-wide audit before cleanup. Understand the scale of the problem and which namespaces are most affected. |
| kubectl delete pods -n <ns> --field-selector=status.phase=Failed | Delete all Failed pods in a specific namespace. | Namespace-scoped cleanup after a failed batch job run or a period of high eviction activity. |
| kubectl get pods -A --field-selector=status.phase=Failed -o json | jq -r '.items[] | .metadata.namespace + "/" + .metadata.name' | xargs -I{} bash -c 'ns=$(echo {} | cut -d/ -f1); pod=$(echo {} | cut -d/ -f2); kubectl delete pod $pod -n $ns' | Delete all Failed pods across every namespace in the cluster. | Full cluster cleanup sweep after a widespread eviction event or a mass Job failure. Run in a maintenance window. |
| kubectl get pods -n <ns> -o json | jq -r '.items[] | select(.status.reason=="Evicted") | .metadata.name' | xargs kubectl delete pod -n <ns> | Find and delete all Evicted pods in a namespace by parsing the status reason field. | After a node memory pressure event — evicted pods pile up in Terminated/Evicted state and do not self-clean. This reclaims etcd space and clears the pod listing noise. |
| kubectl get pods -A -o json | jq -r '.items[] | select(.status.reason=="Evicted") | [.metadata.namespace, .metadata.name] | @tsv' | while IFS=$'\t' read -r ns pod; do kubectl delete pod "$pod" -n "$ns"; done | Cluster-wide deletion of all Evicted pods across all namespaces. | Large-scale post-incident sweep when evictions occurred across multiple namespaces simultaneously (e.g., after a cluster-wide memory spike or a node flap). |
| kubectl get pods -n <ns> | awk '/Error/{print $1}' | xargs kubectl delete pod -n <ns> | Delete pods in Error state using awk for lightweight field parsing. | Scriptable, dependency-free cleanup for Error state pods on hosts where jq is not available. |
# Comprehensive eviction cleanup — safe one-liner for a single namespace kubectl get pods -n production -o json \ | jq -r '.items[] | select(.status.reason=="Evicted") | .metadata.name' \ | xargs kubectl delete pod -n production # Count evicted pods before deleting (audit step) kubectl get pods -n production -o json \ | jq '[.items[] | select(.status.reason=="Evicted")] | length'
Cleaning Up Orphaned ReplicaSets
Kubernetes retains old ReplicaSets by default (controlled by revisionHistoryLimit on the Deployment — default 10). While this enables rollback, it creates clutter in large or long-running clusters. Zero-replica ReplicaSets beyond the history limit are safe to remove.
| Command | Description | Scenario |
|---|---|---|
| kubectl get rs -n <ns> | List all ReplicaSets in a namespace showing desired/current/ready counts. | Identify how many old, zero-replica ReplicaSets have accumulated. Any RS with 0 desired, 0 current, 0 ready is orphaned. |
| kubectl get rs -n <ns> -o json | jq -r '.items[] | select(.spec.replicas==0) | .metadata.name' | List the names of all ReplicaSets with 0 desired replicas. | Pre-cleanup audit — confirm which RS objects are safe to delete. Cross-reference with rollout history if rollback to specific revisions is needed. |
| kubectl get rs -n <ns> -o json | jq -r '.items[] | select(.spec.replicas==0) | .metadata.name' | xargs kubectl delete rs -n <ns> | Delete all zero-replica ReplicaSets in a namespace. | Namespace housekeeping when rollback history is managed by a GitOps tool (ArgoCD, Flux) and old RS objects provide no rollback value. |
| kubectl patch deployment <name> -p '{"spec":{"revisionHistoryLimit":3}}' -n <ns> | Set the Deployment to retain only the last 3 ReplicaSet revisions. Older ones are automatically cleaned up. | Preventive measure. Set this on all Deployments at creation time. 3 revisions provides adequate rollback depth without accumulating RS sprawl in long-lived clusters. |
| kubectl get rs -A -o json | jq '[.items[] | select(.spec.replicas==0)] | length' | Count all zero-replica ReplicaSets across the entire cluster. | Cluster hygiene metric. If this number is in the hundreds, it is time to enforce revisionHistoryLimit at the namespace or admission controller level. |
# Find and delete all zero-replica ReplicaSets in a namespace kubectl get rs -n production -o json \ | jq -r '.items[] | select(.spec.replicas==0) | .metadata.name' \ | xargs --no-run-if-empty kubectl delete rs -n production # Prevent future RS accumulation — cap history at 3 revisions kubectl patch deployment my-app \ -p '{"spec":{"revisionHistoryLimit":3}}' \ -n production # Cluster-wide count of orphaned ReplicaSets kubectl get rs -A -o json \ | jq '[.items[] | select(.spec.replicas==0)] | length'
revisionHistoryLimit: 3 on all Deployments in your base Helm charts or Kustomize patches.
Run eviction cleanup as a weekly CronJob in each namespace.
Use kubectl get events -A --sort-by=.lastTimestamp | tail -50 as the first command at the start of every incident response.
Label all resources that must survive prune operations (keep: "true").