DevOps · SecOps
A practical reference covering the full engineering lifecycle — from writing your first Dockerfile to owning incident response, cost governance, and everything in between. Every concept explained at two depths.
The Engineering Lifecycle
Source Control & Team Flow
Use short-lived feature branches off main. Merge often. Never commit directly to main. Keep main always deployable.
Add branch protection rules requiring PR reviews + passing CI. Use release branches only when your cadence demands them. Enforce naming conventions (feat/, fix/).
- Long-lived feature branches (merge hell)
- Committing directly to main/master
- No branch naming convention
- Merging without CI checks passing
- Short-lived branches, merged within days
- Required PR review + status checks
- Consistent naming:
feat/,fix/,chore/ - Delete branches after merge
type(scope): description. Enables automated changelogs, semantic versioning, and better history readability.Learn the 5 common types: feat, fix, chore, docs, refactor. Just using these consistently makes your history dramatically more readable.
Wire commit types to semantic-release or release-please to automate version bumps and CHANGELOG generation. Add a commitlint pre-commit hook to enforce the format.
MAJOR.MINOR.PATCH where each part has a precise meaning based on backward compatibility.| Part | Increments When | Example |
|---|---|---|
| MAJOR | Breaking API change — existing clients may break | 2.0.0 |
| MINOR | New feature added, backward compatible | 1.3.0 |
| PATCH | Bug fix, backward compatible | 1.3.7 |
Developer Experience
.devcontainer/ that ensures every developer uses identical tool versions, CLI dependencies, and configurations — regardless of their host OS.Open the repo in VS Code with the Dev Containers extension. All tools are pre-installed. Run the same make commands as your teammates.
Pin exact tool versions in the Dockerfile. Add a post-create.sh to auto-run setup (install hooks, configure git). Review pinned versions quarterly.
make test or task build.Continuous Integration
Quality Gates: Linting, SAST & SCA
golangci-lint, eslint, ruff, terraform fmt. Run on every commit.semgrep, CodeQL, gosec.Snyk, Dependabot, trivy fs, npm audit.trivy image in CI and gate on CRITICAL/HIGH severity. Rebuild images regularly to pick up OS patches.checkov, tfsec, terrascan, kics.gitleaks, truffleHog, detect-secrets. Treat every hit as real until proven otherwise.Infrastructure as Code
terraform plan to preview, terraform apply to execute.Separate reusable modules/ from environments/. Always run terraform plan before apply. Start with remote state in S3 + DynamoDB lock table.
Add Terragrunt for DRY multi-environment configs. Run checkov in CI for policy checks. Use workspace strategies or directory-per-env. Add Infracost for cost estimation in PRs.
Kubernetes Packaging
runAsNonRoot: true). These are the most commonly missing production requirements.- Patch-based — overlay diffs on a base
- No templating language to learn
- Great for your own app's env variation
- Built into
kubectl - Use for: your apps, different env configs
- Template-based — Go templates with values
- Packaging format with versioned releases
- Great for distributable, reusable charts
- Huge ecosystem of community charts
- Use for: shared infra (Prometheus, cert-manager)
GitOps
kubectl apply from a CI pipeline (push model), a controller running inside the cluster watches a Git repo and continuously reconciles cluster state to match it (pull model). The repo is the only truth — no manual kubectl edits.Never edit cluster state with kubectl directly in production. All changes go through Git. If you make a manual change, the GitOps controller will revert it (drift correction).
Add health checks and sync windows (e.g., no auto-sync during business peak hours). Use app-of-apps pattern in Argo CD for managing many applications. Separate the app repo from the config/infra repo.
Shift-Left Security
| # | Risk | Simple Rule |
|---|---|---|
| A01 | Broken Access Control | Check authorization on every request server-side. Never trust the client. |
| A02 | Cryptographic Failures | Enforce TLS 1.2+. Use AES-256-GCM. Ban MD5/SHA-1 for security purposes. |
| A03 | Injection (SQL, OS, LDAP) | Always use parameterized queries. Never concatenate user input into commands. |
| A04 | Insecure Design | Threat model before coding. Security is a design requirement, not a patch. |
| A05 | Security Misconfiguration | IaC scanning in CI. CIS Benchmarks. Default-deny everything. |
| A06 | Vulnerable Components | SCA scanning + automated dependency updates (Dependabot/Renovate). |
| A07 | Auth & Session Failures | Use established auth libraries. Enforce MFA. Rotate tokens. Short sessions. |
| A08 | Software Integrity Failures | Sign artifacts. Verify signatures. Use trusted CI for builds. |
| Severity | CVSS | Patch SLA | If Unpatchable |
|---|---|---|---|
| CRITICAL | 9.0–10 | 24 hours | Isolate system immediately |
| HIGH | 7.0–8.9 | 7 days | WAF rule or network ACL |
| MEDIUM | 4.0–6.9 | 30 days | Document risk acceptance |
| LOW | 0.1–3.9 | 90 days | Risk acceptance OK |
Identity & Secrets Management
ref:refs/heads/main) to limit which pipelines can assume which roles.- Hardcode secrets in source code
- Store secrets in
.envfiles committed to git - Share secrets via Slack or email
- Use same credential across dev/staging/prod
- Set secrets that never rotate
- Log secrets in CI pipeline output
- Use a central secret store (Vault, AWS Secrets Manager)
- Use dynamic secrets with short TTLs
- Separate secrets per environment
- Automate rotation and test it
- Audit all secret access (who, when, what)
- Run secret detection as a pre-commit hook
Policy as Code
Start in Audit mode to see what would fail without breaking things. Common beginner policies: require labels, require resource limits, require non-root containers.
Move critical policies to Enforce mode. Add Kyverno Policy Reports to your observability stack. Namespace-scope enforcement for gradual rollout. Exception workflows for approved deviations.
Supply Chain Security
syft or cyclonedx. Required for vulnerability impact analysis: "Are we affected by Log4Shell?"cosign (Sigstore). Verify signatures at deploy time via admission policy. Ensures the image in production came from your trusted CI pipeline, not an attacker.FROM ubuntu@sha256:abc... not FROM ubuntu:latest). Pin dependencies at exact versions. Unpinned deps are the primary supply chain attack vector.Runtime Security & Incident Response
| Framework | When It Applies | Key Focus |
|---|---|---|
| SOC 2 Type II | SaaS / cloud services with enterprise customers | Trust Service Criteria — security, availability, confidentiality |
| ISO 27001 | Any org seeking certification for third-party trust | ISMS — 114 controls, annual audit |
| PCI DSS v4.0 | Handling credit card / cardholder data | 12 requirements — mandatory, audited by QSA |
| GDPR | Processing personal data of EU/UK residents | Lawful basis, 72h breach notification, DPO |
| HIPAA | US healthcare — PHI handling | PHI safeguards, BAAs, breach notification |
The Three Pillars
Answer: "Is something wrong? How often? How bad?"
Tools: Prometheus (scrape-based), Grafana (visualization), OpenTelemetry SDK.
- Track 4 Golden Signals: latency, traffic, errors, saturation
- Use histograms for latency, not averages
Answer: "What exactly happened at this moment?"
Tools: Loki (log aggregation), Elasticsearch/OpenSearch, Datadog.
- Always use structured JSON logs
- Include correlation IDs linking to traces
- Never log PII or credentials
Answer: "Where is the bottleneck in this request path?"
Tools: Tempo, Jaeger, Zipkin, OpenTelemetry.
- Instrument entry points (HTTP handlers, consumers)
- Propagate context via W3C Trace Context headers
SLOs & Error Budgets
| Term | What It Is | Example |
|---|---|---|
| SLI Service Level Indicator | The actual metric you measure | % of HTTP requests returning 2xx in under 500ms |
| SLO Service Level Objective | Your target for that metric | 99.9% of requests must be successful over a 30-day window |
| SLA Service Level Agreement | External contractual commitment (usually stricter penalty) | 99.5% uptime in the contract (lower than SLO — buffer for the org) |
| Error Budget | The allowed unreliability from the SLO | 99.9% SLO = 0.1% budget = 43.8 min/month of allowed downtime |
Start with one availability SLO per critical service (e.g., "99.5% of checkout requests succeed in 30 days"). Track it. That's it. One measured SLO is infinitely better than zero.
Add latency SLOs. Define error budget burn rate alerts (multi-window: 1h + 6h). When burn rate is too high, halt feature work and focus on reliability. Error budget = the bridge between product and operations teams.
Alerting Best Practices
The FinOps Operating Loop
Start by identifying your 10 largest resource consumers with kubectl top pods -A. Compare requests vs actual usage. Even a 20% reduction in your top 10 consumers can mean significant monthly savings.
Automate rightsizing recommendations as pull requests using scripts that read VPA recommendations and generate manifest diffs. Gate changes through PR review to prevent accidental under-provisioning.
Cost Allocation & Governance
Service Catalog & Ownership
Register every production service. Even a simple list with owner + oncall contact is transformative during an incident. The first question is always "who owns this?"
Generate CODEOWNERS automatically from catalog metadata to avoid drift. Validate catalog entries in CI (schema check + URL validation). Enforce registration as a production deployment gate.
Learning Path
| Phase | Concepts | Goal |
|---|---|---|
| Phase 1 Week 1–2 |
Git branching, conventional commits, dev containers, Makefile | Every developer working identically. Code history is readable. No "works on my machine." |
| Phase 2 Week 3–4 |
Pre-commit hooks, basic CI pipeline, secret detection, dep scanning | Automated quality gates on every push. Security checks running in CI. |
| Phase 3 Month 2 |
Terraform basics, Kubernetes manifests, container scanning, IaC scanning | Infrastructure defined as code. Deployments reproducible. |
| Phase 4 Month 3 |
Kustomize/Helm, GitOps with Argo CD, OIDC federation, secrets management | GitOps-driven deployments. No static credentials anywhere. |
| Phase 5 Month 4–5 |
Policy as code (Kyverno), observability (metrics + logs + traces), SLOs | Governance automated. Reliability is measured and owned. |
| Phase 6 Ongoing |
FinOps loop, service catalog, supply chain security, incident response playbooks | Full platform engineering maturity. Cost visible. Ownership clear. Incidents structured. |
| Concept | Primary Tool(s) | Learn First |
|---|---|---|
| Branching | Git, GitHub/GitLab | YES — Day 1 |
| CI Pipeline | GitHub Actions, GitLab CI | YES — Week 1 |
| Secret Detection | gitleaks, truffleHog | YES — Week 1 |
| IaC | Terraform, Pulumi | Month 1 |
| Kubernetes | kubectl, k9s, Lens | Month 1–2 |
| GitOps | Argo CD, Flux | Month 2–3 |
| SLOs | Prometheus, Grafana | Month 3–4 |
| Policy as Code | Kyverno, OPA/Conftest | After Kubernetes basics |
| Supply Chain Sec | cosign, syft, SLSA | After CI/CD mastery |
| FinOps | Infracost, VPA, Kubecost | After infra is running |