Production-Ready Azure Reference

Azure Cloud Engineer Handbook

An engineer-focused guide to Azure architecture, cost mechanics, portal navigation, and repeatable delivery with Azure Resource Manager primitives.

Azure Resource Manager Bicep + Python SDK Cost-Aware Architecture March 2026

This handbook now includes all six modules. It is organized around the decisions Azure engineers make most often in production: what to deploy, how Azure bills it, where to inspect it, and how to standardize delivery with Azure Resource Manager and Bicep.

This handbook is organized around the decisions Azure engineers make in production: what service to pick, how it is billed, where to inspect it in the portal, and how to automate it with Infrastructure as Code or SDKs.

Module 1 Included Now

Compute & Serverless

App Service, Functions, AKS, and fast portal or CLI paths for deployment and diagnosis.

Module 2 Included

Data & Storage

Blob Storage, Cosmos DB, Azure SQL Database, access tiers, throughput models, and pricing traps.

Module 3 Included

Networking & Security

VNets, subnets, NSGs, Network Watcher, Key Vault, and secure identity-driven access patterns.

Module 4 Included

Pricing & Cost Management

Calculator methodology, Reserved Instances, PAYG, Hybrid Benefit, budgets, and alerting.

Module 5 Included

Infrastructure as Code Standard

Why portal-only delivery fails in production and how to standardize on Bicep modules and safe parameters.

Module 6 Included

Common Pitfalls & Anti-Patterns

Idle resources, egress surprises, secret sprawl, and the operational habits that prevent avoidable spend.

Module	Primary Decision	What You Must Know
1. Compute & Serverless	Where application code should run	Billing model, scaling behavior, deployment path, and first-response troubleshooting steps
2. Data & Storage	How data is stored and accessed	Capacity pricing, transaction charges, throughput planning, and lifecycle choices
3. Networking & Security	How resources are isolated and trusted	Subnet boundaries, secret handling, RBAC, effective rules, and private connectivity
4. Pricing & Cost	How to estimate and control spend	Calculator discipline, discount models, and proactive alerts
5. IaC Standard	How infrastructure gets delivered	Version control, repeatability, reviewability, and environment-safe parameters
6. Pitfalls	What repeatedly causes incidents or bill spikes	Zombie resources, outbound data charges, and credential management mistakes

Module 1: Compute & Serverless

Azure compute choices are mostly a tradeoff between operational control and platform abstraction. App Service is the fastest path for standard web workloads, Azure Functions is the lowest-friction option for event-driven execution, and AKS exists for teams that truly need Kubernetes primitives, release independence, or container ecosystem tooling.

Architectural Lens

Pick the simplest runtime that still satisfies scaling, networking, compliance, and deployment requirements.

Billing Lens

Always ask whether you pay per reserved instance, per execution, or per node pool before approving a design.

Operations Lens

Portal search, metrics, deployment logs, and identity settings should be part of the runbook from day one.

Web App / API -> App Service -> Event Trigger -> Functions -> Kubernetes Control -> AKS

App Service

Azure App Service is a managed platform for hosting web applications, REST APIs, and lightweight background workloads without managing operating systems, patching cycles, or load balancer plumbing yourself. It is the default choice when a team wants a fast deployment path for customer-facing sites, internal line-of-business applications, or Python, .NET, Node.js, and Java APIs that do not require container orchestration.

When App Service Is The Right Fit Web / API First

Use App Service when the application is HTTP-centric, state lives outside the process, and the team values managed deployment slots, TLS termination, autoscale, and Microsoft-managed patching over raw infrastructure control.

Tier	Best Use Case	How It Is Billed	Operational Impact
Free (F1)	Learning, demos, proof-of-concept apps	Shared compute with strict limits and no production-grade capacity guarantees	No SLA, constrained CPU minutes, not appropriate for real production traffic
Basic (B1-B3)	Small steady-state web apps and internal APIs	Fixed price per dedicated App Service Plan instance-hour	You reserve compute even when traffic is idle, so predictable but always-on cost
Premium (P1v3+ / P1v4+)	Production APIs, secure apps, autoscale workloads	Higher fixed price per dedicated instance-hour, plus related networking and outbound bandwidth costs	Supports autoscale, private networking scenarios, more memory and CPU, and stronger enterprise fit

Real-World Use Case

A product team hosts a customer portal and REST API on Linux App Service, uses deployment slots for zero-downtime swaps, and injects secrets from Key Vault using managed identity.

Pricing Reality

You pay for the App Service Plan, not just the app. Multiple apps on the same plan share the same reserved compute and memory pool.

Portal Tips

Use portal search with the app name, then open Diagnose and solve problems, Deployment Center, and Scale up before changing code or infrastructure.

Cost mistake to avoid: teams often create separate App Service Plans for every app. If security and scaling profiles allow it, consolidating compatible apps onto the same plan reduces idle reserved spend.

Portal and CLI Navigation

Portal: search for the app name or App Services, then check Overview, Diagnose and solve problems, Log stream, and App Service plan.
CLI: use az webapp up for a fast deploy path, then az webapp log tail and az webapp config appsettings list to confirm runtime and configuration.

// app-service.bicep
// Deploy a Linux App Service Plan and Web App with a system-assigned identity.
// This keeps credentials out of code and makes the app ready for Key Vault or Storage RBAC.

@description('Location for all resources.')
param location string = resourceGroup().location

@description('Globally unique web app name.')
param webAppName string

@description('App Service Plan name.')
param appServicePlanName string = 'asp-azure-handbook'

@allowed([
  'F1'
  'B1'
  'P1v3'
])
@description('Choose Free for labs, Basic for fixed low-volume apps, Premium for production.')
param skuName string = 'B1'

var skuTier = skuName == 'F1' ? 'Free' : skuName == 'B1' ? 'Basic' : 'PremiumV3'

resource plan 'Microsoft.Web/serverfarms@2024-04-01' = {
  name: appServicePlanName
  location: location
  kind: 'linux'
  sku: {
    name: skuName
    tier: skuTier
    capacity: 1
  }
  properties: {
    reserved: true
  }
}

resource site 'Microsoft.Web/sites@2024-04-01' = {
  name: webAppName
  location: location
  kind: 'app,linux'
  identity: {
    type: 'SystemAssigned'
  }
  properties: {
    serverFarmId: plan.id
    httpsOnly: true
    siteConfig: {
      linuxFxVersion: 'PYTHON|3.12'
      alwaysOn: skuName == 'F1' ? false : true
      minTlsVersion: '1.2'
      ftpsState: 'Disabled'
      appSettings: [
        {
          name: 'WEBSITE_RUN_FROM_PACKAGE'
          value: '1'
        }
      ]
    }
  }
}

output hostname string = site.properties.defaultHostName

Azure Functions

Azure Functions is Azure's event-driven serverless runtime. Use it when code should run in response to HTTP requests, storage events, timers, queue messages, Service Bus triggers, or Event Grid notifications without reserving a full web tier all day.

Plan	Billing Model	Best Fit	What To Watch
Consumption	Pay per execution count and execution duration based on memory used	Spiky event workloads, low or unpredictable traffic	Cold starts, execution limits, and network constraints matter
Premium	Pay for pre-warmed and active instances by allocated compute	Latency-sensitive functions, VNet integration, steady enterprise traffic	Higher baseline cost because capacity is reserved
Dedicated	Runs on an App Service Plan that you already pay for	Functions that should share reserved web compute with other apps	Cheap only if that plan already exists for a good reason

Real-World Use Case

A finance integration receives daily CSV drops in Blob Storage, validates them, and posts clean rows into downstream systems using a Blob trigger and queue fan-out.

Cost Mechanics

Consumption is attractive for bursty workloads, but Premium becomes easier to justify when cold starts, private networking, or predictable sustained volume matter.

Portal Tips

Open Functions, Monitor, Application Insights, and Configuration. When a function fails, check trigger binding settings before debugging business logic.

Billing shortcut: if your team says “serverless is always cheaper,” push back. Premium and Dedicated plans can cost more than a small web tier if the workload is always hot.

Portal and CLI Navigation

Portal: search the function app name or Function App, then inspect Functions, Monitor, Deployment Center, and linked Application Insights.
CLI: use az functionapp list for discovery, az functionapp show for configuration, and deployment commands after confirming the plan and storage account.

# function_app.py
# Basic Azure Functions Python v2 HTTP trigger with input validation and safe error handling.

import json
import logging
import azure.functions as func

app = func.FunctionApp(http_auth_level=func.AuthLevel.FUNCTION)


@app.route(route="hello", methods=["GET"])
def hello(req: func.HttpRequest) -> func.HttpResponse:
    try:
        name = req.params.get("name")

        if not name:
            return func.HttpResponse(
                json.dumps({"error": "Query string parameter 'name' is required."}),
                status_code=400,
                mimetype="application/json",
            )

        payload = {
            "message": f"Hello, {name}. Your Azure Function is running.",
            "executionModel": "Python v2",
        }
        return func.HttpResponse(
            json.dumps(payload),
            status_code=200,
            mimetype="application/json",
        )

    except Exception as exc:
        logging.exception("HTTP trigger failed: %s", exc)
        return func.HttpResponse(
            json.dumps({"error": "Unexpected server error."}),
            status_code=500,
            mimetype="application/json",
        )

Azure Kubernetes Service (AKS)

AKS is Azure's managed Kubernetes offering for teams that need container orchestration, independent service release cycles, Kubernetes-native tooling, or deep control over ingress, networking, and workload composition. It is powerful, but it should be chosen because you need Kubernetes, not because it sounds “more cloud-native.”

Billing Model To Remember Nodes Drive Spend

At a high level, the AKS control plane is free and Microsoft-managed, but you still pay for worker nodes, node pool VM sizes, managed disks, load balancers, public IPs, monitoring, and outbound bandwidth. Optional SLA or premium add-ons should still be checked against current pricing.

Real-World Use Case

A platform team runs multiple microservices with separate deployment pipelines, ingress rules, and service meshes, and needs pod-level autoscaling plus cluster policy enforcement.

Cost Mechanics

A lightly used AKS cluster can still cost materially more than App Service because the node pool is always running. Over-sized system pools are a common waste pattern.

Portal Tips

Search the cluster name, then inspect Node pools, Workloads, Insights, Activity log, and Kubernetes resources before making scaling changes.

Architecture warning: AKS adds meaningful operational overhead. If your workload is a single HTTP service with straightforward scaling needs, App Service or Container Apps is usually the better default.

Portal and CLI Navigation

Portal: search Kubernetes services or the cluster name, then check node pressure, pending pods, and cluster insights before resizing anything.
CLI: start with az aks list -o table, then az aks show and az aks get-credentials if you need to inspect cluster state with kubectl.

// aks.bicep
// Minimal AKS deployment. Use this as a baseline, then add private networking,
// Azure Policy, workload identity, and dedicated user pools for real production clusters.

@description('Location for the AKS cluster.')
param location string = resourceGroup().location

@description('AKS cluster name.')
param aksName string = 'aks-handbook-demo'

@description('DNS prefix used by the API server endpoint.')
param dnsPrefix string = 'aks-handbook-demo'

@minValue(1)
@description('System node count. Start small, then scale with measured demand.')
param nodeCount int = 3

@description('Worker node VM size. This is where most AKS cost starts.')
param nodeVmSize string = 'Standard_D4ds_v5'

resource aks 'Microsoft.ContainerService/managedClusters@2024-09-01' = {
  name: aksName
  location: location
  sku: {
    name: 'Base'
    tier: 'Free'
  }
  identity: {
    type: 'SystemAssigned'
  }
  properties: {
    dnsPrefix: dnsPrefix
    enableRBAC: true
    agentPoolProfiles: [
      {
        name: 'system'
        mode: 'System'
        count: nodeCount
        vmSize: nodeVmSize
        osType: 'Linux'
        osSKU: 'Ubuntu'
        type: 'VirtualMachineScaleSets'
        maxPods: 30
      }
    ]
    networkProfile: {
      networkPlugin: 'azure'
      loadBalancerSku: 'standard'
    }
  }
}

output clusterName string = aks.name

Portal Search and Azure CLI Navigation

Fast diagnosis in Azure usually starts with two habits: using the global portal search bar aggressively and using Azure CLI to confirm what actually exists in the subscription. Both habits reduce time lost to clicking through the wrong resource group or reading stale deployment assumptions.

Task	Portal Shortcut	CLI Shortcut	First Diagnostic Check
Find a web app	Search exact app name or `App Services`	`az webapp list -o table`	Deployment status, app settings, and log stream
Find a function app	Search exact app name or `Function App`	`az functionapp list -o table`	Monitor tab, trigger bindings, storage account link
Find an AKS cluster	Search exact cluster name or `Kubernetes services`	`az aks list -o table`	Node pools, insights, recent activity log entries

Portal Workflow

Search by resource name first, then confirm subscription and resource group immediately. Azure estates often fail operationally because engineers troubleshoot the wrong environment.

CLI Workflow

Use CLI output to validate naming, SKU, region, and runtime before assuming the portal reflects the last deployment. Scriptable checks beat memory.

# Azure CLI quick-start for compute discovery and first-response troubleshooting

az login
az account set --subscription "<subscription-id-or-name>"

# Fast deploy for a simple App Service web app
az webapp up \
  --name "<globally-unique-webapp-name>" \
  --resource-group "<resource-group>" \
  --location "eastus" \
  --runtime "PYTHON:3.12" \
  --sku B1

# Inventory current compute assets
az webapp list -o table
az functionapp list -o table
az aks list -o table

# Inspect a specific resource when something looks wrong
az webapp show --name "<webapp-name>" --resource-group "<resource-group>"
az webapp log tail --name "<webapp-name>" --resource-group "<resource-group>"
az functionapp show --name "<function-app-name>" --resource-group "<resource-group>"
az aks show --name "<aks-name>" --resource-group "<resource-group>"
az aks get-credentials --name "<aks-name>" --resource-group "<resource-group>" --overwrite-existing

Recommended operational habit: when an incident begins, capture the exact resource name, resource group, subscription, SKU, and region in the first five minutes. That shortens almost every Azure troubleshooting loop.

Module 2: Data & Storage

Azure data services differ most on access pattern, consistency expectations, operational overhead, and pricing mechanics. Blob Storage is the default landing zone for unstructured data, Cosmos DB fits globally distributed low-latency NoSQL workloads, and Azure SQL Database is the relational PaaS default when transactions, joins, and mature SQL tooling matter.

Storage Lens

Ask whether the workload is object, document, or relational before choosing a service. Azure charges very differently for each model.

Cost Lens

Capacity is only part of the bill. Transactions, throughput reservations, geo-replication, backups, and retrieval fees frequently dominate spend.

Operations Lens

Lifecycle management, partition design, failover settings, and network restrictions should be reviewed before any production launch.

Azure Blob Storage

Azure Blob Storage is Azure's object store for unstructured data such as images, documents, backups, logs, parquet files, and application uploads. It is usually the simplest and cheapest place to put large binary assets, but the final bill depends on more than raw gigabytes.

Tier	Best Fit	How It Is Billed	Common Mistake
Hot	Frequently read objects, active app content, recent analytics files	Higher storage capacity rate, lower access charges	Keeping infrequently used backups hot for months
Cool	Infrequently accessed content with occasional retrieval	Lower capacity rate, higher read and retrieval charges, minimum retention period applies	Moving transactional data here and then reading it constantly
Archive	Long-term retention, audit files, compliance snapshots	Lowest capacity rate, highest retrieval latency and rehydration cost, minimum retention period applies	Treating archive as if it were an online filesystem

Real-World Use Case

A data engineering team lands raw vendor files in Blob Storage, triggers validation functions, then moves approved data to curated containers with lifecycle policies.

Pricing Reality

Blob Storage bills on storage capacity, read and write operations, data retrieval for cooler tiers, redundancy choice, and outbound data transfer when objects leave Azure.

Portal Tips

Open Containers, Lifecycle management, Data protection, Metrics, and Networking before assuming a storage problem is application-side.

Cost trap: cool and archive tiers look cheap until retrieval spikes. Capacity is only one line item. Frequent reads, rehydration, and cross-region traffic can erase expected savings very quickly.

Portal and CLI Navigation

Portal: search the storage account name, then inspect Containers, Access keys, Networking, Lifecycle management, and Metrics.
CLI: use az storage account show, az storage container list --auth-mode login, and az storage blob list to confirm the account, container, and object path before changing application code.

# blob_upload.py
# Upload a file to Azure Blob Storage using Entra ID and RBAC instead of account keys.
# Required role for the caller or managed identity: Storage Blob Data Contributor.

from pathlib import Path
import logging

from azure.core.exceptions import AzureError
from azure.identity import DefaultAzureCredential
from azure.storage.blob import BlobServiceClient, ContentSettings


def upload_file_to_blob(account_url: str, container_name: str, source_path: str, blob_name: str) -> None:
    credential = DefaultAzureCredential()
    blob_service_client = BlobServiceClient(account_url=account_url, credential=credential)
    blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)

    file_path = Path(source_path)
    if not file_path.exists():
        raise FileNotFoundError(f"Source file not found: {file_path}")

    try:
        with file_path.open("rb") as data:
            blob_client.upload_blob(
                data,
                overwrite=True,
                content_settings=ContentSettings(content_type="application/octet-stream"),
            )
        logging.info("Uploaded %s to %s/%s", file_path.name, container_name, blob_name)
    except AzureError as exc:
        logging.exception("Blob upload failed: %s", exc)
        raise


if __name__ == "__main__":
    upload_file_to_blob(
        account_url="https://mystorageaccount.blob.core.windows.net",
        container_name="raw-ingest",
        source_path="./sample-data/orders.csv",
        blob_name="2026/03/orders.csv",
    )

Azure Cosmos DB

Azure Cosmos DB is Azure's globally distributed NoSQL database for low-latency document, key-value, graph, and related models. It is powerful when you genuinely need global distribution, elastic scale, or predictable low latency, but it is also one of the easiest Azure services to overspend on when partition design is poor.

Throughput Model	How It Is Billed	Best Fit	Watchouts
Provisioned RU/s	Pay for reserved request units every hour whether you consume them or not	Steady traffic with predictable throughput needs	Idle but over-provisioned containers waste money continuously
Autoscale RU/s	Pay for autoscale capacity based on configured max RU/s	Workloads with daily or weekly spikes	Setting max RU/s far too high can still inflate spend materially
Serverless	Pay per request consumed plus storage, without reserved RU/s	Low-volume, sporadic, or dev and test workloads	Not the right fit for sustained heavy production load

Real-World Use Case

A SaaS product stores tenant-scoped user preferences and activity events in Cosmos DB so reads stay low latency across multiple regions.

Pricing Reality

Cosmos DB charges for RU consumption or provisioned RU/s, storage, backup options, and each replicated region. Multi-region writes increase cost further.

Portal Tips

Check Data Explorer, Metrics, Replicate data globally, and Scale & settings before assuming a query issue is just application latency.

Partition key failure is the classic Cosmos billing disaster. A bad key creates hot partitions, 429 throttling, uneven storage distribution, and a team response that is usually “buy more RU/s” instead of fixing the data model.

Portal and CLI Navigation

Portal: search the account name, then inspect Data Explorer, Keys, Replicate data globally, Metrics, and Scale & settings.
CLI: start with az cosmosdb show, az cosmosdb sql database list, and az cosmosdb sql container throughput show to verify throughput and container design.

// cosmosdb.bicep
// Provision a Cosmos DB SQL API account, database, and container with autoscale throughput.
// Pick the partition key carefully. It should spread writes evenly and match common query patterns.

@description('Deployment location for the Cosmos DB account.')
param location string = resourceGroup().location

@description('Globally unique Cosmos DB account name.')
param accountName string

@description('SQL database name.')
param databaseName string = 'appdb'

@description('Container name.')
param containerName string = 'orders'

@description('Partition key path. Example: /tenantId or /customerId.')
param partitionKeyPath string = '/tenantId'

@minValue(1000)
@description('Autoscale max RU/s. Start with measured demand, not guesswork.')
param maxAutoscaleRu int = 4000

resource cosmos 'Microsoft.DocumentDB/databaseAccounts@2023-04-15' = {
  name: accountName
  location: location
  kind: 'GlobalDocumentDB'
  properties: {
    databaseAccountOfferType: 'Standard'
    publicNetworkAccess: 'Enabled'
    enableAutomaticFailover: false
    locations: [
      {
        locationName: location
        failoverPriority: 0
        isZoneRedundant: false
      }
    ]
    consistencyPolicy: {
      defaultConsistencyLevel: 'Session'
    }
  }
}

resource sqlDb 'Microsoft.DocumentDB/databaseAccounts/sqlDatabases@2023-04-15' = {
  name: '${cosmos.name}/${databaseName}'
  properties: {
    resource: {
      id: databaseName
    }
    options: {}
  }
}

resource container 'Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers@2023-04-15' = {
  name: '${cosmos.name}/${databaseName}/${containerName}'
  properties: {
    resource: {
      id: containerName
      partitionKey: {
        paths: [
          partitionKeyPath
        ]
        kind: 'Hash'
        version: 2
      }
      indexingPolicy: {
        indexingMode: 'consistent'
        automatic: true
        includedPaths: [
          {
            path: '/*'
          }
        ]
        excludedPaths: [
          {
            path: '/"_etag"/?'
          }
        ]
      }
    }
    options: {
      autoscaleSettings: {
        maxThroughput: maxAutoscaleRu
      }
    }
  }
}

output cosmosEndpoint string = cosmos.properties.documentEndpoint

Azure SQL Database

Azure SQL Database is Azure's PaaS relational database for transactional systems that need SQL Server compatibility without managing Windows hosts, patching, backups, or cluster operations. It is usually the default for line-of-business apps, reporting stores, and APIs that rely on relational constraints and mature SQL tooling.

Model	How It Is Billed	Best Fit	Decision Note
DTU	Bundled compute, memory, and IO in fixed tiers	Smaller legacy workloads or teams that want simplified sizing	Simple to buy, but less transparent about what resources you are getting
vCore	Pay for chosen compute generation, vCores, storage, backups, and extras such as zone redundancy	Modern production workloads needing clearer sizing and cost control	Usually preferred because it maps better to actual resource planning and discount options

Real-World Use Case

An order management API uses Azure SQL Database for transactional writes, relational integrity, predictable backups, and built-in query performance tooling.

Pricing Reality

Beyond compute, Azure SQL bills for database storage, backup storage over included amounts, zone redundancy, and outbound data transfer. Over-sizing for peak load is a common waste pattern.

Portal Tips

Inspect Query editor, Intelligent Performance, Firewall and virtual networks, and Geo-replication before changing the app or the database tier.

Portal and CLI Navigation

Portal: search the logical server or database name, then check Overview, Query Performance Insight, Connection strings, and Networking.
CLI: use az sql server show, az sql db show, and az sql db list-usages to confirm configuration and capacity usage.

// azure-sql.bicep
// Deploy a logical SQL server and a vCore-based Azure SQL Database.
// Credentials are parameters so they are never hardcoded in source control.

@description('Location for SQL resources.')
param location string = resourceGroup().location

@description('Globally unique logical SQL server name.')
param sqlServerName string

@description('Database name.')
param databaseName string = 'appdb'

@description('SQL administrator login name.')
param administratorLogin string

@secure()
@description('SQL administrator password supplied at deployment time.')
param administratorPassword string

resource sqlServer 'Microsoft.Sql/servers@2023-08-01-preview' = {
  name: sqlServerName
  location: location
  properties: {
    administratorLogin: administratorLogin
    administratorLoginPassword: administratorPassword
    minimalTlsVersion: '1.2'
    publicNetworkAccess: 'Disabled'
  }
}

resource database 'Microsoft.Sql/servers/databases@2023-08-01-preview' = {
  name: '${sqlServer.name}/${databaseName}'
  location: location
  sku: {
    name: 'GP_S_Gen5_2'
    tier: 'GeneralPurpose'
    capacity: 2
  }
  properties: {
    collation: 'SQL_Latin1_General_CP1_CI_AS'
    maxSizeBytes: 10737418240
    autoPauseDelay: 60
    minCapacity: 0.5
    backupStorageRedundancy: 'Local'
  }
}

output fullyQualifiedServerName string = sqlServer.properties.fullyQualifiedDomainName

Module 3: Networking & Security

Azure security starts with network boundaries and identity boundaries working together. VNets and subnets create containment. NSGs express allowed traffic. Key Vault keeps secrets out of code. Network Watcher helps you prove whether the network is actually the problem before the incident drifts into guesswork.

Isolation Lens

Put application, data, and management paths in deliberately chosen subnets so a compromise or misconfiguration does not spill across the estate.

Identity Lens

Managed identities and RBAC remove credential sprawl. This is a security control and an operations control, not just a coding preference.

Diagnostics Lens

When packets fail, use Network Watcher and effective rules first. Network incidents get expensive when teams troubleshoot by intuition.

Virtual Networks (VNet) & Subnets

A VNet is the core Azure network boundary for private IP addressing, segmentation, and controlled connectivity between resources. Subnets turn that boundary into meaningful isolation domains. Separating app, data, and management paths is critical because it limits blast radius, simplifies policy, and makes route and security intent explicit.

Billing Model To Remember Boundary, Not The Bill

VNets and subnets themselves are generally not the expensive line items. The bill grows from attached components such as NAT Gateway, Azure Firewall, VPN Gateway, Private Endpoints, cross-region peering, and outbound bandwidth.

Real-World Use Case

A three-tier app places App Service integration in an application subnet, SQL private endpoints in a data subnet, and jump box or management services in a locked-down admin subnet.

Security Reality

Flat address spaces and shared subnets create unclear ownership and looser controls. Teams then compensate with brittle NSG rules and emergency exceptions.

Portal Tips

Inspect Address space, Subnets, Peerings, and any associated Network security group before changing route or firewall behavior.

Portal and CLI Navigation

Portal: search the VNet name, then open Subnets, Peerings, DDoS protection, and the linked NSG.
CLI: use az network vnet show, az network vnet subnet list, and az network nsg rule list to verify what the network is actually enforcing.

// vnet.bicep
// Create a VNet with separate application and data subnets plus an NSG.
// Network isolation is cheap compared to retrofitting security after go-live.

@description('Location for networking resources.')
param location string = resourceGroup().location

@description('Virtual network name.')
param vnetName string = 'vnet-app-prod'

resource nsg 'Microsoft.Network/networkSecurityGroups@2024-03-01' = {
  name: 'nsg-app-subnet'
  location: location
  properties: {
    securityRules: [
      {
        name: 'AllowHttpsInbound'
        properties: {
          protocol: 'Tcp'
          sourcePortRange: '*'
          destinationPortRange: '443'
          sourceAddressPrefix: 'Internet'
          destinationAddressPrefix: '*'
          access: 'Allow'
          priority: 100
          direction: 'Inbound'
        }
      }
    ]
  }
}

resource vnet 'Microsoft.Network/virtualNetworks@2024-03-01' = {
  name: vnetName
  location: location
  properties: {
    addressSpace: {
      addressPrefixes: [
        '10.20.0.0/16'
      ]
    }
    subnets: [
      {
        name: 'app'
        properties: {
          addressPrefix: '10.20.1.0/24'
          networkSecurityGroup: {
            id: nsg.id
          }
        }
      }
      {
        name: 'data'
        properties: {
          addressPrefix: '10.20.2.0/24'
        }
      }
    ]
  }
}

output vnetId string = vnet.id

Azure Key Vault

Azure Key Vault stores secrets, certificates, and cryptographic keys so application teams do not hardcode or manually rotate sensitive values. It should be the default home for application secrets unless there is a stronger service-specific reason to use another secure store.

Capability	Purpose	How It Is Billed	Operational Note
Secrets	Passwords, connection strings, tokens, API keys	Per operation and storage at the vault tier	Usually the default for app configuration that must stay secret
Keys	Encryption keys and signing material	Per key and per cryptographic operation, higher for HSM-backed options	Use Premium when hardware-backed keys are required
Certificates	TLS certificate lifecycle management	Vault operations plus any external CA cost	Useful when teams need central renewal and access control

Real-World Use Case

An App Service reads a database password from Key Vault using its system-assigned managed identity, so no credential is stored in source control or deployment variables.

Pricing Reality

Key Vault itself is inexpensive, but high-volume secret polling, premium HSM usage, and noisy retry loops can still create avoidable cost and operational churn.

Portal Tips

Open Secrets, Access control (IAM), Networking, Events, and Diagnostic settings before assuming an identity issue is an app bug.

Portal and CLI Navigation

Portal: search the vault name, then verify Secrets, Access control (IAM), Role assignments, and Networking.
CLI: use az keyvault show, az keyvault secret list, and az role assignment list --scope to verify whether the caller should be able to read the secret at all.

# key_vault_read.py
# Fetch a Key Vault secret securely using DefaultAzureCredential.
# This works locally with az login and in Azure with a managed identity.

import logging

from azure.core.exceptions import AzureError, ResourceNotFoundError
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient


def get_secret(vault_url: str, secret_name: str) -> str:
    credential = DefaultAzureCredential()
    client = SecretClient(vault_url=vault_url, credential=credential)

    try:
      secret = client.get_secret(secret_name)
      return secret.value
    except ResourceNotFoundError as exc:
      logging.exception("Secret not found: %s", exc)
      raise
    except AzureError as exc:
      logging.exception("Failed to read secret from Key Vault: %s", exc)
      raise


if __name__ == "__main__":
    value = get_secret(
        vault_url="https://my-shared-vault.vault.azure.net/",
        secret_name="sql-admin-password",
    )
    print(f"Retrieved secret with length: {len(value)}")

Navigation and Troubleshooting: Network Watcher & Effective Security Rules

Network Watcher is the first-response toolbox for proving where traffic is being dropped. Use it to check topology, connection troubleshoot, IP flow verify, and effective security rules rather than guessing which NSG, route, or peering change caused the failure.

Real-World Use Case

A VM can reach the internet but not a private database endpoint. Network Watcher narrows the issue to an NSG deny on the subnet instead of a DNS or application problem.

Pricing Reality

Basic Network Watcher features are not usually the main cost driver, but flow logs, Traffic Analytics, and Log Analytics ingestion absolutely are. Turn them on deliberately.

Portal Tips

Use Network Watcher for IP flow verify and Connection troubleshoot. On a VM NIC, open Effective security rules before changing any NSG.

Portal and CLI Navigation

Portal: go to Network Watcher, then use IP flow verify, Topology, and Connection troubleshoot. For a VM, open the NIC and inspect Effective security rules.
CLI: use az network watcher test-connectivity, az network watcher ip-flow-verify, and az network nic list-effective-nsg for scriptable checks.

# effective_nsg.py
# Retrieve effective NSG rules for a VM network interface using the Azure Python SDK.

import logging
import os

from azure.core.exceptions import AzureError
from azure.identity import DefaultAzureCredential
from azure.mgmt.network import NetworkManagementClient


def print_effective_nsg(resource_group: str, nic_name: str) -> None:
    subscription_id = os.environ["AZURE_SUBSCRIPTION_ID"]
    credential = DefaultAzureCredential()
    client = NetworkManagementClient(credential, subscription_id)

    try:
        poller = client.network_interfaces.begin_get_effective_network_security_groups(
            resource_group_name=resource_group,
            network_interface_name=nic_name,
        )
        result = poller.result()
        for association in result.value:
            logging.info("NSG association: %s", association.network_security_group.id)
            for rule in association.effective_security_rules:
                print(
                    f"{rule.name}: {rule.direction} {rule.access} "
                    f"ports={rule.destination_port_range} source={rule.source_address_prefix}"
                )
    except AzureError as exc:
        logging.exception("Failed to retrieve effective NSG rules: %s", exc)
        raise


if __name__ == "__main__":
    print_effective_nsg(resource_group="rg-network-prod", nic_name="vm-app-01-nic")

Module 4: Pricing Calculation & Cost Management

Azure cost control is a design discipline, not a finance afterthought. Estimates should be built from real SKUs, realistic traffic, storage growth, and data transfer assumptions. Governance then keeps reality from drifting far away from the estimate.

Estimate First

Use the Azure Pricing Calculator before deployment so architecture tradeoffs are visible while they are still cheap to change.

Discount Second

Reserved capacity and Hybrid Benefit can materially improve unit economics, but only after the baseline sizing is credible.

Guardrails Always

Budgets, forecast alerts, and cost reviews are how you prevent small experiments from turning into monthly surprises.

The Pricing Calculator

The official Azure Pricing Calculator is the right starting point for workload estimation, but only if you treat it as an engineering exercise instead of a quick sales estimate. Calculator output is usually list-price guidance. It does not automatically know your reserved capacity, enterprise agreement discounts, or workload inefficiencies.

Step	What To Do	Why It Matters
1. Define scope	List every service in the architecture, not just the headline compute tier	Many Azure bills are driven by supporting services such as storage, monitoring, and data transfer
2. Pin region and SKU	Select the exact Azure region, tier, redundancy mode, and expected hours	Region and SKU change prices materially, especially for compute and storage redundancy
3. Model usage	Estimate requests, GB stored, GB transferred, RU/s, or node-hours based on real load expectations	Under-modeling usage makes the calculator look cheap and the production bill look wrong
4. Add operations costs	Include transactions, backup retention, retrieval fees, and monitoring ingestion	These line items are often missing from first-pass estimates
5. Validate against reality	Compare the estimate with a pilot environment or a prior month's spend	This exposes bad assumptions before leadership starts relying on the number
6. Revisit monthly	Update estimates as SKUs, traffic, and architecture change	Pricing drift is normal. Unreviewed estimates become fiction quickly

Real-World Use Case

A migration team compares App Service Premium, Functions Premium, and AKS for the same API workload. The calculator shows that worker node and observability costs make AKS materially more expensive before any engineering labor is considered.

Navigation Tip

Use the official Azure Pricing Calculator for the estimate, then validate assumptions in Cost Management > Cost analysis after the pilot is live. The calculator tells you expected spend. Cost analysis tells you actual spend.

Cost Optimization: Reserved Instances, PAYG, and Azure Hybrid Benefit

Pay-As-You-Go (PAYG) is the baseline model: you pay standard rates with maximum flexibility and no long-term commitment. Reserved Instances or reserved capacity trade commitment for lower unit pricing on eligible services. Azure Hybrid Benefit lets you apply qualifying existing Windows Server or SQL Server licenses to reduce Azure software charges.

Option	Best Fit	How It Changes The Bill	Risk
PAYG	New workloads, uncertain demand, short-lived environments	Highest flexibility, usually highest unit cost	Teams forget to revisit it after usage stabilizes
Reserved Instances / Reserved Capacity	Steady-state production with predictable baseline usage	Lower hourly cost in exchange for 1-year or 3-year commitment	Wrong sizing or wrong region commitment limits savings
Azure Hybrid Benefit	Organizations with eligible Microsoft licenses	Reduces software portion of Azure compute or database cost	License eligibility and assignment must be governed carefully

Real-World Use Case

A production SQL estate running predictable business hours moves from PAYG vCore pricing to reserved capacity and applies Azure Hybrid Benefit, cutting monthly run cost without changing application code.

Portal and CLI Navigation

Use Reservations, Advisor, and Cost analysis in the portal. In CLI, start with az consumption usage list and any service-specific SKU inspection commands to validate that usage is steady enough to justify commitment.

Optimization rule: do not buy reservations to compensate for bad architecture. Rightsize first, then commit to the baseline that remains.

Budgets and Cost Alerts

Budgets are the simplest protection against bill shock. A budget does not stop spend by itself, but it gives teams a threshold, forecast visibility, and a forcing function to react before the invoice arrives.

Portal Setup Steps

Open Cost Management + Billing, choose the target subscription or resource group, open Budgets, create a monthly budget, then add alert thresholds for actual and forecasted spend such as 80 percent, 90 percent, and 100 percent.

Billing Reality

Budget alerts themselves are not the expensive part. The real value is operational: they shorten the time between unexpected spend and action. The underlying Azure resources keep billing until you remediate them.

Portal and CLI Navigation

Portal: go to Cost Management + Billing, then open Budgets, Cost analysis, and Exports for recurring reporting.
CLI: use az consumption budget list and az consumption usage list to confirm whether the right scope is covered and what is actually driving the spend.

// budget.bicep
// Subscription-scope monthly budget with actual and forecast alerts.

targetScope = 'subscription'

@description('Budget resource name.')
param budgetName string = 'engineering-monthly-budget'

@description('Budget amount in USD or the billing currency of the subscription.')
param budgetAmount int = 1500

@description('Budget notification recipients.')
param contactEmails array = [
  'cloud-ops@example.com'
]

resource budget 'Microsoft.Consumption/budgets@2023-05-01' = {
  name: budgetName
  properties: {
    category: 'Cost'
    amount: budgetAmount
    timeGrain: 'Monthly'
    timePeriod: {
      startDate: '2026-01-01T00:00:00Z'
      endDate: '2027-01-01T00:00:00Z'
    }
    notifications: {
      actual80: {
        enabled: true
        operator: 'GreaterThan'
        threshold: 80
        thresholdType: 'Actual'
        contactEmails: contactEmails
      }
      forecast100: {
        enabled: true
        operator: 'GreaterThan'
        threshold: 100
        thresholdType: 'Forecasted'
        contactEmails: contactEmails
      }
    }
  }
}

Module 5: Infrastructure as Code (IaC) Standard

Infrastructure as Code is the production standard because Azure Resource Manager changes should be reviewable, repeatable, parameterized, and deployable across environments without human clicking. The portal is excellent for inspection and emergency response. It is not a durable production change-management system.

Why IaC

Clicking through the portal is an anti-pattern for production because it creates undocumented drift, makes peer review impossible, and turns every rebuild into archaeology. Bicep fixes that by giving Azure engineers a native ARM abstraction with reusable modules, parameter files, and deployment history.

Real-World Use Case

A platform team provisions identical non-production environments from Bicep parameter files, reducing onboarding time and eliminating “it works only in one subscription” drift.

Operational Benefit

Deployments become diffable and testable. Roll-forward becomes safer than manual rollback because the desired state is explicit.

Portal and CLI Navigation

Use the portal Deployments blade to inspect ARM history and use az deployment group what-if or az deployment sub what-if before applying changes.

Production rule: if the only copy of a change exists in somebody's browser history, the infrastructure is not under control.

Bicep Example: Storage Account + App Service with Managed Identity

The following Bicep file is a clean baseline for a typical web workload. It provisions a standard Storage Account and a Linux App Service with a system-assigned managed identity. The storage account is locked down with HTTPS-only and public blob access disabled. The App Service is ready to authenticate to other Azure services using its identity instead of embedded credentials.

Real-World Use Case

A product team deploys the same web tier to dev, test, and prod using different parameter files, then grants the app identity RBAC access to Storage or Key Vault without changing application secrets.

Portal and CLI Navigation

After deployment, verify the Identity blade on the web app, Access control (IAM) on the storage account, and the deployment output using az deployment group show.

// main.bicep
// Deploy a standard storage account and Linux App Service with a system-assigned managed identity.

@description('Deployment location for all resources.')
param location string = resourceGroup().location

@description('Globally unique storage account name using only lowercase letters and numbers.')
param storageAccountName string

@description('Globally unique web app name.')
param webAppName string

@description('App Service Plan name.')
param appServicePlanName string = 'asp-standard-linux'

@allowed([
  'B1'
  'P1v3'
])
@description('Choose B1 for smaller steady-state workloads and P1v3 for production autoscale scenarios.')
param appServiceSku string = 'B1'

var appServiceTier = appServiceSku == 'B1' ? 'Basic' : 'PremiumV3'

resource storage 'Microsoft.Storage/storageAccounts@2023-05-01' = {
  name: storageAccountName
  location: location
  sku: {
    name: 'Standard_LRS'
  }
  kind: 'StorageV2'
  properties: {
    accessTier: 'Hot'
    allowBlobPublicAccess: false
    allowSharedKeyAccess: false
    minimumTlsVersion: 'TLS1_2'
    publicNetworkAccess: 'Enabled'
    supportsHttpsTrafficOnly: true
  }
}

resource plan 'Microsoft.Web/serverfarms@2024-04-01' = {
  name: appServicePlanName
  location: location
  kind: 'linux'
  sku: {
    name: appServiceSku
    tier: appServiceTier
    capacity: 1
  }
  properties: {
    reserved: true
  }
}

resource site 'Microsoft.Web/sites@2024-04-01' = {
  name: webAppName
  location: location
  kind: 'app,linux'
  identity: {
    type: 'SystemAssigned'
  }
  properties: {
    serverFarmId: plan.id
    httpsOnly: true
    siteConfig: {
      linuxFxVersion: 'PYTHON|3.12'
      alwaysOn: true
      minTlsVersion: '1.2'
      ftpsState: 'Disabled'
      appSettings: [
        {
          name: 'AZURE_STORAGE_ACCOUNT_NAME'
          value: storage.name
        }
        {
          name: 'WEBSITE_RUN_FROM_PACKAGE'
          value: '1'
        }
      ]
    }
  }
}

output storageBlobEndpoint string = storage.properties.primaryEndpoints.blob
output webAppHostname string = site.properties.defaultHostName
output webAppPrincipalId string = site.identity.principalId

Module 6: Common Pitfalls & Anti-Patterns

Most Azure overspend and avoidable security exposure comes from a short list of repeated mistakes, not obscure platform behavior. The recurring pattern is the same: no inventory discipline, weak ownership, and changes made without cost or identity review.

1. Leaving Idle Resources Running

Idle resources are one of the fastest ways to waste money in Azure. Unattached managed disks, unused public IPs, forgotten App Service Plans, dormant AKS node pools, and test databases continue billing even when no application is using them.

Real-World Use Case

A project deletes its VMs after a migration test but leaves premium disks and public IPs behind. The application is gone, but the monthly bill remains.

Billing Reality

Managed disks and reserved compute keep billing independently of whether they are attached. Azure charges resources, not intent.

Portal and CLI Navigation

Check Cost analysis, Advisor, Disks, and Public IP addresses. In CLI, inventory with az disk list and az network public-ip list.

# idle_resource_audit.py
# Identify unattached managed disks and unassociated public IP addresses.

import logging
import os

from azure.core.exceptions import AzureError
from azure.identity import DefaultAzureCredential
from azure.mgmt.compute import ComputeManagementClient
from azure.mgmt.network import NetworkManagementClient


def find_idle_resources() -> None:
    subscription_id = os.environ["AZURE_SUBSCRIPTION_ID"]
    credential = DefaultAzureCredential()
    compute_client = ComputeManagementClient(credential, subscription_id)
    network_client = NetworkManagementClient(credential, subscription_id)

    try:
        print("Unattached managed disks:")
        for disk in compute_client.disks.list():
            if not disk.managed_by:
                print(f"- {disk.name} ({disk.location}) sku={disk.sku.name}")

        print("\nUnassociated public IP addresses:")
        for public_ip in network_client.public_ip_addresses.list_all():
            if public_ip.ip_configuration is None:
                print(f"- {public_ip.name} ({public_ip.location}) sku={public_ip.sku.name}")
    except AzureError as exc:
        logging.exception("Idle resource discovery failed: %s", exc)
        raise


if __name__ == "__main__":
    find_idle_resources()

2. Misunderstanding Egress Bandwidth Costs

Data entering Azure is usually free. Data leaving Azure is not. That applies to internet egress, many cross-region transfers, some peering patterns, CDN origin traffic, backups restored across boundaries, and storage-heavy workloads that serve large files directly to users.

Scenario	Why Teams Miss It	What Actually Costs Money
Serving files from Blob Storage	Storage capacity looks cheap in the estimate	Outbound bandwidth and high read volume can become the real bill
Cross-region architectures	Teams focus on resiliency, not transfer patterns	Replication and inter-region data movement can materially raise run cost
API-heavy applications	Developers count requests but not response payload size	Large response bodies multiply egress charges as traffic grows

Real-World Use Case

A media app stores assets cheaply in Blob Storage, then serves them globally without CDN or caching. The storage line item stays modest while bandwidth becomes the surprise top charge.

Portal and CLI Navigation

In the portal, use Cost analysis filtered by meter category Bandwidth and inspect storage account metrics. In CLI, review usage data and validate where traffic leaves the platform before scaling the same pattern further.

Design correction: if large content is leaving Azure frequently, evaluate CDN, caching, compression, regional placement, and whether clients really need the current payload size.

3. Hardcoding Credentials Instead of Using Managed Identities and RBAC

Hardcoded credentials create both security risk and operational drag. Secrets leak into repositories, pipelines, laptops, and ticket comments. Rotation becomes painful, and incident response expands because nobody is sure where the credential was copied. Managed identities and RBAC remove that problem at the root.

Real-World Use Case

An App Service authenticates to Blob Storage with its managed identity and a role assignment, so there is no connection string to rotate when people or environments change.

Billing Reality

Managed identities do not add a meaningful direct cost. The bill stays with the target service and any Key Vault operations, while the security posture improves materially.

Portal and CLI Navigation

Check the workload Identity blade, then validate Access control (IAM) on the target resource. In CLI, inspect with az webapp identity show and az role assignment list.

# managed_identity_blob.py
# Access Blob Storage without a connection string by using DefaultAzureCredential.

import logging

from azure.core.exceptions import AzureError
from azure.identity import DefaultAzureCredential
from azure.storage.blob import BlobServiceClient


def list_blobs(account_url: str, container_name: str) -> None:
    credential = DefaultAzureCredential()
    service_client = BlobServiceClient(account_url=account_url, credential=credential)
    container_client = service_client.get_container_client(container_name)

    try:
        for blob in container_client.list_blobs(name_starts_with="2026/"):
            print(blob.name)
    except AzureError as exc:
        logging.exception("Managed identity blob access failed: %s", exc)
        raise


if __name__ == "__main__":
    list_blobs(
        account_url="https://mystorageaccount.blob.core.windows.net",
        container_name="raw-ingest",
    )

Standard to enforce: new Azure workloads should default to managed identity plus RBAC. Hardcoded secrets should require an explicit exception and review, not the other way around.

Azure Cloud Engineer Handbook

Table of Contents

Module 1: Compute & Serverless

App Service

Portal and CLI Navigation

Azure Functions

Portal and CLI Navigation

Azure Kubernetes Service (AKS)

Portal and CLI Navigation

Portal Search and Azure CLI Navigation

Module 2: Data & Storage

Azure Blob Storage

Portal and CLI Navigation

Azure Cosmos DB

Portal and CLI Navigation

Azure SQL Database

Portal and CLI Navigation

Module 3: Networking & Security

Virtual Networks (VNet) & Subnets

Portal and CLI Navigation

Azure Key Vault

Portal and CLI Navigation

Navigation and Troubleshooting: Network Watcher & Effective Security Rules

Portal and CLI Navigation

Module 4: Pricing Calculation & Cost Management

The Pricing Calculator

Cost Optimization: Reserved Instances, PAYG, and Azure Hybrid Benefit

Budgets and Cost Alerts

Portal and CLI Navigation

Module 5: Infrastructure as Code (IaC) Standard

Why IaC

Bicep Example: Storage Account + App Service with Managed Identity

Module 6: Common Pitfalls & Anti-Patterns

1. Leaving Idle Resources Running

2. Misunderstanding Egress Bandwidth Costs

3. Hardcoding Credentials Instead of Using Managed Identities and RBAC