Back to handbooks index
Production-Ready Azure Reference

Azure Cloud Engineer Handbook

An engineer-focused guide to Azure architecture, cost mechanics, portal navigation, and repeatable delivery with Azure Resource Manager primitives.

Azure Resource Manager Bicep + Python SDK Cost-Aware Architecture March 2026
i
This handbook now includes all six modules. It is organized around the decisions Azure engineers make most often in production: what to deploy, how Azure bills it, where to inspect it, and how to standardize delivery with Azure Resource Manager and Bicep.

Table of Contents

This handbook is organized around the decisions Azure engineers make in production: what service to pick, how it is billed, where to inspect it in the portal, and how to automate it with Infrastructure as Code or SDKs.

Module 1 Included Now
Compute & Serverless
App Service, Functions, AKS, and fast portal or CLI paths for deployment and diagnosis.
Module 2 Included
Data & Storage
Blob Storage, Cosmos DB, Azure SQL Database, access tiers, throughput models, and pricing traps.
Module 3 Included
Networking & Security
VNets, subnets, NSGs, Network Watcher, Key Vault, and secure identity-driven access patterns.
Module 4 Included
Pricing & Cost Management
Calculator methodology, Reserved Instances, PAYG, Hybrid Benefit, budgets, and alerting.
Module 5 Included
Infrastructure as Code Standard
Why portal-only delivery fails in production and how to standardize on Bicep modules and safe parameters.
Module 6 Included
Common Pitfalls & Anti-Patterns
Idle resources, egress surprises, secret sprawl, and the operational habits that prevent avoidable spend.
ModulePrimary DecisionWhat You Must Know
1. Compute & ServerlessWhere application code should runBilling model, scaling behavior, deployment path, and first-response troubleshooting steps
2. Data & StorageHow data is stored and accessedCapacity pricing, transaction charges, throughput planning, and lifecycle choices
3. Networking & SecurityHow resources are isolated and trustedSubnet boundaries, secret handling, RBAC, effective rules, and private connectivity
4. Pricing & CostHow to estimate and control spendCalculator discipline, discount models, and proactive alerts
5. IaC StandardHow infrastructure gets deliveredVersion control, repeatability, reviewability, and environment-safe parameters
6. PitfallsWhat repeatedly causes incidents or bill spikesZombie resources, outbound data charges, and credential management mistakes

Module 1: Compute & Serverless

Azure compute choices are mostly a tradeoff between operational control and platform abstraction. App Service is the fastest path for standard web workloads, Azure Functions is the lowest-friction option for event-driven execution, and AKS exists for teams that truly need Kubernetes primitives, release independence, or container ecosystem tooling.

Architectural Lens
Pick the simplest runtime that still satisfies scaling, networking, compliance, and deployment requirements.
Billing Lens
Always ask whether you pay per reserved instance, per execution, or per node pool before approving a design.
Operations Lens
Portal search, metrics, deployment logs, and identity settings should be part of the runbook from day one.
Web App / API -> App Service -> Event Trigger -> Functions -> Kubernetes Control -> AKS

App Service

Azure App Service is a managed platform for hosting web applications, REST APIs, and lightweight background workloads without managing operating systems, patching cycles, or load balancer plumbing yourself. It is the default choice when a team wants a fast deployment path for customer-facing sites, internal line-of-business applications, or Python, .NET, Node.js, and Java APIs that do not require container orchestration.

When App Service Is The Right Fit Web / API First

Use App Service when the application is HTTP-centric, state lives outside the process, and the team values managed deployment slots, TLS termination, autoscale, and Microsoft-managed patching over raw infrastructure control.

TierBest Use CaseHow It Is BilledOperational Impact
Free (F1)Learning, demos, proof-of-concept appsShared compute with strict limits and no production-grade capacity guaranteesNo SLA, constrained CPU minutes, not appropriate for real production traffic
Basic (B1-B3)Small steady-state web apps and internal APIsFixed price per dedicated App Service Plan instance-hourYou reserve compute even when traffic is idle, so predictable but always-on cost
Premium (P1v3+ / P1v4+)Production APIs, secure apps, autoscale workloadsHigher fixed price per dedicated instance-hour, plus related networking and outbound bandwidth costsSupports autoscale, private networking scenarios, more memory and CPU, and stronger enterprise fit
Real-World Use Case
A product team hosts a customer portal and REST API on Linux App Service, uses deployment slots for zero-downtime swaps, and injects secrets from Key Vault using managed identity.
Pricing Reality
You pay for the App Service Plan, not just the app. Multiple apps on the same plan share the same reserved compute and memory pool.
Portal Tips
Use portal search with the app name, then open Diagnose and solve problems, Deployment Center, and Scale up before changing code or infrastructure.
!
Cost mistake to avoid: teams often create separate App Service Plans for every app. If security and scaling profiles allow it, consolidating compatible apps onto the same plan reduces idle reserved spend.

Portal and CLI Navigation

// app-service.bicep
// Deploy a Linux App Service Plan and Web App with a system-assigned identity.
// This keeps credentials out of code and makes the app ready for Key Vault or Storage RBAC.

@description('Location for all resources.')
param location string = resourceGroup().location

@description('Globally unique web app name.')
param webAppName string

@description('App Service Plan name.')
param appServicePlanName string = 'asp-azure-handbook'

@allowed([
  'F1'
  'B1'
  'P1v3'
])
@description('Choose Free for labs, Basic for fixed low-volume apps, Premium for production.')
param skuName string = 'B1'

var skuTier = skuName == 'F1' ? 'Free' : skuName == 'B1' ? 'Basic' : 'PremiumV3'

resource plan 'Microsoft.Web/serverfarms@2024-04-01' = {
  name: appServicePlanName
  location: location
  kind: 'linux'
  sku: {
    name: skuName
    tier: skuTier
    capacity: 1
  }
  properties: {
    reserved: true
  }
}

resource site 'Microsoft.Web/sites@2024-04-01' = {
  name: webAppName
  location: location
  kind: 'app,linux'
  identity: {
    type: 'SystemAssigned'
  }
  properties: {
    serverFarmId: plan.id
    httpsOnly: true
    siteConfig: {
      linuxFxVersion: 'PYTHON|3.12'
      alwaysOn: skuName == 'F1' ? false : true
      minTlsVersion: '1.2'
      ftpsState: 'Disabled'
      appSettings: [
        {
          name: 'WEBSITE_RUN_FROM_PACKAGE'
          value: '1'
        }
      ]
    }
  }
}

output hostname string = site.properties.defaultHostName

Azure Functions

Azure Functions is Azure's event-driven serverless runtime. Use it when code should run in response to HTTP requests, storage events, timers, queue messages, Service Bus triggers, or Event Grid notifications without reserving a full web tier all day.

PlanBilling ModelBest FitWhat To Watch
ConsumptionPay per execution count and execution duration based on memory usedSpiky event workloads, low or unpredictable trafficCold starts, execution limits, and network constraints matter
PremiumPay for pre-warmed and active instances by allocated computeLatency-sensitive functions, VNet integration, steady enterprise trafficHigher baseline cost because capacity is reserved
DedicatedRuns on an App Service Plan that you already pay forFunctions that should share reserved web compute with other appsCheap only if that plan already exists for a good reason
Real-World Use Case
A finance integration receives daily CSV drops in Blob Storage, validates them, and posts clean rows into downstream systems using a Blob trigger and queue fan-out.
Cost Mechanics
Consumption is attractive for bursty workloads, but Premium becomes easier to justify when cold starts, private networking, or predictable sustained volume matter.
Portal Tips
Open Functions, Monitor, Application Insights, and Configuration. When a function fails, check trigger binding settings before debugging business logic.
i
Billing shortcut: if your team says “serverless is always cheaper,” push back. Premium and Dedicated plans can cost more than a small web tier if the workload is always hot.

Portal and CLI Navigation

# function_app.py
# Basic Azure Functions Python v2 HTTP trigger with input validation and safe error handling.

import json
import logging
import azure.functions as func

app = func.FunctionApp(http_auth_level=func.AuthLevel.FUNCTION)


@app.route(route="hello", methods=["GET"])
def hello(req: func.HttpRequest) -> func.HttpResponse:
    try:
        name = req.params.get("name")

        if not name:
            return func.HttpResponse(
                json.dumps({"error": "Query string parameter 'name' is required."}),
                status_code=400,
                mimetype="application/json",
            )

        payload = {
            "message": f"Hello, {name}. Your Azure Function is running.",
            "executionModel": "Python v2",
        }
        return func.HttpResponse(
            json.dumps(payload),
            status_code=200,
            mimetype="application/json",
        )

    except Exception as exc:
        logging.exception("HTTP trigger failed: %s", exc)
        return func.HttpResponse(
            json.dumps({"error": "Unexpected server error."}),
            status_code=500,
            mimetype="application/json",
        )

Azure Kubernetes Service (AKS)

AKS is Azure's managed Kubernetes offering for teams that need container orchestration, independent service release cycles, Kubernetes-native tooling, or deep control over ingress, networking, and workload composition. It is powerful, but it should be chosen because you need Kubernetes, not because it sounds “more cloud-native.”

Billing Model To Remember Nodes Drive Spend

At a high level, the AKS control plane is free and Microsoft-managed, but you still pay for worker nodes, node pool VM sizes, managed disks, load balancers, public IPs, monitoring, and outbound bandwidth. Optional SLA or premium add-ons should still be checked against current pricing.

Real-World Use Case
A platform team runs multiple microservices with separate deployment pipelines, ingress rules, and service meshes, and needs pod-level autoscaling plus cluster policy enforcement.
Cost Mechanics
A lightly used AKS cluster can still cost materially more than App Service because the node pool is always running. Over-sized system pools are a common waste pattern.
Portal Tips
Search the cluster name, then inspect Node pools, Workloads, Insights, Activity log, and Kubernetes resources before making scaling changes.
!
Architecture warning: AKS adds meaningful operational overhead. If your workload is a single HTTP service with straightforward scaling needs, App Service or Container Apps is usually the better default.

Portal and CLI Navigation

// aks.bicep
// Minimal AKS deployment. Use this as a baseline, then add private networking,
// Azure Policy, workload identity, and dedicated user pools for real production clusters.

@description('Location for the AKS cluster.')
param location string = resourceGroup().location

@description('AKS cluster name.')
param aksName string = 'aks-handbook-demo'

@description('DNS prefix used by the API server endpoint.')
param dnsPrefix string = 'aks-handbook-demo'

@minValue(1)
@description('System node count. Start small, then scale with measured demand.')
param nodeCount int = 3

@description('Worker node VM size. This is where most AKS cost starts.')
param nodeVmSize string = 'Standard_D4ds_v5'

resource aks 'Microsoft.ContainerService/managedClusters@2024-09-01' = {
  name: aksName
  location: location
  sku: {
    name: 'Base'
    tier: 'Free'
  }
  identity: {
    type: 'SystemAssigned'
  }
  properties: {
    dnsPrefix: dnsPrefix
    enableRBAC: true
    agentPoolProfiles: [
      {
        name: 'system'
        mode: 'System'
        count: nodeCount
        vmSize: nodeVmSize
        osType: 'Linux'
        osSKU: 'Ubuntu'
        type: 'VirtualMachineScaleSets'
        maxPods: 30
      }
    ]
    networkProfile: {
      networkPlugin: 'azure'
      loadBalancerSku: 'standard'
    }
  }
}

output clusterName string = aks.name

Portal Search and Azure CLI Navigation

Fast diagnosis in Azure usually starts with two habits: using the global portal search bar aggressively and using Azure CLI to confirm what actually exists in the subscription. Both habits reduce time lost to clicking through the wrong resource group or reading stale deployment assumptions.

TaskPortal ShortcutCLI ShortcutFirst Diagnostic Check
Find a web appSearch exact app name or App Servicesaz webapp list -o tableDeployment status, app settings, and log stream
Find a function appSearch exact app name or Function Appaz functionapp list -o tableMonitor tab, trigger bindings, storage account link
Find an AKS clusterSearch exact cluster name or Kubernetes servicesaz aks list -o tableNode pools, insights, recent activity log entries
Portal Workflow
Search by resource name first, then confirm subscription and resource group immediately. Azure estates often fail operationally because engineers troubleshoot the wrong environment.
CLI Workflow
Use CLI output to validate naming, SKU, region, and runtime before assuming the portal reflects the last deployment. Scriptable checks beat memory.
# Azure CLI quick-start for compute discovery and first-response troubleshooting

az login
az account set --subscription "<subscription-id-or-name>"

# Fast deploy for a simple App Service web app
az webapp up \
  --name "<globally-unique-webapp-name>" \
  --resource-group "<resource-group>" \
  --location "eastus" \
  --runtime "PYTHON:3.12" \
  --sku B1

# Inventory current compute assets
az webapp list -o table
az functionapp list -o table
az aks list -o table

# Inspect a specific resource when something looks wrong
az webapp show --name "<webapp-name>" --resource-group "<resource-group>"
az webapp log tail --name "<webapp-name>" --resource-group "<resource-group>"
az functionapp show --name "<function-app-name>" --resource-group "<resource-group>"
az aks show --name "<aks-name>" --resource-group "<resource-group>"
az aks get-credentials --name "<aks-name>" --resource-group "<resource-group>" --overwrite-existing
i
Recommended operational habit: when an incident begins, capture the exact resource name, resource group, subscription, SKU, and region in the first five minutes. That shortens almost every Azure troubleshooting loop.

Module 2: Data & Storage

Azure data services differ most on access pattern, consistency expectations, operational overhead, and pricing mechanics. Blob Storage is the default landing zone for unstructured data, Cosmos DB fits globally distributed low-latency NoSQL workloads, and Azure SQL Database is the relational PaaS default when transactions, joins, and mature SQL tooling matter.

Storage Lens
Ask whether the workload is object, document, or relational before choosing a service. Azure charges very differently for each model.
Cost Lens
Capacity is only part of the bill. Transactions, throughput reservations, geo-replication, backups, and retrieval fees frequently dominate spend.
Operations Lens
Lifecycle management, partition design, failover settings, and network restrictions should be reviewed before any production launch.

Azure Blob Storage

Azure Blob Storage is Azure's object store for unstructured data such as images, documents, backups, logs, parquet files, and application uploads. It is usually the simplest and cheapest place to put large binary assets, but the final bill depends on more than raw gigabytes.

TierBest FitHow It Is BilledCommon Mistake
HotFrequently read objects, active app content, recent analytics filesHigher storage capacity rate, lower access chargesKeeping infrequently used backups hot for months
CoolInfrequently accessed content with occasional retrievalLower capacity rate, higher read and retrieval charges, minimum retention period appliesMoving transactional data here and then reading it constantly
ArchiveLong-term retention, audit files, compliance snapshotsLowest capacity rate, highest retrieval latency and rehydration cost, minimum retention period appliesTreating archive as if it were an online filesystem
Real-World Use Case
A data engineering team lands raw vendor files in Blob Storage, triggers validation functions, then moves approved data to curated containers with lifecycle policies.
Pricing Reality
Blob Storage bills on storage capacity, read and write operations, data retrieval for cooler tiers, redundancy choice, and outbound data transfer when objects leave Azure.
Portal Tips
Open Containers, Lifecycle management, Data protection, Metrics, and Networking before assuming a storage problem is application-side.
!
Cost trap: cool and archive tiers look cheap until retrieval spikes. Capacity is only one line item. Frequent reads, rehydration, and cross-region traffic can erase expected savings very quickly.

Portal and CLI Navigation

# blob_upload.py
# Upload a file to Azure Blob Storage using Entra ID and RBAC instead of account keys.
# Required role for the caller or managed identity: Storage Blob Data Contributor.

from pathlib import Path
import logging

from azure.core.exceptions import AzureError
from azure.identity import DefaultAzureCredential
from azure.storage.blob import BlobServiceClient, ContentSettings


def upload_file_to_blob(account_url: str, container_name: str, source_path: str, blob_name: str) -> None:
    credential = DefaultAzureCredential()
    blob_service_client = BlobServiceClient(account_url=account_url, credential=credential)
    blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)

    file_path = Path(source_path)
    if not file_path.exists():
        raise FileNotFoundError(f"Source file not found: {file_path}")

    try:
        with file_path.open("rb") as data:
            blob_client.upload_blob(
                data,
                overwrite=True,
                content_settings=ContentSettings(content_type="application/octet-stream"),
            )
        logging.info("Uploaded %s to %s/%s", file_path.name, container_name, blob_name)
    except AzureError as exc:
        logging.exception("Blob upload failed: %s", exc)
        raise


if __name__ == "__main__":
    upload_file_to_blob(
        account_url="https://mystorageaccount.blob.core.windows.net",
        container_name="raw-ingest",
        source_path="./sample-data/orders.csv",
        blob_name="2026/03/orders.csv",
    )

Azure Cosmos DB

Azure Cosmos DB is Azure's globally distributed NoSQL database for low-latency document, key-value, graph, and related models. It is powerful when you genuinely need global distribution, elastic scale, or predictable low latency, but it is also one of the easiest Azure services to overspend on when partition design is poor.

Throughput ModelHow It Is BilledBest FitWatchouts
Provisioned RU/sPay for reserved request units every hour whether you consume them or notSteady traffic with predictable throughput needsIdle but over-provisioned containers waste money continuously
Autoscale RU/sPay for autoscale capacity based on configured max RU/sWorkloads with daily or weekly spikesSetting max RU/s far too high can still inflate spend materially
ServerlessPay per request consumed plus storage, without reserved RU/sLow-volume, sporadic, or dev and test workloadsNot the right fit for sustained heavy production load
Real-World Use Case
A SaaS product stores tenant-scoped user preferences and activity events in Cosmos DB so reads stay low latency across multiple regions.
Pricing Reality
Cosmos DB charges for RU consumption or provisioned RU/s, storage, backup options, and each replicated region. Multi-region writes increase cost further.
Portal Tips
Check Data Explorer, Metrics, Replicate data globally, and Scale & settings before assuming a query issue is just application latency.
!
Partition key failure is the classic Cosmos billing disaster. A bad key creates hot partitions, 429 throttling, uneven storage distribution, and a team response that is usually “buy more RU/s” instead of fixing the data model.

Portal and CLI Navigation

// cosmosdb.bicep
// Provision a Cosmos DB SQL API account, database, and container with autoscale throughput.
// Pick the partition key carefully. It should spread writes evenly and match common query patterns.

@description('Deployment location for the Cosmos DB account.')
param location string = resourceGroup().location

@description('Globally unique Cosmos DB account name.')
param accountName string

@description('SQL database name.')
param databaseName string = 'appdb'

@description('Container name.')
param containerName string = 'orders'

@description('Partition key path. Example: /tenantId or /customerId.')
param partitionKeyPath string = '/tenantId'

@minValue(1000)
@description('Autoscale max RU/s. Start with measured demand, not guesswork.')
param maxAutoscaleRu int = 4000

resource cosmos 'Microsoft.DocumentDB/databaseAccounts@2023-04-15' = {
  name: accountName
  location: location
  kind: 'GlobalDocumentDB'
  properties: {
    databaseAccountOfferType: 'Standard'
    publicNetworkAccess: 'Enabled'
    enableAutomaticFailover: false
    locations: [
      {
        locationName: location
        failoverPriority: 0
        isZoneRedundant: false
      }
    ]
    consistencyPolicy: {
      defaultConsistencyLevel: 'Session'
    }
  }
}

resource sqlDb 'Microsoft.DocumentDB/databaseAccounts/sqlDatabases@2023-04-15' = {
  name: '${cosmos.name}/${databaseName}'
  properties: {
    resource: {
      id: databaseName
    }
    options: {}
  }
}

resource container 'Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers@2023-04-15' = {
  name: '${cosmos.name}/${databaseName}/${containerName}'
  properties: {
    resource: {
      id: containerName
      partitionKey: {
        paths: [
          partitionKeyPath
        ]
        kind: 'Hash'
        version: 2
      }
      indexingPolicy: {
        indexingMode: 'consistent'
        automatic: true
        includedPaths: [
          {
            path: '/*'
          }
        ]
        excludedPaths: [
          {
            path: '/"_etag"/?'
          }
        ]
      }
    }
    options: {
      autoscaleSettings: {
        maxThroughput: maxAutoscaleRu
      }
    }
  }
}

output cosmosEndpoint string = cosmos.properties.documentEndpoint

Azure SQL Database

Azure SQL Database is Azure's PaaS relational database for transactional systems that need SQL Server compatibility without managing Windows hosts, patching, backups, or cluster operations. It is usually the default for line-of-business apps, reporting stores, and APIs that rely on relational constraints and mature SQL tooling.

ModelHow It Is BilledBest FitDecision Note
DTUBundled compute, memory, and IO in fixed tiersSmaller legacy workloads or teams that want simplified sizingSimple to buy, but less transparent about what resources you are getting
vCorePay for chosen compute generation, vCores, storage, backups, and extras such as zone redundancyModern production workloads needing clearer sizing and cost controlUsually preferred because it maps better to actual resource planning and discount options
Real-World Use Case
An order management API uses Azure SQL Database for transactional writes, relational integrity, predictable backups, and built-in query performance tooling.
Pricing Reality
Beyond compute, Azure SQL bills for database storage, backup storage over included amounts, zone redundancy, and outbound data transfer. Over-sizing for peak load is a common waste pattern.
Portal Tips
Inspect Query editor, Intelligent Performance, Firewall and virtual networks, and Geo-replication before changing the app or the database tier.

Portal and CLI Navigation

// azure-sql.bicep
// Deploy a logical SQL server and a vCore-based Azure SQL Database.
// Credentials are parameters so they are never hardcoded in source control.

@description('Location for SQL resources.')
param location string = resourceGroup().location

@description('Globally unique logical SQL server name.')
param sqlServerName string

@description('Database name.')
param databaseName string = 'appdb'

@description('SQL administrator login name.')
param administratorLogin string

@secure()
@description('SQL administrator password supplied at deployment time.')
param administratorPassword string

resource sqlServer 'Microsoft.Sql/servers@2023-08-01-preview' = {
  name: sqlServerName
  location: location
  properties: {
    administratorLogin: administratorLogin
    administratorLoginPassword: administratorPassword
    minimalTlsVersion: '1.2'
    publicNetworkAccess: 'Disabled'
  }
}

resource database 'Microsoft.Sql/servers/databases@2023-08-01-preview' = {
  name: '${sqlServer.name}/${databaseName}'
  location: location
  sku: {
    name: 'GP_S_Gen5_2'
    tier: 'GeneralPurpose'
    capacity: 2
  }
  properties: {
    collation: 'SQL_Latin1_General_CP1_CI_AS'
    maxSizeBytes: 10737418240
    autoPauseDelay: 60
    minCapacity: 0.5
    backupStorageRedundancy: 'Local'
  }
}

output fullyQualifiedServerName string = sqlServer.properties.fullyQualifiedDomainName

Module 3: Networking & Security

Azure security starts with network boundaries and identity boundaries working together. VNets and subnets create containment. NSGs express allowed traffic. Key Vault keeps secrets out of code. Network Watcher helps you prove whether the network is actually the problem before the incident drifts into guesswork.

Isolation Lens
Put application, data, and management paths in deliberately chosen subnets so a compromise or misconfiguration does not spill across the estate.
Identity Lens
Managed identities and RBAC remove credential sprawl. This is a security control and an operations control, not just a coding preference.
Diagnostics Lens
When packets fail, use Network Watcher and effective rules first. Network incidents get expensive when teams troubleshoot by intuition.

Virtual Networks (VNet) & Subnets

A VNet is the core Azure network boundary for private IP addressing, segmentation, and controlled connectivity between resources. Subnets turn that boundary into meaningful isolation domains. Separating app, data, and management paths is critical because it limits blast radius, simplifies policy, and makes route and security intent explicit.

Billing Model To Remember Boundary, Not The Bill

VNets and subnets themselves are generally not the expensive line items. The bill grows from attached components such as NAT Gateway, Azure Firewall, VPN Gateway, Private Endpoints, cross-region peering, and outbound bandwidth.

Real-World Use Case
A three-tier app places App Service integration in an application subnet, SQL private endpoints in a data subnet, and jump box or management services in a locked-down admin subnet.
Security Reality
Flat address spaces and shared subnets create unclear ownership and looser controls. Teams then compensate with brittle NSG rules and emergency exceptions.
Portal Tips
Inspect Address space, Subnets, Peerings, and any associated Network security group before changing route or firewall behavior.

Portal and CLI Navigation

// vnet.bicep
// Create a VNet with separate application and data subnets plus an NSG.
// Network isolation is cheap compared to retrofitting security after go-live.

@description('Location for networking resources.')
param location string = resourceGroup().location

@description('Virtual network name.')
param vnetName string = 'vnet-app-prod'

resource nsg 'Microsoft.Network/networkSecurityGroups@2024-03-01' = {
  name: 'nsg-app-subnet'
  location: location
  properties: {
    securityRules: [
      {
        name: 'AllowHttpsInbound'
        properties: {
          protocol: 'Tcp'
          sourcePortRange: '*'
          destinationPortRange: '443'
          sourceAddressPrefix: 'Internet'
          destinationAddressPrefix: '*'
          access: 'Allow'
          priority: 100
          direction: 'Inbound'
        }
      }
    ]
  }
}

resource vnet 'Microsoft.Network/virtualNetworks@2024-03-01' = {
  name: vnetName
  location: location
  properties: {
    addressSpace: {
      addressPrefixes: [
        '10.20.0.0/16'
      ]
    }
    subnets: [
      {
        name: 'app'
        properties: {
          addressPrefix: '10.20.1.0/24'
          networkSecurityGroup: {
            id: nsg.id
          }
        }
      }
      {
        name: 'data'
        properties: {
          addressPrefix: '10.20.2.0/24'
        }
      }
    ]
  }
}

output vnetId string = vnet.id

Azure Key Vault

Azure Key Vault stores secrets, certificates, and cryptographic keys so application teams do not hardcode or manually rotate sensitive values. It should be the default home for application secrets unless there is a stronger service-specific reason to use another secure store.

CapabilityPurposeHow It Is BilledOperational Note
SecretsPasswords, connection strings, tokens, API keysPer operation and storage at the vault tierUsually the default for app configuration that must stay secret
KeysEncryption keys and signing materialPer key and per cryptographic operation, higher for HSM-backed optionsUse Premium when hardware-backed keys are required
CertificatesTLS certificate lifecycle managementVault operations plus any external CA costUseful when teams need central renewal and access control
Real-World Use Case
An App Service reads a database password from Key Vault using its system-assigned managed identity, so no credential is stored in source control or deployment variables.
Pricing Reality
Key Vault itself is inexpensive, but high-volume secret polling, premium HSM usage, and noisy retry loops can still create avoidable cost and operational churn.
Portal Tips
Open Secrets, Access control (IAM), Networking, Events, and Diagnostic settings before assuming an identity issue is an app bug.

Portal and CLI Navigation

# key_vault_read.py
# Fetch a Key Vault secret securely using DefaultAzureCredential.
# This works locally with az login and in Azure with a managed identity.

import logging

from azure.core.exceptions import AzureError, ResourceNotFoundError
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient


def get_secret(vault_url: str, secret_name: str) -> str:
    credential = DefaultAzureCredential()
    client = SecretClient(vault_url=vault_url, credential=credential)

    try:
      secret = client.get_secret(secret_name)
      return secret.value
    except ResourceNotFoundError as exc:
      logging.exception("Secret not found: %s", exc)
      raise
    except AzureError as exc:
      logging.exception("Failed to read secret from Key Vault: %s", exc)
      raise


if __name__ == "__main__":
    value = get_secret(
        vault_url="https://my-shared-vault.vault.azure.net/",
        secret_name="sql-admin-password",
    )
    print(f"Retrieved secret with length: {len(value)}")

Navigation and Troubleshooting: Network Watcher & Effective Security Rules

Network Watcher is the first-response toolbox for proving where traffic is being dropped. Use it to check topology, connection troubleshoot, IP flow verify, and effective security rules rather than guessing which NSG, route, or peering change caused the failure.

Real-World Use Case
A VM can reach the internet but not a private database endpoint. Network Watcher narrows the issue to an NSG deny on the subnet instead of a DNS or application problem.
Pricing Reality
Basic Network Watcher features are not usually the main cost driver, but flow logs, Traffic Analytics, and Log Analytics ingestion absolutely are. Turn them on deliberately.
Portal Tips
Use Network Watcher for IP flow verify and Connection troubleshoot. On a VM NIC, open Effective security rules before changing any NSG.

Portal and CLI Navigation

# effective_nsg.py
# Retrieve effective NSG rules for a VM network interface using the Azure Python SDK.

import logging
import os

from azure.core.exceptions import AzureError
from azure.identity import DefaultAzureCredential
from azure.mgmt.network import NetworkManagementClient


def print_effective_nsg(resource_group: str, nic_name: str) -> None:
    subscription_id = os.environ["AZURE_SUBSCRIPTION_ID"]
    credential = DefaultAzureCredential()
    client = NetworkManagementClient(credential, subscription_id)

    try:
        poller = client.network_interfaces.begin_get_effective_network_security_groups(
            resource_group_name=resource_group,
            network_interface_name=nic_name,
        )
        result = poller.result()
        for association in result.value:
            logging.info("NSG association: %s", association.network_security_group.id)
            for rule in association.effective_security_rules:
                print(
                    f"{rule.name}: {rule.direction} {rule.access} "
                    f"ports={rule.destination_port_range} source={rule.source_address_prefix}"
                )
    except AzureError as exc:
        logging.exception("Failed to retrieve effective NSG rules: %s", exc)
        raise


if __name__ == "__main__":
    print_effective_nsg(resource_group="rg-network-prod", nic_name="vm-app-01-nic")

Module 4: Pricing Calculation & Cost Management

Azure cost control is a design discipline, not a finance afterthought. Estimates should be built from real SKUs, realistic traffic, storage growth, and data transfer assumptions. Governance then keeps reality from drifting far away from the estimate.

Estimate First
Use the Azure Pricing Calculator before deployment so architecture tradeoffs are visible while they are still cheap to change.
Discount Second
Reserved capacity and Hybrid Benefit can materially improve unit economics, but only after the baseline sizing is credible.
Guardrails Always
Budgets, forecast alerts, and cost reviews are how you prevent small experiments from turning into monthly surprises.

The Pricing Calculator

The official Azure Pricing Calculator is the right starting point for workload estimation, but only if you treat it as an engineering exercise instead of a quick sales estimate. Calculator output is usually list-price guidance. It does not automatically know your reserved capacity, enterprise agreement discounts, or workload inefficiencies.

StepWhat To DoWhy It Matters
1. Define scopeList every service in the architecture, not just the headline compute tierMany Azure bills are driven by supporting services such as storage, monitoring, and data transfer
2. Pin region and SKUSelect the exact Azure region, tier, redundancy mode, and expected hoursRegion and SKU change prices materially, especially for compute and storage redundancy
3. Model usageEstimate requests, GB stored, GB transferred, RU/s, or node-hours based on real load expectationsUnder-modeling usage makes the calculator look cheap and the production bill look wrong
4. Add operations costsInclude transactions, backup retention, retrieval fees, and monitoring ingestionThese line items are often missing from first-pass estimates
5. Validate against realityCompare the estimate with a pilot environment or a prior month's spendThis exposes bad assumptions before leadership starts relying on the number
6. Revisit monthlyUpdate estimates as SKUs, traffic, and architecture changePricing drift is normal. Unreviewed estimates become fiction quickly
Real-World Use Case
A migration team compares App Service Premium, Functions Premium, and AKS for the same API workload. The calculator shows that worker node and observability costs make AKS materially more expensive before any engineering labor is considered.
Navigation Tip
Use the official Azure Pricing Calculator for the estimate, then validate assumptions in Cost Management > Cost analysis after the pilot is live. The calculator tells you expected spend. Cost analysis tells you actual spend.

Cost Optimization: Reserved Instances, PAYG, and Azure Hybrid Benefit

Pay-As-You-Go (PAYG) is the baseline model: you pay standard rates with maximum flexibility and no long-term commitment. Reserved Instances or reserved capacity trade commitment for lower unit pricing on eligible services. Azure Hybrid Benefit lets you apply qualifying existing Windows Server or SQL Server licenses to reduce Azure software charges.

OptionBest FitHow It Changes The BillRisk
PAYGNew workloads, uncertain demand, short-lived environmentsHighest flexibility, usually highest unit costTeams forget to revisit it after usage stabilizes
Reserved Instances / Reserved CapacitySteady-state production with predictable baseline usageLower hourly cost in exchange for 1-year or 3-year commitmentWrong sizing or wrong region commitment limits savings
Azure Hybrid BenefitOrganizations with eligible Microsoft licensesReduces software portion of Azure compute or database costLicense eligibility and assignment must be governed carefully
Real-World Use Case
A production SQL estate running predictable business hours moves from PAYG vCore pricing to reserved capacity and applies Azure Hybrid Benefit, cutting monthly run cost without changing application code.
Portal and CLI Navigation
Use Reservations, Advisor, and Cost analysis in the portal. In CLI, start with az consumption usage list and any service-specific SKU inspection commands to validate that usage is steady enough to justify commitment.
i
Optimization rule: do not buy reservations to compensate for bad architecture. Rightsize first, then commit to the baseline that remains.

Budgets and Cost Alerts

Budgets are the simplest protection against bill shock. A budget does not stop spend by itself, but it gives teams a threshold, forecast visibility, and a forcing function to react before the invoice arrives.

Portal Setup Steps
Open Cost Management + Billing, choose the target subscription or resource group, open Budgets, create a monthly budget, then add alert thresholds for actual and forecasted spend such as 80 percent, 90 percent, and 100 percent.
Billing Reality
Budget alerts themselves are not the expensive part. The real value is operational: they shorten the time between unexpected spend and action. The underlying Azure resources keep billing until you remediate them.

Portal and CLI Navigation

// budget.bicep
// Subscription-scope monthly budget with actual and forecast alerts.

targetScope = 'subscription'

@description('Budget resource name.')
param budgetName string = 'engineering-monthly-budget'

@description('Budget amount in USD or the billing currency of the subscription.')
param budgetAmount int = 1500

@description('Budget notification recipients.')
param contactEmails array = [
  'cloud-ops@example.com'
]

resource budget 'Microsoft.Consumption/budgets@2023-05-01' = {
  name: budgetName
  properties: {
    category: 'Cost'
    amount: budgetAmount
    timeGrain: 'Monthly'
    timePeriod: {
      startDate: '2026-01-01T00:00:00Z'
      endDate: '2027-01-01T00:00:00Z'
    }
    notifications: {
      actual80: {
        enabled: true
        operator: 'GreaterThan'
        threshold: 80
        thresholdType: 'Actual'
        contactEmails: contactEmails
      }
      forecast100: {
        enabled: true
        operator: 'GreaterThan'
        threshold: 100
        thresholdType: 'Forecasted'
        contactEmails: contactEmails
      }
    }
  }
}

Module 5: Infrastructure as Code (IaC) Standard

Infrastructure as Code is the production standard because Azure Resource Manager changes should be reviewable, repeatable, parameterized, and deployable across environments without human clicking. The portal is excellent for inspection and emergency response. It is not a durable production change-management system.

Why IaC

Clicking through the portal is an anti-pattern for production because it creates undocumented drift, makes peer review impossible, and turns every rebuild into archaeology. Bicep fixes that by giving Azure engineers a native ARM abstraction with reusable modules, parameter files, and deployment history.

Real-World Use Case
A platform team provisions identical non-production environments from Bicep parameter files, reducing onboarding time and eliminating “it works only in one subscription” drift.
Operational Benefit
Deployments become diffable and testable. Roll-forward becomes safer than manual rollback because the desired state is explicit.
Portal and CLI Navigation
Use the portal Deployments blade to inspect ARM history and use az deployment group what-if or az deployment sub what-if before applying changes.
!
Production rule: if the only copy of a change exists in somebody's browser history, the infrastructure is not under control.

Bicep Example: Storage Account + App Service with Managed Identity

The following Bicep file is a clean baseline for a typical web workload. It provisions a standard Storage Account and a Linux App Service with a system-assigned managed identity. The storage account is locked down with HTTPS-only and public blob access disabled. The App Service is ready to authenticate to other Azure services using its identity instead of embedded credentials.

Real-World Use Case
A product team deploys the same web tier to dev, test, and prod using different parameter files, then grants the app identity RBAC access to Storage or Key Vault without changing application secrets.
Portal and CLI Navigation
After deployment, verify the Identity blade on the web app, Access control (IAM) on the storage account, and the deployment output using az deployment group show.
// main.bicep
// Deploy a standard storage account and Linux App Service with a system-assigned managed identity.

@description('Deployment location for all resources.')
param location string = resourceGroup().location

@description('Globally unique storage account name using only lowercase letters and numbers.')
param storageAccountName string

@description('Globally unique web app name.')
param webAppName string

@description('App Service Plan name.')
param appServicePlanName string = 'asp-standard-linux'

@allowed([
  'B1'
  'P1v3'
])
@description('Choose B1 for smaller steady-state workloads and P1v3 for production autoscale scenarios.')
param appServiceSku string = 'B1'

var appServiceTier = appServiceSku == 'B1' ? 'Basic' : 'PremiumV3'

resource storage 'Microsoft.Storage/storageAccounts@2023-05-01' = {
  name: storageAccountName
  location: location
  sku: {
    name: 'Standard_LRS'
  }
  kind: 'StorageV2'
  properties: {
    accessTier: 'Hot'
    allowBlobPublicAccess: false
    allowSharedKeyAccess: false
    minimumTlsVersion: 'TLS1_2'
    publicNetworkAccess: 'Enabled'
    supportsHttpsTrafficOnly: true
  }
}

resource plan 'Microsoft.Web/serverfarms@2024-04-01' = {
  name: appServicePlanName
  location: location
  kind: 'linux'
  sku: {
    name: appServiceSku
    tier: appServiceTier
    capacity: 1
  }
  properties: {
    reserved: true
  }
}

resource site 'Microsoft.Web/sites@2024-04-01' = {
  name: webAppName
  location: location
  kind: 'app,linux'
  identity: {
    type: 'SystemAssigned'
  }
  properties: {
    serverFarmId: plan.id
    httpsOnly: true
    siteConfig: {
      linuxFxVersion: 'PYTHON|3.12'
      alwaysOn: true
      minTlsVersion: '1.2'
      ftpsState: 'Disabled'
      appSettings: [
        {
          name: 'AZURE_STORAGE_ACCOUNT_NAME'
          value: storage.name
        }
        {
          name: 'WEBSITE_RUN_FROM_PACKAGE'
          value: '1'
        }
      ]
    }
  }
}

output storageBlobEndpoint string = storage.properties.primaryEndpoints.blob
output webAppHostname string = site.properties.defaultHostName
output webAppPrincipalId string = site.identity.principalId

Module 6: Common Pitfalls & Anti-Patterns

Most Azure overspend and avoidable security exposure comes from a short list of repeated mistakes, not obscure platform behavior. The recurring pattern is the same: no inventory discipline, weak ownership, and changes made without cost or identity review.

1. Leaving Idle Resources Running

Idle resources are one of the fastest ways to waste money in Azure. Unattached managed disks, unused public IPs, forgotten App Service Plans, dormant AKS node pools, and test databases continue billing even when no application is using them.

Real-World Use Case
A project deletes its VMs after a migration test but leaves premium disks and public IPs behind. The application is gone, but the monthly bill remains.
Billing Reality
Managed disks and reserved compute keep billing independently of whether they are attached. Azure charges resources, not intent.
Portal and CLI Navigation
Check Cost analysis, Advisor, Disks, and Public IP addresses. In CLI, inventory with az disk list and az network public-ip list.
# idle_resource_audit.py
# Identify unattached managed disks and unassociated public IP addresses.

import logging
import os

from azure.core.exceptions import AzureError
from azure.identity import DefaultAzureCredential
from azure.mgmt.compute import ComputeManagementClient
from azure.mgmt.network import NetworkManagementClient


def find_idle_resources() -> None:
    subscription_id = os.environ["AZURE_SUBSCRIPTION_ID"]
    credential = DefaultAzureCredential()
    compute_client = ComputeManagementClient(credential, subscription_id)
    network_client = NetworkManagementClient(credential, subscription_id)

    try:
        print("Unattached managed disks:")
        for disk in compute_client.disks.list():
            if not disk.managed_by:
                print(f"- {disk.name} ({disk.location}) sku={disk.sku.name}")

        print("\nUnassociated public IP addresses:")
        for public_ip in network_client.public_ip_addresses.list_all():
            if public_ip.ip_configuration is None:
                print(f"- {public_ip.name} ({public_ip.location}) sku={public_ip.sku.name}")
    except AzureError as exc:
        logging.exception("Idle resource discovery failed: %s", exc)
        raise


if __name__ == "__main__":
    find_idle_resources()

2. Misunderstanding Egress Bandwidth Costs

Data entering Azure is usually free. Data leaving Azure is not. That applies to internet egress, many cross-region transfers, some peering patterns, CDN origin traffic, backups restored across boundaries, and storage-heavy workloads that serve large files directly to users.

ScenarioWhy Teams Miss ItWhat Actually Costs Money
Serving files from Blob StorageStorage capacity looks cheap in the estimateOutbound bandwidth and high read volume can become the real bill
Cross-region architecturesTeams focus on resiliency, not transfer patternsReplication and inter-region data movement can materially raise run cost
API-heavy applicationsDevelopers count requests but not response payload sizeLarge response bodies multiply egress charges as traffic grows
Real-World Use Case
A media app stores assets cheaply in Blob Storage, then serves them globally without CDN or caching. The storage line item stays modest while bandwidth becomes the surprise top charge.
Portal and CLI Navigation
In the portal, use Cost analysis filtered by meter category Bandwidth and inspect storage account metrics. In CLI, review usage data and validate where traffic leaves the platform before scaling the same pattern further.
!
Design correction: if large content is leaving Azure frequently, evaluate CDN, caching, compression, regional placement, and whether clients really need the current payload size.

3. Hardcoding Credentials Instead of Using Managed Identities and RBAC

Hardcoded credentials create both security risk and operational drag. Secrets leak into repositories, pipelines, laptops, and ticket comments. Rotation becomes painful, and incident response expands because nobody is sure where the credential was copied. Managed identities and RBAC remove that problem at the root.

Real-World Use Case
An App Service authenticates to Blob Storage with its managed identity and a role assignment, so there is no connection string to rotate when people or environments change.
Billing Reality
Managed identities do not add a meaningful direct cost. The bill stays with the target service and any Key Vault operations, while the security posture improves materially.
Portal and CLI Navigation
Check the workload Identity blade, then validate Access control (IAM) on the target resource. In CLI, inspect with az webapp identity show and az role assignment list.
# managed_identity_blob.py
# Access Blob Storage without a connection string by using DefaultAzureCredential.

import logging

from azure.core.exceptions import AzureError
from azure.identity import DefaultAzureCredential
from azure.storage.blob import BlobServiceClient


def list_blobs(account_url: str, container_name: str) -> None:
    credential = DefaultAzureCredential()
    service_client = BlobServiceClient(account_url=account_url, credential=credential)
    container_client = service_client.get_container_client(container_name)

    try:
        for blob in container_client.list_blobs(name_starts_with="2026/"):
            print(blob.name)
    except AzureError as exc:
        logging.exception("Managed identity blob access failed: %s", exc)
        raise


if __name__ == "__main__":
    list_blobs(
        account_url="https://mystorageaccount.blob.core.windows.net",
        container_name="raw-ingest",
    )
!
Standard to enforce: new Azure workloads should default to managed identity plus RBAC. Hardcoded secrets should require an explicit exception and review, not the other way around.