Azure Cloud Engineer Handbook
An engineer-focused guide to Azure architecture, cost mechanics, portal navigation, and repeatable delivery with Azure Resource Manager primitives.
Table of Contents
This handbook is organized around the decisions Azure engineers make in production: what service to pick, how it is billed, where to inspect it in the portal, and how to automate it with Infrastructure as Code or SDKs.
| Module | Primary Decision | What You Must Know |
|---|---|---|
| 1. Compute & Serverless | Where application code should run | Billing model, scaling behavior, deployment path, and first-response troubleshooting steps |
| 2. Data & Storage | How data is stored and accessed | Capacity pricing, transaction charges, throughput planning, and lifecycle choices |
| 3. Networking & Security | How resources are isolated and trusted | Subnet boundaries, secret handling, RBAC, effective rules, and private connectivity |
| 4. Pricing & Cost | How to estimate and control spend | Calculator discipline, discount models, and proactive alerts |
| 5. IaC Standard | How infrastructure gets delivered | Version control, repeatability, reviewability, and environment-safe parameters |
| 6. Pitfalls | What repeatedly causes incidents or bill spikes | Zombie resources, outbound data charges, and credential management mistakes |
Module 1: Compute & Serverless
Azure compute choices are mostly a tradeoff between operational control and platform abstraction. App Service is the fastest path for standard web workloads, Azure Functions is the lowest-friction option for event-driven execution, and AKS exists for teams that truly need Kubernetes primitives, release independence, or container ecosystem tooling.
App Service
Azure App Service is a managed platform for hosting web applications, REST APIs, and lightweight background workloads without managing operating systems, patching cycles, or load balancer plumbing yourself. It is the default choice when a team wants a fast deployment path for customer-facing sites, internal line-of-business applications, or Python, .NET, Node.js, and Java APIs that do not require container orchestration.
Use App Service when the application is HTTP-centric, state lives outside the process, and the team values managed deployment slots, TLS termination, autoscale, and Microsoft-managed patching over raw infrastructure control.
| Tier | Best Use Case | How It Is Billed | Operational Impact |
|---|---|---|---|
| Free (F1) | Learning, demos, proof-of-concept apps | Shared compute with strict limits and no production-grade capacity guarantees | No SLA, constrained CPU minutes, not appropriate for real production traffic |
| Basic (B1-B3) | Small steady-state web apps and internal APIs | Fixed price per dedicated App Service Plan instance-hour | You reserve compute even when traffic is idle, so predictable but always-on cost |
| Premium (P1v3+ / P1v4+) | Production APIs, secure apps, autoscale workloads | Higher fixed price per dedicated instance-hour, plus related networking and outbound bandwidth costs | Supports autoscale, private networking scenarios, more memory and CPU, and stronger enterprise fit |
Portal and CLI Navigation
- Portal: search for the app name or
App Services, then checkOverview,Diagnose and solve problems,Log stream, andApp Service plan. - CLI: use
az webapp upfor a fast deploy path, thenaz webapp log tailandaz webapp config appsettings listto confirm runtime and configuration.
// app-service.bicep
// Deploy a Linux App Service Plan and Web App with a system-assigned identity.
// This keeps credentials out of code and makes the app ready for Key Vault or Storage RBAC.
@description('Location for all resources.')
param location string = resourceGroup().location
@description('Globally unique web app name.')
param webAppName string
@description('App Service Plan name.')
param appServicePlanName string = 'asp-azure-handbook'
@allowed([
'F1'
'B1'
'P1v3'
])
@description('Choose Free for labs, Basic for fixed low-volume apps, Premium for production.')
param skuName string = 'B1'
var skuTier = skuName == 'F1' ? 'Free' : skuName == 'B1' ? 'Basic' : 'PremiumV3'
resource plan 'Microsoft.Web/serverfarms@2024-04-01' = {
name: appServicePlanName
location: location
kind: 'linux'
sku: {
name: skuName
tier: skuTier
capacity: 1
}
properties: {
reserved: true
}
}
resource site 'Microsoft.Web/sites@2024-04-01' = {
name: webAppName
location: location
kind: 'app,linux'
identity: {
type: 'SystemAssigned'
}
properties: {
serverFarmId: plan.id
httpsOnly: true
siteConfig: {
linuxFxVersion: 'PYTHON|3.12'
alwaysOn: skuName == 'F1' ? false : true
minTlsVersion: '1.2'
ftpsState: 'Disabled'
appSettings: [
{
name: 'WEBSITE_RUN_FROM_PACKAGE'
value: '1'
}
]
}
}
}
output hostname string = site.properties.defaultHostName
Azure Functions
Azure Functions is Azure's event-driven serverless runtime. Use it when code should run in response to HTTP requests, storage events, timers, queue messages, Service Bus triggers, or Event Grid notifications without reserving a full web tier all day.
| Plan | Billing Model | Best Fit | What To Watch |
|---|---|---|---|
| Consumption | Pay per execution count and execution duration based on memory used | Spiky event workloads, low or unpredictable traffic | Cold starts, execution limits, and network constraints matter |
| Premium | Pay for pre-warmed and active instances by allocated compute | Latency-sensitive functions, VNet integration, steady enterprise traffic | Higher baseline cost because capacity is reserved |
| Dedicated | Runs on an App Service Plan that you already pay for | Functions that should share reserved web compute with other apps | Cheap only if that plan already exists for a good reason |
Portal and CLI Navigation
- Portal: search the function app name or
Function App, then inspectFunctions,Monitor,Deployment Center, and linkedApplication Insights. - CLI: use
az functionapp listfor discovery,az functionapp showfor configuration, and deployment commands after confirming the plan and storage account.
# function_app.py
# Basic Azure Functions Python v2 HTTP trigger with input validation and safe error handling.
import json
import logging
import azure.functions as func
app = func.FunctionApp(http_auth_level=func.AuthLevel.FUNCTION)
@app.route(route="hello", methods=["GET"])
def hello(req: func.HttpRequest) -> func.HttpResponse:
try:
name = req.params.get("name")
if not name:
return func.HttpResponse(
json.dumps({"error": "Query string parameter 'name' is required."}),
status_code=400,
mimetype="application/json",
)
payload = {
"message": f"Hello, {name}. Your Azure Function is running.",
"executionModel": "Python v2",
}
return func.HttpResponse(
json.dumps(payload),
status_code=200,
mimetype="application/json",
)
except Exception as exc:
logging.exception("HTTP trigger failed: %s", exc)
return func.HttpResponse(
json.dumps({"error": "Unexpected server error."}),
status_code=500,
mimetype="application/json",
)
Azure Kubernetes Service (AKS)
AKS is Azure's managed Kubernetes offering for teams that need container orchestration, independent service release cycles, Kubernetes-native tooling, or deep control over ingress, networking, and workload composition. It is powerful, but it should be chosen because you need Kubernetes, not because it sounds “more cloud-native.”
At a high level, the AKS control plane is free and Microsoft-managed, but you still pay for worker nodes, node pool VM sizes, managed disks, load balancers, public IPs, monitoring, and outbound bandwidth. Optional SLA or premium add-ons should still be checked against current pricing.
Portal and CLI Navigation
- Portal: search
Kubernetes servicesor the cluster name, then check node pressure, pending pods, and cluster insights before resizing anything. - CLI: start with
az aks list -o table, thenaz aks showandaz aks get-credentialsif you need to inspect cluster state withkubectl.
// aks.bicep
// Minimal AKS deployment. Use this as a baseline, then add private networking,
// Azure Policy, workload identity, and dedicated user pools for real production clusters.
@description('Location for the AKS cluster.')
param location string = resourceGroup().location
@description('AKS cluster name.')
param aksName string = 'aks-handbook-demo'
@description('DNS prefix used by the API server endpoint.')
param dnsPrefix string = 'aks-handbook-demo'
@minValue(1)
@description('System node count. Start small, then scale with measured demand.')
param nodeCount int = 3
@description('Worker node VM size. This is where most AKS cost starts.')
param nodeVmSize string = 'Standard_D4ds_v5'
resource aks 'Microsoft.ContainerService/managedClusters@2024-09-01' = {
name: aksName
location: location
sku: {
name: 'Base'
tier: 'Free'
}
identity: {
type: 'SystemAssigned'
}
properties: {
dnsPrefix: dnsPrefix
enableRBAC: true
agentPoolProfiles: [
{
name: 'system'
mode: 'System'
count: nodeCount
vmSize: nodeVmSize
osType: 'Linux'
osSKU: 'Ubuntu'
type: 'VirtualMachineScaleSets'
maxPods: 30
}
]
networkProfile: {
networkPlugin: 'azure'
loadBalancerSku: 'standard'
}
}
}
output clusterName string = aks.name
Module 2: Data & Storage
Azure data services differ most on access pattern, consistency expectations, operational overhead, and pricing mechanics. Blob Storage is the default landing zone for unstructured data, Cosmos DB fits globally distributed low-latency NoSQL workloads, and Azure SQL Database is the relational PaaS default when transactions, joins, and mature SQL tooling matter.
Azure Blob Storage
Azure Blob Storage is Azure's object store for unstructured data such as images, documents, backups, logs, parquet files, and application uploads. It is usually the simplest and cheapest place to put large binary assets, but the final bill depends on more than raw gigabytes.
| Tier | Best Fit | How It Is Billed | Common Mistake |
|---|---|---|---|
| Hot | Frequently read objects, active app content, recent analytics files | Higher storage capacity rate, lower access charges | Keeping infrequently used backups hot for months |
| Cool | Infrequently accessed content with occasional retrieval | Lower capacity rate, higher read and retrieval charges, minimum retention period applies | Moving transactional data here and then reading it constantly |
| Archive | Long-term retention, audit files, compliance snapshots | Lowest capacity rate, highest retrieval latency and rehydration cost, minimum retention period applies | Treating archive as if it were an online filesystem |
Portal and CLI Navigation
- Portal: search the storage account name, then inspect
Containers,Access keys,Networking,Lifecycle management, andMetrics. - CLI: use
az storage account show,az storage container list --auth-mode login, andaz storage blob listto confirm the account, container, and object path before changing application code.
# blob_upload.py
# Upload a file to Azure Blob Storage using Entra ID and RBAC instead of account keys.
# Required role for the caller or managed identity: Storage Blob Data Contributor.
from pathlib import Path
import logging
from azure.core.exceptions import AzureError
from azure.identity import DefaultAzureCredential
from azure.storage.blob import BlobServiceClient, ContentSettings
def upload_file_to_blob(account_url: str, container_name: str, source_path: str, blob_name: str) -> None:
credential = DefaultAzureCredential()
blob_service_client = BlobServiceClient(account_url=account_url, credential=credential)
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)
file_path = Path(source_path)
if not file_path.exists():
raise FileNotFoundError(f"Source file not found: {file_path}")
try:
with file_path.open("rb") as data:
blob_client.upload_blob(
data,
overwrite=True,
content_settings=ContentSettings(content_type="application/octet-stream"),
)
logging.info("Uploaded %s to %s/%s", file_path.name, container_name, blob_name)
except AzureError as exc:
logging.exception("Blob upload failed: %s", exc)
raise
if __name__ == "__main__":
upload_file_to_blob(
account_url="https://mystorageaccount.blob.core.windows.net",
container_name="raw-ingest",
source_path="./sample-data/orders.csv",
blob_name="2026/03/orders.csv",
)
Azure Cosmos DB
Azure Cosmos DB is Azure's globally distributed NoSQL database for low-latency document, key-value, graph, and related models. It is powerful when you genuinely need global distribution, elastic scale, or predictable low latency, but it is also one of the easiest Azure services to overspend on when partition design is poor.
| Throughput Model | How It Is Billed | Best Fit | Watchouts |
|---|---|---|---|
| Provisioned RU/s | Pay for reserved request units every hour whether you consume them or not | Steady traffic with predictable throughput needs | Idle but over-provisioned containers waste money continuously |
| Autoscale RU/s | Pay for autoscale capacity based on configured max RU/s | Workloads with daily or weekly spikes | Setting max RU/s far too high can still inflate spend materially |
| Serverless | Pay per request consumed plus storage, without reserved RU/s | Low-volume, sporadic, or dev and test workloads | Not the right fit for sustained heavy production load |
Portal and CLI Navigation
- Portal: search the account name, then inspect
Data Explorer,Keys,Replicate data globally,Metrics, andScale & settings. - CLI: start with
az cosmosdb show,az cosmosdb sql database list, andaz cosmosdb sql container throughput showto verify throughput and container design.
// cosmosdb.bicep
// Provision a Cosmos DB SQL API account, database, and container with autoscale throughput.
// Pick the partition key carefully. It should spread writes evenly and match common query patterns.
@description('Deployment location for the Cosmos DB account.')
param location string = resourceGroup().location
@description('Globally unique Cosmos DB account name.')
param accountName string
@description('SQL database name.')
param databaseName string = 'appdb'
@description('Container name.')
param containerName string = 'orders'
@description('Partition key path. Example: /tenantId or /customerId.')
param partitionKeyPath string = '/tenantId'
@minValue(1000)
@description('Autoscale max RU/s. Start with measured demand, not guesswork.')
param maxAutoscaleRu int = 4000
resource cosmos 'Microsoft.DocumentDB/databaseAccounts@2023-04-15' = {
name: accountName
location: location
kind: 'GlobalDocumentDB'
properties: {
databaseAccountOfferType: 'Standard'
publicNetworkAccess: 'Enabled'
enableAutomaticFailover: false
locations: [
{
locationName: location
failoverPriority: 0
isZoneRedundant: false
}
]
consistencyPolicy: {
defaultConsistencyLevel: 'Session'
}
}
}
resource sqlDb 'Microsoft.DocumentDB/databaseAccounts/sqlDatabases@2023-04-15' = {
name: '${cosmos.name}/${databaseName}'
properties: {
resource: {
id: databaseName
}
options: {}
}
}
resource container 'Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers@2023-04-15' = {
name: '${cosmos.name}/${databaseName}/${containerName}'
properties: {
resource: {
id: containerName
partitionKey: {
paths: [
partitionKeyPath
]
kind: 'Hash'
version: 2
}
indexingPolicy: {
indexingMode: 'consistent'
automatic: true
includedPaths: [
{
path: '/*'
}
]
excludedPaths: [
{
path: '/"_etag"/?'
}
]
}
}
options: {
autoscaleSettings: {
maxThroughput: maxAutoscaleRu
}
}
}
}
output cosmosEndpoint string = cosmos.properties.documentEndpoint
Azure SQL Database
Azure SQL Database is Azure's PaaS relational database for transactional systems that need SQL Server compatibility without managing Windows hosts, patching, backups, or cluster operations. It is usually the default for line-of-business apps, reporting stores, and APIs that rely on relational constraints and mature SQL tooling.
| Model | How It Is Billed | Best Fit | Decision Note |
|---|---|---|---|
| DTU | Bundled compute, memory, and IO in fixed tiers | Smaller legacy workloads or teams that want simplified sizing | Simple to buy, but less transparent about what resources you are getting |
| vCore | Pay for chosen compute generation, vCores, storage, backups, and extras such as zone redundancy | Modern production workloads needing clearer sizing and cost control | Usually preferred because it maps better to actual resource planning and discount options |
Portal and CLI Navigation
- Portal: search the logical server or database name, then check
Overview,Query Performance Insight,Connection strings, andNetworking. - CLI: use
az sql server show,az sql db show, andaz sql db list-usagesto confirm configuration and capacity usage.
// azure-sql.bicep
// Deploy a logical SQL server and a vCore-based Azure SQL Database.
// Credentials are parameters so they are never hardcoded in source control.
@description('Location for SQL resources.')
param location string = resourceGroup().location
@description('Globally unique logical SQL server name.')
param sqlServerName string
@description('Database name.')
param databaseName string = 'appdb'
@description('SQL administrator login name.')
param administratorLogin string
@secure()
@description('SQL administrator password supplied at deployment time.')
param administratorPassword string
resource sqlServer 'Microsoft.Sql/servers@2023-08-01-preview' = {
name: sqlServerName
location: location
properties: {
administratorLogin: administratorLogin
administratorLoginPassword: administratorPassword
minimalTlsVersion: '1.2'
publicNetworkAccess: 'Disabled'
}
}
resource database 'Microsoft.Sql/servers/databases@2023-08-01-preview' = {
name: '${sqlServer.name}/${databaseName}'
location: location
sku: {
name: 'GP_S_Gen5_2'
tier: 'GeneralPurpose'
capacity: 2
}
properties: {
collation: 'SQL_Latin1_General_CP1_CI_AS'
maxSizeBytes: 10737418240
autoPauseDelay: 60
minCapacity: 0.5
backupStorageRedundancy: 'Local'
}
}
output fullyQualifiedServerName string = sqlServer.properties.fullyQualifiedDomainName
Module 3: Networking & Security
Azure security starts with network boundaries and identity boundaries working together. VNets and subnets create containment. NSGs express allowed traffic. Key Vault keeps secrets out of code. Network Watcher helps you prove whether the network is actually the problem before the incident drifts into guesswork.
Virtual Networks (VNet) & Subnets
A VNet is the core Azure network boundary for private IP addressing, segmentation, and controlled connectivity between resources. Subnets turn that boundary into meaningful isolation domains. Separating app, data, and management paths is critical because it limits blast radius, simplifies policy, and makes route and security intent explicit.
VNets and subnets themselves are generally not the expensive line items. The bill grows from attached components such as NAT Gateway, Azure Firewall, VPN Gateway, Private Endpoints, cross-region peering, and outbound bandwidth.
Portal and CLI Navigation
- Portal: search the VNet name, then open
Subnets,Peerings,DDoS protection, and the linked NSG. - CLI: use
az network vnet show,az network vnet subnet list, andaz network nsg rule listto verify what the network is actually enforcing.
// vnet.bicep
// Create a VNet with separate application and data subnets plus an NSG.
// Network isolation is cheap compared to retrofitting security after go-live.
@description('Location for networking resources.')
param location string = resourceGroup().location
@description('Virtual network name.')
param vnetName string = 'vnet-app-prod'
resource nsg 'Microsoft.Network/networkSecurityGroups@2024-03-01' = {
name: 'nsg-app-subnet'
location: location
properties: {
securityRules: [
{
name: 'AllowHttpsInbound'
properties: {
protocol: 'Tcp'
sourcePortRange: '*'
destinationPortRange: '443'
sourceAddressPrefix: 'Internet'
destinationAddressPrefix: '*'
access: 'Allow'
priority: 100
direction: 'Inbound'
}
}
]
}
}
resource vnet 'Microsoft.Network/virtualNetworks@2024-03-01' = {
name: vnetName
location: location
properties: {
addressSpace: {
addressPrefixes: [
'10.20.0.0/16'
]
}
subnets: [
{
name: 'app'
properties: {
addressPrefix: '10.20.1.0/24'
networkSecurityGroup: {
id: nsg.id
}
}
}
{
name: 'data'
properties: {
addressPrefix: '10.20.2.0/24'
}
}
]
}
}
output vnetId string = vnet.id
Azure Key Vault
Azure Key Vault stores secrets, certificates, and cryptographic keys so application teams do not hardcode or manually rotate sensitive values. It should be the default home for application secrets unless there is a stronger service-specific reason to use another secure store.
| Capability | Purpose | How It Is Billed | Operational Note |
|---|---|---|---|
| Secrets | Passwords, connection strings, tokens, API keys | Per operation and storage at the vault tier | Usually the default for app configuration that must stay secret |
| Keys | Encryption keys and signing material | Per key and per cryptographic operation, higher for HSM-backed options | Use Premium when hardware-backed keys are required |
| Certificates | TLS certificate lifecycle management | Vault operations plus any external CA cost | Useful when teams need central renewal and access control |
Portal and CLI Navigation
- Portal: search the vault name, then verify
Secrets,Access control (IAM),Role assignments, andNetworking. - CLI: use
az keyvault show,az keyvault secret list, andaz role assignment list --scopeto verify whether the caller should be able to read the secret at all.
# key_vault_read.py
# Fetch a Key Vault secret securely using DefaultAzureCredential.
# This works locally with az login and in Azure with a managed identity.
import logging
from azure.core.exceptions import AzureError, ResourceNotFoundError
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient
def get_secret(vault_url: str, secret_name: str) -> str:
credential = DefaultAzureCredential()
client = SecretClient(vault_url=vault_url, credential=credential)
try:
secret = client.get_secret(secret_name)
return secret.value
except ResourceNotFoundError as exc:
logging.exception("Secret not found: %s", exc)
raise
except AzureError as exc:
logging.exception("Failed to read secret from Key Vault: %s", exc)
raise
if __name__ == "__main__":
value = get_secret(
vault_url="https://my-shared-vault.vault.azure.net/",
secret_name="sql-admin-password",
)
print(f"Retrieved secret with length: {len(value)}")
Navigation and Troubleshooting: Network Watcher & Effective Security Rules
Network Watcher is the first-response toolbox for proving where traffic is being dropped. Use it to check topology, connection troubleshoot, IP flow verify, and effective security rules rather than guessing which NSG, route, or peering change caused the failure.
Portal and CLI Navigation
- Portal: go to
Network Watcher, then useIP flow verify,Topology, andConnection troubleshoot. For a VM, open the NIC and inspectEffective security rules. - CLI: use
az network watcher test-connectivity,az network watcher ip-flow-verify, andaz network nic list-effective-nsgfor scriptable checks.
# effective_nsg.py
# Retrieve effective NSG rules for a VM network interface using the Azure Python SDK.
import logging
import os
from azure.core.exceptions import AzureError
from azure.identity import DefaultAzureCredential
from azure.mgmt.network import NetworkManagementClient
def print_effective_nsg(resource_group: str, nic_name: str) -> None:
subscription_id = os.environ["AZURE_SUBSCRIPTION_ID"]
credential = DefaultAzureCredential()
client = NetworkManagementClient(credential, subscription_id)
try:
poller = client.network_interfaces.begin_get_effective_network_security_groups(
resource_group_name=resource_group,
network_interface_name=nic_name,
)
result = poller.result()
for association in result.value:
logging.info("NSG association: %s", association.network_security_group.id)
for rule in association.effective_security_rules:
print(
f"{rule.name}: {rule.direction} {rule.access} "
f"ports={rule.destination_port_range} source={rule.source_address_prefix}"
)
except AzureError as exc:
logging.exception("Failed to retrieve effective NSG rules: %s", exc)
raise
if __name__ == "__main__":
print_effective_nsg(resource_group="rg-network-prod", nic_name="vm-app-01-nic")
Module 4: Pricing Calculation & Cost Management
Azure cost control is a design discipline, not a finance afterthought. Estimates should be built from real SKUs, realistic traffic, storage growth, and data transfer assumptions. Governance then keeps reality from drifting far away from the estimate.
The Pricing Calculator
The official Azure Pricing Calculator is the right starting point for workload estimation, but only if you treat it as an engineering exercise instead of a quick sales estimate. Calculator output is usually list-price guidance. It does not automatically know your reserved capacity, enterprise agreement discounts, or workload inefficiencies.
| Step | What To Do | Why It Matters |
|---|---|---|
| 1. Define scope | List every service in the architecture, not just the headline compute tier | Many Azure bills are driven by supporting services such as storage, monitoring, and data transfer |
| 2. Pin region and SKU | Select the exact Azure region, tier, redundancy mode, and expected hours | Region and SKU change prices materially, especially for compute and storage redundancy |
| 3. Model usage | Estimate requests, GB stored, GB transferred, RU/s, or node-hours based on real load expectations | Under-modeling usage makes the calculator look cheap and the production bill look wrong |
| 4. Add operations costs | Include transactions, backup retention, retrieval fees, and monitoring ingestion | These line items are often missing from first-pass estimates |
| 5. Validate against reality | Compare the estimate with a pilot environment or a prior month's spend | This exposes bad assumptions before leadership starts relying on the number |
| 6. Revisit monthly | Update estimates as SKUs, traffic, and architecture change | Pricing drift is normal. Unreviewed estimates become fiction quickly |
Cost Optimization: Reserved Instances, PAYG, and Azure Hybrid Benefit
Pay-As-You-Go (PAYG) is the baseline model: you pay standard rates with maximum flexibility and no long-term commitment. Reserved Instances or reserved capacity trade commitment for lower unit pricing on eligible services. Azure Hybrid Benefit lets you apply qualifying existing Windows Server or SQL Server licenses to reduce Azure software charges.
| Option | Best Fit | How It Changes The Bill | Risk |
|---|---|---|---|
| PAYG | New workloads, uncertain demand, short-lived environments | Highest flexibility, usually highest unit cost | Teams forget to revisit it after usage stabilizes |
| Reserved Instances / Reserved Capacity | Steady-state production with predictable baseline usage | Lower hourly cost in exchange for 1-year or 3-year commitment | Wrong sizing or wrong region commitment limits savings |
| Azure Hybrid Benefit | Organizations with eligible Microsoft licenses | Reduces software portion of Azure compute or database cost | License eligibility and assignment must be governed carefully |
az consumption usage list and any service-specific SKU inspection commands to validate that usage is steady enough to justify commitment.Budgets and Cost Alerts
Budgets are the simplest protection against bill shock. A budget does not stop spend by itself, but it gives teams a threshold, forecast visibility, and a forcing function to react before the invoice arrives.
Portal and CLI Navigation
- Portal: go to
Cost Management + Billing, then openBudgets,Cost analysis, andExportsfor recurring reporting. - CLI: use
az consumption budget listandaz consumption usage listto confirm whether the right scope is covered and what is actually driving the spend.
// budget.bicep
// Subscription-scope monthly budget with actual and forecast alerts.
targetScope = 'subscription'
@description('Budget resource name.')
param budgetName string = 'engineering-monthly-budget'
@description('Budget amount in USD or the billing currency of the subscription.')
param budgetAmount int = 1500
@description('Budget notification recipients.')
param contactEmails array = [
'cloud-ops@example.com'
]
resource budget 'Microsoft.Consumption/budgets@2023-05-01' = {
name: budgetName
properties: {
category: 'Cost'
amount: budgetAmount
timeGrain: 'Monthly'
timePeriod: {
startDate: '2026-01-01T00:00:00Z'
endDate: '2027-01-01T00:00:00Z'
}
notifications: {
actual80: {
enabled: true
operator: 'GreaterThan'
threshold: 80
thresholdType: 'Actual'
contactEmails: contactEmails
}
forecast100: {
enabled: true
operator: 'GreaterThan'
threshold: 100
thresholdType: 'Forecasted'
contactEmails: contactEmails
}
}
}
}
Module 5: Infrastructure as Code (IaC) Standard
Infrastructure as Code is the production standard because Azure Resource Manager changes should be reviewable, repeatable, parameterized, and deployable across environments without human clicking. The portal is excellent for inspection and emergency response. It is not a durable production change-management system.
Why IaC
Clicking through the portal is an anti-pattern for production because it creates undocumented drift, makes peer review impossible, and turns every rebuild into archaeology. Bicep fixes that by giving Azure engineers a native ARM abstraction with reusable modules, parameter files, and deployment history.
az deployment group what-if or az deployment sub what-if before applying changes.Bicep Example: Storage Account + App Service with Managed Identity
The following Bicep file is a clean baseline for a typical web workload. It provisions a standard Storage Account and a Linux App Service with a system-assigned managed identity. The storage account is locked down with HTTPS-only and public blob access disabled. The App Service is ready to authenticate to other Azure services using its identity instead of embedded credentials.
az deployment group show.// main.bicep
// Deploy a standard storage account and Linux App Service with a system-assigned managed identity.
@description('Deployment location for all resources.')
param location string = resourceGroup().location
@description('Globally unique storage account name using only lowercase letters and numbers.')
param storageAccountName string
@description('Globally unique web app name.')
param webAppName string
@description('App Service Plan name.')
param appServicePlanName string = 'asp-standard-linux'
@allowed([
'B1'
'P1v3'
])
@description('Choose B1 for smaller steady-state workloads and P1v3 for production autoscale scenarios.')
param appServiceSku string = 'B1'
var appServiceTier = appServiceSku == 'B1' ? 'Basic' : 'PremiumV3'
resource storage 'Microsoft.Storage/storageAccounts@2023-05-01' = {
name: storageAccountName
location: location
sku: {
name: 'Standard_LRS'
}
kind: 'StorageV2'
properties: {
accessTier: 'Hot'
allowBlobPublicAccess: false
allowSharedKeyAccess: false
minimumTlsVersion: 'TLS1_2'
publicNetworkAccess: 'Enabled'
supportsHttpsTrafficOnly: true
}
}
resource plan 'Microsoft.Web/serverfarms@2024-04-01' = {
name: appServicePlanName
location: location
kind: 'linux'
sku: {
name: appServiceSku
tier: appServiceTier
capacity: 1
}
properties: {
reserved: true
}
}
resource site 'Microsoft.Web/sites@2024-04-01' = {
name: webAppName
location: location
kind: 'app,linux'
identity: {
type: 'SystemAssigned'
}
properties: {
serverFarmId: plan.id
httpsOnly: true
siteConfig: {
linuxFxVersion: 'PYTHON|3.12'
alwaysOn: true
minTlsVersion: '1.2'
ftpsState: 'Disabled'
appSettings: [
{
name: 'AZURE_STORAGE_ACCOUNT_NAME'
value: storage.name
}
{
name: 'WEBSITE_RUN_FROM_PACKAGE'
value: '1'
}
]
}
}
}
output storageBlobEndpoint string = storage.properties.primaryEndpoints.blob
output webAppHostname string = site.properties.defaultHostName
output webAppPrincipalId string = site.identity.principalId
Module 6: Common Pitfalls & Anti-Patterns
Most Azure overspend and avoidable security exposure comes from a short list of repeated mistakes, not obscure platform behavior. The recurring pattern is the same: no inventory discipline, weak ownership, and changes made without cost or identity review.
1. Leaving Idle Resources Running
Idle resources are one of the fastest ways to waste money in Azure. Unattached managed disks, unused public IPs, forgotten App Service Plans, dormant AKS node pools, and test databases continue billing even when no application is using them.
az disk list and az network public-ip list.# idle_resource_audit.py
# Identify unattached managed disks and unassociated public IP addresses.
import logging
import os
from azure.core.exceptions import AzureError
from azure.identity import DefaultAzureCredential
from azure.mgmt.compute import ComputeManagementClient
from azure.mgmt.network import NetworkManagementClient
def find_idle_resources() -> None:
subscription_id = os.environ["AZURE_SUBSCRIPTION_ID"]
credential = DefaultAzureCredential()
compute_client = ComputeManagementClient(credential, subscription_id)
network_client = NetworkManagementClient(credential, subscription_id)
try:
print("Unattached managed disks:")
for disk in compute_client.disks.list():
if not disk.managed_by:
print(f"- {disk.name} ({disk.location}) sku={disk.sku.name}")
print("\nUnassociated public IP addresses:")
for public_ip in network_client.public_ip_addresses.list_all():
if public_ip.ip_configuration is None:
print(f"- {public_ip.name} ({public_ip.location}) sku={public_ip.sku.name}")
except AzureError as exc:
logging.exception("Idle resource discovery failed: %s", exc)
raise
if __name__ == "__main__":
find_idle_resources()
2. Misunderstanding Egress Bandwidth Costs
Data entering Azure is usually free. Data leaving Azure is not. That applies to internet egress, many cross-region transfers, some peering patterns, CDN origin traffic, backups restored across boundaries, and storage-heavy workloads that serve large files directly to users.
| Scenario | Why Teams Miss It | What Actually Costs Money |
|---|---|---|
| Serving files from Blob Storage | Storage capacity looks cheap in the estimate | Outbound bandwidth and high read volume can become the real bill |
| Cross-region architectures | Teams focus on resiliency, not transfer patterns | Replication and inter-region data movement can materially raise run cost |
| API-heavy applications | Developers count requests but not response payload size | Large response bodies multiply egress charges as traffic grows |
3. Hardcoding Credentials Instead of Using Managed Identities and RBAC
Hardcoded credentials create both security risk and operational drag. Secrets leak into repositories, pipelines, laptops, and ticket comments. Rotation becomes painful, and incident response expands because nobody is sure where the credential was copied. Managed identities and RBAC remove that problem at the root.
az webapp identity show and az role assignment list.# managed_identity_blob.py
# Access Blob Storage without a connection string by using DefaultAzureCredential.
import logging
from azure.core.exceptions import AzureError
from azure.identity import DefaultAzureCredential
from azure.storage.blob import BlobServiceClient
def list_blobs(account_url: str, container_name: str) -> None:
credential = DefaultAzureCredential()
service_client = BlobServiceClient(account_url=account_url, credential=credential)
container_client = service_client.get_container_client(container_name)
try:
for blob in container_client.list_blobs(name_starts_with="2026/"):
print(blob.name)
except AzureError as exc:
logging.exception("Managed identity blob access failed: %s", exc)
raise
if __name__ == "__main__":
list_blobs(
account_url="https://mystorageaccount.blob.core.windows.net",
container_name="raw-ingest",
)