Back to handbooks index
Production-Ready GCP Reference

Google Cloud (GCP) Engineer Handbook

A practical, data-focused guide to Google Cloud architecture — compute, BigQuery, networking, cost management, and security patterns with Terraform and Google Cloud Python Client Libraries.

Google Cloud Platform Terraform · Python SDKs Data & Analytics Focus March 2026
Data-first platform: GCP is the industry leader in data and analytics. BigQuery, Cloud Storage, and Pub/Sub form the backbone of most GCP architectures. This handbook reflects that bias — the data sections are intentionally the most detailed. Every service section includes primary cost drivers.

Table of Contents

Module 1: Compute & Containers
Compute Engine VM families, sustained use discounts, Cloud Functions serverless triggers, and GKE Autopilot vs Standard.
Module 2: Data, Storage & Analytics
Cloud Storage classes, BigQuery architecture and pricing, Cloud SQL vs Firestore for relational and document workloads.
Module 3: Networking & IAM
Global VPC architecture, Shared VPC, IAM resource hierarchy, Service Accounts, and Identity-Aware Proxy (IAP).
Module 4: Pricing & Billing Management
GCP Pricing Calculator, Billing Export to BigQuery for SQL-based cost analysis, and managing API quotas.
Module 5: Common Pitfalls
Default VPC security risks, runaway BigQuery costs, and the danger of project proliferation without billing strategy.

Module 1: Compute & Containers

GCP offers compute across the full abstraction spectrum — from Compute Engine VMs you fully control, to Cloud Functions where you write a single handler and GCP manages everything else. GKE sits in the middle as the premier managed Kubernetes offering in any cloud.

Compute Engine

Compute Engine provides virtual machines running on Google's infrastructure. You select a Machine Family (general-purpose, compute-optimized, memory-optimized, accelerator-optimized), choose an OS image, and pick a zone. Unlike AWS, GCP automatically applies Sustained Use Discounts — no upfront commitment needed.

Machine Families

FamilySeriesUse CaseExample
General PurposeE2, N2, N2D, T2D, C3Web servers, dev/test, small databases, microservicese2-medium (2 vCPU / 4 GB)
Compute-OptimizedC2, C2D, H3Batch processing, gaming servers, HPC, CI/CDc2-standard-8 (8 vCPU / 32 GB)
Memory-OptimizedM1, M2, M3SAP HANA, large in-memory databases, real-time analyticsm2-ultramem-208 (208 vCPU / 5.8 TB)
Accelerator-OptimizedA2, A3, G2ML training/inference, video transcoding, GPU workloadsa2-highgpu-1g (12 vCPU / 85 GB + A100)

Pricing Models

On-Demand
Pay per second (1-minute minimum). No commitment. Full flexibility for unpredictable workloads. Automatic Sustained Use Discounts kick in after the first month.
Committed Use Discounts (CUDs)
1 or 3-year commitment for specific vCPU/memory amounts. Up to 57% savings (3-year) vs on-demand. Applied at the billing account level, not per-VM.
Spot VMs (Preemptible)
Up to 91% savings. GCP can reclaim with 30 seconds notice. Best for fault-tolerant batch, rendering, and data processing. Max 24-hour lifetime.

Sustained Use Discounts (SUD)

GCP automatically applies discounts when a VM runs for more than 25% of a month. No action required — no reservations, no upfront payment. The discount scales linearly:

Monthly UsageEffective DiscountYou Pay
0–25% of month0%Full on-demand rate
25–50% of month~20%80% of on-demand rate on incremental usage
50–75% of month~40%60% of on-demand rate on incremental usage
75–100% of month~60%40% of on-demand rate on incremental usage
💡
SUD advantage over AWS: AWS requires you to purchase Reserved Instances upfront. GCP applies SUD automatically. For workloads that run most of the month but you're unsure about committing, GCP's automatic discounts are a significant advantage.
💰
Cost drivers: Machine type × hours running + persistent disk storage (GB/month) + network egress + static IP addresses ($0.01/hour when not attached to a running VM). A forgotten n2-standard-8 costs ~$280/month.

Console Navigation

GCP Console Compute Engine VM Instances Create Instance Select Machine Family Configure Boot Disk Add Firewall Rules

Terraform — Launch a Compute Engine VM

# terraform/compute.tf — Production-ready Compute Engine instance

provider "google" {
  project = "my-gcp-project-id"
  region  = "us-central1"
}

resource "google_compute_instance" "web_server" {
  name         = "web-server-prod"
  machine_type = "e2-medium"
  zone         = "us-central1-a"

  boot_disk {
    initialize_params {
      image = "debian-cloud/debian-12"
      size  = 30  # GB
      type  = "pd-balanced"
    }
  }

  network_interface {
    network    = "default"
    subnetwork = "default"

    # Omit access_config block for private-only VM (no external IP)
    access_config {
      # Ephemeral external IP — use for testing only
    }
  }

  # Use a Service Account instead of user credentials
  service_account {
    email  = google_service_account.vm_sa.email
    scopes = ["cloud-platform"]
  }

  # Enable Shielded VM features for security
  shielded_instance_config {
    enable_secure_boot          = true
    enable_vtpm                 = true
    enable_integrity_monitoring = true
  }

  metadata = {
    # Block project-wide SSH keys — use OS Login instead
    block-project-ssh-keys = "true"
  }

  labels = {
    environment = "production"
    managed_by  = "terraform"
  }
}

resource "google_service_account" "vm_sa" {
  account_id   = "web-server-sa"
  display_name = "Web Server Service Account"
}
💡
Always use Shielded VMs. Setting enable_secure_boot = true prevents rootkits and bootkits. Combined with block-project-ssh-keys and OS Login, you get a hardened VM baseline.

Cloud Functions

Cloud Functions is Google's serverless compute platform for event-driven code. You write a function, attach it to a trigger (HTTP, Cloud Storage, Pub/Sub, Firestore, Cloud Scheduler), and GCP handles provisioning, scaling, and patching. Cloud Functions (2nd gen) is built on Cloud Run, giving you longer timeouts (up to 60 minutes) and concurrency support.

Gen 1 vs Gen 2

FeatureGen 1Gen 2 (Recommended)
Max timeout9 minutes60 minutes (HTTP) / 9 min (event)
Concurrency1 request per instanceUp to 1,000 concurrent requests per instance
Min instancesSupportedSupported — eliminates cold starts
Traffic splittingNot availableSupported via Cloud Run revisions
Built onCustom runtimeCloud Run + Eventarc
💰
Cost drivers: Invocations ($0.40 per 1M) + compute time (vCPU-seconds + GB-seconds) + network egress. Free tier: 2M invocations/month, 400,000 GB-seconds, 200,000 GHz-seconds. Min instances add a flat idle cost even when not handling requests.

Python — Cloud Function Triggered by GCS Upload

# main.py — Cloud Function (Gen 2) triggered by a Cloud Storage object upload
import functions_framework
from google.cloud import storage

@functions_framework.cloud_event
def process_gcs_upload(cloud_event):
    """Triggered when a new object is created in a GCS bucket.

    The event payload contains bucket name, object name, and metadata.
    """
    data = cloud_event.data

    bucket_name = data["bucket"]
    file_name = data["name"]
    content_type = data.get("contentType", "unknown")
    size_bytes = data.get("size", 0)

    print(f"New file uploaded: gs://{bucket_name}/{file_name}")
    print(f"Content type: {content_type}, Size: {size_bytes} bytes")

    # Example: Read the file and process it
    client = storage.Client()
    bucket = client.bucket(bucket_name)
    blob = bucket.blob(file_name)

    # Only process CSV files
    if file_name.endswith(".csv"):
        content = blob.download_as_text()
        line_count = len(content.strip().split("\n"))
        print(f"CSV has {line_count} lines — sending to BigQuery...")
        # Insert into BigQuery, call another service, etc.
    else:
        print(f"Skipping non-CSV file: {file_name}")

Deploy with gcloud CLI

# Deploy the Gen 2 Cloud Function with a GCS trigger
gcloud functions deploy process-gcs-upload \
  --gen2 \
  --runtime python312 \
  --region us-central1 \
  --source . \
  --entry-point process_gcs_upload \
  --trigger-event-filters="type=google.cloud.storage.object.v1.finalized" \
  --trigger-event-filters="bucket=my-data-bucket" \
  --memory 256Mi \
  --timeout 120s \
  --service-account my-cf-sa@my-gcp-project-id.iam.gserviceaccount.com

Terraform — Cloud Function (Gen 2)

# terraform/cloud_function.tf — Gen 2 Cloud Function with GCS trigger

resource "google_storage_bucket" "source_code" {
  name     = "cf-source-${var.project_id}"
  location = "US"
}

resource "google_storage_bucket_object" "function_zip" {
  name   = "function-source.zip"
  bucket = google_storage_bucket.source_code.name
  source = "${path.module}/function-source.zip"
}

resource "google_cloudfunctions2_function" "processor" {
  name     = "process-gcs-upload"
  location = "us-central1"

  build_config {
    runtime     = "python312"
    entry_point = "process_gcs_upload"
    source {
      storage_source {
        bucket = google_storage_bucket.source_code.name
        object = google_storage_bucket_object.function_zip.name
      }
    }
  }

  service_config {
    max_instance_count = 10
    min_instance_count = 0
    available_memory   = "256Mi"
    timeout_seconds    = 120
    service_account_email = google_service_account.cf_sa.email
  }

  event_trigger {
    trigger_region = "us-central1"
    event_type     = "google.cloud.storage.object.v1.finalized"
    event_filters {
      attribute = "bucket"
      value     = google_storage_bucket.data_bucket.name
    }
  }
}

resource "google_service_account" "cf_sa" {
  account_id   = "cloud-function-sa"
  display_name = "Cloud Function Service Account"
}

Google Kubernetes Engine (GKE)

GKE is widely considered the best managed Kubernetes service in any cloud. Google invented Kubernetes (from the Borg system), and GKE reflects that heritage — it supports the latest K8s versions first, has the tightest integration with GCP services, and offers an Autopilot mode that eliminates node management entirely.

Autopilot vs Standard Mode

GKE Autopilot (Recommended)
Google manages the nodes. You only define Pods — GKE provisions the right node types, sizes, and scaling automatically. You pay per Pod resource request (vCPU, memory, ephemeral storage). No node pools to manage, no OS patching, no capacity planning. Best for teams that want Kubernetes without the infrastructure overhead.
GKE Standard
You manage the nodes. Full control over node pools, machine types, OS images, and scaling policies. You pay for the entire node regardless of pod utilization. Offers DaemonSets, privileged containers, and access to the node OS. Best for workloads that need custom kernel settings, GPU scheduling, or specific machine types.
Why GKE Leads the Market Industry Context

Fastest K8s upgrades: GKE supports new Kubernetes versions weeks before EKS/AKS. Release channels: Rapid, Regular, and Stable channels with automatic upgrades. GKE Enterprise: Multi-cluster management, service mesh (Anthos Service Mesh), and fleet-level policy enforcement. Binary Authorization: Only deploy signed container images — built-in supply chain security.

💰
Cost drivers (Autopilot): Pod vCPU-seconds + Pod memory-seconds + ephemeral storage. Cost drivers (Standard): Node VM cost (same as Compute Engine) + $0.10/hour/cluster management fee. The management fee is waived for one zonal cluster per billing account. Autopilot is often cheaper because you don't pay for unused node capacity.

Console Navigation

GCP Console Kubernetes Engine Clusters Create Cluster Choose Autopilot or Standard Select Region & Release Channel

Terraform — GKE Autopilot Cluster

# terraform/gke.tf — GKE Autopilot cluster

resource "google_container_cluster" "autopilot" {
  name     = "prod-autopilot-cluster"
  location = "us-central1"

  # Enable Autopilot mode
  enable_autopilot = true

  # Use a release channel for automatic upgrades
  release_channel {
    channel = "REGULAR"
  }

  # Private cluster — nodes have no external IPs
  private_cluster_config {
    enable_private_nodes    = true
    enable_private_endpoint = false  # Keep API server public for kubectl access
    master_ipv4_cidr_block  = "172.16.0.0/28"
  }

  # Network configuration
  network    = google_compute_network.vpc.name
  subnetwork = google_compute_subnetwork.gke_subnet.name

  # IP allocation for Pods and Services
  ip_allocation_policy {
    cluster_secondary_range_name  = "pods"
    services_secondary_range_name = "services"
  }

  # Workload Identity — maps K8s SAs to GCP SAs
  workload_identity_config {
    workload_pool = "${var.project_id}.svc.id.goog"
  }
}

Module 2: Data, Storage & Analytics

Data is GCP's strongest domain. BigQuery is arguably the most important service in all of cloud computing for analytics workloads. Cloud Storage is the universal data lake foundation. This module covers the storage and database services that underpin modern data architectures.

Cloud Storage (GCS)

Cloud Storage (GCS) is Google's object storage service — the equivalent of AWS S3. It stores unstructured data (files, images, backups, ML training data) in buckets. GCS is globally unique by bucket name, and objects are stored in the closest region or in dual/multi-region configurations for high availability.

Storage Classes

ClassMin Storage DurationUse CaseStorage $/GB/monthRetrieval $/GB
StandardNoneFrequently accessed data, hot data, serving website assets$0.020Free
Nearline30 daysData accessed <1x/month — backups, long-tail content$0.010$0.01
Coldline90 daysData accessed <1x/quarter — disaster recovery$0.004$0.02
Archive365 daysData accessed <1x/year — regulatory compliance, long-term retention$0.0012$0.05
💡
Autoclass: Enable Autoclass on a bucket and GCS will automatically move objects between storage classes based on actual access patterns. No lifecycle rules to write — Google observes and optimizes. This is a significant UX improvement over manually configuring S3 lifecycle policies.

Location Types

Region
Single region (e.g., us-central1). Lowest cost. Best for compute co-location — store data in the same region as your VMs/GKE.
Dual-Region
Two specific regions (e.g., us-east1 + us-central1). Automatic replication with turbo mode (<15 min RPO). Good balance of HA and cost.
Multi-Region
Broad area (e.g., US, EU, ASIA). Highest availability and geo-redundancy. Best for serving content worldwide.
💰
Cost drivers: Storage (GB/month × class rate) + network egress ($0.12/GB to internet) + operations (Class A: $0.05/10K, Class B: $0.004/10K). Early deletion fees apply for Nearline/Coldline/Archive. Egress to BigQuery in the same region is free — this is a major advantage for data pipelines.

Python — Upload and Download Objects

# gcs_operations.py — Upload and download objects using Google Cloud Storage client
from google.cloud import storage

# Client uses Application Default Credentials (ADC)
# On GCE/GKE: automatically uses the attached Service Account
# Locally: uses `gcloud auth application-default login`
client = storage.Client()

# ── Upload a file with custom metadata ──
bucket = client.bucket("my-data-bucket")
blob = bucket.blob("uploads/2026/03/report.csv")

# Set custom metadata (searchable, useful for tracking)
blob.metadata = {
    "uploaded_by": "data-pipeline-v2",
    "source_system": "salesforce",
    "record_count": "45230",
}

blob.upload_from_filename(
    "./local-data/report.csv",
    content_type="text/csv",
)
print(f"Uploaded to gs://{bucket.name}/{blob.name}")

# ── Download a file ──
download_blob = bucket.blob("uploads/2026/03/report.csv")
download_blob.download_to_filename("./downloads/report.csv")
print("Downloaded successfully")

# ── List objects with a prefix ──
blobs = client.list_blobs("my-data-bucket", prefix="uploads/2026/")
for b in blobs:
    print(f"  {b.name} ({b.size} bytes, {b.storage_class})")

Terraform — Secure GCS Bucket

# terraform/gcs.tf — Production GCS bucket with lifecycle and access control

resource "google_storage_bucket" "data_lake" {
  name          = "data-lake-${var.project_id}"
  location      = "US"
  storage_class = "STANDARD"

  # Enable Autoclass to automatically optimize storage costs
  autoclass {
    enabled = true
  }

  # Prevent accidental deletion
  force_destroy = false

  # Enable uniform bucket-level access (recommended over legacy ACLs)
  uniform_bucket_level_access = true

  # Versioning for data recovery
  versioning {
    enabled = true
  }

  # Lifecycle rule: delete old versions after 90 days
  lifecycle_rule {
    condition {
      age                = 90
      with_state         = "ARCHIVED"
    }
    action {
      type = "Delete"
    }
  }

  # Block public access
  public_access_prevention = "enforced"
}

BigQuery

BigQuery is Google's "killer app" — the service that differentiates GCP from every other cloud. It's a fully managed, serverless, petabyte-scale data warehouse with built-in ML, geospatial analysis, and BI Engine for caching. There are no indexes to tune, no clusters to manage, and no vacuum operations. You write SQL, and Google handles the rest.

Architecture: Columnar & Serverless

BigQuery separates storage and compute. Data is stored in Google's Capacitor columnar format on Colossus (Google's distributed file system). Queries are executed by Dremel, a multi-tenant execution engine that distributes work across thousands of workers. This separation means:

Why BigQuery Matters Strategic Context

BigQuery processes over 110 TB of data per second across Google. It can scan a 1 PB table in under 30 seconds. No other cloud service at any price point matches this throughput for ad-hoc analytics. AWS Redshift Serverless and Azure Synapse are the closest competitors, but both require more tuning and have steeper cost curves at scale.

Pricing Models

On-Demand Pricing
Pay per query: $6.25 per TB scanned. The first 1 TB per month is free. Best for: exploratory analytics, development, and variable workloads. Risk: an unpartitioned SELECT * on a 10 TB table costs ~$62.50 per query.
Capacity Pricing (Slots)
Purchase BigQuery Editions (Standard, Enterprise, Enterprise Plus). You buy "slots" (units of compute) on a per-second, autoscaling basis. Best for: production pipelines with predictable, high-volume queries. You pay for compute time, not data scanned. Starting at 100 slots (~$0.04/slot-hour for Standard edition).
💰
Cost drivers (On-Demand): Bytes scanned per query (after column pruning and partition pruning). Use --dry_run to preview cost before executing. Cost drivers (Slots): Slot-hours consumed × edition rate. Storage: $0.02/GB/month active, $0.01/GB/month long-term (90+ days unmodified). Streaming inserts: $0.01 per 200 MB.

Console Navigation

GCP Console BigQuery SQL Workspace Compose New Query Check "Bytes processed" estimate before running

Python — Query a Public Dataset

# bigquery_demo.py — Query a public dataset using the BigQuery Python client
from google.cloud import bigquery

# Client uses Application Default Credentials
client = bigquery.Client()

# ── Query the public GitHub dataset ──
# This scans ~6 GB on-demand = ~$0.04
query = """
    SELECT
        language.name AS language,
        COUNT(*) AS repo_count
    FROM
        `bigquery-public-data.github_repos.languages`,
        UNNEST(language) AS language
    GROUP BY
        language
    ORDER BY
        repo_count DESC
    LIMIT 20
"""

# Use QueryJobConfig for cost control
job_config = bigquery.QueryJobConfig(
    # Set a maximum bytes billed to prevent runaway costs
    maximum_bytes_billed=10 * 1024 ** 3,  # 10 GB limit
    # Use standard SQL (default, but explicit is good)
    use_legacy_sql=False,
)

# Dry run first to check cost
dry_run_config = bigquery.QueryJobConfig(dry_run=True, use_legacy_sql=False)
dry_run_job = client.query(query, job_config=dry_run_config)
mb_scanned = dry_run_job.total_bytes_processed / (1024 ** 2)
print(f"This query will scan {mb_scanned:.1f} MB")

# Execute the actual query
query_job = client.query(query, job_config=job_config)
results = query_job.result()

print(f"\nTop programming languages on GitHub:")
for row in results:
    print(f"  {row.language:20} {row.repo_count:>,} repos")

print(f"\nTotal bytes billed: {query_job.total_bytes_billed:,}")
💡
Always set maximum_bytes_billed. This acts as a safety net — if the query would scan more than the limit, BigQuery rejects it instead of running. This single line of code can prevent a $500 mistake.

Cost Optimization Strategies

BigQuery Cost Reduction Checklist Save 50–90%

Partition tables by date/timestamp — queries that filter on the partition column only scan relevant partitions. Cluster tables by frequently filtered columns (up to 4). Never use SELECT * — always specify columns. Use materialized views for repeated queries. Export to BI Engine for sub-second dashboard queries (cached in memory). Set per-user and per-project query quotas to prevent accidental cost spikes.

Terraform — BigQuery Dataset and Table

# terraform/bigquery.tf — Partitioned and clustered BigQuery table

resource "google_bigquery_dataset" "analytics" {
  dataset_id = "analytics"
  location   = "US"

  # Default table expiration: 180 days (auto-cleanup for temp data)
  default_table_expiration_ms = 15552000000

  labels = {
    environment = "production"
    managed_by  = "terraform"
  }
}

resource "google_bigquery_table" "events" {
  dataset_id = google_bigquery_dataset.analytics.dataset_id
  table_id   = "events"

  # Partition by ingestion time (or a specific TIMESTAMP/DATE column)
  time_partitioning {
    type  = "DAY"
    field = "event_timestamp"
    expiration_ms = 7776000000  # 90-day partition expiry
  }

  # Cluster by frequently filtered columns — up to 4
  clustering = ["user_id", "event_type"]

  schema = jsonencode([
    { name = "event_id",        type = "STRING",    mode = "REQUIRED" },
    { name = "event_timestamp", type = "TIMESTAMP", mode = "REQUIRED" },
    { name = "user_id",          type = "STRING",    mode = "REQUIRED" },
    { name = "event_type",       type = "STRING",    mode = "REQUIRED" },
    { name = "properties",       type = "JSON",      mode = "NULLABLE" },
  ])

  labels = {
    environment = "production"
  }
}

Cloud SQL & Firestore

GCP offers both managed relational databases (Cloud SQL) and a serverless NoSQL document store (Firestore). Choose based on your data model — if you need joins, transactions, and schemas, use Cloud SQL. If you need flexible documents with real-time sync, use Firestore.

Cloud SQL

Cloud SQL is a fully managed service for MySQL, PostgreSQL, and SQL Server. Google handles replication, backups, encryption, and patching. It supports High Availability with automatic failover (regional instance with a standby in another zone).

FeatureCloud SQLAlloyDB (Premium)
EngineMySQL, PostgreSQL, SQL ServerPostgreSQL-compatible only
PerformanceStandard managed DB4x faster than standard PostgreSQL (Google claims)
HARegional with zonal failoverRegional with <1 sec failover
Best forStandard OLTP workloads, lift-and-shiftHigh-performance OLTP, hybrid OLAP/OLTP
PricingvCPU/hour + storage/GBvCPU/hour + storage/GB (higher base)

Firestore

Firestore is a serverless, NoSQL document database with real-time syncing and offline support. Documents are organized in collections, and queries are indexed automatically. Firestore operates in two modes:

💰
Cloud SQL cost drivers: Instance vCPU-hours + storage (SSD/HDD per GB/month) + HA doubles the instance cost + backups. Firestore cost drivers: Document reads ($0.06/100K), writes ($0.18/100K), deletes ($0.02/100K) + storage ($0.18/GB/month). Firestore free tier: 50K reads, 20K writes, 20K deletes per day.

Terraform — Cloud SQL with HA

# terraform/cloud_sql.tf — Cloud SQL PostgreSQL with HA and private IP

resource "google_sql_database_instance" "postgres" {
  name             = "prod-postgres"
  database_version = "POSTGRES_16"
  region           = "us-central1"

  settings {
    tier = "db-custom-4-16384"  # 4 vCPU, 16 GB RAM

    # High Availability — automatic failover to another zone
    availability_type = "REGIONAL"

    # Disk configuration
    disk_size         = 100  # GB
    disk_type         = "PD_SSD"
    disk_autoresize   = true

    # Private IP only — no public exposure
    ip_configuration {
      ipv4_enabled    = false
      private_network = google_compute_network.vpc.id
    }

    # Automated backups
    backup_configuration {
      enabled                        = true
      point_in_time_recovery_enabled = true
      start_time                     = "03:00"
      transaction_log_retention_days = 7
      backup_retention_settings {
        retained_backups = 14
      }
    }

    # Maintenance window — Sunday 4AM
    maintenance_window {
      day          = 7
      hour         = 4
      update_track = "stable"
    }
  }

  deletion_protection = true
}

Module 3: Networking & IAM

GCP's networking model is fundamentally different from AWS and Azure. VPCs are global, subnets are regional, and firewall rules are centralized. The IAM hierarchy (Organization → Folders → Projects) is how Google expects you to model your organization.

VPC & Shared VPC

A GCP Virtual Private Cloud (VPC) is a global resource. Unlike AWS (where a VPC is regional) or Azure (where a VNet is regional), a single GCP VPC spans all regions. Subnets, however, are regional — each subnet belongs to one region and one VPC.

GCP VPC vs AWS VPC vs Azure VNet

FeatureGCP VPCAWS VPCAzure VNet
ScopeGlobalRegionalRegional
SubnetsRegionalZonal (AZ-bound)Regional
FirewallVPC-level rules with tags/SAsSecurity Groups per ENINSGs per subnet/NIC
PeeringGlobal (cross-region peering built in)Regional (cross-region costs extra)Global
Cross-VPCShared VPC (centralized)Transit GatewayVirtual WAN
💡
Key insight: Because GCP VPCs are global, you don't need to create separate VPCs per region or peer them together. A single VPC with regional subnets can span your entire multi-region infrastructure. This dramatically simplifies network architecture.

Shared VPC

Shared VPC lets you designate a host project that owns the VPC and subnets, while service projects use those subnets for their resources (VMs, GKE clusters, etc.). This centralizes network management and security while allowing teams to manage their own compute resources.

When to Use Shared VPC Architecture Pattern

Enterprise standard: If you have more than 2–3 projects, use Shared VPC. It prevents IP overlap, centralizes firewall rules, and gives the networking team a single pane of glass. Without Shared VPC: Each project creates its own VPC, leading to IP conflicts, duplicated firewall rules, and peering nightmares. Shared VPC is free — there's no reason not to use it.

💰
Cost drivers: VPCs and subnets are free. You pay for: Cloud NAT ($0.045/hour/gateway + $0.045/GB processed), Cloud VPN ($0.075/hour/tunnel), Egress ($0.12/GB to internet; inter-region egress $0.01–0.08/GB), and Static external IPs ($0.01/hour when unattached). Cloud NAT is the most common "hidden" cost.

Console Navigation

GCP Console VPC Network VPC Networks Create VPC Network Add Subnets (Custom Mode) Configure Firewall Rules

Terraform — VPC with Public and Private Subnets

# terraform/vpc.tf — Custom VPC with public and private subnets

resource "google_compute_network" "vpc" {
  name                    = "prod-vpc"
  auto_create_subnetworks = false  # Custom mode — we define our own subnets
  routing_mode            = "GLOBAL"
}

# ── Public subnet (VMs can have external IPs) ──
resource "google_compute_subnetwork" "public" {
  name          = "public-us-central1"
  ip_cidr_range = "10.10.0.0/24"
  region        = "us-central1"
  network       = google_compute_network.vpc.id

  # Enable Flow Logs for network monitoring
  log_config {
    aggregation_interval = "INTERVAL_5_SEC"
    flow_sampling        = 0.5
    metadata             = "INCLUDE_ALL_METADATA"
  }
}

# ── Private subnet (no external IPs, uses Cloud NAT for outbound) ──
resource "google_compute_subnetwork" "private" {
  name                     = "private-us-central1"
  ip_cidr_range            = "10.20.0.0/24"
  region                   = "us-central1"
  network                  = google_compute_network.vpc.id
  private_ip_google_access = true  # Access Google APIs without external IP

  # Secondary ranges for GKE Pods and Services
  secondary_ip_range {
    range_name    = "pods"
    ip_cidr_range = "10.100.0.0/16"
  }
  secondary_ip_range {
    range_name    = "services"
    ip_cidr_range = "10.200.0.0/20"
  }
}

# ── Cloud NAT for private subnet outbound access ──
resource "google_compute_router" "router" {
  name    = "nat-router"
  region  = "us-central1"
  network = google_compute_network.vpc.id
}

resource "google_compute_router_nat" "nat" {
  name                               = "cloud-nat"
  router                             = google_compute_router.router.name
  region                             = "us-central1"
  nat_ip_allocate_option             = "AUTO_ONLY"
  source_subnetwork_ip_ranges_to_nat = "LIST_OF_SUBNETWORKS"

  subnetwork {
    name                    = google_compute_subnetwork.private.id
    source_ip_ranges_to_nat = ["ALL_IP_RANGES"]
  }
}

# ── Firewall: Allow SSH via IAP (no public SSH port) ──
resource "google_compute_firewall" "allow_iap_ssh" {
  name    = "allow-iap-ssh"
  network = google_compute_network.vpc.id

  allow {
    protocol = "tcp"
    ports    = ["22"]
  }

  # IAP's IP range — only IAP can initiate SSH
  source_ranges = ["35.235.240.0/20"]
  target_tags   = ["allow-ssh"]
}

# ── Firewall: Allow internal traffic between subnets ──
resource "google_compute_firewall" "allow_internal" {
  name    = "allow-internal"
  network = google_compute_network.vpc.id

  allow {
    protocol = "tcp"
    ports    = ["0-65535"]
  }
  allow {
    protocol = "udp"
    ports    = ["0-65535"]
  }
  allow {
    protocol = "icmp"
  }

  source_ranges = ["10.10.0.0/24", "10.20.0.0/24"]
}

IAM & Resource Hierarchy

GCP's Identity and Access Management is built around a resource hierarchy. Permissions granted at a higher level are inherited by all children. This is fundamentally different from AWS (where IAM is account-flat) and gives GCP a powerful organizational model.

The GCP Hierarchy

# GCP Resource Hierarchy — permissions flow downward

Organization (example.com)
├── Folder: Engineering
│   ├── Folder: Backend
│   │   ├── Project: backend-prod        ←  VMs, GKE, Cloud SQL live here
│   │   └── Project: backend-staging
│   └── Folder: Data
│       ├── Project: data-warehouse-prod  ←  BigQuery datasets live here
│       └── Project: data-warehouse-dev
├── Folder: Marketing
│   └── Project: marketing-analytics
└── Folder: Shared Services
    ├── Project: shared-networking        ←  Shared VPC host project
    └── Project: shared-monitoring

Key IAM Concepts

ConceptWhat It IsExample
PrincipalWho is making the request (user, group, service account)user:alice@example.com
RoleCollection of permissions (predefined or custom)roles/bigquery.dataViewer
Policy BindingAttaches a role to a principal at a resource level"Alice gets BigQuery Viewer on project X"
Service AccountIdentity for applications and VMs (not humans)my-app@project.iam.gserviceaccount.com
Workload IdentityMaps Kubernetes SAs to GCP Service AccountsGKE pods authenticate as GCP SAs

Service Accounts — The GCP Way

In GCP, Service Accounts are the primary way applications authenticate. Unlike AWS IAM Users (with Access Keys), GCP Service Accounts use short-lived tokens automatically rotated by the platform. Never download service account key files — use attached service accounts on Compute Engine, GKE (Workload Identity), or Cloud Functions.

🚨
Never download SA key files. Service Account JSON keys are the GCP equivalent of AWS Access Keys — long-lived credentials that can be leaked. Instead, attach SAs directly to resources (VMs, functions, GKE pods). For external applications, use Workload Identity Federation.

Python — Authenticate and List Resources

# iam_demo.py — Authenticate with a Service Account and list resources
from google.cloud import compute_v1
from google.cloud import resourcemanager_v3
import google.auth

# ── Application Default Credentials (ADC) ──
# On GCE/GKE: automatically uses the attached Service Account
# Locally: uses credentials from `gcloud auth application-default login`
credentials, project = google.auth.default()
print(f"Authenticated as project: {project}")

# ── List Compute Engine instances ──
instance_client = compute_v1.InstancesClient()
request = compute_v1.AggregatedListInstancesRequest(project=project)

print("\nCompute Engine instances:")
for zone, instances_scoped_list in instance_client.aggregated_list(request=request):
    if instances_scoped_list.instances:
        for instance in instances_scoped_list.instances:
            print(f"  {instance.name:30} {instance.status:10} {zone}")

# ── Impersonate a Service Account (no key file needed) ──
from google.auth import impersonated_credentials

target_sa = "data-pipeline@my-project.iam.gserviceaccount.com"
target_scopes = ["https://www.googleapis.com/auth/cloud-platform"]

# Create impersonated credentials — requires iam.serviceAccountTokenCreator role
impersonated_creds = impersonated_credentials.Credentials(
    source_credentials=credentials,
    target_principal=target_sa,
    target_scopes=target_scopes,
    lifetime=3600,  # 1 hour max
)

# Use impersonated credentials with any Google Cloud client
from google.cloud import bigquery
bq_client = bigquery.Client(credentials=impersonated_creds, project=project)
datasets = list(bq_client.list_datasets())
print(f"\nDatasets accessible as {target_sa}: {len(datasets)}")

Terraform — IAM Bindings

# terraform/iam.tf — Project-level and resource-level IAM bindings

# ── Grant a group BigQuery Data Viewer on the project ──
resource "google_project_iam_member" "bq_viewer" {
  project = var.project_id
  role    = "roles/bigquery.dataViewer"
  member  = "group:data-analysts@example.com"
}

# ── Grant a Service Account storage access on a specific bucket ──
resource "google_storage_bucket_iam_member" "pipeline_writer" {
  bucket = google_storage_bucket.data_lake.name
  role   = "roles/storage.objectCreator"
  member = "serviceAccount:${google_service_account.pipeline_sa.email}"
}

# ── Create a custom role with minimal permissions ──
resource "google_project_iam_custom_role" "log_reader" {
  role_id     = "customLogReader"
  title       = "Custom Log Reader"
  description = "Can read logs but not modify anything"

  permissions = [
    "logging.logEntries.list",
    "logging.logs.list",
    "logging.logServices.list",
  ]
}

Cloud Identity-Aware Proxy (IAP)

IAP lets you control access to your web applications and VMs without a VPN. It acts as a reverse proxy that verifies the user's identity (via Google Sign-In or external IdP) and checks IAM permissions before forwarding the request. This is Google's implementation of the BeyondCorp zero-trust security model.

How IAP Works

User Request Load Balancer IAP (Identity Check) IAM Authorization Backend Service

Use Cases

💡
Replace your bastion hosts with IAP tunnels. Instead of maintaining a bastion VM ($30+/month), use gcloud compute ssh --tunnel-through-iap VM_NAME. Traffic is encrypted end-to-end, access is IAM-controlled, and there's no public IP to expose. IAP TCP tunneling is free.
💰
Cost drivers: IAP itself is free. You pay for the Load Balancer fronting the IAP-protected backend ($0.025/hour + $0.008/GB processed). IAP TCP tunnels (SSH/RDP) are free. The savings from removing VPN infrastructure and bastion hosts typically exceed the LB cost.

Terraform — IAP-Protected Backend

# terraform/iap.tf — Enable IAP on a backend service

# Enable IAP on the backend service
resource "google_iap_web_backend_service_iam_member" "access" {
  web_backend_service = google_compute_backend_service.app.name
  role                = "roles/iap.httpsResourceAccessor"
  member              = "group:developers@example.com"
}

# IAP OAuth consent — required for web app protection
resource "google_iap_brand" "default" {
  support_email     = "admin@example.com"
  application_title = "Internal Tools"
}

resource "google_iap_client" "default" {
  display_name = "IAP Client"
  brand        = google_iap_brand.default.name
}

Module 4: Pricing & Billing Management

GCP's pricing model rewards data-heavy, long-running workloads with automatic discounts (SUD, CUDs). But without visibility, costs can spiral quickly — especially with BigQuery, egress, and over-provisioned VMs. This module covers the tools to estimate, monitor, and control spend.

GCP Pricing Calculator

The Google Cloud Pricing Calculator helps estimate monthly costs for multi-service architectures before deployment. For BigQuery specifically, you need to estimate both storage and analysis costs separately.

Estimating BigQuery Costs (Example)

Scenario: Your team runs 50 queries/day, each scanning an average of 20 GB, on a 5 TB dataset.

ComponentCalculationMonthly Cost
Storage (Active)5 TB × $0.02/GB = 5,000 GB × $0.02$100.00
On-Demand Analysis50 queries/day × 20 GB × 30 days = 30 TB × $6.25/TB$187.50
Free tier offsetFirst 1 TB/month is free-$6.25
Total$281.25 / month
Compare with Capacity pricing: If you partition tables well and your teams run queries continuously, BigQuery Editions (slot-based) may be cheaper. At 100 Standard slots × $0.04/slot-hour × 730 hours = $2,920/month — only cost-effective if you'd exceed that in on-demand scanning. For most teams, on-demand + partitioning is the sweet spot.

Console Navigation

cloud.google.com/products/calculator Add Services Configure Parameters Share Estimate Link

Hidden Costs Checklist

Common Cost Surprises on GCP Budget Review

Cloud NAT: $0.045/hour = ~$32/month per gateway + per-GB processing fees. Load Balancers: $0.025/hour = ~$18/month even with zero traffic. Egress: $0.12/GB to internet (first 1 GB/month free). Static IPs: $0.01/hour when not attached. Persistent Disk snapshots: Charged per GB stored. Log ingestion (Cloud Logging): First 50 GB/month free, then $0.50/GB.

Billing Export to BigQuery

GCP's most powerful cost analysis feature is Billing Export to BigQuery. Once enabled, every line item on your invoice is exported to a BigQuery table in near-real-time. You can then run SQL queries to find cost anomalies, break down spend by project/service/label, and build custom dashboards.

Setup

Billing Account Billing Export Edit Settings (BigQuery Export) Select Project & Dataset Enable Standard & Detailed Usage

SQL — Top 10 Costliest Services This Month

-- billing_analysis.sql — Find top cost drivers in your billing export

SELECT
  service.description AS service_name,
  SUM(cost) + SUM(IFNULL(
    (SELECT SUM(c.amount) FROM UNNEST(credits) c), 0
  )) AS net_cost
FROM
  `my-billing-project.billing_dataset.gcp_billing_export_v1_XXXXXX`
WHERE
  invoice.month = FORMAT_DATE('%Y%m', CURRENT_DATE())
GROUP BY
  service_name
ORDER BY
  net_cost DESC
LIMIT 10

SQL — Daily Spend by Project (for Anomaly Detection)

-- daily_by_project.sql — Track spend trends per project per day

SELECT
  DATE(usage_start_time) AS usage_date,
  project.id AS project_id,
  SUM(cost) AS daily_cost
FROM
  `my-billing-project.billing_dataset.gcp_billing_export_v1_XXXXXX`
WHERE
  usage_start_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY
  usage_date, project_id
ORDER BY
  usage_date DESC, daily_cost DESC
💡
Pro tip: Connect Looker Studio (formerly Data Studio) to your billing BigQuery table for free, auto-refreshing cost dashboards. Label all resources with team, environment, and cost_center labels — the billing export includes labels, making per-team chargeback trivial.

Quotas

Every GCP API has quotas — limits on how many requests per minute, how many resources per project, or how much capacity you can use. Quotas exist to protect both Google and you (preventing a runaway script from creating 10,000 VMs). Unlike AWS service limits, GCP quotas are granular and can block you silently.

Common Quota Traps

QuotaDefault LimitWhat Happens
CPUs per region24 (new projects)Cannot create VMs — API returns QUOTA_EXCEEDED
GKE nodes per zone1,000Node pool can't scale
BigQuery concurrent slots2,000 (on-demand)Queries queue and slow down
Cloud Functions per region1,000Deployment fails
External IP addresses8 per regionCannot attach public IPs

How to Request Quota Increases

GCP Console IAM & Admin Quotas & System Limits Filter by Service Select Quota Edit Quotas
Quota increases are not instant. Some take minutes (CPU quotas), others take days (GPU quotas, especially for A100/H100). Plan ahead — request GPU quotas weeks before you need them. New billing accounts start with very low quotas; spending history increases your default limits.

Python — Check Quotas Programmatically

# check_quotas.py — List quotas and usage for a project/region
from google.cloud import compute_v1

client = compute_v1.RegionsClient()
project = "my-gcp-project-id"
region = "us-central1"

region_info = client.get(project=project, region=region)

print(f"Quotas for {region}:")
for quota in region_info.quotas:
    usage_pct = (quota.usage / quota.limit * 100) if quota.limit > 0 else 0
    if usage_pct > 50:  # Only show quotas above 50% usage
        print(f"  ⚠ {quota.metric:35} {quota.usage:.0f}/{quota.limit:.0f} ({usage_pct:.0f}%)")

Module 5: Common Pitfalls

Every cloud platform has traps. GCP's are unique because of its global VPC model, BigQuery's scan-based pricing, and the ease of creating new projects. This module covers the mistakes that cost real money and create real security incidents.

Default VPC Rules — The Silent Security Risk

Every GCP project comes with a default VPC and two dangerously permissive firewall rules:

Rule NameWhat It AllowsWhy It's Dangerous
default-allow-internalAll TCP, UDP, ICMP between 10.128.0.0/9Any VM can talk to any other VM across all subnets in the VPC — no segmentation
default-allow-sshTCP:22 from 0.0.0.0/0SSH is open to the entire internet
default-allow-rdpTCP:3389 from 0.0.0.0/0RDP is open to the entire internet
default-allow-icmpICMP from 0.0.0.0/0Enables reconnaissance via ping sweeps
🚨
Never use the default VPC for production. The default-allow-ssh rule alone means every VM you create has SSH open to the internet by default. This is the GCP equivalent of leaving your front door open. Always create a custom VPC with explicit, least-privilege firewall rules.

Fix: Delete Default VPC Firewall Rules

# Delete the dangerous default rules (run once per project)
gcloud compute firewall-rules delete default-allow-ssh \
  --project=my-gcp-project-id --quiet

gcloud compute firewall-rules delete default-allow-rdp \
  --project=my-gcp-project-id --quiet

gcloud compute firewall-rules delete default-allow-icmp \
  --project=my-gcp-project-id --quiet

# Or better: delete the entire default VPC and create a custom one
gcloud compute networks delete default \
  --project=my-gcp-project-id --quiet

Terraform — Organization Policy to Block Default Network

# terraform/org_policy.tf — Prevent default VPC creation in all new projects

resource "google_organization_policy" "skip_default_network" {
  org_id     = var.org_id
  constraint = "constraints/compute.skipDefaultNetworkCreation"

  boolean_policy {
    enforced = true
  }
}

# This ensures every new project starts with NO default VPC.
# Teams must create custom VPCs with proper firewall rules.

BigQuery Query Costs — The $100 SELECT *

BigQuery's on-demand pricing charges $6.25 per TB scanned. A single SELECT * on a large, unpartitioned table can be shockingly expensive. This is the most common cost mistake on GCP.

The Scenario

Real-World Cost Example $$$

A developer runs SELECT * FROM events on a 20 TB unpartitioned table to "just look at a few rows." BigQuery scans the entire table — all 20 TB. Cost: 20 × $6.25 = $125.00 for one query. They run it 5 times with small modifications: $625 in an afternoon. With partitioning and a WHERE event_date = '2026-03-31' filter, the same query scans 50 GB: $0.31.

Prevention Strategies

Terraform — BigQuery Per-User Quota

# Set a custom per-user query quota via gcloud (not directly in Terraform)
# Limits each user to 1 TB of on-demand scanning per day

gcloud alpha bq settings update \
  --project=my-project-id \
  --default-query-job-timeout=600s

# For programmatic enforcement, use BigQuery Reservations API
# to set per-project slot caps in Capacity mode
💡
Use the BigQuery query validator. In the GCP Console BigQuery editor, the green/red status badge in the top-right shows estimated bytes scanned before you click "Run." Train your team to always check this badge. If it shows "TB" instead of "GB", stop and add a partition filter.

Project Proliferation

GCP makes it easy to create projects — too easy. Without governance, teams create ad-hoc projects for experiments, POCs, and one-off demos. These accumulate. Each project may have running resources (VMs, Cloud SQL instances, GKE clusters) that no one monitors. This is how a $5K/month GCP bill becomes $50K/month.

The Problem

SymptomRoot CauseImpact
50+ projects in the orgNo naming convention or folder structureImpossible to track ownership or costs
Projects with no labelsNo enforcement of labeling at creationCannot attribute costs to teams
Projects with no billing budgetNo org-level budget policyRunaway spend goes unnoticed for weeks
"Zombie" projectsPOC completed but project not deletedVMs, Cloud SQL, GKE clusters running 24/7

Prevention: Centralized Project Factory

Use a Project Factory pattern (via Terraform) to standardize project creation. Every project gets a naming convention, required labels, a billing budget, and folder placement.

# terraform/project_factory.tf — Standardized project creation

resource "google_project" "managed" {
  name            = "${var.team}-${var.environment}"
  project_id      = "${var.org_prefix}-${var.team}-${var.environment}"
  folder_id       = var.folder_id
  billing_account = var.billing_account_id

  labels = {
    team        = var.team
    environment = var.environment
    cost_center = var.cost_center
    created_by  = "terraform"
  }
}

# ── Automatically enable required APIs ──
resource "google_project_service" "required_apis" {
  for_each = toset([
    "compute.googleapis.com",
    "container.googleapis.com",
    "bigquery.googleapis.com",
    "logging.googleapis.com",
    "monitoring.googleapis.com",
  ])

  project = google_project.managed.project_id
  service = each.value
}

# ── Create a billing budget for the project ──
resource "google_billing_budget" "project_budget" {
  billing_account = var.billing_account_id
  display_name    = "Budget: ${google_project.managed.name}"

  budget_filter {
    projects = ["projects/${google_project.managed.number}"]
  }

  amount {
    specified_amount {
      currency_code = "USD"
      units         = var.monthly_budget
    }
  }

  threshold_rules {
    threshold_percent = 0.5   # Alert at 50%
  }
  threshold_rules {
    threshold_percent = 0.8   # Alert at 80%
  }
  threshold_rules {
    threshold_percent = 1.0   # Alert at 100%
  }
  threshold_rules {
    threshold_percent = 1.5   # Alert at 150% (overspend)
    spend_basis       = "CURRENT_SPEND"
  }

  all_updates_rule {
    monitoring_notification_channels = [var.notification_channel_id]
  }
}
Project Governance Checklist Organization Admin

1. Use Terraform Project Factory for all project creation — no Console/gcloud ad-hoc. 2. Enforce required labels via Organization Policy. 3. Attach a billing budget to every project at creation. 4. Quarterly audit: list all projects, check for running resources, delete zombies. 5. Use Folder structure (Engineering, Data, Shared) to group projects logically. 6. Enable Recommender API to surface idle VMs, overdprovisioned instances, and unused IPs.

🚨
Audit regularly. Run gcloud projects list --filter="lifecycleState=ACTIVE" monthly. Cross-reference with billing export to find projects with non-zero spend but no recent code deployments. These are your "zombie" projects.