Google Cloud (GCP) Engineer Handbook
A practical, data-focused guide to Google Cloud architecture — compute, BigQuery, networking, cost management, and security patterns with Terraform and Google Cloud Python Client Libraries.
Table of Contents
Module 1: Compute & Containers
GCP offers compute across the full abstraction spectrum — from Compute Engine VMs you fully control, to Cloud Functions where you write a single handler and GCP manages everything else. GKE sits in the middle as the premier managed Kubernetes offering in any cloud.
Compute Engine
Compute Engine provides virtual machines running on Google's infrastructure. You select a Machine Family (general-purpose, compute-optimized, memory-optimized, accelerator-optimized), choose an OS image, and pick a zone. Unlike AWS, GCP automatically applies Sustained Use Discounts — no upfront commitment needed.
Machine Families
| Family | Series | Use Case | Example |
|---|---|---|---|
| General Purpose | E2, N2, N2D, T2D, C3 | Web servers, dev/test, small databases, microservices | e2-medium (2 vCPU / 4 GB) |
| Compute-Optimized | C2, C2D, H3 | Batch processing, gaming servers, HPC, CI/CD | c2-standard-8 (8 vCPU / 32 GB) |
| Memory-Optimized | M1, M2, M3 | SAP HANA, large in-memory databases, real-time analytics | m2-ultramem-208 (208 vCPU / 5.8 TB) |
| Accelerator-Optimized | A2, A3, G2 | ML training/inference, video transcoding, GPU workloads | a2-highgpu-1g (12 vCPU / 85 GB + A100) |
Pricing Models
Sustained Use Discounts (SUD)
GCP automatically applies discounts when a VM runs for more than 25% of a month. No action required — no reservations, no upfront payment. The discount scales linearly:
| Monthly Usage | Effective Discount | You Pay |
|---|---|---|
| 0–25% of month | 0% | Full on-demand rate |
| 25–50% of month | ~20% | 80% of on-demand rate on incremental usage |
| 50–75% of month | ~40% | 60% of on-demand rate on incremental usage |
| 75–100% of month | ~60% | 40% of on-demand rate on incremental usage |
n2-standard-8 costs ~$280/month.Console Navigation
Terraform — Launch a Compute Engine VM
# terraform/compute.tf — Production-ready Compute Engine instance provider "google" { project = "my-gcp-project-id" region = "us-central1" } resource "google_compute_instance" "web_server" { name = "web-server-prod" machine_type = "e2-medium" zone = "us-central1-a" boot_disk { initialize_params { image = "debian-cloud/debian-12" size = 30 # GB type = "pd-balanced" } } network_interface { network = "default" subnetwork = "default" # Omit access_config block for private-only VM (no external IP) access_config { # Ephemeral external IP — use for testing only } } # Use a Service Account instead of user credentials service_account { email = google_service_account.vm_sa.email scopes = ["cloud-platform"] } # Enable Shielded VM features for security shielded_instance_config { enable_secure_boot = true enable_vtpm = true enable_integrity_monitoring = true } metadata = { # Block project-wide SSH keys — use OS Login instead block-project-ssh-keys = "true" } labels = { environment = "production" managed_by = "terraform" } } resource "google_service_account" "vm_sa" { account_id = "web-server-sa" display_name = "Web Server Service Account" }
enable_secure_boot = true prevents rootkits and bootkits. Combined with block-project-ssh-keys and OS Login, you get a hardened VM baseline.Cloud Functions
Cloud Functions is Google's serverless compute platform for event-driven code. You write a function, attach it to a trigger (HTTP, Cloud Storage, Pub/Sub, Firestore, Cloud Scheduler), and GCP handles provisioning, scaling, and patching. Cloud Functions (2nd gen) is built on Cloud Run, giving you longer timeouts (up to 60 minutes) and concurrency support.
Gen 1 vs Gen 2
| Feature | Gen 1 | Gen 2 (Recommended) |
|---|---|---|
| Max timeout | 9 minutes | 60 minutes (HTTP) / 9 min (event) |
| Concurrency | 1 request per instance | Up to 1,000 concurrent requests per instance |
| Min instances | Supported | Supported — eliminates cold starts |
| Traffic splitting | Not available | Supported via Cloud Run revisions |
| Built on | Custom runtime | Cloud Run + Eventarc |
Python — Cloud Function Triggered by GCS Upload
# main.py — Cloud Function (Gen 2) triggered by a Cloud Storage object upload import functions_framework from google.cloud import storage @functions_framework.cloud_event def process_gcs_upload(cloud_event): """Triggered when a new object is created in a GCS bucket. The event payload contains bucket name, object name, and metadata. """ data = cloud_event.data bucket_name = data["bucket"] file_name = data["name"] content_type = data.get("contentType", "unknown") size_bytes = data.get("size", 0) print(f"New file uploaded: gs://{bucket_name}/{file_name}") print(f"Content type: {content_type}, Size: {size_bytes} bytes") # Example: Read the file and process it client = storage.Client() bucket = client.bucket(bucket_name) blob = bucket.blob(file_name) # Only process CSV files if file_name.endswith(".csv"): content = blob.download_as_text() line_count = len(content.strip().split("\n")) print(f"CSV has {line_count} lines — sending to BigQuery...") # Insert into BigQuery, call another service, etc. else: print(f"Skipping non-CSV file: {file_name}")
Deploy with gcloud CLI
# Deploy the Gen 2 Cloud Function with a GCS trigger gcloud functions deploy process-gcs-upload \ --gen2 \ --runtime python312 \ --region us-central1 \ --source . \ --entry-point process_gcs_upload \ --trigger-event-filters="type=google.cloud.storage.object.v1.finalized" \ --trigger-event-filters="bucket=my-data-bucket" \ --memory 256Mi \ --timeout 120s \ --service-account my-cf-sa@my-gcp-project-id.iam.gserviceaccount.com
Terraform — Cloud Function (Gen 2)
# terraform/cloud_function.tf — Gen 2 Cloud Function with GCS trigger resource "google_storage_bucket" "source_code" { name = "cf-source-${var.project_id}" location = "US" } resource "google_storage_bucket_object" "function_zip" { name = "function-source.zip" bucket = google_storage_bucket.source_code.name source = "${path.module}/function-source.zip" } resource "google_cloudfunctions2_function" "processor" { name = "process-gcs-upload" location = "us-central1" build_config { runtime = "python312" entry_point = "process_gcs_upload" source { storage_source { bucket = google_storage_bucket.source_code.name object = google_storage_bucket_object.function_zip.name } } } service_config { max_instance_count = 10 min_instance_count = 0 available_memory = "256Mi" timeout_seconds = 120 service_account_email = google_service_account.cf_sa.email } event_trigger { trigger_region = "us-central1" event_type = "google.cloud.storage.object.v1.finalized" event_filters { attribute = "bucket" value = google_storage_bucket.data_bucket.name } } } resource "google_service_account" "cf_sa" { account_id = "cloud-function-sa" display_name = "Cloud Function Service Account" }
Google Kubernetes Engine (GKE)
GKE is widely considered the best managed Kubernetes service in any cloud. Google invented Kubernetes (from the Borg system), and GKE reflects that heritage — it supports the latest K8s versions first, has the tightest integration with GCP services, and offers an Autopilot mode that eliminates node management entirely.
Autopilot vs Standard Mode
Fastest K8s upgrades: GKE supports new Kubernetes versions weeks before EKS/AKS. Release channels: Rapid, Regular, and Stable channels with automatic upgrades. GKE Enterprise: Multi-cluster management, service mesh (Anthos Service Mesh), and fleet-level policy enforcement. Binary Authorization: Only deploy signed container images — built-in supply chain security.
Console Navigation
Terraform — GKE Autopilot Cluster
# terraform/gke.tf — GKE Autopilot cluster resource "google_container_cluster" "autopilot" { name = "prod-autopilot-cluster" location = "us-central1" # Enable Autopilot mode enable_autopilot = true # Use a release channel for automatic upgrades release_channel { channel = "REGULAR" } # Private cluster — nodes have no external IPs private_cluster_config { enable_private_nodes = true enable_private_endpoint = false # Keep API server public for kubectl access master_ipv4_cidr_block = "172.16.0.0/28" } # Network configuration network = google_compute_network.vpc.name subnetwork = google_compute_subnetwork.gke_subnet.name # IP allocation for Pods and Services ip_allocation_policy { cluster_secondary_range_name = "pods" services_secondary_range_name = "services" } # Workload Identity — maps K8s SAs to GCP SAs workload_identity_config { workload_pool = "${var.project_id}.svc.id.goog" } }
Module 2: Data, Storage & Analytics
Data is GCP's strongest domain. BigQuery is arguably the most important service in all of cloud computing for analytics workloads. Cloud Storage is the universal data lake foundation. This module covers the storage and database services that underpin modern data architectures.
Cloud Storage (GCS)
Cloud Storage (GCS) is Google's object storage service — the equivalent of AWS S3. It stores unstructured data (files, images, backups, ML training data) in buckets. GCS is globally unique by bucket name, and objects are stored in the closest region or in dual/multi-region configurations for high availability.
Storage Classes
| Class | Min Storage Duration | Use Case | Storage $/GB/month | Retrieval $/GB |
|---|---|---|---|---|
| Standard | None | Frequently accessed data, hot data, serving website assets | $0.020 | Free |
| Nearline | 30 days | Data accessed <1x/month — backups, long-tail content | $0.010 | $0.01 |
| Coldline | 90 days | Data accessed <1x/quarter — disaster recovery | $0.004 | $0.02 |
| Archive | 365 days | Data accessed <1x/year — regulatory compliance, long-term retention | $0.0012 | $0.05 |
Location Types
us-central1). Lowest cost. Best for compute co-location — store data in the same region as your VMs/GKE.us-east1 + us-central1). Automatic replication with turbo mode (<15 min RPO). Good balance of HA and cost.US, EU, ASIA). Highest availability and geo-redundancy. Best for serving content worldwide.Python — Upload and Download Objects
# gcs_operations.py — Upload and download objects using Google Cloud Storage client from google.cloud import storage # Client uses Application Default Credentials (ADC) # On GCE/GKE: automatically uses the attached Service Account # Locally: uses `gcloud auth application-default login` client = storage.Client() # ── Upload a file with custom metadata ── bucket = client.bucket("my-data-bucket") blob = bucket.blob("uploads/2026/03/report.csv") # Set custom metadata (searchable, useful for tracking) blob.metadata = { "uploaded_by": "data-pipeline-v2", "source_system": "salesforce", "record_count": "45230", } blob.upload_from_filename( "./local-data/report.csv", content_type="text/csv", ) print(f"Uploaded to gs://{bucket.name}/{blob.name}") # ── Download a file ── download_blob = bucket.blob("uploads/2026/03/report.csv") download_blob.download_to_filename("./downloads/report.csv") print("Downloaded successfully") # ── List objects with a prefix ── blobs = client.list_blobs("my-data-bucket", prefix="uploads/2026/") for b in blobs: print(f" {b.name} ({b.size} bytes, {b.storage_class})")
Terraform — Secure GCS Bucket
# terraform/gcs.tf — Production GCS bucket with lifecycle and access control resource "google_storage_bucket" "data_lake" { name = "data-lake-${var.project_id}" location = "US" storage_class = "STANDARD" # Enable Autoclass to automatically optimize storage costs autoclass { enabled = true } # Prevent accidental deletion force_destroy = false # Enable uniform bucket-level access (recommended over legacy ACLs) uniform_bucket_level_access = true # Versioning for data recovery versioning { enabled = true } # Lifecycle rule: delete old versions after 90 days lifecycle_rule { condition { age = 90 with_state = "ARCHIVED" } action { type = "Delete" } } # Block public access public_access_prevention = "enforced" }
BigQuery
BigQuery is Google's "killer app" — the service that differentiates GCP from every other cloud. It's a fully managed, serverless, petabyte-scale data warehouse with built-in ML, geospatial analysis, and BI Engine for caching. There are no indexes to tune, no clusters to manage, and no vacuum operations. You write SQL, and Google handles the rest.
Architecture: Columnar & Serverless
BigQuery separates storage and compute. Data is stored in Google's Capacitor columnar format on Colossus (Google's distributed file system). Queries are executed by Dremel, a multi-tenant execution engine that distributes work across thousands of workers. This separation means:
- Storage scales independently — you only pay for data at rest, and it's cheap ($0.02/GB/month, dropping to $0.01 after 90 days of no edits).
- Compute scales on demand — Dremel allocates slots (units of compute) dynamically. No cluster sizing.
- Columnar format — queries only read the columns referenced in your
SELECTstatement. A query on 3 columns of a 200-column table reads ~1.5% of the data.
BigQuery processes over 110 TB of data per second across Google. It can scan a 1 PB table in under 30 seconds. No other cloud service at any price point matches this throughput for ad-hoc analytics. AWS Redshift Serverless and Azure Synapse are the closest competitors, but both require more tuning and have steeper cost curves at scale.
Pricing Models
SELECT * on a 10 TB table costs ~$62.50 per query.
--dry_run to preview cost before executing. Cost drivers (Slots): Slot-hours consumed × edition rate. Storage: $0.02/GB/month active, $0.01/GB/month long-term (90+ days unmodified). Streaming inserts: $0.01 per 200 MB.Console Navigation
Python — Query a Public Dataset
# bigquery_demo.py — Query a public dataset using the BigQuery Python client from google.cloud import bigquery # Client uses Application Default Credentials client = bigquery.Client() # ── Query the public GitHub dataset ── # This scans ~6 GB on-demand = ~$0.04 query = """ SELECT language.name AS language, COUNT(*) AS repo_count FROM `bigquery-public-data.github_repos.languages`, UNNEST(language) AS language GROUP BY language ORDER BY repo_count DESC LIMIT 20 """ # Use QueryJobConfig for cost control job_config = bigquery.QueryJobConfig( # Set a maximum bytes billed to prevent runaway costs maximum_bytes_billed=10 * 1024 ** 3, # 10 GB limit # Use standard SQL (default, but explicit is good) use_legacy_sql=False, ) # Dry run first to check cost dry_run_config = bigquery.QueryJobConfig(dry_run=True, use_legacy_sql=False) dry_run_job = client.query(query, job_config=dry_run_config) mb_scanned = dry_run_job.total_bytes_processed / (1024 ** 2) print(f"This query will scan {mb_scanned:.1f} MB") # Execute the actual query query_job = client.query(query, job_config=job_config) results = query_job.result() print(f"\nTop programming languages on GitHub:") for row in results: print(f" {row.language:20} {row.repo_count:>,} repos") print(f"\nTotal bytes billed: {query_job.total_bytes_billed:,}")
maximum_bytes_billed. This acts as a safety net — if the query would scan more than the limit, BigQuery rejects it instead of running. This single line of code can prevent a $500 mistake.Cost Optimization Strategies
Partition tables by date/timestamp — queries that filter on the partition column only scan relevant partitions. Cluster tables by frequently filtered columns (up to 4). Never use SELECT * — always specify columns. Use materialized views for repeated queries. Export to BI Engine for sub-second dashboard queries (cached in memory). Set per-user and per-project query quotas to prevent accidental cost spikes.
Terraform — BigQuery Dataset and Table
# terraform/bigquery.tf — Partitioned and clustered BigQuery table resource "google_bigquery_dataset" "analytics" { dataset_id = "analytics" location = "US" # Default table expiration: 180 days (auto-cleanup for temp data) default_table_expiration_ms = 15552000000 labels = { environment = "production" managed_by = "terraform" } } resource "google_bigquery_table" "events" { dataset_id = google_bigquery_dataset.analytics.dataset_id table_id = "events" # Partition by ingestion time (or a specific TIMESTAMP/DATE column) time_partitioning { type = "DAY" field = "event_timestamp" expiration_ms = 7776000000 # 90-day partition expiry } # Cluster by frequently filtered columns — up to 4 clustering = ["user_id", "event_type"] schema = jsonencode([ { name = "event_id", type = "STRING", mode = "REQUIRED" }, { name = "event_timestamp", type = "TIMESTAMP", mode = "REQUIRED" }, { name = "user_id", type = "STRING", mode = "REQUIRED" }, { name = "event_type", type = "STRING", mode = "REQUIRED" }, { name = "properties", type = "JSON", mode = "NULLABLE" }, ]) labels = { environment = "production" } }
Cloud SQL & Firestore
GCP offers both managed relational databases (Cloud SQL) and a serverless NoSQL document store (Firestore). Choose based on your data model — if you need joins, transactions, and schemas, use Cloud SQL. If you need flexible documents with real-time sync, use Firestore.
Cloud SQL
Cloud SQL is a fully managed service for MySQL, PostgreSQL, and SQL Server. Google handles replication, backups, encryption, and patching. It supports High Availability with automatic failover (regional instance with a standby in another zone).
| Feature | Cloud SQL | AlloyDB (Premium) |
|---|---|---|
| Engine | MySQL, PostgreSQL, SQL Server | PostgreSQL-compatible only |
| Performance | Standard managed DB | 4x faster than standard PostgreSQL (Google claims) |
| HA | Regional with zonal failover | Regional with <1 sec failover |
| Best for | Standard OLTP workloads, lift-and-shift | High-performance OLTP, hybrid OLAP/OLTP |
| Pricing | vCPU/hour + storage/GB | vCPU/hour + storage/GB (higher base) |
Firestore
Firestore is a serverless, NoSQL document database with real-time syncing and offline support. Documents are organized in collections, and queries are indexed automatically. Firestore operates in two modes:
- Native mode: Full Firestore features including real-time listeners, offline cache, and mobile SDK support. Best for web/mobile apps.
- Datastore mode: Backward-compatible with the legacy Datastore API. No real-time features, but supports server-side workloads with higher throughput.
Terraform — Cloud SQL with HA
# terraform/cloud_sql.tf — Cloud SQL PostgreSQL with HA and private IP resource "google_sql_database_instance" "postgres" { name = "prod-postgres" database_version = "POSTGRES_16" region = "us-central1" settings { tier = "db-custom-4-16384" # 4 vCPU, 16 GB RAM # High Availability — automatic failover to another zone availability_type = "REGIONAL" # Disk configuration disk_size = 100 # GB disk_type = "PD_SSD" disk_autoresize = true # Private IP only — no public exposure ip_configuration { ipv4_enabled = false private_network = google_compute_network.vpc.id } # Automated backups backup_configuration { enabled = true point_in_time_recovery_enabled = true start_time = "03:00" transaction_log_retention_days = 7 backup_retention_settings { retained_backups = 14 } } # Maintenance window — Sunday 4AM maintenance_window { day = 7 hour = 4 update_track = "stable" } } deletion_protection = true }
Module 3: Networking & IAM
GCP's networking model is fundamentally different from AWS and Azure. VPCs are global, subnets are regional, and firewall rules are centralized. The IAM hierarchy (Organization → Folders → Projects) is how Google expects you to model your organization.
VPC & Shared VPC
A GCP Virtual Private Cloud (VPC) is a global resource. Unlike AWS (where a VPC is regional) or Azure (where a VNet is regional), a single GCP VPC spans all regions. Subnets, however, are regional — each subnet belongs to one region and one VPC.
GCP VPC vs AWS VPC vs Azure VNet
| Feature | GCP VPC | AWS VPC | Azure VNet |
|---|---|---|---|
| Scope | Global | Regional | Regional |
| Subnets | Regional | Zonal (AZ-bound) | Regional |
| Firewall | VPC-level rules with tags/SAs | Security Groups per ENI | NSGs per subnet/NIC |
| Peering | Global (cross-region peering built in) | Regional (cross-region costs extra) | Global |
| Cross-VPC | Shared VPC (centralized) | Transit Gateway | Virtual WAN |
Shared VPC
Shared VPC lets you designate a host project that owns the VPC and subnets, while service projects use those subnets for their resources (VMs, GKE clusters, etc.). This centralizes network management and security while allowing teams to manage their own compute resources.
Enterprise standard: If you have more than 2–3 projects, use Shared VPC. It prevents IP overlap, centralizes firewall rules, and gives the networking team a single pane of glass. Without Shared VPC: Each project creates its own VPC, leading to IP conflicts, duplicated firewall rules, and peering nightmares. Shared VPC is free — there's no reason not to use it.
Console Navigation
Terraform — VPC with Public and Private Subnets
# terraform/vpc.tf — Custom VPC with public and private subnets resource "google_compute_network" "vpc" { name = "prod-vpc" auto_create_subnetworks = false # Custom mode — we define our own subnets routing_mode = "GLOBAL" } # ── Public subnet (VMs can have external IPs) ── resource "google_compute_subnetwork" "public" { name = "public-us-central1" ip_cidr_range = "10.10.0.0/24" region = "us-central1" network = google_compute_network.vpc.id # Enable Flow Logs for network monitoring log_config { aggregation_interval = "INTERVAL_5_SEC" flow_sampling = 0.5 metadata = "INCLUDE_ALL_METADATA" } } # ── Private subnet (no external IPs, uses Cloud NAT for outbound) ── resource "google_compute_subnetwork" "private" { name = "private-us-central1" ip_cidr_range = "10.20.0.0/24" region = "us-central1" network = google_compute_network.vpc.id private_ip_google_access = true # Access Google APIs without external IP # Secondary ranges for GKE Pods and Services secondary_ip_range { range_name = "pods" ip_cidr_range = "10.100.0.0/16" } secondary_ip_range { range_name = "services" ip_cidr_range = "10.200.0.0/20" } } # ── Cloud NAT for private subnet outbound access ── resource "google_compute_router" "router" { name = "nat-router" region = "us-central1" network = google_compute_network.vpc.id } resource "google_compute_router_nat" "nat" { name = "cloud-nat" router = google_compute_router.router.name region = "us-central1" nat_ip_allocate_option = "AUTO_ONLY" source_subnetwork_ip_ranges_to_nat = "LIST_OF_SUBNETWORKS" subnetwork { name = google_compute_subnetwork.private.id source_ip_ranges_to_nat = ["ALL_IP_RANGES"] } } # ── Firewall: Allow SSH via IAP (no public SSH port) ── resource "google_compute_firewall" "allow_iap_ssh" { name = "allow-iap-ssh" network = google_compute_network.vpc.id allow { protocol = "tcp" ports = ["22"] } # IAP's IP range — only IAP can initiate SSH source_ranges = ["35.235.240.0/20"] target_tags = ["allow-ssh"] } # ── Firewall: Allow internal traffic between subnets ── resource "google_compute_firewall" "allow_internal" { name = "allow-internal" network = google_compute_network.vpc.id allow { protocol = "tcp" ports = ["0-65535"] } allow { protocol = "udp" ports = ["0-65535"] } allow { protocol = "icmp" } source_ranges = ["10.10.0.0/24", "10.20.0.0/24"] }
IAM & Resource Hierarchy
GCP's Identity and Access Management is built around a resource hierarchy. Permissions granted at a higher level are inherited by all children. This is fundamentally different from AWS (where IAM is account-flat) and gives GCP a powerful organizational model.
The GCP Hierarchy
# GCP Resource Hierarchy — permissions flow downward Organization (example.com) ├── Folder: Engineering │ ├── Folder: Backend │ │ ├── Project: backend-prod ← VMs, GKE, Cloud SQL live here │ │ └── Project: backend-staging │ └── Folder: Data │ ├── Project: data-warehouse-prod ← BigQuery datasets live here │ └── Project: data-warehouse-dev ├── Folder: Marketing │ └── Project: marketing-analytics └── Folder: Shared Services ├── Project: shared-networking ← Shared VPC host project └── Project: shared-monitoring
Key IAM Concepts
| Concept | What It Is | Example |
|---|---|---|
| Principal | Who is making the request (user, group, service account) | user:alice@example.com |
| Role | Collection of permissions (predefined or custom) | roles/bigquery.dataViewer |
| Policy Binding | Attaches a role to a principal at a resource level | "Alice gets BigQuery Viewer on project X" |
| Service Account | Identity for applications and VMs (not humans) | my-app@project.iam.gserviceaccount.com |
| Workload Identity | Maps Kubernetes SAs to GCP Service Accounts | GKE pods authenticate as GCP SAs |
Service Accounts — The GCP Way
In GCP, Service Accounts are the primary way applications authenticate. Unlike AWS IAM Users (with Access Keys), GCP Service Accounts use short-lived tokens automatically rotated by the platform. Never download service account key files — use attached service accounts on Compute Engine, GKE (Workload Identity), or Cloud Functions.
Python — Authenticate and List Resources
# iam_demo.py — Authenticate with a Service Account and list resources from google.cloud import compute_v1 from google.cloud import resourcemanager_v3 import google.auth # ── Application Default Credentials (ADC) ── # On GCE/GKE: automatically uses the attached Service Account # Locally: uses credentials from `gcloud auth application-default login` credentials, project = google.auth.default() print(f"Authenticated as project: {project}") # ── List Compute Engine instances ── instance_client = compute_v1.InstancesClient() request = compute_v1.AggregatedListInstancesRequest(project=project) print("\nCompute Engine instances:") for zone, instances_scoped_list in instance_client.aggregated_list(request=request): if instances_scoped_list.instances: for instance in instances_scoped_list.instances: print(f" {instance.name:30} {instance.status:10} {zone}") # ── Impersonate a Service Account (no key file needed) ── from google.auth import impersonated_credentials target_sa = "data-pipeline@my-project.iam.gserviceaccount.com" target_scopes = ["https://www.googleapis.com/auth/cloud-platform"] # Create impersonated credentials — requires iam.serviceAccountTokenCreator role impersonated_creds = impersonated_credentials.Credentials( source_credentials=credentials, target_principal=target_sa, target_scopes=target_scopes, lifetime=3600, # 1 hour max ) # Use impersonated credentials with any Google Cloud client from google.cloud import bigquery bq_client = bigquery.Client(credentials=impersonated_creds, project=project) datasets = list(bq_client.list_datasets()) print(f"\nDatasets accessible as {target_sa}: {len(datasets)}")
Terraform — IAM Bindings
# terraform/iam.tf — Project-level and resource-level IAM bindings # ── Grant a group BigQuery Data Viewer on the project ── resource "google_project_iam_member" "bq_viewer" { project = var.project_id role = "roles/bigquery.dataViewer" member = "group:data-analysts@example.com" } # ── Grant a Service Account storage access on a specific bucket ── resource "google_storage_bucket_iam_member" "pipeline_writer" { bucket = google_storage_bucket.data_lake.name role = "roles/storage.objectCreator" member = "serviceAccount:${google_service_account.pipeline_sa.email}" } # ── Create a custom role with minimal permissions ── resource "google_project_iam_custom_role" "log_reader" { role_id = "customLogReader" title = "Custom Log Reader" description = "Can read logs but not modify anything" permissions = [ "logging.logEntries.list", "logging.logs.list", "logging.logServices.list", ] }
Cloud Identity-Aware Proxy (IAP)
IAP lets you control access to your web applications and VMs without a VPN. It acts as a reverse proxy that verifies the user's identity (via Google Sign-In or external IdP) and checks IAM permissions before forwarding the request. This is Google's implementation of the BeyondCorp zero-trust security model.
How IAP Works
Use Cases
- Internal dashboards: Protect Grafana, admin panels, or internal tools without exposing them to the internet or requiring VPN access.
- SSH/RDP via IAP tunnels: Replace bastion hosts. Use
gcloud compute ssh --tunnel-through-iapto SSH into private VMs through IAP without any external IP. - Context-aware access: Combine with Access Context Manager to require device posture (managed device, screen lock enabled) before granting access.
gcloud compute ssh --tunnel-through-iap VM_NAME. Traffic is encrypted end-to-end, access is IAM-controlled, and there's no public IP to expose. IAP TCP tunneling is free.Terraform — IAP-Protected Backend
# terraform/iap.tf — Enable IAP on a backend service # Enable IAP on the backend service resource "google_iap_web_backend_service_iam_member" "access" { web_backend_service = google_compute_backend_service.app.name role = "roles/iap.httpsResourceAccessor" member = "group:developers@example.com" } # IAP OAuth consent — required for web app protection resource "google_iap_brand" "default" { support_email = "admin@example.com" application_title = "Internal Tools" } resource "google_iap_client" "default" { display_name = "IAP Client" brand = google_iap_brand.default.name }
Module 4: Pricing & Billing Management
GCP's pricing model rewards data-heavy, long-running workloads with automatic discounts (SUD, CUDs). But without visibility, costs can spiral quickly — especially with BigQuery, egress, and over-provisioned VMs. This module covers the tools to estimate, monitor, and control spend.
GCP Pricing Calculator
The Google Cloud Pricing Calculator helps estimate monthly costs for multi-service architectures before deployment. For BigQuery specifically, you need to estimate both storage and analysis costs separately.
Estimating BigQuery Costs (Example)
Scenario: Your team runs 50 queries/day, each scanning an average of 20 GB, on a 5 TB dataset.
| Component | Calculation | Monthly Cost |
|---|---|---|
| Storage (Active) | 5 TB × $0.02/GB = 5,000 GB × $0.02 | $100.00 |
| On-Demand Analysis | 50 queries/day × 20 GB × 30 days = 30 TB × $6.25/TB | $187.50 |
| Free tier offset | First 1 TB/month is free | -$6.25 |
| Total | $281.25 / month |
Console Navigation
Hidden Costs Checklist
Cloud NAT: $0.045/hour = ~$32/month per gateway + per-GB processing fees. Load Balancers: $0.025/hour = ~$18/month even with zero traffic. Egress: $0.12/GB to internet (first 1 GB/month free). Static IPs: $0.01/hour when not attached. Persistent Disk snapshots: Charged per GB stored. Log ingestion (Cloud Logging): First 50 GB/month free, then $0.50/GB.
Billing Export to BigQuery
GCP's most powerful cost analysis feature is Billing Export to BigQuery. Once enabled, every line item on your invoice is exported to a BigQuery table in near-real-time. You can then run SQL queries to find cost anomalies, break down spend by project/service/label, and build custom dashboards.
Setup
SQL — Top 10 Costliest Services This Month
-- billing_analysis.sql — Find top cost drivers in your billing export SELECT service.description AS service_name, SUM(cost) + SUM(IFNULL( (SELECT SUM(c.amount) FROM UNNEST(credits) c), 0 )) AS net_cost FROM `my-billing-project.billing_dataset.gcp_billing_export_v1_XXXXXX` WHERE invoice.month = FORMAT_DATE('%Y%m', CURRENT_DATE()) GROUP BY service_name ORDER BY net_cost DESC LIMIT 10
SQL — Daily Spend by Project (for Anomaly Detection)
-- daily_by_project.sql — Track spend trends per project per day SELECT DATE(usage_start_time) AS usage_date, project.id AS project_id, SUM(cost) AS daily_cost FROM `my-billing-project.billing_dataset.gcp_billing_export_v1_XXXXXX` WHERE usage_start_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY) GROUP BY usage_date, project_id ORDER BY usage_date DESC, daily_cost DESC
team, environment, and cost_center labels — the billing export includes labels, making per-team chargeback trivial.Quotas
Every GCP API has quotas — limits on how many requests per minute, how many resources per project, or how much capacity you can use. Quotas exist to protect both Google and you (preventing a runaway script from creating 10,000 VMs). Unlike AWS service limits, GCP quotas are granular and can block you silently.
Common Quota Traps
| Quota | Default Limit | What Happens |
|---|---|---|
| CPUs per region | 24 (new projects) | Cannot create VMs — API returns QUOTA_EXCEEDED |
| GKE nodes per zone | 1,000 | Node pool can't scale |
| BigQuery concurrent slots | 2,000 (on-demand) | Queries queue and slow down |
| Cloud Functions per region | 1,000 | Deployment fails |
| External IP addresses | 8 per region | Cannot attach public IPs |
How to Request Quota Increases
Python — Check Quotas Programmatically
# check_quotas.py — List quotas and usage for a project/region from google.cloud import compute_v1 client = compute_v1.RegionsClient() project = "my-gcp-project-id" region = "us-central1" region_info = client.get(project=project, region=region) print(f"Quotas for {region}:") for quota in region_info.quotas: usage_pct = (quota.usage / quota.limit * 100) if quota.limit > 0 else 0 if usage_pct > 50: # Only show quotas above 50% usage print(f" ⚠ {quota.metric:35} {quota.usage:.0f}/{quota.limit:.0f} ({usage_pct:.0f}%)")
Module 5: Common Pitfalls
Every cloud platform has traps. GCP's are unique because of its global VPC model, BigQuery's scan-based pricing, and the ease of creating new projects. This module covers the mistakes that cost real money and create real security incidents.
Default VPC Rules — The Silent Security Risk
Every GCP project comes with a default VPC and two dangerously permissive firewall rules:
| Rule Name | What It Allows | Why It's Dangerous |
|---|---|---|
default-allow-internal | All TCP, UDP, ICMP between 10.128.0.0/9 | Any VM can talk to any other VM across all subnets in the VPC — no segmentation |
default-allow-ssh | TCP:22 from 0.0.0.0/0 | SSH is open to the entire internet |
default-allow-rdp | TCP:3389 from 0.0.0.0/0 | RDP is open to the entire internet |
default-allow-icmp | ICMP from 0.0.0.0/0 | Enables reconnaissance via ping sweeps |
default-allow-ssh rule alone means every VM you create has SSH open to the internet by default. This is the GCP equivalent of leaving your front door open. Always create a custom VPC with explicit, least-privilege firewall rules.Fix: Delete Default VPC Firewall Rules
# Delete the dangerous default rules (run once per project) gcloud compute firewall-rules delete default-allow-ssh \ --project=my-gcp-project-id --quiet gcloud compute firewall-rules delete default-allow-rdp \ --project=my-gcp-project-id --quiet gcloud compute firewall-rules delete default-allow-icmp \ --project=my-gcp-project-id --quiet # Or better: delete the entire default VPC and create a custom one gcloud compute networks delete default \ --project=my-gcp-project-id --quiet
Terraform — Organization Policy to Block Default Network
# terraform/org_policy.tf — Prevent default VPC creation in all new projects resource "google_organization_policy" "skip_default_network" { org_id = var.org_id constraint = "constraints/compute.skipDefaultNetworkCreation" boolean_policy { enforced = true } } # This ensures every new project starts with NO default VPC. # Teams must create custom VPCs with proper firewall rules.
BigQuery Query Costs — The $100 SELECT *
BigQuery's on-demand pricing charges $6.25 per TB scanned. A single SELECT * on a large, unpartitioned table can be shockingly expensive. This is the most common cost mistake on GCP.
The Scenario
A developer runs SELECT * FROM events on a 20 TB unpartitioned table to "just look at a few rows." BigQuery scans the entire table — all 20 TB. Cost: 20 × $6.25 = $125.00 for one query. They run it 5 times with small modifications: $625 in an afternoon. With partitioning and a WHERE event_date = '2026-03-31' filter, the same query scans 50 GB: $0.31.
Prevention Strategies
- Always partition tables: Use
time_partitioningon a date/timestamp column. Queries with partition filters scan only matching partitions. - Always cluster tables: Cluster by frequently filtered columns (
user_id,event_type). BigQuery skips irrelevant blocks. - Never use
SELECT *: Specify only the columns you need. BigQuery is columnar — fewer columns = less data scanned. - Use
--dry_run: Preview bytes scanned before executing. In the Console, check the green badge showing "This query will process X GB." - Set
maximum_bytes_billed: AddmaximumBytesBilledto every query config as an automatic guard. - Set per-user quotas: Limit each user to X TB/day of on-demand scanning.
Terraform — BigQuery Per-User Quota
# Set a custom per-user query quota via gcloud (not directly in Terraform) # Limits each user to 1 TB of on-demand scanning per day gcloud alpha bq settings update \ --project=my-project-id \ --default-query-job-timeout=600s # For programmatic enforcement, use BigQuery Reservations API # to set per-project slot caps in Capacity mode
Project Proliferation
GCP makes it easy to create projects — too easy. Without governance, teams create ad-hoc projects for experiments, POCs, and one-off demos. These accumulate. Each project may have running resources (VMs, Cloud SQL instances, GKE clusters) that no one monitors. This is how a $5K/month GCP bill becomes $50K/month.
The Problem
| Symptom | Root Cause | Impact |
|---|---|---|
| 50+ projects in the org | No naming convention or folder structure | Impossible to track ownership or costs |
| Projects with no labels | No enforcement of labeling at creation | Cannot attribute costs to teams |
| Projects with no billing budget | No org-level budget policy | Runaway spend goes unnoticed for weeks |
| "Zombie" projects | POC completed but project not deleted | VMs, Cloud SQL, GKE clusters running 24/7 |
Prevention: Centralized Project Factory
Use a Project Factory pattern (via Terraform) to standardize project creation. Every project gets a naming convention, required labels, a billing budget, and folder placement.
# terraform/project_factory.tf — Standardized project creation resource "google_project" "managed" { name = "${var.team}-${var.environment}" project_id = "${var.org_prefix}-${var.team}-${var.environment}" folder_id = var.folder_id billing_account = var.billing_account_id labels = { team = var.team environment = var.environment cost_center = var.cost_center created_by = "terraform" } } # ── Automatically enable required APIs ── resource "google_project_service" "required_apis" { for_each = toset([ "compute.googleapis.com", "container.googleapis.com", "bigquery.googleapis.com", "logging.googleapis.com", "monitoring.googleapis.com", ]) project = google_project.managed.project_id service = each.value } # ── Create a billing budget for the project ── resource "google_billing_budget" "project_budget" { billing_account = var.billing_account_id display_name = "Budget: ${google_project.managed.name}" budget_filter { projects = ["projects/${google_project.managed.number}"] } amount { specified_amount { currency_code = "USD" units = var.monthly_budget } } threshold_rules { threshold_percent = 0.5 # Alert at 50% } threshold_rules { threshold_percent = 0.8 # Alert at 80% } threshold_rules { threshold_percent = 1.0 # Alert at 100% } threshold_rules { threshold_percent = 1.5 # Alert at 150% (overspend) spend_basis = "CURRENT_SPEND" } all_updates_rule { monitoring_notification_channels = [var.notification_channel_id] } }
1. Use Terraform Project Factory for all project creation — no Console/gcloud ad-hoc. 2. Enforce required labels via Organization Policy. 3. Attach a billing budget to every project at creation. 4. Quarterly audit: list all projects, check for running resources, delete zombies. 5. Use Folder structure (Engineering, Data, Shared) to group projects logically. 6. Enable Recommender API to surface idle VMs, overdprovisioned instances, and unused IPs.
gcloud projects list --filter="lifecycleState=ACTIVE" monthly. Cross-reference with billing export to find projects with non-zero spend but no recent code deployments. These are your "zombie" projects.