Atsushi2022の日記

データエンジニアリングに関連する記事を投稿してます

Learnings on GCP DevOps

What is the content?

A memorandum on learning for GCP DevOps certification.

SRE (Site Reliability Engineering)

DevOps has five goals.

  1. reduce organizational silos
  2. Accept failure as normal
  3. Implement gradual changes
  4. Leverage tooling and automation
  5. Mesure everything

SRE is a way to realize DevOps.

  • Blameless Postmortem
  • Error Budgets
  • Eliminating Toil

Metrics

  • SLO
    • Internal target of service level. It's not a commitment to external parties.
    • SLO + Error budget = 100%
  • SLI
    • Metric to indicate how much you can satisfy SLO.
  • SLA
    • Commitment of service level to external parties.

CI/CD pipeline

Cloud Run deployment

  • Scenario
    • Once you push source code in Container/Artifact Registry, Cloud Build builds Docker image and deploys it in Cloud Run.
  • Configuration
    • Make a trigger to build Docker image and deploy it in Cloud Run once source codes get stored in Container/Artifact Registry
steps:
# Build the container image
- name: 'gcr.io/cloud-builders/docker'
  args: ['build', '-t', 'gcr.io/PROJECT_ID/IMAGE', '.']
# Push the container image to Container Registry
- name: 'gcr.io/cloud-builders/docker'
  args: ['push', 'gcr.io/PROJECT_ID/IMAGE']
# Deploy container image to Cloud Run
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
  entrypoint: gcloud
  args: ['run', 'deploy', 'SERVICE-NAME', '--image', 'gcr.io/PROJECT_ID/IMAGE', '--region', 'REGION']
images:
- gcr.io/PROJECT_ID/IMAGE

Container Regisry v.s. Artifact Registry

Container Registry is older than Artifact Registry.

Limited functionalities compared to Artifact Registry.

Only container images can be stored in Container Registry.

On the other hand, Artifact Registry can store container images, Maven, NPM, Python packages, RPM, and Debian packages.

You can designate the storing region for Artifact Registry, but not for Container Registry.

Very limited control on the access to Container Registry. You can control the access to Artifact Registry in detail.

Container Scanning

To find vulnerabilities in the image, you can scan the container manually or automatically.

You can make the following command to check vulnerabilities in your image.

gcloud artifacts docker images scan <Continer ID> --remote

Or, you can just turn on the vulnerability scanning on the setting at Container Registry or Artifact Registry.

Once you turn it on, whenever you upload your image to Container Registry or Artifact Registry, the image will be automatically scanned.

Secure Image

Using the image provided on the Google Marketplace.

Those images are developed by Google and used in their operation.

They maintain the image with zero vulnerabilities.

Binary Authorization

This is a way to prevent untrusted container deployment on GKE and Cloud Run.

It ensures that only trusted container images are deployed on GKE or Cloud Run.

First, you should turn on Binary Authorization API.

Then, edit the binary authorization policy. For instance, you can allow all images, or block all images to be deployed.

You can allow only images which have been verified by attestors.

An attestor can sign the image with the following command.

gcloud beta container binauthz attestations sign-and-create --artifact-url=ARTIFACT_URL \
                                                            --attestor=ATTESTOR \ 
                                                            --attestor-project=ATTESTOR_PROJECT \
                                                            --keyversion=KEYVERSION \
                                                            --keyversion-key=KEYVERSION_KEY \
                                                            --keyversion-keyring=KEYVERSION_KEYRING \
                                                            --keyversion-location=KEYVERSION_LOCATION \
                                                            --keyversion-project=KEYVERSION_PROJECT

Cloud Monitoring

Monitor resources under multiple projects

You can monitor resources under your project.

Once you add a project at Cloud Monitoring, you can monitor other projects as well.

Scoping project is a kind of hub project to monitor all resources in multiple projects.

Google recommends creating a new scoping project which doesn't have any resources.

Then, you can simplify the management of privileges and monitored metrics.

If someone has the privilege to view the resources only for the scoping project, the person can monitor both the scoping project and the monitored projects.

Dashboard

A monitoring dashboard is created once a resource is created.

For instance, once a virtual machine instance gets created, the VM instance dashboard is automatically created.

There are pre-defined dashboards such as Cloud Function, MySQL, Nginx, AWS, and Azureas as sample libraries.

You can make a custom dashboard from scratch choosing a chart from the selections.

Ops agent

You can install an Ops agent in virtual machine instances to collect more metrics.

For instance, disk utilization percentage is collected.

You can find more metrics that are collected by the Ops agent on the below link.

https://cloud.google.com/monitoring/api/metrics_opsagent

Uptime checks

Repeatedly make a http/https request to a server to check an application is running.

You can test requests from multiple regions.

Alerting

You can make alerts in case of:

  • metrics exceeded a threshold
  • uptime check got failed
  • process got failed, stopped

Resource grouping

You can make a group to group some resources.

For instance, you can make VM group with the instance name e.g. the instance name contains "prd".

You can use the group for alerting and policy.

Cloud Logging

It's a fully managed service and is not required to be set up.

You can query logs and download them in CSV or JSON format.

Cloud Audit Logging

There are four types of audit logs.

  • Admin activity log
    • By default
    • Retain logs in 400 days
    • Free
    • e.g. creation or deletion of VM
  • System event log
    • By default
    • Retain logs in 400 days
    • Free
    • e.g operation by GCP
  • Data access log
    • Not by default
    • e.g. data access to Cloud Storage
    • Retain log in 30 days
    • Not free
  • Policy denied log
    • By default
    • Google services denied
    • 30 days retention period
    • Not free
    • It cannot be disabled but can be filtered

You can query logs with a Log name in Cloud Logging Explorer.

Data Access Log

You need to enable Data Access Log at each service level.

Go to IAM & Admin -> Audit Logs and find the target service.

You can choose the type of log to be collected i.e. Admin Read, Admin Write, Data Read, and Data Write.

Log Collection

  • Cloud Run, GKE, App Engine
    • Logs are automatically collected
  • Compute Engine
    • Ops agent collects logs
  • gcloud SDK
  • Cloud Logging API

Log collection with gcloud SDK

You can write and read logs with gcloud SDK such as follows.

gcloud logging write my-test-log "A simple entry."

You may find a way in detail in the following link.

https://cloud.google.com/logging/docs/write-query-log-entries-gcloud?hl=ja#write-log-entries-using-sdk

Log collection with Ops agent

https://cloud.google.com/logging/docs/agent/ops-agent/configuration?hl=ja

https://blog.g-gen.co.jp/entry/opsagent-windows

By default, it collects: - Linux: /var/log/syslog - Linux: /var/log/messages - Windows: Event log

Ops agent supports collecting logs from third-party applications such as MySQL, Nginx, and so on.

Ops agents can collect custom logs by designating log file paths.

In this case, you should overwrite the following config file.

  • Linux: /etc/google-cloud-ops-agent/config.yaml
  • Windows: C:\Program Files\Google\Cloud Operations\Ops Agent\config\config.yaml

config.yaml should be like the following.

logging:
  receivers:
    syslog: # This is the name of the receiver
      type: files
      include_paths:
      - /var/log/messages
      - /var/log/syslog
  service:
    pipelines:
      default_pipeline:
        receivers: [syslog] # designating the receiver

receivers configures the log collection such as file path.

processors configures the log transformation such as Regular expression.

pipelines configure the connection from receivers to processors.

Log-based Metrics

There are two types of log-based metrics.

  • Counter
    • the number of log entries matching a given filter
  • Distribution
    • collecting numeric data from log entries matching a given filter

Log Router

  • _Required
    • 400 days retention
    • Not configurable
  • _Default
    • 30 days retention
    • Configurable
  • User-defined
    • 30 days retention
    • Configurable

Sinks for user-defined log router

  • Log Storage
  • Cloud Storage
  • BigQyery
  • Pub/Sub

Cloud Error Reporting

Cloud Error Reporting collects errors happening in cloud services.

It helps to locate the cause of the error in a short time.

Snapshot Debugger

The Snapshot Debugger lets you inspect the state of a running cloud application, at any code location, without stopping or slowing it down.

Cloud Trace

Cloud Trace is a distributed tracing system that collects latency data from your applications and displays it in the Google Cloud Console.

Cloud Trace can capture traces from all of your VMs, containers, or App Engine projects.

Cloud Profiler

Continuous profiling of production systems is an effective way to discover where resources like CPU cycles and memory are consumed as a service operates in its working environment.

But profiling adds an additional load on the production system: in order to be an acceptable way to discover patterns of resource consumption, the additional load of profiling must be small.

Fluentd daemonset deplloyment on GKE

https://cloud.google.com/architecture/customizing-stackdriver-logs-fluentd?hl=ja

IAM

roles/logging.viewer (Logs Viewer) gives you read-only access to all features of Logging,

roles/logging.privateLogViewer (Private Logs Viewer role) gives you read access to the Data Access audit logs.

roles/logging.logWriter provides the permissions to write log entries.

roles/logging.configWriter provides permissions to read and write the configurations of logs-based metrics and sinks for exporting logs.

https://cloud.google.com/logging/docs/access-control?hl=ja

Logging Access Cotrol on GKE

https://cloud.google.com/stackdriver/docs/solutions/gke/managing-logs?hl=ja

Applications need permission to write logs to Cloud Logging, which is granted by assigning the IAM role roles/logging.logWriter to the service account attached to the underlying node pool.

GKE Cluster Autoscale

Once you enable the Cluster Autoscale at GKE, the Cluster Autoscale will add an additional node in the node pool when the workload is not enough.

Third party platform / OSS

  • Spinnaker
  • OpenTelemetry

https://cloud.google.com/learn/what-is-opentelemetry?hl=ja

https://logicmonitor.saaspresto.jp/blog/opentelemetry/

Incident Response

  • Incident Commander
  • Operations Lead
  • Communications Lead

General Glossary

  • MTTD (Mean time to detect)
  • MTTR (Mean time to recover)
  • MTBF (Mean time between failure)

Cloud Monitoring Workspace

This is the former name of Scoping Project.

MetricKind

https://cloud.google.com/monitoring/api/v3/kinds-and-types?hl=ja

https://cloud.google.com/monitoring/api/ref_v3/rest/v3/projects.metricDescriptors#MetricKind

  • GAUGE
    • An instantaneous measurement of a value.
  • DELTA
    • The change in a value during a time interval.
  • CUMULATIVE
    • A value accumulated over a time interval. Cumulative measurements in a time series should have the same start time and increasing end times, until an event resets the cumulative value to zero and sets a new start time for the following points.

Cloud Billing

You can associate Projects and Billing accounts and make your own billing report.

Provides alerts on the pre-defined budget.

Exports the charges to BigQuery.

Preemptible Virtual Machine

  • Life span is maximum 24 hours, but cheaper
  • When you host the fault-tolerant workload on the VM, you can choose it.
  • Up to 80% discount
  • Not always available
  • Google gives you 30 sec warning before shutting down
  • You can use a preemptible VM when you choose spot provisioning model for the VM creation.

Flat-rate, Sustained use discount, Committed use discount

  • Flat rate
    • No discount
    • Charging as per the usage
  • Sustained use discounts
    • If you use N1/N2 machine for more than a specific time for GKE and VM instances, you can get more discount as per the use time.
    • The discount is applied when you use 25% of the month.
  • Committed use discounts
    • You can commit the use of VM for 1 year or 3 year
    • Up to 70% discount
    • For GKE and VM instances
    • Cannot cancel the commitment

Total Cost of Operations

Total Cost of Operations = Purchase Cost of Asset + Cost of Operation

Deployment methods

Blue/Green deployment

Prepare a live server and a staging server.

The staging server got upgraded to v2.0. Then, change the staging server to the live server.

Rolling deployment

There are multiple servers.

One server gets upgraded at first.

Gradually, upgrades servers more and more.

Canary deployment

There are multiple servers.

Some of the servers get upgraded at first.

If there are no problems on those upgraded servers, then the rest of the servers get upgraded.

Traffic splitting deployment

A small percentage of users will access to the upgraded servers.

If those users are fine, redirect all users to the upgraded servers.

GKE

https://googleblog.g-gen.co.jp/entry/gke-explained

There are two types of clusters. Google recommends using Autopilot cluster.

  • Autopilot cluster
    • manages worker nodes as well as mater plane.
    • automatically scales, recovers the worker nodes, and applies security patches on them.
    • charges computing resources on only the worker nodes.
  • Standard cluster
    • manages master plane only.
    • charges worker nodes, not depending on the computing resources.

Availability type

There are three types of cluster availability.

  1. Single zone cluster
    • Control plane and worker nodes are in single-zone
  2. Multi zone cluster
    • Worker noes are in multiple zones in a single region, but the control plane is in single-zone
  3. Region cluster
    • Control plane and worker nodes are in multiple zones in a single region.

Autopilot cluster is Region cluster.

Cluster availability type cannot be changed after the cluster creation.

Cluster network

https://medium.com/google-cloud-jp/gke-network-basic-8a22be15517d

There are two types of cluster networks.

  • VPC native cluster
    • IP address of pod can be routed in VPC network
    • Subnet route is automatically created
    • Firewall rules can be created for the IP address range of pods.
    • The primary IP address range of a subnet is allocated to nodes in the cluster.
    • The secondary IP address range of a subnet is allocated to pods and services.
  • Route-based cluster
    • New IP address space is created.
    • For connecting to pods, a new route from the VPC network to the new IP address space is automatically created.

Node Scaling

  • Cluster autoscaler
    • Worker node is added/deleted according to Pod resource request.
  • Node provisioning
    • Node pool is added/deleted in accordance with workloads such as CPU, and memory.

Pod Scaling

  • Horizontal Pod auto-scaling
    • Pod is added/deleted
  • Vertical Pod auto-scaling
    • CPU and memory resources of pods are scaled
  • Multidimensional Pod auto-scaling
    • Both horizontal and vertical scaling are performed

Storage

The following database, storage, and registry services can be connected from GKE cluster and used for the purpose of each service.

  • Cloud SQL
  • Datastore
  • Cloud Spanner
  • Cloud Storage
  • Persistent Disk
  • Artifact Registry
  • Container Registry
  • Cloud Filestore

Backup

Backup for GKE can backup the whole GKE cluster. A whole cluster or a part of the workload can be restored/rolled back.

You can use Kubernetes volume snapshot to get snapshots of PersistentVolume such as Persistent Disk.

Kubernetes RBAC

https://cloud.google.com/kubernetes-engine/docs/how-to/role-based-access-control?hl=ja

You can control the access to cluster or namespace using Role and RoleBinding (or, ClusterRole and ClusterRoleBinding).

Example of Role

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: accounting
  name: pod-reader
rules:
- apiGroups: [""] # "" indicates the core API group
  resources: ["pods"]
  verbs: ["get", "watch", "list"]

Example of RoleBinding

kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: pod-reader-binding
  namespace: accounting
subjects:
# Google Cloud user account
- kind: User
  name: janedoe@example.com
# Kubernetes service account
- kind: ServiceAccount
  name: johndoe
# IAM service account
- kind: User
  name: test-account@test-project.iam.gserviceaccount.com
# Google Group
- kind: Group
  name: accounting-group@example.com
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

Access from Pod to Google Cloud APIs

https://blog.g-gen.co.jp/entry/gke-workload-identity

  • Service account
    • IAM service account associated with Node pool is used to access the APIs
    • A pod in each node uses the service account for accessing the APIs.
    • All pods in the node pool are using the same service account.
  • Workload Identity
    • associates k8s service account with GCP service accounts
    • You create k8s service account. Then, associate the k8s service account and GCP service account with gcloud command. Once a pod uses k8s service account describing it in the pod manifest file, the pod uses the service account for Google Cloud API access.

k8s service account

apiVersion: v1
kind: ServiceAccount
metadata:
  name: <Name of k8s service account>
  annotations:
    iam.gke.io/gcp-service-account: <GCP Service Account>@<GCP Project Name>.iam.gserviceaccount.com

Pod manifest

apiVersion: v1
kind: Pod
metadata:
  name: pod-workloadid
spec:
  containers:
  - name: cloud-sdk
    image: google/cloud-sdk:slim
    command: ["sleep", "infinity"]
  serviceAccountName: <Name of k8s service account>

Cloud Run Job

https://blog.g-gen.co.jp/entry/cloud-run-jobs-explained#%E9%95%B7%E6%99%82%E9%96%93%E3%81%AE%E5%AE%9F%E8%A1%8C

Cloud Run Service

https://blog.g-gen.co.jp/entry/cloud-run-explained

https://blog.g-gen.co.jp/entry/using-cloud-run-tagged-revision

Cloud KMS

https://blog.g-gen.co.jp/entry/cloud-kms-explained

https://blog.g-gen.co.jp/entry/gke-workload-identity

Cloud Deploy

https://medium.com/google-cloud-jp/cloud-deploy-397c8a7c68c0

CI tool "Cloud Build" was used for the deployment.

But, Cloud Build is just CI tool and it doesn't have enough control over the deployment.

Cloud Deploy is CD tool that provides the deployment in seconds, the rollback, and the rollout.