Google SRE's best practicee

https://sre.google/sre-book/table-of-contents/

Incident Document

It should contain the followings.

Incident timeline
List of actions carried out to restore the service
Command hierarchy: The roles such as Incident Commander.
Root cause analysis

Incident Update

Google. (2016). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media, Inc.

Give timely updates to all stakeholders
Prioritize responses to customers over internal stakeholders
Set 'next update' times in all communications.

Postmortem

A postmortem should be reviewed by all of the relevant stakeholders before it is finalized.
This includes the engineers who were involved in the incident, as well as the product managers, managers, and other stakeholders who may be affected by the incident.
The purpose of reviewing the postmortem is to ensure that it identifies all of the root causes of the incident and to get feedback from the stakeholders so that they can understand the incident and how it can be prevented from happening again.

Command hierarchy

Assign one team member as Incident Commander and another team member as Communications Lead. Keep stakeholders updated and ensure a postmortem analysis is conducted.

Trbouleshooting Steps

https://sre.google/sre-book/effective-troubleshooting/

Understand the severity of the issue.
- This includes determining how many customers are affected, how severe the impact is, and whether or not the issue is widespread.
- Opening a bug ticket or reviewing application logs can be helpful, but they should not be done before you have a good understanding of the severity of the issue. This is because you may be wasting your time troubleshooting a problem that is not actually affecting a large number of customers.
- Making the system work as well as it can while you troubleshoot is also a good idea, but it is not the first step. You need to understand the severity of the issue before you can make any changes to the system.
Identify the affected components.
Gather data.
Analyze the data.
Reproduce the issue.
Fix the issue.
Roll back the changes.
Monitor the system.

SLIs

https://sre.google/sre-book/postmortem-culture/

SLIs for big data systems

SLIs for the pipeline to ensure that the data in the final storage is up to date.

Throughput shows speed of processing
Latency shows total time to process a request
Correctness measures the accuracy of results returned.

Durability is not typically considered a Service Level Indicator (SLI) when ensuring that data in the final storage is up to date. Durability generally refers to the ability of a storage system to reliably store data without loss or corruption over time. While durability is an important characteristic for data storage systems, it may not be directly related to the freshness or up-to-dateness of the data in the context of a data processing pipeline.

Roles for DevOps

Cloud Source Repository

https://cloud.google.com/source-repositories/docs/configure-access-control?hl=ja#roles_and_permissions_matrix

roles/source.reader Source Repository 読み取り
roles/source.writer Source Repository 書き込み
roles/source.admin Source Repository 管理者

Docker image

compute.imageUser create instances from images

Viewer of Monitoring Dashboard

Monitoring viewer role roles/monitoring.viewer
- Provides read-only access to get and list information about all monitoring data and configurations.
Logs Viewer role (roles/logging.viewer
- Provides access to view logs.

Agent

Ops agent
- This collects both logs and metrics
- https://cloud.google.com/monitoring/agent/ops-agent?hl=ja
Logging agent (fluentd agent)
- This is old agent which collects only logs
- https://cloud.google.com/stackdriver/docs/solutions/agents/monitoring?hl=ja
Monitoring agent (Collectd agent)
- This is old agent which collects only metrics
- https://cloud.google.com/logging/docs/agent/logging?hl=ja

Billing

Export Cloud Billing data to BigQuery

https://cloud.google.com/billing/docs/how-to/export-data-bigquery-setup?hl=en

Create a project where the Cloud Billing data will be stored, and enable billing on the project
Configure permissions on the project and on the Cloud Billing account
Enable the BigQuery Data Transfer Service API
Create a BigQuery dataset in which to store the data
Enable Cloud Billing export of cost data and pricing data to be written into the dataset

Permissions

To enable and configure the export of Google Cloud billing usage cost data to a BigQuery dataset, you need the following permissions:

For Cloud Billing, you need either the Billing Account Costs Manager role or the Billing Account Administrator role on the target Cloud Billing account.
For BigQuery, you need the BigQuery User role for the Google Cloud project that contains the BigQuery dataset to be used to store the Cloud Billing data.

To enable and configure the export of Cloud Billing pricing data, you need the following permissions:

For Cloud Billing, you need the Billing Account Administrator role on the target Cloud Billing account.
For BigQuery, you need the BigQuery Admin role for the Google Cloud project that contains the BigQuery dataset to be used to store the Cloud Billing pricing data.
For the Google Cloud project containing the target dataset, you need the resourcemanager.projects.update permission. This permission is included in the roles/editor role.

Security marks, Labels, Network tags, and Resource tags

https://cloud.google.com/blog/products/gcp/labelling-and-grouping-your-google-cloud-platform-resources?hl=en

Security marks

https://cloud.google.com/security-command-center/docs/how-to-security-marks?hl=en

Use cases
- classifying and organizing assets and findings independent of resource-level labelling mechanisms, including multi-parented groupings
- enabling tracking of violation severity and priority
- integrating with workflow systems for assignment and resolution of incidents
- enabling differentiated policy enforcement on resources, projects or groups of projects
- enhancing security focused insights into your resources, e.g., clarifying which publicly accessible buckets are within policy and which are not
Resources that can be annotated
- Almost all major resources can be annotated using Security marks
- https://cloud.google.com/security-command-center/docs/supported-asset-types?hl-en&hl=en

Labels

Use cases
- Identify resources used by individual teams or cost centers
- Distinguish deployment environments (prod,stage, qa, test)
- Identify owners, state labels.
- Use for cost allocation and billing breakdowns.
- Monitor resource groups via Stackdriver, which can use labels accessible in the resource metadata
Resouces can be labeled
- https://cloud.google.com/resource-manager/docs/creating-managing-labels?hl=ja#label_support

Network tags

Use cases
- Create additional isolation between subnetworks by selectively allowing only certain instances to communicate.
- If you arrange for all instances in a subnetwork to share the same tag, you can specify that tag in firewall rules to simulate a per-subnetwork firewall. For example if you have a subnet called ‘subnet-a’, you can tag all instances in subnet-a with the tag ‘my-subnet-a’, and use that tag in firewall rules as a source or destination.
Resouces can be annotated
- Only Compute Engine VM

Resource tags

https://cloud.google.com/resource-manager/docs/tags/tags-overview?hl=en#policies

https://cloud.google.com/iam/docs/tags-access-control?hl=ja

Use cases
- you can conditionally grant Identity and Access Management (IAM) roles and conditionally deny IAM permissions based on whether a resource has a specific tag.

Networking pricing

https://cloud.google.com/vpc/network-pricing?hl=en

Premium Tier
- It leverages Google's premium backbone to carry traffic to and from your external users.
Standard Tier
- It leverages the public internet to carry traffic between your services and your users.
- While using the public internet provides a lower quality of service, it is more economical than Premium Tier.

Log export

Logs can be exported to Cloud Storage, Big Query, and Cloud Pub/Sub using Log sink.
Logs can be exported to third-party platform such as Splunk through Cloud Pub/Sub.
The --include-children flag is important so that logs from all the Google Cloud projects within your organization are also included.

gcloud logging sinks create all-audit-logs-sink \
logging.googleapis.com/projects/logs-test-project/locations/global/buckets/all-audit-logs-bucket \
  --log-filter='logName:cloudaudit.googleapis.com' \
  --description="All audit logs from my org log sink" \
  --organization=12345 \
  --include-children

https://cloud.google.com/logging/docs/central-log-storage?hl=en

Creating a sink with a filter to export logs to BigQuery, and enabling the "Include logs in the sink" option ensures logs are kept in Cloud Logging while being exported.

Deployment

Blue/Green deployment

Supported by:

Spinnaker deployed on GKE
Jenkins deployed on GKE

Jenkins

Google’s recommended approach is deploying Jenkins on either GCE or GKE

Prevent Server Overload

https://sre.google/sre-book/addressing-cascading-failures/

Queue Management
Load Shedding and Graceful Degradation
Retries

VPC flow log

https://cloud.google.com/vpc/docs/flow-logs?hl=en#filtering

VPC Flow Logs records a sample of network flows sent from and received by VM instances, including instances used as Google Kubernetes Engine nodes.

Filtering VPC flow log

https://cloud.google.com/vpc/docs/flow-logs?hl=en#log-sampling

When you enable VPC Flow Logs, you can set a filter based on both base and metadata fields that only preserves logs that match the filter. All other logs are discarded before being written to Logging, which saves you money and reduces the time needed to find the information you are looking for.

Metadata annotations

https://cloud.google.com/vpc/docs/flow-logs?hl=en#metadata

If you select all metadata, all metadata fields in the VPC Flow Logs record format are included in the flow logs. When new metadata fields are added to the record format, the flow logs automatically include the new fields.
If you select no metadata, this omits all metadata fields.

Secret Manager

https://cloud.google.com/secret-manager/docs/overview?hl=en

Secret Manager allows you to store, manage, and access secrets as binary blobs or text strings. With the appropriate permissions, you can view the contents of the secret.
Secret Manager works well for storing configuration information such as database passwords, API keys, or TLS certificates needed by an application at runtime.
A key management system, such as Cloud KMS, allows you to manage cryptographic keys and to use them to encrypt or decrypt data. However, you cannot view, extract, or export the key material itself.
Similarly, you can use a key management system to encrypt sensitive data before transmitting it or storing it. You can then decrypt the sensitive data before using it. Using a key management system to protect a secret in this way is more complex and less efficient than using Secret Manager.
Secrets rotation policies can only be done through the API or gcloud commands

Packages

The Container Registry can hold images and packages can be stored in Cloud Storage. This is Google’s recommended approach.

GKE

Customise logging at GKE

https://cloud.google.com/architecture/customizing-stackdriver-logs-fluentd?hl=en

When you create GKE cluster, you should disable logging other than "SYSTEM"

https://cloud.google.com/sdk/gcloud/reference/container/clusters/create#--logging
Deploy custom FluentD datemonset to the cluster for log filtering

Binary Authorization at GKE cluster

https://cloud.google.com/binary-authorization/docs/enable-cluster?hl=en

You can turn on Binary Authorization at GKE cluster as well as Cloud Run.

It ensures only trusted container images are deployed on Google Kubernetes Engine (GKE).

Cloud Operations for GKE

Cloud Operations for GKE is designed to monitor GKE clusters. It manages Monitoring and Logging services together and features a Cloud Operations for GKE dashboard that provides a customized interface for GKE clusters:

You can view a cluster's key metrics, such as CPU utilization, memory utilization, and the number of open incidents.
You can view clusters by their infrastructure, workloads, or services.
You can inspect namespaces, nodes, workloads, services, pods, and containers.
For pods and containers, you can view metrics as a function of time and view log entries.

Cloud Operations for GKE is enabled by default for Autopilot cluster.

You can enbable logging and Cloud monitoring in the following way.

https://cloud.google.com/stackdriver/docs/solutions/gke/installing?hl=en

Reduce network costs for your GKE

Set up a Google Cloud HTTP Load Balancer as Ingress. It can be used to reduce network costs by routing traffic to the most efficient nodes in your cluster. It can also be used to cache content, which can further reduce network costs.
Use Kubernetes autoscaling to scale your cluster up or down based on demand. This will help you avoid overprovisioning, which can lead to unnecessary network costs.
Use spot VMs for your GKE nodes. Spot VMs are available at a discounted price, but they can be terminated at any time. This is a good option if you have a flexible workload.
Use caching to reduce the amount of data that needs to be transferred from your GKE cluster. Caching can be done at the application level or at the network level.
Optimize your network configuration. This includes things like using the right MTU size and avoiding unnecessary network hops.

Project structure

https://cloud.google.com/docs/enterprise/best-practices-for-enterprise-organizations?hl=ja#project-structure

The general recommendation is to use one project per environment and per application.
If you have two applications called "app1" and "app2", each with a development and production environment, create four projects: app1-dev, app1-prod, app2-dev, app2-prod .
This isolates the environments from each other so that changes made to the development project do not negatively impact the production environment.
You can also increase access control by allowing all developers access to development projects while restricting production access to CI/CD pipelines.

Metrics Explorer

You can explore metric data by building a temporary chart with Metrics Explorer.
For example, to view the CPU utilization of a virtual machine (VM), you can use Metrics Explorer to construct a chart that displays the most recent data.
To keep a reference to the chart configuration, save the chart URL. Because the chart URL encodes the chart configuration, when you paste this URL into a browser the chart you configured is displayed.To obtain the chart's URL, click Link in the chart toolbar.

Mirroring a GitHub repository

https://cloud.google.com/source-repositories/docs/mirroring-a-github-repository?hl=en#force_a_repository_sync

You can easily mirror a GitHub repository from Cloud Source Repository.

Add a repository and Connect external repository selecting GitHub as a Git provider.

Need to provide a GitHub machine credential and then Cloud Source Repository access to all repositories in the GitHub user account.

.gcloudignore files

https://cloud.google.com/sdk/gcloud/reference/topic/gcloudignore

If there is a file called .gcloudignore in the top-level directory to upload, the files that it specifies (see "SYNTAX") will be ignored.

The following gcloud commands respect the .gcloudignore file:

gcloud app deploy
gcloud functions deploy
gcloud builds submit
gcloud composer environments storage {dags, data, plugins} import
gcloud container builds submit
gcloud run deploy
gcloud run jobs deploy
gcloud alpha deploy releases create
gcloud alpha infra-manager deployments apply
gcloud alpha functions local deploy
gcloud alpha run jobs deploy
gcloud beta run jobs deploy

Scoping Project (=Cloud Monitoring Workspace)

https://cloud.google.com/monitoring/settings?hl=en

This is the former name of Scoping Project.

You can monitor resources under your project.

Once you add a project at Cloud Monitoring, you can monitor other projects as well.

Scoping project is a kind of hub project to monitor all resources in multiple projects.

Google recommends creating a new scoping project which doesn't have any resources.

Then, you can simplify the management of privileges and monitored metrics.

If someone has the privilege to view the resources only for the scoping project, the person can monitor both the scoping project and the monitored projects.

Realtime monitoring data export to BigQuery

Creating a Pub/Sub topic to export monitoring data allows for real-time streaming and decoupling of the data ingestion process. Using Dataflow to process the data provides a scalable, serverless solution for data transformation, and the BigQuery sink enables efficient storage of the data in BigQuery. This approach is the most appropriate, as it offers a scalable, cost-effective, and real-time solution to integrate Cloud Monitoring with BigQuery.

Dev and Test environment

Create a development environment for writing code and a test environment for configurations, experiments, and load testing.

By creating separate environments for development, testing, and production, you can reduce the likelihood of introducing errors into the production environment.

Dashboard

Create custom dashboards with the required metrics, filter dashboards based on specific conditions, then grant IAM roles to other teams and ask them to access the Google Cloud Console.

-> Incorrect. Although granting IAM roles to other teams allows them to access the Google Cloud Console, it does not directly share the filtered view of the dashboards. This option may also expose additional information or grant excessive permissions to other teams.

Web Security Scanner

A web application security scanning tool that identifies vulnerabilities in App Engine, Compute Engine, and Google Kubernetes Engine applications.

Only a web application that is hosted in the same project can be scanned by Web Security Scanner in the project.

Cloud Security Command Center

A comprehensive security and data risk platform that helps DevOps teams to discover, manage, and remediate security risks across their Google Cloud infrastructure.

It provides a centralized view of security findings, actionable recommendations, and integrated remediation tools.

You can activate Security Command Center for an entire organization (organization-level activation) or for individual projects (project-level activation).

Activating Security Command Center at the organization level is considered a best practice because it provides the most complete protection for your business by allowing Security Command Center to access and scan resources and assets across all of the folders and projects in the organization.

the Standard tier, which offers a limited feature set for free, or the Premium tier, which offers the full feature set.

Cloud Armor

A security service that provides Distributed Denial of Service (DDoS) protection and Web Application Firewall (WAF) capabilities for applications running on Google Cloud.

Data Access Log

You can enable Data Acceess Audi logs in the following way.

https://cloud.google.com/logging/docs/audit/configure-data-access?hl=en#config-console

Once enabling Data Access Log, you can get logs at real-time in Log Bucket (ログストレージ).

https://cloud.google.com/logging/docs/routing/overview?hl=en#buckets

You can query the logs in Log Exploler.

https://cloud.google.com/logging/docs/audit?hl=en#view-logs

Multi-cloud environment

Using cloud-native tools such as Terraform and platforms can help teams to manage and deploy service components efficiently and consistently across different hybrid and multicloud environments.
It can also help to manage dependencies and version control, and ensure that the service components are deployed consistently across different environments.

Log Router

When you configure Log Router, you can desingate Sink Destination and inclusion filter.

At optional, you can create exlusion filter as well.

https://cloud.google.com/logging/docs/export/configure_export_v2?hl=en#creating_sink

Log Storage

https://cloud.google.com/logging/docs/buckets?hl=en#create_bucket

When creating a Log Bucket, you can configure a region where logs will be stored and the log retention period (1 ~ 3,650 days).

If you upgrade a bucket to use Log Analytics, you can query the logs in the Log Analytics page by using SQL queries.

*Log Analytics is not Log Exploler.

Network Service Tiers

You can choose Network Service Tiers at Project level in the following way.

https://cloud.google.com/network-tiers/docs/set-network-tier?hl=en#setting_the_tier_for_all_resources_in_a_project

Also, choose Network Service Tiers at Resource level such as Static External IP address, Network Interface, and Loadbalancer.

Resource level Network Service Tier is more prioritized than project level.

Synthetic Monitoring

https://cloud.google.com/monitoring/uptime-checks/introduction?hl=en#about-sm

Fully configurable synthetic monitors let you deploy a single-purpose 2nd gen Cloud Function, which is built on Cloud Run. You configure your Cloud Function to run your Node.js test script. Cloud Monitoring periodically executes your Cloud Function, and it collects metrics, logs, status, and other test results. You can create synthetic monitors by using the Google Cloud console or the Cloud Monitoring API.
When you create a synthetic monitor, you create a 2nd gen Cloud Function that executes code written in Node.js by using the open source Synthetics SDK framework. Cloud Monitoring distributes and manages this framework.
The request-execution system for synthetic monitors, which is provided by Google Cloud, manages the following:
Periodic execution of your Cloud Function.
Collecting and storing the results of each execution:
- Success and failure information, such as the error message, error type, and line of code.
- Execution time
- Logs
- Metrics

Performance of Cloud Build

Caching intermediate artifacts in Cloud Storage can significantly reduce build time while minimizing cost and development effort.
By storing the artifacts, Cloud Build can access them more quickly and efficiently compared to re-building them from scratch.
This can lead to notable performance gains, especially for larger projects.

Terraform Workspace

https://developer.hashicorp.com/terraform/language/state/workspaces

Google Cloud Deployment Manager

While Google Cloud Deployment Manager is used for creating and managing cloud resources, it's not used for handling the traffic redirection in a canary deployment scenario in GKE.
That is typically handled via Kubernetes service objects.

VPC Flow Log

https://cloud.google.com/vpc/docs/using-flow-logs?hl=ja&ga=2.58477865.-282534489.1681516806&gac=1.19081162.1695168359.CjwKCAjwjaWoBhAmEiwAXz8DBbUkBS5ewU0asLwi3iTzALgc73TytDLg9MHBYe63-YI9pGwnFvO_WhoCa3kQAvD_BwE

You can enable VPC flow log at subnet level.

While enabling VPC flow log, you can configure Sample rate and Aggregation interval.

If sample rate is 100%, all log entries will be captured.

An aggregation interval is an interval to collect logs into a batch.

Kubernetes

https://kubernetes.io/ja/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

Implementing separate CI/CD pipelines for each microservice and using Kubernetes readiness and liveness probes to manage inter-service dependencies during deployment allows for better isolation, traceability, and reliability.
Readiness probes ensure that dependent services are available before a service starts receiving traffic, and liveness probes monitor the health of running services, restarting them if necessary.
This approach promotes efficient and reliable deployments while maintaining the advantages of a microservices architecture.

Retiring service

When retiring a service, it is important to consider the impact on users and other services.

If the service is suddenly retired, users may experience disruptions.

Leaving the service running for an extended period of time to allow users to migrate to a replacement service.

Communicate with users and other services about the retirement. This will help to minimize disruptions.
Create a plan for migrating users and other services to a replacement service. This plan should include a timeline for the migration.
Test the replacement service to ensure that it meets the needs of users and other services.
Monitor the replacement service after it is deployed to ensure that it is performing as expected.

Troubleshooting Ops agent

https://www.google.com/search?q=what+is+name+of+process+ops+agent&sca_esv=567840750&sxsrf=AM9HkKnrp9xLYKRqUoogGuN3Xpkb4WX1iA%3A1695471448415&ei=WNcOZcqWGIXe-Qa6ipf4CA&ved=0ahUKEwjKoqKr28CBAxUFb94KHTrFBY8Q4dUDCBA&uact=5&oq=what+is+name+of+process+ops+agent&gs_lp=Egxnd3Mtd2l6LXNlcnAiIXdoYXQgaXMgbmFtZSBvZiBwcm9jZXNzIG9wcyBhZ2VudDILEAAY8QQYiQUYogQyBRAAGKIESPtIUMcJWI1CcAJ4AZABAJgBfqABtxGqAQQ1LjE2uAEDyAEA-AEBwgIKEAAYRxjWBBiwA8ICBxAjGLACGCfiAwQYACBBiAYBkAYJ&sclient=gws-wiz-serp#fpstate=ive&vld=cid:7c651999,vid:Sd0iznXSVcc,st:0

# systemctl status google-cloud-ops-agent"*"

Cloud Build

The images field in the build config file specifies one or more Linux Docker images to be pushed by Cloud Build to Artifact Registry or Container Registry (Deprecated).

https://cloud.google.com/build/docs/build-config-file-schema?hl=ja#images

The artifacts field in the build config file specifies one or more non-container artifacts to be stored in Cloud Storage.

https://cloud.google.com/build/docs/build-config-file-schema?hl=ja#artifacts

Use the tags field to organize your builds into groups and to filter your builds.

https://cloud.google.com/build/docs/build-config-file-schema?hl=ja#tags

Federating Google Cloud with Azure Active Directory

you can configure Cloud Identity or Google Workspace to use Azure AD as IdP and source for identities.

https://cloud.google.com/architecture/identity/federating-gcp-with-azure-active-directory?hl=en

Google Cloud Directory Sync

https://support.google.com/a/answer/106368?hl=en

With Google Cloud Directory Sync (GCDS), you can synchronize the data in your Google Account with your Microsoft Active Directory or LDAP server.

Filtering at Logging Agent

Use a Fluentd filter plugin with the Cloud Monitoring Agent to remove log entries containing userinfo, and then copy the entries to a Cloud Storage bucket. Fluentd is a log management tool that can be used to filter logs before they are sent to Cloud Logging. By using a Fluentd filter plugin with the Cloud Logging Agent, the log entries containing userinfo can be removed and copied to a Cloud Storage bucket. This will ensure that the PII is securely stored in a separate location and will not leak to Cloud Logging.