Book: release it! - Michael Nygard

August 16th, 2022

(Book: Release It! - Michael Nygard)

 

Part 1 - Create Stability

[ ] Ch2 case study

[ ] Ch3 Stabilize your system

[ ] Ch04 Stability Anti-patterns

[ ] Ch05 Stability Patterns

 

Part 2 - Design for production

[ ] Ch06 case study

[ ] Ch07 Foundations

[x] Ch08 Processes on Machines

[x] Ch09 Interconnect

[x] Ch10 Control plane

[x] Ch11 Security

 

Part 3 - Deliver your system

[ ] Ch12

[ ] Ch13

[ ] Ch14

 

Part 4 - Solve Systemic Problems

[ ] Ch15

[ ] Ch16

[ ] Ch17

 



Model: cap theorem

Model: OSI model

 

Model: 12-factor app

 

 



Ch08 Processes on Machines

  1. Code

    1. Building the code

    2. Immutable and disposable infrastructure

  2. Configuration

    1. Configuration files

    2. Configuration with Disposable Infrastructure

  3. Transparency

    1. Designing for transparency

    2. Enabling technologies

    3. Logging

      1. Log locations

      2. Log levels

      3. Human factors

      4. Voodoo operations

    4. Instance metrics

    5. Health checks

Pattern: split deployment view and runtime view

 

(1) Code

Building the code

Pattern: put all dependencies in a private repository

 

Immutable and disposable infrastructure

Pattern: immutable infrastructure

 

(2) Configuration

 

(3) Transparency

(3.1) Designing for transparency

  • watch for coupling of monitoring framework and application code

    • "should be like an exoskeleton"

    • Keep out of instance

      • What metrics should trigger alerts

      • Where to set thresholds

      • How to roll-up state variables into an overall system health status

(3.2) Enabling technologies

  • Block-box - outside the process

  • White-box - inside the process

 

(3.3) Logging

Log locations

  • Security: separate folder

    • Code: read only

    • Logs: write

  • Configurable

  • Physical machines - Separate drive

  • Containers - stdout

Log levels

 

Human factors

"logs are a Human-computer interface"

Users = ops engineer

 

Pattern: add trace information in logs (eg. User ID, session, transaction, random number)

 

Heuristic: log level INFO for interesting state transitions

 

(3.4) Instance metrics

 

(3.5) Health checks

Model: health check

 



 

Ch09 Interconnect

  1. DNS

    1. Service discovery with dns

    2. Load balancing with dns

    3. Global server load balancing (GSLB)

  2. Load balancing

    1. Software LB

    2. Hardware LB

    3. Health checks

    4. Stickiness

    5. Partitioning ("content based routing")

  3. Demand control

    1. How systems fail

    2. Preventing disaster ("load shedding")

  4. Network routing

  5. Discovering services

 

(1) DNS

When services don't change often

Load balancing with dns

Round Robin (Model: OSI layer 7 - application)

  • :) cheap

  • :( all instances must be routable

  • :( control in client hands

  • :( no health check

  • :( load is not distributed (traffic!=load)

  • :( client cached dns results

Global server Load balancing (GSLB) with dns

Multi-region

Sends requests to all regions, closest will be fastest to respond

  • :) great fit for DNS

 

Architecture pattern: GSLB + regional load balancers

 

(2) Load Balancing

Virtual IPs (VIPs)

each VIP maps to a pool

no description for image available

policy information

  • algorithm

  • what health checks

  • session stickiness

  • how to handle incoming requests when no pool member available

Listen with DNS name of the VIP

 

Software load balancing (reverse proxy)

  • Model: OSI layer 7 - application layer

  • :) cheap

  • :) easy to operate

  • :( limited max scale

(optional) caching

Heuristic: when software load balancing becomes insufficient

 

Hardware load balancing

Model: OSI layer 4 through 7

  • Not just http or ftp

  • :) fail over (diver traffic to another site)

  • Works well with Architecture pattern: GSLB + regional load balancers

 

Heuristic: when to run your own software load balancer

 

(3) Demand Control

2 strategies

  1. Refuse work

  2. Scale out

 

How systems fail (request-reply)

  • Sockets

    • Each request claims a socket on each tier it passes through

    • Model: Little's Law

      • Service slows down under heavy load

    • Fewer sockets available

  • I/O bandwidth through the NICs

    • Time for packets thought the wire

    • Buffers

      • Write buffers full - write calls will block

      • Read buffers full - connection will stall

      • Listen queue (server socket)

 

Preventing disaster

Pattern: use SLA to determine when to start load shedding

 

(4) Network Routing

Nearby servers - subnet address

Further

  • Default gateway

  • Static route definitions

  • Software-defined networking

    • :) Virtualize infrastructure

    • :) container based infrastructure

 

(5) discovering services

When?

  • Too many services for DNS

  • Highly dynamic environment

Parts

  1. Services announce themselves

  2. Lookup

 

 



Ch10 Control Plane

  1. How much is right for you?

  2. Mechanical Advantage

    1. system failure, not human error

    2. automation goes really fast

  3. Platform and Ecosystem

  4. Development is Production

  5. System-Wide Transparency

    1. real user monitoring

    2. economic value

    3. the risk of fragmentation

    4. logs and stats

    5. what to expose

  6. Configuration Services

  7. Provisioning and Deployment Services

  8. Command and Control

    1. controls to offer

    2. sending commands

    3. scriptable interfaces

    4. remember this

  9. The Platform Players

  10. The Shopping List

Control plane

all the software and services that run in the background to make production load successful

 

Heuristic: production software vs control plane

 

(1) How much is right for you?

 

(2) Mechanical Advantage

system failure, not human error

Model: Blameless Postmortem

 

automation goes really fast

Heuristic: when to use automation

 

(3) Platform and Ecosystem

Model: platform team

 

(4) Development is Production

 

(5) System-Wide Transparency

fundamental questions

  • Are users receiving a good experience?

  • Is the system producing the economic value we want?

 

economic value

Focus - connect revenue to cost

  • Income (top line)

    • Performance bottlenecks

    • Errors preventing ppl from registering

    • ...

  • Costs (bottom line)

    • Infrastructure

    • Operations

    • Platforms & runtimes

 

risk of fragmentation

marketing uses tracking bugs on web pages

sales uses conversions reported in a business intelligence tool

operations analyzes log files in splunk

development uses blind hope and intuition

-> same data through different interfaces

different perspectives

same information system

 

Logs & stats - what to expose

Model: push vs pull log collection

Heuristic: useful metrics

Heuristic: nominal values for continuous metrics

 

(6) Configuration Services

distributed databases

Model: elastic vs scalable

 

(7) Provisioning and Deployment Services

Pattern: canary deployments

 

(8) Command and Control

Command and control - live control

When? If services have a long startup time (not containers)

 

controls to offer

Heuristic: useful controls for Control Plane - live control

 

sending commands

  • Admin API over http (small scale)

    • Security: Different port

    • Swagger UI

  • Command queue (large scale)

    • Prevent dog piling

      • Random delay

      • Waves

 

Operating costs

"anything you build must either be maintained or torn down"

Choose options that are appropriate

  • Team size

  • Scale of workload

consider

  1. Visibility: Logging, tracing, metrics

  2. Dynamic systems? Configuration, provisioning, deployment services

  3. Long-lived? Control mechanisms

 

(9) The Platform Players

 

(10) The Shopping List

not every org needs everything on this list

no description for image available

 



Ch11 Security

 

(1) OWASP top 10

Model: OWASP top 10

 

(2) The Principle of Least Privilege

anti-pattern: run as root (unix) or administrator (windows)

pattern: each major application should have it's own user

Heuristic: only use root when

opening a socket on a port below 1024 is the only thing that a UNIX application might require root privilege for

 

pattern: configure timed builds for any application that isn't still under active development

-> so that you get updates from the base image

 

(3) Configured Passwords

pattern: keep passwords to prod databases separate from other config files

anti-pattern: passwords in installation directory

pattern: password files readable only to the owner (application's own user)

pattern: Password vaulting - keep passwords in encrypted files

pattern: expunge keys from memory as soon as possible

  • memory dumps

pattern: disable code dumps on production

 



 

This post was referenced in: