(Book: Release It! - Michael Nygard)

Part 1 - Create Stability

  • Ch2 case study
  • Ch3 Stabilize your system
  • Ch04 Stability Anti-patterns
  • Ch05 Stability Patterns

Part 2 - Design for production

  • Ch06 case study

  • Ch07 Foundations

  • Ch08 Processes on Machines

  • Ch09 Interconnect

  • Ch10 Control plane

  • Ch11 Security

Part 3 - Deliver your system

  • Ch12
  • Ch13
  • Ch14

Part 4 - Solve Systemic Problems

  • Ch15
  • Ch16
  • Ch17


Model: cap theorem Model: Open Systems Interconnection (OSI) model

Model: 12-factor app



Ch08 Processes on Machines

  1. Code
    1. Building the code
    2. Immutable and disposable infrastructure
  2. Configuration
    1. Configuration files
    2. Configuration with Disposable Infrastructure
  3. Transparency
    1. Designing for transparency
    2. Enabling technologies
    3. Logging
      1. Log locations
      2. Log levels
      3. Human factors
      4. Voodoo operations
    4. Instance metrics
    5. Health checks

Pattern: split deployment view and runtime view

(1) Code Building the code Pattern: put all dependencies in a private repository

Immutable and disposable infrastructure Pattern: immutable infrastructure (aka phoenix server)

(2) Configuration

(3) Transparency (3.1) Designing for transparency

  • watch for coupling of monitoring framework and application code
    • “should be like an exoskeleton”
    • Keep out of instance
      • What metrics should trigger alerts
      • Where to set thresholds
      • How to roll-up state variables into an overall system health status

(3.2) Enabling technologies

  • Block-box - outside the process
  • White-box - inside the process

(3.3) Logging Log locations

  • Security: separate folder
    • Code: read only
    • Logs: write
  • Configurable
  • Physical machines - Separate drive
  • Containers - stdout

Log levels

Human factors “logs are a Human-computer interface” Users = ops engineer

Pattern: add trace information in logs (eg. User ID, session, transaction, random number)

Heuristic: log level INFO for interesting state transitions

(3.4) Instance metrics

(3.5) Health checks Model: health check



Ch09 Interconnect

  1. DNS
    1. Service discovery with dns
    2. Load balancing with dns
    3. Global server load balancing (GSLB)
  2. Load balancing
    1. Software LB
    2. Hardware LB
    3. Health checks
    4. Stickiness
    5. Partitioning (“content based routing”)
  3. Demand control
    1. How systems fail
    2. Preventing disaster (“load shedding”)
  4. Network routing
  5. Discovering services

(1) DNS When services don’t change often Load balancing with dns Round Robin (Model: Open Systems Interconnection (OSI) model layer 7 - application)

  • :) cheap
  • :( all instances must be routable
  • :( control in client hands
  • :( no health check
  • :( load is not distributed (traffic!=load)
  • :( client cached dns results

Global server Load balancing (GSLB) with dns Multi-region Sends requests to all regions, closest will be fastest to respond

  • :) great fit for DNS

Architecture pattern: global dns + regional load balancers

(2) Load Balancing Virtual IPs (VIPs) each VIP maps to a pool ./resources/book-release-it-michael-nygard.resources/img_20220816_102044.696.jpg policy information

  • algorithm
  • what Model: health checks
  • session stickiness
  • how to handle incoming requests when no pool member available

Listen with DNS name of the VIP

Software load balancing (reverse proxy)

(optional) caching Heuristic: when software load balancing becomes insufficient

Hardware load balancing Model: Open Systems Interconnection (OSI) model layer 4 through 7

  • Not just http or ftp
  • :) fail over (diver traffic to another site)
  • Works well with Architecture pattern: GSLB + regional load balancers

Heuristic: when to run your own software load balancer

(3) Demand Control 2 strategies

  1. Refuse work
  2. Scale out

How systems fail (request-reply)

  • Sockets
    • Each request claims a socket on each tier it passes through
    • Model: Little’s Law
      • Service slows down under heavy load
    • Fewer sockets available
  • I/O bandwidth through the NICs
    • Time for packets thought the wire
    • Buffers
      • Write buffers full - write calls will block
      • Read buffers full - connection will stall
      • Listen queue (server socket)

Preventing disaster Pattern: use SLA to determine when to start load shedding

(4) Network Routing Nearby servers - subnet address Further

  • Default gateway
  • Static route definitions
  • Software-defined networking
    • :) Virtualize infrastructure
    • :) container based infrastructure

(5) discovering services When?

  • Too many services for DNS
  • Highly dynamic environment

Parts

  1. Services announce themselves
  2. Lookup


Ch10 Control Plane

  1. How much is right for you?
  2. Mechanical Advantage
    1. system failure, not human error
    2. automation goes really fast
  3. Platform and Ecosystem
  4. Development is Production
  5. System-Wide Transparency
    1. real user monitoring
    2. economic value
    3. the risk of fragmentation
    4. logs and stats
    5. what to expose
  6. Configuration Services
  7. Provisioning and Deployment Services
  8. Command and Control
    1. controls to offer
    2. sending commands
    3. scriptable interfaces
    4. remember this
  9. The Platform Players
  10. The Shopping List

Control plane

all the software and services that run in the background to make production load successful

Heuristic: production software vs control plane

(1) How much is right for you?

(2) Mechanical Advantage system failure, not human error Model: Blameless Postmortem

automation goes really fast Heuristic: when to use automation

(3) Platform and Ecosystem Model: platform team

(4) Development is Production

(5) System-Wide Transparency fundamental questions

  • Are users receiving a good experience?
  • Is the system producing the economic value we want?

economic value Focus - connect revenue to cost

  • Income (top line)
    • Performance bottlenecks
    • Errors preventing ppl from registering
  • Costs (bottom line)
    • Infrastructure
    • Operations
    • Platforms & runtimes

risk of fragmentation

marketing uses tracking bugs on web pages
sales uses conversions reported in a business intelligence tool
operations analyzes log files in splunk
development uses blind hope and intuition

-> same data through different interfaces

different perspectives
same information system

Logs & stats - what to expose Model: push vs pull log collection Heuristic: useful metrics Heuristic: nominal values for continuous metrics

(6) Configuration Services distributed databases Model: elastic vs scalable

(7) Provisioning and Deployment Services Pattern: canary deployments

(8) Command and Control Command and control - live control When? If services have a long startup time (not containers)

controls to offer Heuristic: useful controls for Control Plane - live control

sending commands

  • Admin API over http (small scale)
    • Security: Different port
    • Swagger UI
  • Command queue (large scale)
    • Prevent dog piling
      • Random delay
      • Waves

Operating costs “anything you build must either be maintained or torn down” Choose options that are appropriate

  • Team size
  • Scale of workload

consider

  1. Visibility: Logging, tracing, metrics
  2. Dynamic systems? Configuration, provisioning, deployment services
  3. Long-lived? Control mechanisms

(9) The Platform Players

(10) The Shopping List not every org needs everything on this list ./resources/book-release-it-michael-nygard.resources/img_20220816_091905.049.jpg



Ch11 Security

(1) OWASP top 10 Model: OWASP top 10

(2) The Principle of Least Privilege anti-pattern: run as root (unix) or administrator (windows) pattern: each major application should have it’s own user Heuristic: only use root when

opening a socket on a port below 1024 is the only thing that a UNIX application might require root privilege for

pattern: configure timed builds for any application that isn’t still under active development -> so that you get updates from the base image

(3) Configured Passwords pattern: keep passwords to prod databases separate from other config files anti-pattern: passwords in installation directory pattern: password files readable only to the owner (application’s own user) pattern: Password vaulting - keep passwords in encrypted files pattern: expunge keys from memory as soon as possible

  • memory dumps

pattern: disable code dumps on production