(Book: Release It! - Michael Nygard)
Part 1 - Create Stability
- Ch2 case study
- Ch3 Stabilize your system
- Ch04 Stability Anti-patterns
- Ch05 Stability Patterns
Part 2 - Design for production
-
Ch06 case study
-
Ch07 Foundations
-
Ch08 Processes on Machines
-
Ch09 Interconnect
-
Ch10 Control plane
-
Ch11 Security
Part 3 - Deliver your system
- Ch12
- Ch13
- Ch14
Part 4 - Solve Systemic Problems
- Ch15
- Ch16
- Ch17
Model: cap theorem Model: Open Systems Interconnection (OSI) model
Ch08 Processes on Machines
- Code
- Building the code
- Immutable and disposable infrastructure
- Configuration
- Configuration files
- Configuration with Disposable Infrastructure
- Transparency
- Designing for transparency
- Enabling technologies
- Logging
- Log locations
- Log levels
- Human factors
- Voodoo operations
- Instance metrics
- Health checks
Pattern: split deployment view and runtime view
(1) Code Building the code Pattern: put all dependencies in a private repository
Immutable and disposable infrastructure Pattern: immutable infrastructure (aka phoenix server)
(2) Configuration
(3) Transparency (3.1) Designing for transparency
- watch for coupling of monitoring framework and application code
- “should be like an exoskeleton”
- Keep out of instance
- What metrics should trigger alerts
- Where to set thresholds
- How to roll-up state variables into an overall system health status
(3.2) Enabling technologies
- Block-box - outside the process
- White-box - inside the process
(3.3) Logging Log locations
- Security: separate folder
- Code: read only
- Logs: write
- Configurable
- Physical machines - Separate drive
- Containers - stdout
Log levels
- Primary consumer
- Not developer
- Administrator / ops engineer
- Heuristic: log levels ERROR & SEVERE should require action of the by the operators
- Pattern: no debug logs in prod
- Noise
- Confusion
Human factors “logs are a Human-computer interface” Users = ops engineer
Pattern: add trace information in logs (eg. User ID, session, transaction, random number)
Heuristic: log level INFO for interesting state transitions
(3.4) Instance metrics
(3.5) Health checks Model: health check
Ch09 Interconnect
- DNS
- Service discovery with dns
- Load balancing with dns
- Global server load balancing (GSLB)
- Load balancing
- Software LB
- Hardware LB
- Health checks
- Stickiness
- Partitioning (“content based routing”)
- Demand control
- How systems fail
- Preventing disaster (“load shedding”)
- Network routing
- Discovering services
(1) DNS When services don’t change often Load balancing with dns Round Robin (Model: Open Systems Interconnection (OSI) model layer 7 - application)
- :) cheap
- :( all instances must be routable
- :( control in client hands
- :( no health check
- :( load is not distributed (traffic!=load)
- :( client cached dns results
Global server Load balancing (GSLB) with dns Multi-region Sends requests to all regions, closest will be fastest to respond
- :) great fit for DNS
Architecture pattern: global dns + regional load balancers
(2) Load Balancing Virtual IPs (VIPs) each VIP maps to a pool policy information
- algorithm
- what Model: health checks
- session stickiness
- how to handle incoming requests when no pool member available
Listen with DNS name of the VIP
Software load balancing (reverse proxy)
- Model: Open Systems Interconnection (OSI) model layer 7 - application layer
- :) cheap
- :) easy to operate
- :( limited max scale
(optional) caching Heuristic: when software load balancing becomes insufficient
Hardware load balancing Model: Open Systems Interconnection (OSI) model layer 4 through 7
- Not just http or ftp
- :) fail over (diver traffic to another site)
- Works well with Architecture pattern: GSLB + regional load balancers
Heuristic: when to run your own software load balancer
(3) Demand Control 2 strategies
- Refuse work
- Scale out
How systems fail (request-reply)
- Sockets
- Each request claims a socket on each tier it passes through
- Model: Little’s Law
- Service slows down under heavy load
- Fewer sockets available
- I/O bandwidth through the NICs
- Time for packets thought the wire
- Buffers
- Write buffers full - write calls will block
- Read buffers full - connection will stall
- Listen queue (server socket)
Preventing disaster Pattern: use SLA to determine when to start load shedding
(4) Network Routing Nearby servers - subnet address Further
- Default gateway
- Static route definitions
- Software-defined networking
- :) Virtualize infrastructure
- :) container based infrastructure
(5) discovering services When?
- Too many services for DNS
- Highly dynamic environment
Parts
- Services announce themselves
- Lookup
Ch10 Control Plane
- How much is right for you?
- Mechanical Advantage
- system failure, not human error
- automation goes really fast
- Platform and Ecosystem
- Development is Production
- System-Wide Transparency
- real user monitoring
- economic value
- the risk of fragmentation
- logs and stats
- what to expose
- Configuration Services
- Provisioning and Deployment Services
- Command and Control
- controls to offer
- sending commands
- scriptable interfaces
- remember this
- The Platform Players
- The Shopping List
Control plane
all the software and services that run in the background to make production load successful
Heuristic: production software vs control plane
(1) How much is right for you?
(2) Mechanical Advantage system failure, not human error Model: Blameless Postmortem
automation goes really fast Heuristic: when to use automation
(3) Platform and Ecosystem Model: platform team
(4) Development is Production
(5) System-Wide Transparency fundamental questions
- Are users receiving a good experience?
- Is the system producing the economic value we want?
economic value Focus - connect revenue to cost
- Income (top line)
- Performance bottlenecks
- Errors preventing ppl from registering
- …
- Costs (bottom line)
- Infrastructure
- Operations
- Platforms & runtimes
risk of fragmentation
marketing uses tracking bugs on web pages
sales uses conversions reported in a business intelligence tool
operations analyzes log files in splunk
development uses blind hope and intuition
-> same data through different interfaces
different perspectives
same information system
Logs & stats - what to expose Model: push vs pull log collection Heuristic: useful metrics Heuristic: nominal values for continuous metrics
(6) Configuration Services distributed databases Model: elastic vs scalable
(7) Provisioning and Deployment Services Pattern: canary deployments
(8) Command and Control Command and control - live control When? If services have a long startup time (not containers)
controls to offer Heuristic: useful controls for Control Plane - live control
sending commands
- Admin API over http (small scale)
- Security: Different port
- Swagger UI
- Command queue (large scale)
- Prevent dog piling
- Random delay
- Waves
- Prevent dog piling
Operating costs “anything you build must either be maintained or torn down” Choose options that are appropriate
- Team size
- Scale of workload
consider
- Visibility: Logging, tracing, metrics
- Dynamic systems? Configuration, provisioning, deployment services
- Long-lived? Control mechanisms
(9) The Platform Players
(10) The Shopping List not every org needs everything on this list
Ch11 Security
(1) OWASP top 10 Model: OWASP top 10
(2) The Principle of Least Privilege anti-pattern: run as root (unix) or administrator (windows) pattern: each major application should have it’s own user Heuristic: only use root when
opening a socket on a port below 1024 is the only thing that a UNIX application might require root privilege for
pattern: configure timed builds for any application that isn’t still under active development -> so that you get updates from the base image
(3) Configured Passwords pattern: keep passwords to prod databases separate from other config files anti-pattern: passwords in installation directory pattern: password files readable only to the owner (application’s own user) pattern: Password vaulting - keep passwords in encrypted files pattern: expunge keys from memory as soon as possible
- memory dumps
pattern: disable code dumps on production