Book: release it! - Michael Nygard
August 16th, 2022
(Book: Release It! - Michael Nygard)
Part 1 - Create Stability
[ ] Ch2 case study
[ ] Ch3 Stabilize your system
[ ] Ch04 Stability Anti-patterns
[ ] Ch05 Stability Patterns
Part 2 - Design for production
[ ] Ch06 case study
[ ] Ch07 Foundations
[x] Ch08 Processes on Machines
[x] Ch09 Interconnect
[x] Ch10 Control plane
[x] Ch11 Security
Part 3 - Deliver your system
[ ] Ch12
[ ] Ch13
[ ] Ch14
Part 4 - Solve Systemic Problems
[ ] Ch15
[ ] Ch16
[ ] Ch17
Model: cap theorem
Ch08 Processes on Machines
Code
Building the code
Immutable and disposable infrastructure
Configuration
Configuration files
Configuration with Disposable Infrastructure
Transparency
Designing for transparency
Enabling technologies
Logging
Log locations
Log levels
Human factors
Voodoo operations
Instance metrics
Health checks
Pattern: split deployment view and runtime view
(1) Code
Building the code
Pattern: put all dependencies in a private repository
Immutable and disposable infrastructure
Pattern: immutable infrastructure
(2) Configuration
(3) Transparency
(3.1) Designing for transparency
watch for coupling of monitoring framework and application code
"should be like an exoskeleton"
Keep out of instance
What metrics should trigger alerts
Where to set thresholds
How to roll-up state variables into an overall system health status
(3.2) Enabling technologies
Block-box - outside the process
White-box - inside the process
(3.3) Logging
Log locations
Security: separate folder
Code: read only
Logs: write
Configurable
Physical machines - Separate drive
Containers - stdout
Log levels
Primary consumer
Not developer
Administrator / ops engineer
Heuristic: log levels ERROR & SEVERE should require action of the by the operators
Pattern: no debug logs in prod
Noise
Confusion
Human factors
"logs are a Human-computer interface"
Users = ops engineer
Pattern: add trace information in logs (eg. User ID, session, transaction, random number)
Heuristic: log level INFO for interesting state transitions
(3.4) Instance metrics
(3.5) Health checks
Ch09 Interconnect
DNS
Service discovery with dns
Load balancing with dns
Global server load balancing (GSLB)
Load balancing
Software LB
Hardware LB
Health checks
Stickiness
Partitioning ("content based routing")
Demand control
How systems fail
Preventing disaster ("load shedding")
Network routing
Discovering services
(1) DNS
When services don't change often
Load balancing with dns
Round Robin (Model: OSI layer 7 - application)
:) cheap
:( all instances must be routable
:( control in client hands
:( no health check
:( load is not distributed (traffic!=load)
:( client cached dns results
Global server Load balancing (GSLB) with dns
Multi-region
Sends requests to all regions, closest will be fastest to respond
:) great fit for DNS
Architecture pattern: GSLB + regional load balancers
(2) Load Balancing
Virtual IPs (VIPs)
each VIP maps to a pool
policy information
algorithm
what health checks
session stickiness
how to handle incoming requests when no pool member available
Listen with DNS name of the VIP
Software load balancing (reverse proxy)
Model: OSI layer 7 - application layer
:) cheap
:) easy to operate
:( limited max scale
(optional) caching
Heuristic: when software load balancing becomes insufficient
Hardware load balancing
Model: OSI layer 4 through 7
Not just http or ftp
:) fail over (diver traffic to another site)
Works well with Architecture pattern: GSLB + regional load balancers
Heuristic: when to run your own software load balancer
(3) Demand Control
2 strategies
Refuse work
Scale out
How systems fail (request-reply)
Sockets
Each request claims a socket on each tier it passes through
Service slows down under heavy load
Fewer sockets available
I/O bandwidth through the NICs
Time for packets thought the wire
Buffers
Write buffers full - write calls will block
Read buffers full - connection will stall
Listen queue (server socket)
Preventing disaster
Pattern: use SLA to determine when to start load shedding
(4) Network Routing
Nearby servers - subnet address
Further
Default gateway
Static route definitions
Software-defined networking
:) Virtualize infrastructure
:) container based infrastructure
(5) discovering services
When?
Too many services for DNS
Highly dynamic environment
Parts
Services announce themselves
Lookup
Ch10 Control Plane
How much is right for you?
Mechanical Advantage
system failure, not human error
automation goes really fast
Platform and Ecosystem
Development is Production
System-Wide Transparency
real user monitoring
economic value
the risk of fragmentation
logs and stats
what to expose
Configuration Services
Provisioning and Deployment Services
Command and Control
controls to offer
sending commands
scriptable interfaces
remember this
The Platform Players
The Shopping List
Control plane
all the software and services that run in the background to make production load successful
Heuristic: production software vs control plane
(1) How much is right for you?
(2) Mechanical Advantage
system failure, not human error
automation goes really fast
Heuristic: when to use automation
(3) Platform and Ecosystem
(4) Development is Production
(5) System-Wide Transparency
fundamental questions
Are users receiving a good experience?
Is the system producing the economic value we want?
economic value
Focus - connect revenue to cost
Income (top line)
Performance bottlenecks
Errors preventing ppl from registering
...
Costs (bottom line)
Infrastructure
Operations
Platforms & runtimes
risk of fragmentation
marketing uses tracking bugs on web pages
sales uses conversions reported in a business intelligence tool
operations analyzes log files in splunk
development uses blind hope and intuition
-> same data through different interfaces
different perspectives
same information system
Logs & stats - what to expose
Model: push vs pull log collection
Heuristic: nominal values for continuous metrics
(6) Configuration Services
distributed databases
(7) Provisioning and Deployment Services
(8) Command and Control
Command and control - live control
When? If services have a long startup time (not containers)
controls to offer
Heuristic: useful controls for Control Plane - live control
sending commands
Admin API over http (small scale)
Security: Different port
Swagger UI
Command queue (large scale)
Prevent dog piling
Random delay
Waves
Operating costs
"anything you build must either be maintained or torn down"
Choose options that are appropriate
Team size
Scale of workload
consider
Visibility: Logging, tracing, metrics
Dynamic systems? Configuration, provisioning, deployment services
Long-lived? Control mechanisms
(9) The Platform Players
(10) The Shopping List
not every org needs everything on this list
Ch11 Security
(1) OWASP top 10
(2) The Principle of Least Privilege
anti-pattern: run as root (unix) or administrator (windows)
pattern: each major application should have it's own user
Heuristic: only use root when
opening a socket on a port below 1024 is the only thing that a UNIX application might require root privilege for
pattern: configure timed builds for any application that isn't still under active development
-> so that you get updates from the base image
(3) Configured Passwords
pattern: keep passwords to prod databases separate from other config files
anti-pattern: passwords in installation directory
pattern: password files readable only to the owner (application's own user)
pattern: Password vaulting - keep passwords in encrypted files
pattern: expunge keys from memory as soon as possible
memory dumps
pattern: disable code dumps on production
This post was referenced in:
- Heuristic: when software load balancing becomes insufficient
- Model: Blameless Postmortem
- Model: OWASP top 10
- Model: OWASP top 10
- Model: OWASP top 10
- Model: OWASP top 10
- Model: OWASP top 10
- Model: OWASP top 10
- Model: OWASP top 10
- OWASP Injection
- OWASP Broken Authentication and Session Management
- OWASP Cross Site Scripting (XSS)
- OWASP Broken Access Control
- OWASP Security Misconfiguration
- OWASP Sensitive Data Exposure
- OWASP Cross-Site Request Forgery (CSRF)
- OWASP Underprotected APIs
- OWASP Insufficient Attack Protection
- OWASP Using Components with Known Vulnerabilities
- OWASP XML External Entities (XXE)
- Architecture pattern: global dns + regional load balancers
- Pattern: use SLA to determine when to start load shedding
- Pattern: put all dependencies in a private repository
- Pattern: split deployment view and runtime view
- Model: 12-factor app
- Pattern: immutable infrastructure (aka phoenix server)
- Heuristic: log levels ERROR & SEVERE should require action of the by the operators
- Heuristic: log level INFO for interesting state transitions
- Heuristic: when to run your own software load balancer
- Stability pattern: back pressure
- Stability Pattern: Shed load
- Heuristic: production software vs control plane
- Heuristic: nominal values for continuous metrics
- Heuristic: useful metrics
- Heuristic: when to use automation
- Model: elastic vs scalable
- Pattern: canary deployments
- Heuristic: useful controls for Control Plane - live control
- Model: push vs pull log collection
- Model: platform team
- Model: health check