Pets vs Cattle

One of the most useful questions I ask myself about a software project is: "Is this a pet or is it cattle?" I've been surprised how often asking this question helps clarify my thinking.

The product maturity lifecycle for SaaS applications tends naturally to start with pets which graduate to cattle. In development you may only have one instance of each of your critical components. When they break you spend the time to debug and diagnose. This is a productive exercise since it helps you harden the binaries against faults and also learn how to operationalize your system.

But as time goes on and the system matures individual requests, then even jobs and machines, become cattle. The first customer requests flowing through the system are magical; the 10-millionth is just another blip on the dashboard. When I worked at Google the old-hat SREs were incredibly principled about not investigating issues-apparent unless they fired an alert. "You could spend all your time trying to figure out why something in the dashboard looks weird. If it isn't hurting customers enough to fire an alert, choose not to care." -- A wise SRE.

This isn't an epistle against deep debugging in distributed systems. It's incredibly important to understand exactly how and why things go wrong, especially when they go wrong all at once. But if one job out of 200 goes

(Credit for introducing me to this concept goes to John Truscott Reese.)