In the process world, large scale is just an opportunity for process improvement.
At small scale, many improvements won't earn back their investment in a reasonable time frame.? But at large scale, or increasing scale, more and more improvements will earn back their investment in a reasonable time frame.? Of course, the flip side of this is that, when small, exceptional occurrences are, well, exceptional.? Whereas when at large scale, exceptional occurrences become common ones - and thus a good opportunity for process improvement.
Great examples from Jason Cohen of WP Engine:
Suppose I told you that on average our servers experience one fatal failure every three years. The kernel panics (the Linux equivalent of the Blue Screen of Death), or both the main and redundant power supply fails, or some other rare event that causes outage. Does that sound like a bad batting average? [...]
So, even if our servers are more hardy than your MacBook Pro, they?re taking 100x the beating, so one failure every three years seems pretty reasonable.
But remember, we have 1,000 servers. Three years is about 1,000 days. So that means, on average, every single day we have a fatal server error.
In other words, the "exception" is now a daily occurrence.? The kind of thing you might invest in improving over time.? Jason takes it further:
The insight is that that scale causes rare events to become regular. Things happen with 1000 servers that you literally never once saw with 50 servers, and things which used to happen once in a blue moon, where a shrug and a manual reboot every six months was in fact an appropriate ?process,? now happen every week, or even every day.
This is a really key insight for startups, growing companies, and operations that are running at scale. And a great insight for improving process.
And many people would assume improving process means automating.? Or as Jason puts it "Automate Everything" as the knee-jerk response:
Sure, without automated monitoring we?d be blind, and without automated problem-solving we?d be overwhelmed. So yes, ?automate everything.?
But some things you can?t automate. You can?t ?automate? a knowledgable, friendly customer support team. You can?t ?automate? responding to a complaint on social media, which as our Twitter meister Austin Gunter says is usually a customer?s last resort and thus should always be treated as the very legitimate issue that it is. You can?t ?automate? the recruiting, training, rapport, culture, and downright caring of teams of human beings who are awake 24/7/365, with skills ranging from multi-tasking on support chat to communicating clearly and professionally over the phone to logging into servers and identifying and fixing issues as fast as (humanly?) possible.
He's right.? You can't automate away the personal touch. The knowledgeable help.? You also can't automate the truly rare technical problems (some of which maybe haven't been seen before).
On the other hand, the business that is scaling, as WP Engine is, can afford to invest in the improvements that take care of customers and even improve service despite the challenges of scale.? As a WP Engine customer, I can attest to the fact that, if anything, customer support has improved since we first became a customer.? At BP3, our BP Labs group is investing in improvements to handle the increased scale of our investments - and when we roll each investment out publicly, the adherence to the theme will be more obvious.? It is nice to have some good examples like WP Engine to be familiar with when we look at our own investments.