Keep your traffic models updated!
Another outage this year for gmai (a similar one in February .)
Estimation, traffic models and feedback is crucial in performance scenarios, understanding the network characteristics and especially live traffic scenarios (for maintenance procedures.)
Some of the lessons they are learning are akin to a sledgehammer and nutshell analogy…
Let’s hope that one of the lessons learned include:
- Keeping the traffic models up to date wrt both traffic distributions and intesity - to help understand the network characteristics for redundancy and maintenance.
- Updates to the in-service introduction routines.
Amplify’d from gmailblog.blogspot.com
Gmail’s web interface had a widespread outage earlier today, lasting about 100 minutes.
Here’s what happened: This morning (Pacific Time) we took a small fraction of Gmail’s servers offline to perform routine upgrades. This isn’t in itself a problem — we do this all the time, and Gmail’s web interface runs in many locations and just sends traffic to other locations when one is offline.
However, as we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers — servers which direct web queries to the appropriate Gmail server for response.
The Gmail engineering team was alerted to the failures within seconds (we take monitoring very seriously). After establishing that the core problem was insufficient available capacity, the team brought a LOT of additional request routers online
What’s next: We’ve turned our full attention to helping ensure this kind of event doesn’t happen again.Read more at gmailblog.blogspot.com

