This is a repost of a story I posted on Reddit a few years ago.
Story participants
Me: Slazer
Boss: the boss
T1: Tech 1
T2 Tech 2
Backstory
The boss is all about redundancy and backup. If he finds a single point of failure that I have missed he lets us know and sets a time frame for when he wants it resolved along with a when the failover testing should be done. Because an untested backup is worse than no backup.
To spare the boring BGP details
We have 2 data centres in our closest state capitol. With transit multihomed transit through a single level 2 carrier (while not true multihomed we have transit of last resort through one of our layer 2 customers).
One day the boss arrives in the office around 10:30 AM after being in a huff about hearing of a major outage in a competitors network.
Boss: Slazer, did you get our traffic balanced over our 2 transit paths like we discussed a while ago?
Me: Yes, DC1 advertises prefix 1,3,5 and the aggregate. DC2 advertises prefixes 2,4,6 and the aggregate.
Boss: What happens when one of the transit fails?
Me: I am advertising the DC2 prefixes out DC1 with the backup BGP community. Then doing the same thing for DC1 prefixes over DC2. In the event of a transit failure the upstream has a backup path ready to go. Boss: and it works?
Me: Yes, last time I tested it was about 2 or 3 months ago and it failover over correctly.
Boss: Why haven’t you tested it sooner?
Me: RANCID hasn’t reported a configuration change since the last test. I only test it if there has been a config change on and of those routers.
Boss: But how can you be sure it still works?
Me: Shall I force a failover now to show it works?
Boss: Sure. (which I assume he said with sarcasm)
Me: Starts logging to DC1 core router
T1 seeing me do my configuration change face.
T1: If you are doing that I am going for a break.
I shutdown our transit interface for DC1 and wait for BGP to time out.
After about 10 min with no calls the boss turns around and continues the conversation.
Boss: So when will you be testing the failover?
Me: We are, right now.
Boss: What??!! as his face drops.
Me: You agreed. Plus this way now you know for sure it works because the phones haven’t started ringing.
T2: Slazer is right. The graphs show how an increase in traffic on DC2 transit.
Boss slides over to T2 desk. Sure enough, the graph for DC1 transit is reading zero traffic and the graph for DC2 is showing all the transit traffic for the state.
Boss: That doesn’t looks like much traffic.
Me: Only about 20-30% of our traffic goes via Transit, the rest goes via the various IXs we are on.
Boss: Who don’t we get via the IX?
Me: Customers of our transit provider who aren’t on any IX, Telstra and Optus as they aren’t on any IX, and any international site that doesn’t use a CDN.
We continue discussing for a good 20 - 30 min about where we get various traffic from and further redundancy in the core networks. During which time T1 returns from his break.
T1: Phones are quiet?
Me: Yes.
Boss: Can you turn the DC1 transit back on?
I walk back to my desk and turn the transit interface back on and see the BGP peer back on. While T2 and the boss are watching the graph for DC2 transit it drops about 2/3 of traffic and that appears back on DC1 transit.
And from that day the Boss hasn’t asked about the transit failover because now he knows it works.
I’ve worked at the same site for almost 20 years. We’ve never actually cut over to our COOP (Continuity Of Operations Plan) site in all that time. Not once - not even partially.
I’ve recommended top management go to the data center, yank out the plugs and say “Let’s see how it goes!” (after verifying plans are up to date and ready to go). No one is the least bit interested in doing anything like that.