The uptime questions every engineering leader should ask this week

In this interview with Help Net Security, Mattias Geniar, CTO at Oh Dear, explains why most outages start quietly, as creeping latency or a slow rise in errors. He argues teams alert on the wrong things: absolute numbers instead of changes, isolated endpoints instead of real user outcomes. He covers alert fatigue, the DNS and certificate failures buried deep in the stack, the risk of leaning on one provider, and the mistakes tired engineers make at 3am. Geniar closes with questions leaders should ask to test their uptime story.

preventing outages

A lot of outages do not announce themselves as outages. They show up as rising latency or a 2 percent error rate that slowly climbs. How do you separate a degradation worth waking someone at 3am from noise, and where do teams draw that line badly?

The mistake I see most often is that teams monitor too specifically. If you’ve got a large fleet of servers, a CPU spike on one box isn’t worth a page. When it’s one server out of a hundred all doing the same workload, a one-off doesn’t warrant a 3am wake-up. But if you only have one server and it spikes to 90%, maybe it does. It depends entirely on the case, and that’s the part people skip.

A better way to look at it is to watch changes and outcomes rather than absolute numbers. Take that 2% error rate. If it’s been flat at 2% for the past year, it doesn’t need alerting. That’s your baseline, it’s what “normal” looks like for your organisation. It only becomes a problem when it moves, up or down, drastically compared to what you measured before. You’re looking at the delta from previous values. If that exceeds what’s normal for you, that’s worth alerting on. The number 2% on its own tells you nothing.

The same goes for how you monitor applications in general. The common approach is to monitor everything in isolation: you’ve got a bunch of API endpoints, so you monitor each endpoint; you’ve got a bunch of pages, so you monitor each page. But an application is the sum of its parts.

What’s far more powerful, on top of all that, is monitoring the actual outcome. Simulate a login and measure how fast it is, whether it works, and what errors you get if it doesn’t. Or do it shopping-basket style: navigate the site, add a few products, go to the basket, expect to land on a Visa payment screen. A full scenario you reproduce end to end, measuring the timings along the way and checking for a specific outcome.

That changes what’s worth waking up for. If your fleet of a hundred servers has some downtime but it doesn’t affect the outcome, if the application still works for the user, it’s probably not a page. It’s worth telling someone, maybe in a non-urgent channel, that it still needs a look. But if there’s no actual downtime for the user, I don’t think it’s good enough to drag someone out of bed at 3am.

Then there’s where teams draw this line badly. The big one, and I’ve seen it over and over in previous teams, is alert fatigue. People monitor the details and end up drowning in them, and it turns into a non-stop stream of alerts. Eventually you get the boy-who-cried-wolf scenario: you receive so many alerts that the alerts themselves stop meaning anything. They become a bad signal of urgency or priority. Everything is flagged as urgent, so in reality nothing is.

Back when I was CTO at a hosting company we had a standing rule for this. Every Monday the whole team got together and we went over every alert from the past week, urgent and non-urgent alike. For each one we asked out loud: was this alert useful? Did it warrant an action? Is the severity right? The goal was to actively bring down the number of notifications people got off-hours while on call. We worked on catching the early warning signals during office hours instead, so we could prevent the 3am outage with a 2pm routine job.

The same caveat applies to anomaly detection. It’s powerful, but it needs a duration attached. Say your baseline is a 2% error rate and it spikes to 10% for 30 seconds, then normalises. You have to ask yourself: do you need to wake up for that? Is it enough to be notified the next day, do a postmortem, look at what happened and fine-tune your alerts? Or do you genuinely need to get out of bed for a blip that’s already gone? Most of the time, a short-lived spike isn’t a 3am problem. It’s a fix-the-alert-tomorrow problem.

DNS misconfigurations and expired TLS certificates still take down large organizations with mature engineering teams. Why do these supposedly solved problems keep causing high-profile incidents, and what does a realistic example look like?

Both DNS and TLS certificates look easy on the surface, but they get complex fast. Everything is a DNS problem, and by extension everything is a certificate problem.

There’s the outside, which is the easy part. You’ve got a web application, it has a domain name, it has DNS, and it has a certificate. That’s trivial to monitor. A simple probe checks that your DNS still resolves and your certificate is still valid, and you’re done.

What’s much harder is everything happening behind the scenes. You’ve got Kubernetes clusters that rely on DNS for internal routing, services talking to each other over HTTPS, internal certificates bound to a root that’s only trusted inside that one cluster. It escalates from there. All that software leans on the same two basics, DNS and TLS, but they get harder to monitor the deeper they sit. A failure five layers down, one Kubernetes module that can’t reach its sibling service, might be DNS caching somewhere along the path, or a misconfigured certificate. It’s so deep in the stack that it’s genuinely difficult to reach and watch.

When an outage gets blamed on DNS or on an expired certificate, it’s rarely the one sitting at the front. The front one is monitored, everyone checks that. It’s the one buried deep down that bites you. Teams spend their time and effort securing the application, making sure all traffic is routed properly and encrypted properly, and that work leaves black holes: certificates and resolvers people forgot to monitor because they worked fine the day they were set up, and they’re so far down the stack that reaching that exact spot to monitor it is hard in the first place.

Third-party dependencies mean an enterprise can go down without anything in its own infrastructure breaking. How should teams reason about outages they cannot fix themselves, and what have you seen work for containing the blast radius?

If a third-party dependency is so crucial that your organisation goes down when it goes down, you’ve got no choice but to react. That might mean a 3am wake-up call and getting on the phone with the vendor so they’re aware and working on it. What I’d hope is that the supplier already knows and is already on it, but you might not know that from the outside. Either way, if you depend on that third party you don’t have much choice but to react, if only so you can tell your own customers you’re aware of the problem, it’s being worked on, and you’re doing everything you can.

The ideal, and this is easier said than done, is not to be too reliant on a single provider. If it’s a cloud provider or your server provider, failover scenarios exist for exactly this reason. But we don’t take a failover lightly. It’s not a single click with zero consequences every time. A human needs to be in the loop to judge how bad the outage actually is.

An example: we had servers in AWS’s Bahrain region, and regional conflict took the entire data center offline. At that point you just have to decide this isn’t recovering in the next few minutes. We weren’t talking minutes, we were talking weeks. So you start a recovery scenario. But it’s always case dependent, it comes down to what’s actually causing the third-party outage.

The only real way to contain that blast radius is to have alternatives. As long as you’ve got options you can activate, or automatically fail over to, you’re no longer too dependent on any one of them. That’s also easier said than done, because the reliance on a handful of major technology providers these days is enormous.

Recovery is its own failure mode. Teams sometimes make an incident worse during the fix, or the failover itself fails over to nothing. What recovery-time mistakes do you see repeated, and how do you verify the safety net holds before you depend on it?

It boils down to testing. You might have the best recovery scenario, the best backups, the best spare capacity in another region. If you don’t regularly test and actually fail over to that location, then the moment you need it, it’s a wild guess.

And the failover to a second region is only half the story. At some point you have to think about going back to the main region, or about treating the backup region as your new primary and reconfiguring the original primary as the secondary. It’s not over once you’ve failed over, and a lot of teams seem to forget that. You either need a rollback scenario that doesn’t lose data, or you redesign the architecture so the secondary becomes the primary.

The other recovery mistake is more human. Someone who’s just been dragged out of bed at 3am to debug a full production outage is running on adrenaline. They feel wide awake, but they’re not. That’s exactly when the silly mistakes happen: overwriting the wrong file, restoring a database to the wrong place, copying something where it doesn’t belong. It’s never malice or bad will. It’s a person trying to do the right thing on a brain that’s still half asleep, and a half-asleep brain makes stupid mistakes. The fix becomes the second outage. The way you guard against that is the same answer: the more of the recovery you’ve turned into a tested, automated, boring procedure, the less you’re asking a groggy human to improvise the dangerous parts at 3am.

The only way I’ve seen teams get really good at this is testing, testing, and more testing. You make the failover such a routine, boring procedure that people stop enjoying it. There’s no thrill, no excitement, it’s just a series of actions, mostly automated, that work. That’s exactly what gives you the confidence it’ll work when you actually need it.

For a CISO or VP of Engineering reading this interview, what uncomfortable question should they put to their team this week that would expose how fragile their uptime story is?

Ask the team what they think is the most fragile piece of the infrastructure, and how it’s currently being monitored. The engineers usually know exactly where the Achilles heel is. It’s the area they spend most of their time in, either putting out fires or sticking band-aids on things before a fire actually breaks out. A team almost always knows where the weaknesses are. And often it’s not out of bad will that it hasn’t been fixed. It’s that the business case for spending money to improve it only becomes obvious once it fails. It’s so much easier to get budget for a failure that needs resolving than for preventing one in the first place.

Another question you can put to your engineers: what could a customer notice before we do? Would a customer see a failing login, or a failure to order a product, before your own engineering team sees it? If so, that’s a gap. Customers, especially the ones in the middle of using your application, are very quick to notice a change in speed or an error message popping up. If they’re your early-warning system, your monitoring isn’t doing its job.

Or ask how often, and when last, the backups were actually tested. Not “the backup exists and the file size looks about right”, but restored, with the data validated. It doesn’t have to take long, but having someone lay eyes on a real restore is hugely valuable for spotting problems before you need it.

Or look at your team. Look at the person everyone associates with solving problems, and ask: what if they quit tomorrow? What if the team suddenly has to carry on without the one person who always seems to have the answer when things break? Most teams have exactly that bottleneck. You probably consider them your strongest performer, your biggest asset, the one who’s there for every outage and always pulls through. There will be a day when that person leaves, or is simply off sick, and the rest of the team has to cover. That’s when knowledge sharing and shared responsibility stop being nice-to-haves.

Guide: What automated pentesting alone cannot see

More about

Don't miss