One of the big disconnects in infosec lies between people who build infosec products and people who end up using them on the ground.
On the one hand, this manifests as misplaced effort: features that are used once in a product-lifetime get tons of developer-effort, while tiny pieces of friction that will chaff the user daily are ignored as insignificant. On the other, this leaves a swath of problems that are considered “solved” that really aren’t.
The first problem is why using many security products feels like pulling teeth. This is partially explained by who does what on the development team. The natural division of labor amongst developers means that the super talented developers are working on the hairy-edge-case problems (which by definition are edge-cases) while less experienced developers are thrown at “mundane” / CRUD parts of the system.
But most of your users will spend most of their time on those “mundane” parts of the system. It’s those common paths that are most in need of talented re-thinking.
The second problem is more insidious and is why we have a zillion security products that barely register as speed-bumps to determined attackers / penetration-testers: because shipping the feature is not the same thing as solving the problem.
I recall reading that one of the most trying parts of swimming the English Channel is the final stage, where one can see the shore but it’s still a long way off. Building challenging features sometimes brings the same kind of pain: you find yourself on the wrong end of the Pareto Principle, with the last 20% requiring 80% of the work. When you add pressure and deadlines to this, it’s easy to see why many features will ship at this point. (Some smart process-optimizer might even get a raise for having maximized output per unit of input).
The wrinkle though, is that the problem itself might not have been solved. (A big part of that is that real-world problems seldom seem to fall to idealised solutions. We can’t just “assume spherical cows
A few years back a major retail corporation invested millions of dollars in a popular brand of threat detection appliances. When the retailer was publicly exploited a few months later (by attackers who had roamed the network for months) it turned out that the threat detection devices (and the team behind them) had been raising alerts periodically, but that they were being ignored. The retailer probably replaced the CISO, and the product vendor went away making it clear that it wasn’t their fault.
But wasn’t it?
The argument that security product vendors often use is that they weren’t deployed properly, or that their alerts were ignored. Ignoring the fact that this is inexcusably user-hostile, it goes against the axiom made famous by Theodore Levitt
. That “People don’t want to buy a quarter-inch drill. They want a quarter-inch hole!
Customers don’t buy security products because we generate alerts. They buy security products because they want us to stop badness (or catch bad actors). If we are generating ten thousand alerts, and customers can’t separate Alice from Carol, then to butcher analogies, they wanted a quarter-inch hole and we sold them 500 drill-bits and a power cord. (Some assembly required).
Solving “alerting” isn’t easy. There’s a ton of academic work done on Alert-Fatigue and humans have spent a long time trying to figure out how to protect our flocks without crying wolf. Consider this example from our own Canary product:
- We see an attacker brute-forcing a service (let’s say an internal admin interface).
- We alert our customer with: “Attacker at IP:10.11.2.45 tried admin:secret to login to the NewYork-Wan”
Over the next 5 minutes, the attacker runs through her userlist (or password list) trying 10,000 credential pairs. What do we do?
There are a bunch of simple options, but in this case, most are pretty sub-optimal. 10k alerts are a pain, but throwing away the information is also silly (It makes a huge difference to me if the attacker is throwing random usernames and passwords at the site or if she is using real usernames and passwords from my Active Directory).
On our Canary Console, we would generate a single event: “Admin thing being Brute Forced” and we would update the event continually. So you’d have one alert, and when you looked at it, you’d have the extra context you need.
But how to best handle the SMS alerts? We cant update the SMS with new information the way we can on other channels. We could hold back the alert for a bit, to gain more context and send one summary alert but that’s a dangerous trade-off. Do we delay screaming “fire” so we can be more accurate about the spread of the blaze?
We could send n-sms’es and then throttle them, but thats pretty arbitrary too. There’s tons of little niggly details to be solved here. (The same is true for sending alerts via Syslog or webhooks).
A few years ago, Slack posted their flowchart
for deciding when and where you get a notification if you’re tagged in a message.
It’s clear that lots of thought has gone into this relatively straightforward feature. It’s also obvious that all messaging apps will claim to handle “Notifications”.
You can see here where Pareto thinking back-fires. Generating an alert when you see an event is easy. You can totally ship the feature at that point. It demos well and if your sales team is making sales based on a list of features, they probably want you to move onto a new feature anyway. But this is how we end up here:
It’s easy to get in a bunch of developers and hand out bonuses for features shipped. It takes a dedication and commitment to solving the problem to keep hacking at the feature until it’s actually useful.
All our natural behaviour pushes us towards new features. When I go onto a podcast, the host asks me “What’s new with Canary?” because that’s probably what his listeners want to hear, and a part of me really wants to tell him about all the new stuff we are thinking about because it shows that we are charging forward. But there’s another, more important side to the work we do, and often what I want to talk about is something much less “exciting”, like “We are working really hard to optimize alerting”. It isn’t that catchy but it might be more of the effort we need.
Maybe instead of constantly focusing on shiny new features, we all need to focus a little harder on making sure the ones we already built actually work.