Many people need to shift away from this blaming mindset and think about systems that prevent these things from happening. I doubt anyone at CrowdStrike desired to ground airlines and disrupt emergency systems. No one will prevent incidents like this by finding scapegoats.
That means spending time and money on developing such a system, which means increasing costs in the short term… which is kryptonite for current-day CEOs
Right. More than money, I say it’s about incentives. You might change the entire C-suite, management, and engineering teams, but if the incentives remain the same (e.g. developers are evaluated by number of commits), the new staff is bound to make the same mistakes.
I strongly believe in no-blame mindsets, but “blame” is not the same as “consequences” and lack of consequences is definitely the biggest driver of corporate apathy. Every incident should trigger a review of systemic and process failures, but in my experience corporate leadership either sucks at this, does not care, or will bury suggestions that involve spending man-hours on a complex solution if the problem lies in that “low likelihood, big impact” corner.
Because likely when the problem happens (again) they’ll be able to sweep it under the rug (again) or will have moved on to greener pastures.What the author of the article suggests is actually a potential fix; if developers (in a broad sense of the word and including POs and such) were accountable (both responsible and empowered) then they would have the power to say No to shortsighted management decisions (and/or deflect the blame in a way that would actually stick to whoever went against an engineer’s recommendation).
CrowdStrike ToS, section 8.6 Disclaimer
[…] THE OFFERINGS AND CROWDSTRIKE TOOLS ARE NOT FAULT-TOLERANT AND ARE NOT DESIGNED OR INTENDED FOR USE IN ANY HAZARDOUS ENVIRONMENT REQUIRING FAIL-SAFE PERFORMANCE OR OPERATION. NEITHER THE OFFERINGS NOR CROWDSTRIKE TOOLS ARE FOR USE IN THE OPERATION OF AIRCRAFT NAVIGATION, NUCLEAR FACILITIES, COMMUNICATION SYSTEMS, WEAPONS SYSTEMS, DIRECT OR INDIRECT LIFE-SUPPORT SYSTEMS, AIR TRAFFIC CONTROL, OR ANY APPLICATION OR INSTALLATION WHERE FAILURE COULD RESULT IN DEATH, SEVERE PHYSICAL INJURY, OR PROPERTY DAMAGE. […]
It’s about safety, but truly ironic how it mentions aircraft-related twice, and communication systems (very broad).
It certainly doesn’t impose confidence in the overall stability. But it’s also general ToS-speak, and may only be noteworthy now, after the fact.
That’s just covering up, like a disclaimer that your software is intended to only be used on 29ᵗʰ of February. You don’t expect anyone to follow that rule, but you expect the court to rule that the user is at fault.
Luckily, it doesn’t always work that way, but we will see how it turns out this time
Lawful Masses with Leonard French covered this yesterday. He is a copyright attorney. He starts the video with the opinion that the ToS wouldn’t protect CrowdStrike.
I’m pretty sure if a client pays for use in any of that they’ll shut up and take the money. Pretty ethical.
deleted by creator
sure it is the dev who is to blame and not the clueless managers who evaluate devs based on number of commits/reviews per day and CEOs who think such managers are on top of their game.
Is that the case at CrowdStrike?
I don’t have any information on that, this was more like a criticism of where the world seems to be leading to
I’ve been working as a professional programmer for many years and have never ever seen this kind of evaluation, not even once. I’m pretty convinced it’s an exception rather than a rule. And I’d add that it’s probably a very rare exception.
NGL I am also a second hand witness to it. This particular example may be a few but there are a lot of others to the same effect: evaluating performance based on number of lines of code, trying to combine multiple dev responsibilities into a single position, unrealistic deadlines which can usually be met very superficially, managers looking for opportunities to replace coders with AI and further tasking other devs with AI code checking responsibilities, replacing experienced coders with newly graduates because they are willing to work more for less. All of these are some form of quantity over quality and usually end up with some sort of crisis.
Yeah, and at the end of the day, it is just as much a very rare exception that a dev actually gets enough time to complete their work at a level of quality they would take responsibility for.
Hell, it is standard industry practice to ship things and then start fixing the issues that crop up.Nono listen to me, it’s agile
It’s never a single person who caused a failure.
Yeah exactly. You’d think they’d have a test suite before pushing an update, or do a staggered rollout where they only push it to a sample amount of machines first. Just blaming one guy because you had an inadequate UAT process is ridiculous.
Allow me to introduce myself
Microsoft also started blaming th eu. Its such a shitshow its ridiculous.
OMG the article conflates kennel API calls and kennel drivers such as what crowdstrike actually does. I refuse to read it until the end.
Kennel? You mean kernel?
Opsi my dumb keyboard still haven’t learned what I do
It’s a systematic multi-layered problem.
The simplest, least effort thing that could have prevented the scale of issues is not automatically installing updates, but waiting four days and triggering it afterwards if no issues.
Automatically forwarding updates is also forwarding risk. The higher the impact area, the more worth it safe-guards are.
Testing/Staging or partial successive rollouts could have also mitigated a large number of issues, but requires more investment.
The update that crashed things was an anti-malware definitions update, Crowdstrike offers no way to delay or stage them (they are downloaded automatically as soon as they are available), and there’s good reason for not wanting to delay definition updates as it leaves you vulnerable to known malware longer.
And there’s a better reason for wanting to delay definition updates: this outage.
How does a definitions update crash windows with a BSOD?
Four days for an update to malware definitions is how computers get infected with malware. But you’re right that they should at least do some sort of simple test. “Does the machine boot, and are its files not getting overzealously deleted?”
One of the fixes was deleting a sysm32 driver file. Is a Windows driver how they update definitions?
The driver was one installed on the computer by the security company. The driver would look for and block threats incoming via the internet or intranet.
The definitions update included a driver update, and most of the computers the software was used on were configured to automatically restarted to install the update. Unfortunately, the faulty driver update caused computers to BSOD and enter a boot loop.
Because of the boot loop, the driver could only be removed manually by entering Safe Mode. (That’s the thing you saw about deleting that file.) Then the updated driver, the one they released when they discovered the bug, would ideally be able to be installed normally after exiting Safe Mode.
Reading between the lines, crowdstrike is certainly going to be sued for damages, putting a Dev on the hook means nobody gets - or pays - anything so long as one guy’s life gets absolutely ruined. Great system
Yesterday I was browsing /r/programming
:tabclose
I blame the users for using that software in the first place
Crowdstrike CEO should go to jail. The corporation should get the death sentence.
Edit: For the downvoters, they for real negligently designed a system that killed people when it fails. The CEO as an officer of the company holds liability. If corporations want rights like people when they are grossly negligent they should be punished. We can’t put them in jail so they should be forced to divest their assets and be “killed.” This doesn’t even sound radical to me, this sounds like a basic safe guard against corporate overreach.
We don’t blame the leopards who ate the guy’s face. We blame the guy who stuck his face near the leopards.
But how do you identify a leopard when you don’t know about animals and it’s wearing a shiny mask?
That one, then go up the chain of command.