Cybersecurity firm Crowdstrike pushed an update that caused millions of Windows computers to enter recovery mode, triggering the blue screen of death. Learn …
Cybersecurity firm Crowdstrike pushed an update that caused millions of Windows computers to enter recovery mode, triggering the blue screen of death. Learn …
There is learning here.
As companies, we put faith in an external entity with goals not identical to our own: a lot of faith, and a lot of control.
That company had the power to destroy our businesses, cripple travel and medicine and our courts, and delay daily work that could include some timely and critical tasks.
This is not crowdstrike’s fault; for the bad code yes, but for the indirect effects of that no. We knew - please tell me we had the brains god gave a gnat and we knew - that putting so much control in the hands of outsiders not concerned or aware of our detailed needs and priorities, was a negligent and foolish thing to do.
The lesson is to do our jobs: we need to ensure we have the ability to make the decisions to which were entrusted, and the power that authority gives us that our decisions when accepted are not threatened by a negligent mistake so boneheaded it’s all but the whim of a simpleton. We cannot choose to manage our part of our organization effectively, no matter how (un)important that organization or part is, and then share control with a force that we’ve seen can run roughshod over it.
It’s exactly like the leopards eating our face, except people didn’t see they were leopards. No one blames the leopards, as they’re just confirming to their nature, eventually.
And no one should blame this company for a small mistake, just because we let the jaws get so close to our faces that we became complacent.
Have you never worked in corporate IT or something? Of course we should blame Crowdstrike, that way we don’t get a sev 1 on our scorecard.
https://youtu.be/v0mwT3DkG4w?t=438
It’s funny that corporate IT will be one of the groups getting the blame in this case, despite it being in most cases not their decision that a company lacks a separate test and production environment. The executives that decided that usually gets off scot free.
Hahah, no doubt, while popping in and out of the outage call repeating the phrases “can I get an update?”, " Is there an ETA on recovery?" and “We need to get this back online”
Unless you have the ability and capacity to develop your own ISA/CPU architecture, firmware, OS, and every tool you use from the ground up, you will always be, at some point, “relying on others stuff” which can break on you at a moments notice.
That could be Intel, or Microsoft, or OpenSSH, or CrowdStrike^0. Very, very, very few organizations can exist in the modern computing world without relying on others code/hardware (with the main two that could that come to mind outside smaller embedded systems being IBM and Apple).
I do wish that consumers had held Microsoft more to account over the last few decades to properly use the Intel Protection Rings (if the CrowdStrike driver were able to run in Ring 1, then it’s possible the OS could have isolated it and prevented a BSOD, but instead it runs in Ring 0 with the kernel and has access to damage anything and everything) — but that horse appears to be long out of the gate (enough so that X86S proposes only having Ring 0 and Ring 3 for future processors).
But back to my basic thesis: saying “it’s your fault for relying on other peoples code” is unhelpful and overly reductive, as in the modern day it’s virtually impossible to do so. Even fully auditing your stacks is prohibitive. There is a good argument to be made about not living in a compute monoculture^1; and lots of good arguments against ever using Windows^2 (especially in the cloud) — but those aren’t the arguments you’re making. Saying “this is your fault for relying on other peoples stuff” is unhelpful — and I somehow doubt you designed your own ISA, CPU architecture, firmware, OS, network stack, and application code to post your comment.
——- ^0 — Indeed, all four of these organizations/projects have let us down like this; Intel with Spectre/Meltdown, Microsoft with the 28 day 32-bit Windows reboot bug, and OpenSSH just announced regreSSHion.
^1 — My organization was hit by the Falcon Sensor outage — our app tier layers running on Linux and developer machines running on macOS were unaffected, but our DBMS is still a legacy MS SQL box, so the outage hammered our stack pretty badly. We’ve fortunately been well funded to remove our dependency on MS SQL (and Windows in general), but that’s a multi-year effort that won’t pay off for some time yet.
^2 — my Windows hate is well documented elsewhere.