…according to a Twitter post by the Chief Informational Security Officer of Grand Canyon Education.
So, does anyone else find it odd that the file that caused everything CrowdStrike to freak out, C-00000291-
00000000-00000032.sys was 42KB of blank/null values, while the replacement file C-00000291-00000000-
00000.033.sys was 35KB and looked like a normal, if not obfuscated sys/.conf file?
Also, apparently CrowdStrike had at least 5 hours to work on the problem between the time it was discovered and the time it was fixed.
This file compresses so well. 🤏
The fact that a single bad file can cause a kernel panic like this tells you everything you need to know about using this kind of integrated security product. Crowdstrike is apparently a rootkit, and windows apparently has zero execution integrity.
This is a pretty hot take. A single bad file can topple pretty much any operating system depending on what the file is. That’s part of why it’s important to be able to detect file corruption in a mission critical system.
Imagine the world if those companies were using Atomic distribution and the only thing you would need to do is to boot the previous good image.
ohno_anyway.png
How can all of those zeroes cause a major OS crash?
If I send you on stage at the Olympic Games opening ceremony with a sealed envelope
And I say “This contains your script, just open it and read it”
And then when you open it, the script is blank
You’re gonna freak out
This guy ELI5s
The funny bit is, I’m sure more than a few people at Crowdstrike are preparing 3 envelopes right now.
Except “freak out” could have various manifestations.
In this case it was “burn down the venue”.
It should have been “I’m sorry, there’s been an issue, let’s move on to the next speaker”
Except since it was an antivirus software the system is basically told “I must be running for you to finish booting”, which does make sense as it means the antivirus can watch the system before any malicious code can get it’s hooks into things.
I don’t think the kernel could continue like that. The driver runs in kernel mode and took a null pointer exception. The kernel can’t know how badly it’s been screwed by that, the only feasible option is to BSOD.
The driver itself is where the error handling should take place. First off it ought to have static checks to prove it can’t have trivial memory errors like this. Secondly, if a configuration file fails to load, it should make a determination about whether it’s safe to continue or halt the system to prevent a potential exploit. You know, instead of shitting its pants and letting Windows handle it.
Computers have social anxiety.
You’re right of course and that should be on Microsoft to better implement their driver loading. But yes.
The driver is in kernel mode. If it crashes, the kernel has no idea if any internal structures have been left in an inconsistent state. If it doesn’t halt then it has the potential to cause all sorts of damage.
The envelope contains a barrel of diesel and a lit flare
Ah yes. So Windows is the screaming in terror version and other systems are the “oh, sorry everyone, looks like there’s an error. Let’s just move on to the next bit” version.
Nice analogy, except you’d check the script before you tried to use it. Computers are really good at crc/hash checking files to verify their integrity, and that’s exactly what a privileged process like antivirus should do with every source of information.
Maybe. But I’d like to think I’d just say something clever like, “says here that this year the pummel horse will be replaced by yours truly!”
Problem is that software cannot deal with unexpected situations like a human brain can. Computers do exactly what a programmer tells it to do, nothing more nothing less. So if a situation arises that the programmer hasn’t written code for, then there will be a crash.
Poorly written code can’t.
In this case:
- Load config data
- If data is valid:
- Use config data
- If data is invalid:
- Crash entire OS
Is just poor code.
I agree that the code is probably poor but I doubt it was a conscious decision to crash the OS.
The code is probably just:
- Load config data
- Do something with data
And 2 fails unexpectedly because the data is garbage and wasn’t checked if it’s valid.
You can still catch the error at runtime and do something appropriate. That might be to say this update might have been tampered with and refuse to boot, but more likely it’d be to just send an error report back to the developers that an unexpected condition is being hit and just continuing without loading that one faulty definition file.
If there’s an error, use last known good config. So many systems do this.
Unfortunately, an OS that covers such cases is a lost monetization opportunity, fuck the system, use a Linux distro, you get the idea. Microsoft makes money off of tech support for people too unversed in computers to fix it themselves.
When talking about the driver level, you can’t always just proceed to the next thing when an error happens.
Imagine if you went in for open heart surgery but the doctor forgot to put in the new valve while he was in there. He can’t just stitch you up and tell you to get on with it, you’ll be bleeding away inside.
In this specific case we’re talking about security for business devices and critical infrastructure. If a security driver is compromised, in a lot of cases it may legitimately be better for the computer to not run at all, because a security compromise could mean it’s open season for hackers on your sensitive device. We’ve seen hospitals held random, we’ve seen customer data swiped from major businesses. A day of downtime is arguably better than those outcomes.
The real answer here is crowdstrike needs a more reliable CI/CD pipeline. A failure of this magnitude is inexcusable and represents a major systemic failure in their development process. But the OS crashing as a result of that systemic failure may actually be the most reasonable desirable outcome compared to any other possible outcome.
That’s a bad analogy. CrowdStrike’s driver encountering an error isn’t the same as not having disk IO or a memory corruption. If CrowdStrike’s driver
didn’t load at allwasn’t installed the system could still boot.It should absolutely be expected that if the CrowdStrike driver itself encounters an error, there should be a process that allows the system to gracefully recover. The issue is that CrowdStrike likely thought of their code as not being able to crash as they likely only ever tested with good configs, and thus never considered a graceful failure of their driver.
But the OS crashing as a result of that systemic failure may actually be the most reasonable desirable outcome compared to any other possible outcome.
In which case this should’ve been documented behaviour and probably configurable.
This error isn’t intentionally crashing because of a security risk, though that could happen. It’s a null pointer exception, so there are no static or runtime checks that could have prevented or handled this more gracefully. This was presumably a bug in the driver for a long time, then a faulty config file came and triggered the crashes. Better static analysis and testing of the kernel driver is one aspect, how these live config updates are deployed and monitored is another.
If AV suddenly stops working, it could mean the AV is compromised. A BSOD is a desirable outcome in that case. Booting a compromised system anyway is bad code.
You know there’s a whole other scenario where the system can simply boot the last known good config.
And what guarantees that that “last known good config” is available, not compromised and there’s no malicious actor trying to force the system to use a config that has a vulnerability?
I’m gonna take from this that we should have AI doing disaster recovery on all deployments. Tech CEO’s have been hyping AI up so much, what could possibly go wrong?
What are the chances that Crowdstrike started using ai to do their update deployments, and they just won’t admit it?
Great layman’s explanation.
Ah, makes sense. I guess a driver would completely freak out if that file gave no instructions and was just like “…”
Well, the file shouldn’t be zeroes
The front of the file fell off
The file is used to store values to use as denominators on some divisions down the process. Being all zeros is caused a division by zero erro. Pretty rookie mistake, you should do IFERROR(;0) when using divisions to avoid thay.
IFERROR(;0)
Maybe they should use a more appropriate development tool for their critical security platform than Excel.
Windows
school districts were also affected… at least mine was.
I can’t imagine how much worse this would have been for global GDP if schools had to be closed for it.
have they ruled out any possibility of a man in the middle attack by a foreign actor?
In the middle of the download path of all the machines that got the update?
Or it being an intentional proof of concept
The CEO made a statement to the effect of “It’s not an attack, it’s just me and my company being shockingly incompetent.” He didn’t use exactly those words but that was the gist.
This was not a cyberattack.
https://www.crowdstrike.com/blog/statement-on-falcon-content-update-for-windows-hosts/
I guess they could be lying, but if they were lying, I don’t know if their argument of “we’re incompetent” is instilling more trust in them.
I’m not a dev, but don’t they have like a/b updates or at least test their updates in a sandbox before releasing them?
It could have been the release process itself that was bugged. The actual update that was supposed to go out was tested and worked, then the upload was corrupted/failed. They need to add tests on the actual released version instead of a local copy.
Could also be that the Windows versions they tested on weren’t as problematic as the updated drivers around the time they released.
I wonder how many governments and companies will take this as a lesson on why brittle systems suck. My guess is most of them won’t… It’s popular to rely on very large third party services, which makes this type of incident inevitable.
If it had been all ones this could have been avoided.
Just needed to add 42k of ones to balance the data. Everyone knows that, like tires, you need to balance your data.
I mean, joking aside, isn’t that how parity calculations used to work? “Got more uppy bits than downy bits - that’s a paddlin’” or something.
Assuming they were all calculations, which they won’t have been.
We will probably never know for sure, because the company will never actually release a postmortem, but I suspect that the file was essentially just treated as unreadable, and didn’t actually do anything. The problem will have been that important bits of code, that should have been in there, now no longer existed.
You would have thought they’d do some testing before releasing an update wouldn’t you. I’m sure their software developers have a bright future at Boeing ahead of them. Although in fairness to them, this will almost certainly have been a management decision.
If I had to bet my money, a bad machine with corrupted memory pushed the file at a very final stage of the release.
The astonishing fact is that for a security software I would expect all files being verified against a signature (that would have prevented this issue and some kinds of attacks
From my experience it was more likely to be an accidental overwrite from human error with recent policy changes that removed vetting steps.
So here’s my uneducated question: Don’t huge software companies like this usually do updates in “rollouts” to a small portion of users (companies) at a time?
That’s certainly what we do in my workplace. Shocked that they don’t.
Companies don’t like to be beta testers. Apparently the solution is to just not test anything and call it production ready.
Every company has a full-scale test environment. Some companies are just lucky enough to have a separate prod environment.
When I worked at a different enterprise IT company, we published updates like this to our customers and strongly recommended they all have a dedicated pool of canary machines to test the update in their own environment first.
I wonder if CRWD advised their customers to do the same, or soft-pedaled the practice because it’s an admission there could be bugs in the updates.
I know the suggestion of keeping a stage environment was off putting to smaller customers.
I mean yes, but one of the issuess with “state of the art av” is they are trying to roll out updates faster than bad actors can push out code to exploit discovered vulnerabilities.
The code/config/software push may have worked on some test systems but MS is always changing things too.
the smart ones probably do
Which is still unacceptable.
Which is still unacceptable.
Windows kernel drivers are signed by Microsoft. They must have rubber stamped this for this to go through, though.
This was not the driver, it was a config file or something read by the driver. Now having a driver in kernel space depending on a config on a regular path is another fuck up
isn’t .sys a driver?
So yes, .sys is by convention on Windows is for a kernel mode driver. However, Crowdstrike specifically uses .sys for non-driver files and this specifically was not a driver.
Not just drivers, no https://fileinfo.com/extension/sys
What about the Mac and Linux PCs? Did Microsoft sign those too?
Not sure about Mac, but on Linux, they’re signed by the distro maintainer or with the computer’s secure boot key.
So… Microsoft couldn’t have “rubber-stamped” anything to do with the outage.
The outage only affected the Windows version of Falcon. OSX and Linux were not affected.
This time. Last time it did affect Linux. It doesn’t have anything to do with Microsoft.
Sorry to burst your bubble.what are you on about? who suggested anything about microsoft?
only the Windows version was affected
Ah, a classic off by 43,008 zeroes error.
Every affected company should be extremely thankful that this was an accidental bug, because if crowdstrike gets hacked, it means the bad actors could basically ransom I don’t know how many millions of computers overnight
Not to mention that crowdstrike will now be a massive target from hackers trying to do exactly this
security as a service is about to cost the world a pretty penny.
You mean it’s going to cost corporations a pretty penny. Which means they’ll pass those “costs of operation” on to the rest of us. Fuck.
Either that or cyber instance
well, the world does include the rest of us.
and its not just opeerational costs. what happens when an outage lasts 3+ days and affects all communication and travel? thats another massive shock to the system.
they come faster and faster.
You did not just fall out of a coconut tree. You exist in a context of all that came before you.
I’d assume state (or other serious) actors already know about these companies.
Don’t Google solar winds
That one turns out to have been largely Microsoft’s fault for repeatedly ignoring warnings of a severe vulnerability relating to Active Directory. Microsoft were warned about it, acknowledged it and ignored it for years until it got used in the Solar Winds hack.
Holy hell
On Monday I will once again be raising the point of not automatically updating software. Just because it’s being updated does not mean it’s better and does not mean we should be running it on production servers.
Of course they won’t listen to me but at least it’s been brought up.
I’m thought it was a security definition download; as in, there’s nothing short of not connecting to the Internet that you can do about it.
Well I haven’t looked into it for this piece of software but essentially you can prevent automatic updates from applying to the network. Usually because the network is behind a firewall that you can use to block the update until you decide that you like it.
Also a lot of companies recognize that businesses like to check updates and so have more streamlined ways of doing it. For instance Apple have a whole dedicated update system for iOS devices that only businesses have access to where you can decide you don’t want the latest iOS and it’s easy you just don’t enable it and it doesn’t happen.
Regardless of the method, what should happen is you should download the update to a few testing computers (preferably also physically isolated from the main network) and run some basic checks to see if it works. In this case the testing computers would have blue screened instantly, and you would have known that this is not an update that you want on your system. Although usually requires a little bit more investigation to determine problems.
It makes me so fuckdamn angry that people make this assumption.
This Crowdstrike update was NOT pausable. You cannot disable updates without disabling the service as they get fingerprint files nearly every day.
Thank God someone else said it. I was constantly in an existential battle with IT at my last job when they were constantly forcing updates, many of which did actually break systems we rely on because Apple loves introducing breaking changes in OS updates (like completely fucking up how dynamic libraries work).
Updates should be vetted. It’s a pain in the ass to do because companies never provide an easy way to rollback, but this really should be standard practice.
You can use AirWatch to deal with Apple devices. Although it is a clunky program it does at least give you the ability to roll things back.
I’ve got a feeling crowdstrike won’t be as grand of target anymore. They’re sure to lose a lot of clients…ateast until they spin up a new name and erease all traces of “cdowdstrike”.
That trick doesn’t work for B2B as organizations tend to do their research before buying. Consumers tend not to.
This is why I openly advocate for a diverse ecosystems of services, so not everyone is affected if the biggest gets targeted.
But unfortunately, capitalism favors only the frontrunner and everyone else can go spin, and we aren’t getting rid of capitalism anytime soon.
So basically, it is inevitable that crowdstrike WILL be hacked, and the next time will be much much worse.
Properly regulated capitalism breaks up monopolies so new players can enter the market. What you’re seeing is dysfunctional capitalism - an economy of monopolies.
Sorry no, capitalism is working exactly as intended. Concentration of wealth breaks regulation with unlimited political donations.
You call it unregulated, but that is the natural trend for when the only acceptable goal is the greater accumulation of wealth. There comes a time when that wealth is financially best spent buying politicians.
Until there are inherent mechanisms within capitalism to prevent special interest money from pushing policy and direct regulatory capture, capitalism will ALWAYS trend to deregulation.
You call it unregulated, but that is the natural trend for when the only acceptable goal is the greater accumulation of wealth.
Yes…obviously.
And that IS dysfunctional capitalism.
Until there are inherent mechanisms within capitalism to prevent special interest money from pushing policy and direct regulatory capture
That’s exactly what I’m saying, dude.
This is NOT capitalism working as intended. This is broken capitalism. Runaway capitalism. Corrupt capitalism.
Its like saying we just need good kings, no ids a bad system. Any capitalist system will devolve in corruption and monopoly. No regulations can survive the unavailable regulatory capture and corruption.
No system is perfect. All systems require some form of keeping power from accruing to the few.
Yes, very insightful.
If you start regulating capitalism, thats called something else. That would be saying that the markets can not regulate by themselves, and proving as a myth one of the basics of capitalism.
So I, as well, think capitalism is working as intended. and sure is based on greed.
Something else, as in what? As long as the means of production is privately owned for profit, it’s capitalism.
would you like an introduction to the almighty red rose?
I know you are trying to be clever but I’m not really in a clever mood rn.
Years ago I read an study about insurance companies and diversification of assets in Brazil. By regulation, an individual insurance company need to have a diversified investment portfolio, but the insurance market as a whole not, so the diversification of every individual company, as a whole all the insurance market was exposed and the researchers found, iirc, like 3 banks that if they fail can they cause a chain reaction that would take out the entire insurance market.
Don’t know why, but your comment made me remind of that.
That’s kind of fascinating, never considered what the results of that kind of regulation can bring without anyone even noticing it at the time. Thanks for a good reading topic for lunch!
Third parties being able to push updates to production machines without being tested first is giant red flag for me. We’re human … we fuck up. I understand that. But that’s why you test things first.
I don’t trust myself without double checking, so why would we completely trust a third party so completely.
d’000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Damnit you’re comment just crashed the rest of the computers that were still up.
thank you for the visual representation ☺️