Rust doesn’t solve the CrowdStrike outage
Look, I like Rust. I really, really do, and I agree with the premise that memory-unsafe languages like C++ should not be used anymore. But claiming that Rust would have prevented the massive outage that the world went through last Friday is misleading and actively harmful to Rust’s evangelism.
Having CrowdStrike written in Rust would have minimized the chances of the outage happening, but not resolved the root cause that allowed the outage to happen in the first place. Thus, it irks me to see various folks blanket-claiming that Rust is the answer. It’s not, and pushing this agenda hurts Rust’s adoption more than it helps: C++ experts can understand the root cause and see that this claim is misleading, causing further divide in the systems programming world.
So, why won’t Rust help? Let me try to answer that question, but while we are at it, let’s also delve deeper into the causes of the outage. In a way, let me put my SRE hat on and write my own version of the postmortem.
Here is what CrowdStrike’s official “postmortem” has to say about the problem that the industry faced:
On July 19, 2024 at 04:09 UTC, as part of ongoing operations, CrowdStrike released a sensor configuration update to Windows systems. Sensor configuration updates are an ongoing part of the protection mechanisms of the Falcon platform. This configuration update triggered a logic error resulting in a system crash and blue screen (BSOD) on impacted systems.
The sensor configuration update that caused the system crash was remediated on Friday, July 19, 2024 05:27 UTC.
Paraphrasing:
-
CrowdStrike (the company) pushed a configuration change.
-
The change tickled a latent bug in “the Falcon platform” (the product).
-
This bug in Falcon resulted in a crash that brought down Windows.
The first two points are not too strange: configuration changes are “business as usual” for any online system and having changes tickle bugs in code is unfortunately common. In fact, the majority of the outages (citation needed) are caused by human-initiated configuration changes.
Obviously, we should ask why the bug existed and how it could be remediated in order to increase the robustness of the product. But we must also question the third point: why was the bug able to bring down the whole machine? And, much more importantly, why did this bug bring down so many systems across the world?
Let’s start with the first question: what was the nature of the bug in Falcon?
Easy: there was a logic bug in the “Channel Files” (aka configuration files) parser that, given some invalid input, made the code try to access an invalid memory position. The details are really not interesting: this could be have been a null pointer dereference, a general protection fault, or whatever. The point is: the crash was triggered by an invalid memory access issue.
And this is where some Rust enthusiasts will zero in and say “Ah-HAH! We got you, fools. If the code had been written in Rust, this bug would not have existed!” And, you know what, that’s literally true: this specific bug would not have happened.
But so what? Avoiding this specific type of error would just have delayed this outage until another time when a different class of error that Rust doesn’t protect against happened. (This is the exact same argument I raised in a critique of forbidding assertions in Go back in 2018 by the way.) Focusing on the memory bug is missing the forest for the trees because of the nature of what Falcon is.
OK so, what is Falcon?
Falcon is “malware… but for the good guys”. Oops, I mean: Falcon is an endpoint security system. Falcon is a product typically installed on corporate machines so that the security team can detect and mitigate threats in real time (while monitoring the actions of their employees). There is some value in this: most cyberattacks (citation needed again) start by compromising corporate machines, often via social engineering practices.
This type of product must have control over the machine. It must be able to intercept all user file and network operations to scan their content. And it must be tamper-proof so that “savvy” corporate users don’t disable it when they read sketchy online instructions to fix their broken WiFi in an attempt to (shudder) not have to create IT tickets.
How can you implement a product like Falcon? The easiest approach, and the approach that Windows encourages, is to write a kernel module. It easily follows that Falcon is a kernel module and, as such, it runs in kernel space. This means that any mess up in Falcon’s code can damage the running kernel and, in turn, bring the whole system down.
And when I say “any mess up”, I really mean it. The kernel is not only brought down by memory errors, and you don’t have to “crash the kernel” to make a machine unusable. Think of a deadlock preventing the kernel from making forward progress. Think of a logic error in the open(2)
system call handler preventing user space from opening any file later on. Think of mistakenly mapping file system code as pageable when it should never be paged out. Think of writing an unbounded recursive algorithm that exhausts the kernel’s stack. Think of… an innocent buggy unwrap()
call if the code actually used Rust.
There are just too many ways to destroy the kernel’s stability, which is why claiming that Rust would have prevented this incident irks me. Rust’s memory-safety only addresses one type of crash. It is true, though, that the focus on correctness in the Rust ecosystem via strong types could minimize the chances of other types of logic bugs. But… as much as we want to reach perfection, we must accept that bugs happen, and asserting that Rust is the only answer to the problem is as negligent as sticking to C++.
And, you know, there are many more C++ developers that work in kernel space than Rust developers know kernel internals (oops, another citation need). So, naturally, a large portion of C++ developers can smell the rubbish in this claim. Which is unfortunate because this increases animosity between the two communities, which goes against the goal of converting folks to safe languages. Rust folks know that Rust can definitely help make the situation better, but C++ folks cannot get bought into it because the arguments they hear don’t resonate with them.
There have also been other claims saying that this would not have happened at all if Falcon was not running in the kernel. OK, that’s a better take, but… it is not crystal-clear that this alone would help either.
As I mentioned earlier, Falcon needs to be as tamper-proof as possible to prevent malware from interfering with it and compromised users from trying to disable it. And if malware or humans were able to easily do that, then the product would be as useless as nothing.
Now, the Windows kernel could definitely forbid kernel modules for anything similar to Falcon. Instead, the kernel could expose a bunch of APIs so that user space applications could hook into them to expose similar functionality. And you know what? Microsoft did try to move Windows in that direction, but the antivirus mob threatened to sue on antitrust grounds and the whole thing went nowhere. So we are stuck with a less secure system because antivirus companies need to be able to sell their obnoxious products.
But let’s step aside from that dumpster fire for a moment. Even if Falcon ran in user space and communicated with the kernel via controlled APIs… would that be sufficient to prevent a system malfunction? Note that these APIs would need to be tamper-proof too. Imagine if, say, you wanted this user space driver to validate every binary before it’s executed by the kernel. If you make the kernel require an answer from the user space driver on every execution, and such driver is faulty, the system won’t be able to execute any program anymore. And if you make it optional for the kernel to communicate with the driver so that the kernel can tolerate a crashing driver, then you open up a path for malware to try to crash the driver first and then infiltrate the system.
Thus it is not obvious that “just moving to user space” is the answer here either. Oh, by the way, Apple has been moving macOS in the direction of disallowing kernel modules for years and providing alternative APIs to implement things like file systems and antivirus-type software without kernel privileges. They also have this kind of security daemon running in user space in the default system which validates process executions, and drove me crazy a while back.
If we must accept that bugs exist, that memory-related bugs are not the only ones that can kill a system, and that moving the driver to user space isn’t an obvious fix either… are we doomed? Is there nothing we could do to prevent this from happening?
The above are all things that could (and sould!) be done to reduce the chances of a misbehavior happening, but we must accept that the code bug was just the specific trigger this time around and a different trigger could have had similarly nefarious consequences. The root cause behind the outage lies in the process to get the configuration change shipped to the world.
Now, SRE 101 (or DevOps or whatever you want to call it) says that configuration changes must be staged for slow and controlled deployment, and validated at every step. Those changes should first be validated in a very small scale before being pushed to the world later on, and every push should be incremental.
I find it quite hard to believe that CrowdStrike has nothing to validate deployments given the criticality of Falcon and the massive impact that a bug can have (and has had). Maybe they don’t have any process whatsoever as the external evidence seems to suggest, which would be incredibly negligent, but let’s give them the benefit of the doubt.
Part of why I say this is due to new information from Microsoft with an assessment of how many machines were impacted by the outage. Here is what Microsoft has to say:
While software updates may occasionally cause disturbances, significant incidents like the CrowdStrike event are infrequent. We currently estimate that CrowdStrike’s update affected 8.5 million Windows devices, or less than one percent of all Windows machines. While the percentage was small, the broad economic and societal impacts reflect the use of CrowdStrike by enterprises that run many critical services.
Read that: “less than 1% of the machines were impacted”. Does this mean that CrowdStrike does have a staged rollout in which they push configuration changes to just a subset of 1% of the machines worldwide? That doesn’t seem crazy actually and is in line with how many services are deployed, but if that’s their first step in the deployment process, it’s time to reevaluate their approach because this 1% of machines is way too many and has proven to be catastrophic. (This is leaving aside the question of whether 1% refers to the total number of Windows machines in the world or 1% of Windows machines with CrowdStrike on them.)
The question that remains answering, then, is what sort of testing happened between “the configuration change is done” to “let’s roll it out to 1% of the world”. Did any testing happen at all? That’s where it gets concerning and where we may not get any official information on the matter from CrowdStrike.
So, yes, CrowdStrike’s deployment practices are definitely to blame for the incident. This outage was a process problem, not a code/technology problem.
Before concluding, let’s shift away from CrowdStrike and look at the downstream companies impacted by this bad configuration push. Did you notice how many companies were absolving themselves from problems by pointing out that “due to a third-party IT incident” their service was down? OK, yes, that is literally true once again, but pointing fingers is not helpful.
The problem in this situation is the monoculture around CrowdStrike. Certain security certifications require “endpoint protection” as a line item and it seems perfectly plausible that most IT departments just deploy Falcon due to aggressive marketing from CrowdStrike’s part and call it a day without putting any more thought into it. It’s just not interesting to them to spend any extra time on the issue.
But this raises a problem: dependencies are always a liability, and when you choose to take a dependency on another vendor or another piece of code, you must own your choice. You must model how your own system will fail when the dependency fails, and then, if the risk warrants it, engineer a solution around the potential problem.
I do not know how Falcon works, so I cannot tell if CrowdStrike offered sufficient clarity on how upgrades work to customers or even if their product offers a way to control how the deployment of configuration changes happens within an organization. These are things that will have to be built so that the few customers that care can increase the reliability of their systems.
And finally, let’s talk about the last part of CrowdStrike’s official statement:
This issue is not the result of or related to a cyberattack.
Is that true? We may never know. It is unlikely that this is a cyberattack because crashing systems left and right doesn’t seem to serve a clear purpose. I guess you could claim that an attacker may have wanted to hide something else while the world was scrambling, and that might be possible, but… well, you can imagine all sorts of conspiracy theories.
What is interesting is this part of CrowdStrike’s official postmortem:
Although Channel Files end with the SYS extension, they are not kernel drivers.
Emphasis theirs. Funny, huh? They seem to be wanting to hide the fact that Falcon doesn’t run in the kernel. But it does. Knowing that these files are not drivers is not comforting: the kernel module is reacting to changes to these files and thus these files influence the behavior of the kernel. As a result, it seems plausible that malformed Channel Files could tamper with the kernel in ways that do not result in more subtle ways than just a blatant crash.
And this situation, my friend, is precisely where Rust would definitely help. Rust’s memory safety would minimize the chances that a malformed configuration file could exploit bugs like buffer overflows to escalate privileges within the kernel, resulting in much more subtle, but dangerous, attacks.
But that’s not what happened this time around.