Are Outages The Price We Pay for Innovation?
Do the ends of Blue Screens of Death justify the means of better security solutions?
Introduction
Waking up to news on Friday morning of blue screen outages worldwide, caused not by a cyberattack but by a faulty CrowdStrike update, left me, a CrowdStrike alum, fan, and shareholder, feeling confused, and frustrated. Like millions globally, I was on a flight Friday and assumed the worst. While travel was impacted, I reached my destination safely. However, my heart goes out to those whose flights were canceled, the hospitals that lost the ability to deliver critical services, and countless other organizations that were affected.
It’s hard to turn anywhere on LinkedIn, TikTok, even the Washington Post without a discussion about the catastrophic failure event that occurred. It’s no secret that this is one of the worst outages ever, however Friday’s news, as impactful as it was, was to some extent unsurprising to me. The cybersecurity vendor marketplace has been innovating at breakneck speed for the past decade, and it felt like it was almost a matter of time before something like this happened. I did not expect that it would be CrowdStrike, the newly appointed S&P 500 company and security industry darling, to cause the problem.
I’d like to venture a different approach to the CrowdStrike outage discussion: is CrowdStrike’s Blue Screen of Death Debacle the price we pay for good security and the innovation required to achieve it?
…but first, a little history and a quick overview of what the heck the Blue Screen of Death is.
What IS the Blue Screen of Death?
The Blue Screen of Death (BSOD) is a notorious error screen displayed by the Windows operating system when a critical system error occurs. This error forces the system to halt completely, preventing potential damage or data loss. Typically, a BSOD shows a blue background with white text, detailing the error and troubleshooting steps.
Why Does BSOD Happen?
BSODs occur due to hardware malfunctions, driver conflicts, or software errors that the operating system cannot recover from. Causes include faulty memory, overheating components, corrupted drivers, or system software bugs. When Windows encounters an unrecoverable error, it triggers a BSOD to protect system integrity.
Example of a Normal BSOD Cause
A common BSOD trigger is a driver conflict. For example, a newly installed hardware device with an incompatible driver can cause the system to crash, forcing Windows to display a BSOD to prevent further damage or instability.
History Lesson: The DAT File
In antivirus software, a DAT file contains the virus definition database that helps identify and neutralize threats. Regular updates to DAT files ensure protection against the latest threats. However, faulty DAT files can cause significant problems, like a BSOD.
CrowdStrike, along with Cylance and others, aimed to eliminate BSODs in the early 2010s by shifting from signature-based antivirus to machine learning-based endpoint detection and prevention. Ironically, a DAT file at McAfee, where CrowdStrike co-founders George Kurtz and Dmitri Alperovitch previously worked, caused a critical outage similar to the recent CrowdStrike incident. This resemblance is notable because the CrowdStrike team used such incidents to justify moving to next-generation, machine-learning-driven endpoint security.
CrowdStrike Incident (2024)
What Happened
On July 19, 2024, a faulty software update for CrowdStrike's Falcon endpoint security product caused Windows systems to crash, resulting in the BSOD. This incident highlighted the risks associated with automated software updates and the need for rigorous pre-release testing.
Causes for the CrowdStrike BSOD
Routine Update Release: At 04:09 UTC, CrowdStrike released a routine sensor configuration update to Windows systems.
Detection of Issues: By 05:27 UTC, reports of BSOD errors began as systems failed to initialize correctly, leading to global disruption.
Identification of Fault: CrowdStrike identified the issue as a faulty update in the kernel-level driver used by Falcon. The update targeted newly observed malicious named pipes but contained a logic error.
Causes
Faulty Update: A logic error in the "Channel File 291" update controlled how Falcon evaluated named pipe execution on Windows systems.
Kernel-Level Driver Issue: The update introduced a conflict with critical Windows system components, causing instability and crashes.
Impact
Who Was Impacted:
Global Reach: The incident affected sectors worldwide, including airlines, banks, broadcasters, healthcare providers, and government agencies.
Extent of Impact: Approximately 8.5 million computers were affected globally, about 1% of all Windows computers.
Fix Provided
Steps for Resolution:
Boot into Safe Mode or Windows Recovery Environment.
Navigate to the C:\Windows\System32\drivers\CrowdStrike directory.
Locate and delete the file csfalcon.sys.
Reboot the system.
Swift Response: Once identified, a fix was in place in less than 90 minutes. However, teams across CrowdStrike’s 24,000 customers had to manually fix thousands of Windows machines over the weekend. Many are still working on these fixes today.
The McAfee Incident (2010): The Infamous DAT File
In 2010, a faulty update to McAfee's antivirus software caused Windows XP systems to repeatedly reboot or display the BSOD. This incident, known as the Infamous DAT File incident, highlighted vulnerabilities in update management processes and the importance of thorough testing.
What Happened
An update to McAfee's DAT file, version 5958, incorrectly identified a crucial Windows system file, svchost.exe, as a virus. This led the antivirus to delete or quarantine the file, causing Windows XP systems to become unstable and crash, resulting in BSODs and endless reboot cycles.
This incident affected hundreds of thousands of computers worldwide, disrupting businesses, hospitals, police departments, and schools. McAfee quickly released a fix and provided instructions for manually removing the faulty update. Sound familiar?
So How Did CrowdStrike Get Here?
Before Friday’s events, CrowdStrike’s name has been in headlines across the world since it first went public in 2019. It’s been seen by and large to be the gold standard of any venture-funded (dare I say any) cybersecurity company. A firm that grew from a unicorn valuation in May 2017 to one whose valuation has topped out over $80 billion at the peak of its share price, every single vendor (with the exception of Palo Alto Networks, Cisco, and a handful of others) in cybersecurity wants to, aspires to…craves to be “the next CrowdStrike.” With valuations like that, it’s tough to see why not.
Innovation:
With the transition of signature-based antivirus to machine-learning driven “next generation” antivirus, CrowdStrike delivered the first fully cloud-native large-scale enterprise cybersecurity capability. They, along with Carbon Black and Cylance, showed that endpoint prevention is more than just blocking files; it requires large analytics databases and a versatile business model to cover threat intelligence, incident response, and endpoint detection and response.
Go to Market:
CrowdStrike’s “land and expand” go-to-market was quite effective in the mid-to-late 2010s, and we’ve seen similar strategies play out at other security vendors. I myself used this approach after leaving CrowdStrike post-IPO to sell Palo Alto Networks’ (formidable) extended detection and response (XDR) capability to enterprises, competing head to head with CrowdStrike. Now, almost every endpoint vendor seeds prospects in hopes of a competitive bake off to prove technical dominance of their incumbent.
Public/Private Partnerships:
Notable public/private partnerships have also emerged, driving innovation and collaboration in the cybersecurity sector. CrowdStrike has collaborated with the U.S. Department of Homeland Security (DHS) to enhance national security through initiatives like the Cybersecurity and Infrastructure Security Agency (CISA) programs. Similarly, partnerships with the FBI have facilitated the sharing of threat intelligence, leading to more robust and proactive security measures.
Analysis: Vendor Service Level Agreement Comparisons
Considering CrowdStrike’s (five-day and counting) outage, I’ve decided to foil it with two axes of analysis, uptime service level agreements (SLAs) from Amazon Web Services (AWS) and a ransomware attack. It’s worth noting that CrowdStrike has, prior to Friday, never experienced a major outage in their 12 years of platform delivery (and 13 years of existence), and has been successful at preventing ransomware and providing incident response services to assist in ransomware recovery for a decade.
Current AWS SLA:
Amazon Web Services (AWS) offers a standard SLA guaranteeing 99.99% uptime for its services. This level of reliability is critical for businesses relying on cloud infrastructure for their operations.
CrowdStrike's SLA:
Performing an analysis for CrowdStrike, a one-day outage and a five-day outage can be compared to the total time CrowdStrike has been running, 12 years, in days.
One-day outage: 1 day out of 4,380 days (12 years) equates to approximately 99.98% uptime.
Five-day outage: 5 days out of 4,380 days equates to approximately 99.89% uptime.
Comparison with AWS SLA:
It could be argued that CrowdStrike’s 90 minute fix response could satisfy an “SLA” of sorts to their customers that’s close to AWS’. However, I’ve decided to compare a 1-day and 5-day resolution to the outage, based on the amount of time it took for many organizations to issue the fix on Friday, and the organizations still struggling with updates today.
The Irony of Cyber Attacks and Innovation
The greatest comparable cyberattack that caused a similar outage to the CrowdStrike event was WannaCry. Thinking out loud, I wonder how many WannaCrys CrowdStrike and its peers have prevented due to the competition-caused innovation in cybersecurity?
WannaCry Ransomware Attack
The WannaCry ransomware attack was a global epidemic that impacted over 200,000 computers in more than 150 countries. This attack utilized a vulnerability in Microsoft Windows known as "EternalBlue," which had been developed by the NSA and later leaked by a hacker group called the Shadow Brokers. WannaCry spread rapidly, encrypting files on infected computers and demanding a ransom in Bitcoin to decrypt the data.
Key Facts about WannaCry:
Date: May 12, 2017
Impact: Over 200,000 computers in 150 countries
Mechanism: Exploited the EternalBlue vulnerability in Microsoft Windows
Notable Victims: FedEx, Honda, Nissan, and the UK's National Health Service (NHS)
Resolution: A security researcher discovered a "kill switch" that temporarily halted the spread of the ransomware, though many systems remained encrypted until the ransom was paid or the encryption was reversed.
Comparison with CrowdStrike Outage
CrowdStrike Outage: July 19, 2024
Cause: Issue in a content update for CrowdStrike Falcon sensor
Impact: Over 8.5 million computers affected globally
Notable Victims: Various enterprises relying on CrowdStrike services
Resolution: Microsoft and CrowdStrike working on recovery as this is written.
Both the CrowdStrike outage and the WannaCry ransomware attack had significant global impacts, disrupting millions of systems and highlighting the vulnerabilities in cybersecurity infrastructures. While the CrowdStrike outage was due to a flawed software update, WannaCry was a direct ransomware attack exploiting a known vulnerability. CrowdStrike’s outage had a fix, WannaCry’s killswitch was fortunately found soon after the incident occurred. Both incidents underscore the critical need for robust cybersecurity measures, redundancies, and timely updates to protect against such large-scale disruptions.
Conclusion
So, are BSODs the price we pay for good security? In the case of CrowdStrike's recent mis-step, it appears that sometimes the cost of rapid innovation and cutting-edge technology indeed results in unforeseen, widespread consequences. The incident was not just a momentary lapse but a significant disruption that affected millions worldwide.
CrowdStrike's swift response, identifying and fixing the issue within 90 minutes, underscores their commitment to resolving such critical failures promptly. However, the broader impact—airlines grounded, hospitals disrupted, and countless businesses halted—serves as a stark reminder of the delicate balance between innovation and reliability.
This situation highlights the inherent risks in cybersecurity, where rapid advancements are necessary to stay ahead of increasingly sophisticated threats. The competition and innovation driven by companies like CrowdStrike have undoubtedly raised the bar for security solutions, but they also introduce new unexpected consequences. The fact that CrowdStrike has helped prevent numerous cyberattacks, including potentially catastrophic ones akin to WannaCry, cannot be overlooked.
Ultimately, while BSODs are an undesirable outcome, I believe they’re a byproduct of the relentless pursuit of better security. The key takeaway here is that rigorous testing, robust disaster recovery planning, and maintaining scalable infrastructure are crucial to mitigating these risks. For organizations and vendors alike, proactive measures and continuous improvement are essential to balancing the benefits of innovation with the need for reliability.
As we continue to push the boundaries of cybersecurity, we must accept that occasional setbacks are part of the journey. The lessons learned from these incidents will only make the industry stronger and better equipped to handle future challenges.
Stay secure, and stay curious, my friends!
Damien
Note: All points of view are solely my opinion and do not endorse or dissuade investment in CrowdStrike, McAfee, or other security vendors noted in this blog post.
References
CrowdStrike. (2024, July 19). Falcon update for Windows hosts: Technical details. CrowdStrike. https://www.crowdstrike.com/blog/falcon-update-for-windows-hosts-technical-details/
Lemos, R. (2010, April 21). Defective McAfee update causes worldwide meltdown of XP PCs. ZDNet. https://www.zdnet.com/article/defective-mcafee-update-causes-worldwide-meltdown-of-xp-pcs/
Microsoft. (2017). WannaCry ransomware attack report. Microsoft Corporation.
Amazon Web Services. (n.d.). Service Level Agreement. https://aws.amazon.com/sla/
U.S. Department of Homeland Security. (n.d.). Cybersecurity and Infrastructure Security Agency (CISA) programs.
https://www.cisa.gov
Federal Bureau of Investigation. (n.d.). Public-private partnerships in cybersecurity.
https://www.fbi.gov