As a cybersecurity specialist, writing about incidents is part of the job. However, when the CrowdStrike outage happened, I found myself hesitant to pen my thoughts. CrowdStrike, a leader in endpoint security, experienced an outage that was significant. Yet, my reluctance stemmed from a mixture of professional respect and the pervasive fatigue surrounding cybersecurity failures. In this article, I’ll explore why I initially resisted writing about the incident and the valuable insights I gained by listening to the cybersecurity community.
My company cellphone went off, and I picked up my pillow and made my way to the couch. I usually do this because working alerts at 1 AM can wake up some family members. Someone who didn’t log in from Russia is scary, but waking up a sleeping baby is scarier. When I actually opened my groggy eyes, I realized it was just an alert from a newsfeed. It wasn’t until I looked at my email that I began to see what everyone else was talking about. The headlines read “Global IT outage”, “Airlines down due to IT outage”, “Cybersecurity Issue outage”. I had to open my eyes wider to see that CrowdStrike was the reason for the outage. The Y2K nightmare had finally come true, and we all are about to bear witness.
FYI - Y2K isn’t a famous rapper, and if you need more information because you are too young, visit the link below.
https://education.nationalgeographic.org/resource/Y2K-bug/
The local and worldwide news was losing their minds. They were claiming that the whole world was shut down due to the outage. Businesses, banks, airports, hospitals, and more were down. Nobody could do anything, it seemed. You could literally drive downtown wherever you were and see a business sign with a BSOD “Blue Screen Of Death.” It seemed worse than trying to get a roll of toilet paper during the COVID-19 pandemic.
Then came the yelling. It wasn’t from me or my one-year-old who heard me sobbing in the bathroom from all the work I was going to have to do that morning. The yelling came from businesses that could not conduct business and wanted answers from CrowdStrike on what happened and how to fix the issue. Please keep in mind that every minute that passed, a company was losing money, a patient was not
getting admitted, and someone’s paycheck was not getting deposited. These are all examples, but you get the point.
Before I start verbally bashing CrowdStrike, let’s take a moment and talk about how truly great they are as a company. Queue the Sarah Mclachlan music and grab a Snuggie.
CrowdStrike Holdings, Inc. is a US-based cybersecurity company that offers software and services to businesses to help protect against cyberattacks. Their core technology is the Falcon platform, which combines antivirus, endpoint detection and response (EDR), and threat-hunting services into a single agent. CrowdStrike's cloud-based architecture collects and analyzes more than 30 billion endpoint events per day from sensors in 176 countries.
CrowdStrike's customers include some of the world's largest tech companies, such as Google, Amazon, and Intel, as well as retail giant Target, Formula One team Mercedes-AMG PETRONAS, and the US
government. The company has been in business since 2011 and is based in Austin, Texas. CrowdStrike is publicly traded on the Nasdaq as CRWD and is a component of both the Nasdaq-100 and S&P 500.
I could go on and on about CrowdStrike and show you shiny graphs and statistics, but you don’t want that. The “Big picture” to remember is that CrowdStrike is a great product and company. Their product and services alone save people millions of dollars from data breaches and cybercrime. I know my opinion of them is different from the support guy who was up at 1 AM fixing computers after an all-night bender at a Metallica concert. They are the real heroes of this event.
On Friday, July 19, 2024, CrowdStrike released a content configuration update for the Windows sensor to gather telemetry on possible novel threat techniques. Some have speculated it was due to the updated release of a known exploit framework known as Cobalt Strike. The updates are a regular part of the dynamic protection mechanism of CrowdStrike’s platform. The content update resulted in a global Windows system crash, but Mac and Linux hosts were not impacted. Lucky for them.
CrowdStrike said in the report that the company routinely tests its software updates before pushing them out to customers. But on July 19, a bug in CrowdStrike’s cloud-based testing system — specifically, the part that runs validation checks on new updates prior to release — ended up allowing the software to be pushed out “despite containing problematic content data. When Windows devices using CrowdStrike’s cybersecurity tools tried to access the flawed file, it caused an “out-of-bounds memory read” that “could not be gracefully handled, resulting in a Windows operating system crash, a crash that totaled 8.5 million Windows devices around the world.
Resolving the issue was not an easy task. Some users were able to resolve the problem by rebooting their computers several times. However, if problems persist, CrowdStrike offers a manual workaround solution for the blue screen error. The manual workaround solution meant that there had to be someone in front of the computer to correct the issue. This was a very tedious method for companies that had thousands of computers and a limited workforce. Especially if you had angry customers who needed attention. The workaround fix involves booting the system into Safe Mode or the Windows Recovery Environment and navigating to the C:\Windows\System32\drivers\CrowdStrike directory. Users must then delete the file title “C-00000291*.sys.” The process puts the system into a mode where CrowdStrike and other third-party drivers aren’t able to operate. Just to bring context to this solution from a technical person’s perspective, we will pretend that you are a person who is going to help assist with this fix.
You have been asleep for a few hours, and you hear your company phone ring. Your boss is on the phone, and they are telling you that employees are calling, stating that they cannot work. The details are there is a “Blue Screen”, and they cannot log in to do their work and it appears that the problem is spreading to more computers. Your boss needs you to come in and troubleshoot the issue right away because at this point, business is at a standstill, and they are losing money. You arrive on the scene after a quick coffee stop. The issue has escalated to hundreds of computers, including some very critical infrastructure assets. Your boss is on the phone with some very concerned shareholders and C-Suite executives. They want answers sooner than later.
As you are thanking God that you are wearing your brown pants today, you receive information that your beloved antivirus solution is responsible for the outage. You take a minute to reflect and gather more information on how to proceed. The workaround to correct the issue that is being discussed is a manual process of you behind every computer that is affected. You have no idea how many computers are affected, but you have to start somewhere. You are able to correct a handful of errors but realize that you have other branch companies, and the calls keep coming about affected computers. By the end of the day, hundreds of computers were down, and only three technical poor souls could correct the issue. This is what a lot of companies faced, and it was a miracle that they were able to recover without going out of business. One example is Delta Airlines, which canceled more than 6,000 flights over a six-day period, impacting more than 500,000 passengers. The damages are at $500 million, and like other companies, they are engaged in a lawsuit with CrowdStrike to pay up.
As the world was shut down by the CrowdStrike outage, I was able to listen to the needs of the cybersecurity community and the world. There were a lot of questions and accusations about the outage. Some people blamed CrowdStrike, and some blamed Microsoft. Some didn’t have time to blame anyone because they were too busy working.
Here is what we know:
So, we ask ourselves, who is to blame for the outage? I feel it is a difficult decision to make because there is a lot to consider. CrowdStrike updates their system all the time without any issues. They have been protecting companies and assets since they have been in business. All other solutions follow a process and have problems as well, even if they say otherwise.
Microsoft has been “Keeping the world moving”, since Bill Gates put on those bitchin glasses and the world fell in love with him. Microsoft is embedded in every single form of communication and technology in our world today. You cannot take a job at a company without seeing some sort of Microsoft product. I can see where someone would blame Microsoft for the outage. It’s only because Microsoft Windows blue-screened when it went to access the faulty update. Also, as a former ethical hacker, I made a living off of exploiting everything that was Microsoft. You can Google all the exploits available for Windows products and the list of vulnerabilities that seem to stack up.
The other factor that was discussed is that not all companies were affected. Sounds good, right? Does it now? One such example is Southwest Airlines. Southwest avoided the outage because it’s still running Windows 3.1….the fourth-largest US airline remained unaffected because its operating system hasn't been updated in decades. But let’s be mad at Microsoft and CrowdStrike and not at the fact that some companies today that are in charge of airlines, banks, hospitals, and more can’t even keep their systems up to date. It’s not a problem, really; they are just in charge of travel, money, and your lives.
The true blame is on companies not being ready when something horrible happens. For example, according to reporting, it took FEMA several days to provide sufficient water and supplies to evacuees sheltered at the Superdome after Hurricane Katrina struck the U.S. Gulf Coast in August 2005. We all make mistakes, and people are not very forgiving, but we have to try to become more aware of what it takes for disasters to happen. We also need to learn from them to do better for our companies and our customers.