Friday, July 19, 2024

I'm a software engineer stranded in the IT outage airport chaos. CrowdStrike broke a cardinal rule of software development and I can't believe the fallout.

Lines at Barcelona-El Prat Airport at the check-in desks amid travel chaos.
Check-in was congested at Barcelona-El Prat Airport.
  • Ahmed Al Sharif has worked in software engineering and consumer electronics for decades.
  • The CTO found himself caught up in the outage-induced travel chaos early July 19.
  • He said CrowdStrike broke a cardinal rule for developers, and the fall-out has been eye-opening. 

This as-told-to essay is based on a transcribed conversation with Ahmed Al Sharif, 32, the CTO of Sandsoft, a game developer. Al Sharif was stranded at Barcelona airport on Friday, July 19th, due to the IT outage that's been disrupting travel and other services. The following has been edited for length and clarity.

I began my tech career almost two decades ago as a software engineer. I've been a startup founder and have worked for large companies like EA and Meta.

On Friday July 19th, I was meant to be flying from Barcelona International Airport to London Heathrow on business at 11 a.m. local time.

I was surprised to arrive and find the flights shuttered. It was also surprising to learn that there had been significant malfunctions across multiple Windows-based systems at the airport. It took some digging to find out that it was a global event.

Even as an engineer, the outage has been strange to witness. I didn't believe that there was this much dependency on a single piece of third-party software that, if they pushed an update irresponsibly, would cause this much havoc.

I realized the IT outage was happening after I arrived to chaos at the airport

I left for the airport at 8 a.m. Before I got there, there were some early signs of something being wrong.

I couldn't log into my online banking app, and things were a bit slow when I logged into my Outlook-based email, but I chalked it up to my hotel WiFi.

When I got to the airport at around 8:20 a.m., it was packed. The queues were endless. Several check-in desks displayed blue screens, and no one was being processed. I realized something bigger was happening.

I couldn't understand where to queue, and when I asked an airport advisor, they said there was no point queuing now because there was a fault with the ticketing, booking, and reservation systems.

I asked if our airport was the only one affected, and they told me that it was happening everywhere.

That's when my furious Googling began. I realized that the issue was with CrowdStrike, and it was happening globally.

Lines at Barcelona-El Prat Airport at the check-in desks amid travel chaos.
Check-in was congested at Barcelona-El Prat Airport.

Disruptions at the airport continued throughout the day

Over the course of the day, baggage drop machines, vending machines, and most display boards at the airport weren't working.

Check-ins were being done manually. Before I was given a handwritten paper ticket, I had to prove I had booked a flight for that day by showing staff my emails as proof of payment. Anyone with checked luggage had to bring it to the gate, and airport staff were throwing it into the cargo hold by hand.

A handwritten boarding pass
Al Sharif's hand-written boarding pass.

My own flight has been delayed by six hours, and I'm still in the airport at the time of speaking. It's been annoying and inconvenient, but thankfully, the company is covering my travel, and they understand the situation is out of my control.

I've spoken to people waiting in the airport for 11 hours. People seem pretty frustrated.

I was surprised to learn about CrowdStrike's dominance on Microsoft devices

During my delay, I've been occupying myself by trying to understand the situation more. It's quite intriguing and reveals that we've taken for granted how interconnected our world is and how dependent we are.

I was surprised to learn that the cascading failures were caused by an update CrowdStrike pushed early Friday morning.

CrowdStrike is well-known in the cybersecurity industry, but until today, I don't think anyone was aware of its dominance as a platform on Windows. It didn't cross my mind that a third-party solution could cause this much damage to Windows-based machines.

One of the cardinal rules of software development is that you typically don't want to push a fix later in the week or on a Friday. You'll have less support trying to fix any issues and the weekend is gone.

It feels like cardinal rules were broken. However, CrowdStrike might have evidence to suggest that this is a freak accident, so I don't think it's fair to point fingers too much.

I've seen people comment on this situation about diversification and companies not relying on just one or two providers. I don't think that's realistic.

Windows is commonly used. In any free market economy, the best-performing and most suitable product will dominate the market.

We need more stringent reviewing and understanding of the tools we use on critical infrastructure. The impact here was so big because it affected infrastructure like airports, railway stations, and hospitals. It's going to cost the economy a lot.

Had there been more due diligence or even government regulation, I don't this would have happened.

I'd expect that after this, government authorities will expect widely used platforms like CrowdStrike to notify them before they push changes like this.

Technology will shape our future, so we need to consider how we regulate it

We haven't fully seen the impact this outage will have, and it's definitely put into perspective how the little devices we keep in our pockets dictate the rhythms of our lives.

I don't think anyone really understood the true scope and presence of CrowdStrike before today.

I still see our future becoming increasingly automated. There will probably be many more situations like this, especially as our society pushes toward more automation. I still think technology is a good thing, but it's made me want to get my bills in the post now and then.

It usually takes a horizon event like this to cause lessons to be learned. I don't think technology is intrinsically good or evil; it depends on how it's used and regulated.

Read the original article on Business Insider


from All Content from Business Insider https://www.businessinsider.com/crowd-strike-outage-travel-chaos-tech-expert-2024-7
via gqrds

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Back To Top