Normal view

Before yesterdayMain stream

How the AWS outage happened: Amazon blames rare software bug and ‘faulty automation’ for massive glitch

24 October 2025 at 00:46
(GeekWire Photo / Todd Bishop)

A detailed explanation of this week’s Amazon Web Services outage, released Thursday morning, confirms that it wasn’t a hardware glitch or an outside attack but a complex, cascading failure triggered by a rare software bug in one of the company’s most critical systems.

The company said a “faulty automation” in its internal systems — two independent programs that began racing each other to update records — erased key network entries for its DynamoDB database service, triggering a domino effect that temporarily broke many other AWS tools.

AWS said it has turned off the flawed automation worldwide and will fix the bug before bringing it back online. The company also plans to add new safety checks and improve how quickly its systems recover if something similar happens again.

Amazon apologized and acknowledged the widespread disruption caused by the outage.

“While we have a strong track record of operating our services with the highest levels of availability, we know how critical our services are to our customers, their applications and end users, and their businesses,” the company said, promising to learn from the incident.

The outage began early Monday and impacted sites and online services around the world, again illustrating the internet’s deep reliance on Amazon’s cloud and showing how a single failure inside AWS can quickly ripple across the web.

Related: The AWS outage is a warning about the risks of digital dependance and AI infrastructure

Tech Moves: Allen Institute gets new exec; AWS leader shifts roles; NuScale names legal officer

23 October 2025 at 21:54
Susan Kaech. (Allen Institute Photo)

Award-winning immunologist ​​Susan Kaech is the new executive vice president of the Allen Institute’s Immunology Moonshot, an initiative that aims to understand the immune system’s role in human health and disease.

Kaech currently leads the NOMIS Center for Immunobiology and Microbial Pathogenesis at the Salk Institute for Biological Studies and will join the Allen Institute in January.

“The appointment comes at a critical time in bioscience when the immune system is regarded as the cornerstone of all diseases and understanding its foundational principles is vital to unlocking new treatments and therapies,” the institute said in a statement.

Kaech’s research includes the investigation of how the immune system remembers infections to develop immunity, T-cell communications, and the role of metabolism in the immune system’s fight against cancer.

Arthur Valdez Jr. (LinkedIn Photo)

—  Seattle RFID company Impinj named Arthur Valdez Jr. to its board of directors.

Valdez recently left the role of executive VP of global supply chain and customer solutions at Starbucks and his career includes leadership roles at Amazon, Target and elsewhere.

“Arthur’s expertise transforming and optimizing strategic supply chain and logistics networks for large consumer-facing companies will be invaluable as we continue to advance our vision of connecting every thing,” said Impinj CEO Chris Diorio in a statement.

Jason Bennett. (LinkedIn Photo)

Jason Bennett has taken a new role at Amazon Web Services, shifting from VP of U.S. enterprise to VP of worldwide startups and venture capital. Bennett has been with the company for more than 17 years.

On LinkedIn Bennett shared his fondness for working with startups and said he was eager to return to a position serving that community.

“I’m energized by the opportunity to work alongside our teams to support a thriving startup ecosystem — from founders and VCs, to accelerators, and the broader innovation community,” he said, adding that the work “has a lasting impact on the direction of industries and the future of AI.”

James Canafax. (NuScale Photo)

NuScale Power named James Canafax as chief legal officer and corporate secretary. The Tigard, Ore.-based nuclear energy company is developing small modular reactors.

Canafax has decades of legal experience and joins NuScale from Maritime Partners. Past positions include executive leadership at BWX Technologies, which supplies nuclear components and services.

“[Canafax’s] extensive experience in the nuclear industry, deep familiarity with the regulatory environment and track record of guiding organizations through key growth periods make him uniquely suited to support NuScale at this important moment for our company,” CEO John Hopkins said in statement.

Elvis Dieguez. (symphonie Photo)

— Seattle entrepreneur Elvis Dieguez is now VP of data science, analytics and platforms for the healthcare startup hims & hers. Diegeuz joins the company from symphonie, a Seattle e-commerce marketing platform where he was CEO and co-founder. He was previously at Amazon for more than four years working in business analytics and as a senior manager.

Hims & hers offers a telehealth platform for conditions including sexual health, hair loss, mental health, skincare and weight loss.

“I look forward to leading and working with a ~70 person team who’ve been working hard to make the #healthcare system work for all Americans,” Dieguez said on LinkedIn.

Ariel Brumbaugh. (LinkedIn Photo)

— Biotech startup Synthesize Bio named Ariel Brumbaugh as senior director of business development. In the role, Brumbaugh will help the company partner with biopharma companies interested in using Synthesize’s AI-based research platform to accelerate and de-risk drug development.

Seattle’s Synthesize Bio was founded by leaders from Fred Hutchinson Cancer Center. Last month it announced $10 million in funding from Madrona.

Brumbaugh joined the startup from the San Francisco biotech company Gladstone Institutes.

Sophie Brougham is director of philanthropic operations for the recently launched Clean Economy Project. Nicknamed CleanEcon, the effort includes past employees of the Bill Gates-led Breakthrough Energy and is a policy and advocacy platform promoting clean power.

Prior to Breakthrough, Brougham was with the Paul Allen holding company Vulcan (now known as Vale Group) for more than a decade, where she was a senior manager and led programs including philanthropic and grants management.

— Seattle’s Jake Laes is now executive director of AI Tinkerers, a global network of AI engineers and builders. Laes joined the group from Deel, where he helped facilitate partnerships between investors and accelerator programs. Laes is the founder of YoungTech Seattle, and his background includes mentoring and leadership roles at the University of Washington’s CoMotion and Techstars.

Pranam Kolari, VP of search and recommendations at Coupang, is resigning from his role next month. Coupang is South Korea’s largest e-commerce platform and is headquartered in Seattle. Kolari, based in San Jose, Calif., was previously at Walmart Labs for nearly a decade where his roles included vice president of engineering for search.

Datavault AI appointed Pete Scobell as VP of global security. The Beaverton, Ore.-based company helps businesses monetize their data and create digital twins of physical objects. Scobell is a decorated U.S. Navy SEAL veteran and will oversee Datavault AI’s security operations, risk management and asset logistics.

Erin McHugh Saif, a former Massachusetts-based Microsoft executive, is CEO of an as-yet unnamed data and AI venture to serve “place-based partnerships,” which are networks of nonprofits, government agencies, and educational entities that aim to address education, jobs and housing needs.

“With better access to data, these organizations will leap ahead in this moment of AI transformation, gaining faster insight into which programs deliver the greatest improvement to significantly scale their impact,” Saif said on LinkedIn.

The effort has the support of the Ballmer Group, a philanthropic organization co-founded by former Microsoft CEO Steve Ballmer and his wife Connie, and the nonprofit TechSoup.

Karen Ng was promoted to executive VP of product at HubSpot. Ng has been with the company since 2022, joining as senior VP of product and partnerships. Past employers include Common Room, Google and Microsoft, where she was chief of staff across the company’s developer tools business. Ng is based in the Seattle area.

The AWS outage is a warning about the risks of digital dependance and AI infrastructure

23 October 2025 at 00:08
The show floor at AWS re:Invent 2024 in Las Vegas. (GeekWire File Photo)

Unless you’ve been on a “digital cleanse” this week, you know that Amazon Web Services (AWS) had a major outage at the start of the week.

You know this because apps and sites you use were down. Credible reports estimate at least 1,000 sites and apps were affected. Large swaths of modern digital life went dark: from finance (Venmo and Robinhood) to gaming (Roblox and Fortnite) to communications (Signal and Slack). Some people couldn’t even get a good night’s sleep because the outage took out “smart beds.” Even sporting events were impacted when Ticketmaster failed.

We’ve seen outages before, but this one seemed broader and harder to ignore.

In the wake of the outage, many well-intentioned hot takes boiled down to: “They should’ve used more cloud providers.”

Setting aside the subtle victim-blaming, there’s also the fact that in a world with only three major cloud providers (AWS, Microsoft Azure, Google Cloud) if you want to “diversify” there’s not a lot of diversity out there.

And the argument for diversity in cloud providers is really about market diversity, not individual organizations juggling multiple vendors. More competition in the cloud market would mean fewer cascading failures when one provider goes down.

The key question when something like this happens is whether we’re taking the risk lessons and expanding them beyond the immediate problem to see the emerging problems. 

Instead of saying organizations need to have multiple cloud providers, we should be asking how we’re dealing with the reality of highly concentrated risks with exceptionally broad impact because we just had an object lesson in what that really means.

In this recent outage there’s a pointer to where we should be looking proactively to apply this lesson: generative AI. This recent AWS outage gives us two lessons for the emerging generative AI ecosystem.

Concentration crisis in AI

With the generative AI ecosystem, I’m talking not about chatbots — I mean AI-native applications that are built on generative AI as a platform. We just saw that when there’s no cloud, there’s no cloud-native application. Likewise, when there’s no generative AI provider, there’s no AI-native application.

The first lesson from the AWS outage for AI-native applications is what happens to an industry when there’s a limited number of providers for centralized resources and there’s an outage. We just saw: it has huge rippling effects across the industry and all walks of life built on it.

It’s a throwback to the mainframe era: when “the computer” is down, it’s down for everyone.

There are as few, if not fewer, generative AI providers as there are cloud providers. A major outage is inevitable — that’s just engineering reality. When that happens, every AI-native app built on that generative AI platform will also go down, full stop.

The impact could be even more severe than the AWS outage. It will be more like “the computer is down, and the people are gone” for many different industries and services. Ironically, the “smarter” the industry and service, the greater the potential fallout.

The second lesson is one of intertwined risk. OpenAI itself was affected by this week’s AWS outage. 

That means AI-native apps have double exposure to the risks around a limited number of providers for critical, centralized resources. For AI-native apps, it’s like the mainframe era squared. If the generative AI platform fails, everything built on it fails. And if the cloud that hosts the AI platform fails, it all goes down, too.

This is not to say don’t do cloud or don’t do AI. But it is to say we need to understand this new, complex intertwining of risks inherent in a world where everything is relying on a small number of key providers and that small number of key providers also rely on a small number of key providers.

The realities of physical requirements and capital investment required for cloud and generative AI make a truly diverse ecosystem impracticable for either. I don’t think anyone sees more than a literal handful of providers for either of these in the future. 

The bottom line

Highly concentrated risks with exceptionally broad impact aren’t going away anytime soon. 

But the growth of generative AI providers — and their reliance on cloud providers — show where there is going to be growth and where and what those risks will be. The growth will be upwards, as technologies stack on top of and rely on each other. And that means these risks are only going to become more concentrated and the impacts even broader.

In the world of security, there’s the “CIA” triad: “confidentiality”, “integrity” and “availability.” In the first days of “Trustworthy Computing” at Microsoft, the principles included “availability.” But in recent years, availability has been overlooked often as security and privacy concerns understandably dominate.

A thoughtful application of the AWS outage tells us that outages like this are a kind of problem that isn’t an anomaly: it’s inherent in the nature of today’s technology reality. And since there are no easy solutions and only increasingly complex problems around this, we need to start understanding this new reality and thinking seriously about how to mitigate these risks.

AWS outage affects Ticketmaster for pivotal Mariners vs. Blue Jays playoff game in Toronto

21 October 2025 at 02:46
(Photo by appshunter.io on Unsplash)

The effects of the massive AWS outage reached the sports world on Monday.

Ticketmaster was dealing with ticket management issues as a result of the outage, according to messages shared by several sports teams hosting games on Monday, including the Toronto Blue Jays and Seattle Seahawks.

The Blue Jays, facing off against the Seattle Mariners in a Game 7 MLB playoff bout at Rogers Centre in Toronto, posted a statement earlier Monday about the outage and advised fans to “hold off on managing your tickets as we work through this.”

A few hours later, the team said ticket management was returning to normal.

>World Series appearance on the line
>AWS outage sends Ticketmaster down
>Blue Jays fans can't access Game 7 tickets
>Blue Jays opponent…Seattle
>Amazon headquarters…Seattle https://t.co/OYjjDj5cdf pic.twitter.com/rbNnwKYegG

— Morning Brew ☕️ (@MorningBrew) October 20, 2025

The Seahawks, which are hosting the Houston Texans for Monday Night Football in Seattle, issued a statement about the outage “that may impact access to Ticketmaster, Seahawks Account Manager, and the Seahawks Mobile App.”

The Detroit Lions, hosting their own Monday Night Football game, also had ticketing impacted.

The outage effects went beyond just ticketing. The Premier League said its VAR tech system, used to determine offside calls in soccer, would not be available for Monday’s match between West Ham and Brentford.

Amazon’s outage began shortly after midnight Pacific in Amazon’s Northern Virginia (US-EAST-1) region, which is AWS’s oldest and largest cloud region, a popular nerve center for online services.

In an initial update, AWS said the outage was related to a DNS resolution issue with its DynamoDB product, meaning the internet’s phone book failed to find the correct address for a database service used by thousands of apps to store and find data.

Amazon later said the root cause of the outage was an “underlying internal subsystem responsible for monitoring the health of our network load balancers.”

By 3 p.m. PT, the company said all AWS services had returned to normal operations.

Major sites and services including Facebook, Snapchat, Coinbase and Amazon itself were impacted — reviving concerns about the internet’s heavy reliance on the cloud giant.

The outage suggests that many sites have not adequately implemented the redundancy needed to quickly fall back to other regions or cloud providers in the event of AWS outages.

Previously:

❌
❌