It Took Just Four Days From Elon Gleefully Admitting He’d Unplugged A Server Rack For Twitter To Have A Major Outage
from the whoops dept
"I know, I know. Some of the more angry commenters around here keep insisting that I should stop talking about Elon Musk and Twitter, and I want to do exactly that. I planned to do exactly that and not write another post about it all until next week. And then… Twitter crashed hard last night. Downdetector has the receipts:
Here’s what happened when I went to visit Twitter:
I especially like that “it’s not your fault” bit, because, well, yeah. It’s not.
As I write this, there hasn’t been anything official about what happened, but I’m assuming that Elon will show up at some point to blame the “woke mind virus” or the federal reserve or SBF or Anthony Fauci.
And, it may be a total coincidence, but it was just four days ago that he bragged about pulling the plug on an “important server rack.”
Separately, there have been reports that Musk decided (with little to no notice, and almost no planning) to shut down its Sacramento data center and massively downsize their Atlanta data center. Twitter only has one other data center in the US, in Portland, Oregon. Twitter’s use of data centers rather than the cloud is something that’s been discussed over the years, and two years ago the company did sign a deal to start using Amazon Web Services, though I don’t think the company relies to heavily on it yet, and the first link in this paragraph notes that Elon has been trying to renegotiate the AWS contract as well (which might mean he’s also stopped paying the bills as he seems to have done that with many vendors as part of his “renegotiation” efforts).
Separately, I’ve heard from three separate people that Elon more or less ordered the shutdown of the an entire data center (presumably the Sacramento one) with basically one day’s notice and no planning.
And, with that in mind, I’ll remind people that one part of former Twitter security chief Peiter “Mudge” Zatko’s whistleblower report noted that the company had a deep need for more redundancy, not less:
Insufficient data center redundancy, without a plan to cold-boot or recover from even minor overlapping data center failure, raising the risk of a brief outage to that of a catastrophic and existential risk for Twitter’s survival
That report also presented a redacted version of the “threat matrix” Mudge claims he wanted to show the Board, though was urged only to give a high level overview, orally, rather than present a more complete written report. It again notes that a data center failure could be catastrophic.
Later in the report, Mudge notes that this almost happened in the past:
Cascading data center problems: In or around the spring of 2021, Twitter’s primary data center began to experience problems from a runaway engineering process, requiring the company to move operations to other systems outside of this datacenter. But, the other systems could not handle these rapid changes and also began experiencing problems. Engineers flagged the catastrophic danger that all the data centers might go offline simultaneously. A couple months earlier in February, Mudge had flagged this precise risk to the Board because Twitter data centers were fragile, and Twitter lacked plans and processes to “cold boot.” That meant that if all the centers went offline simultaneously, even briefly, Twitter was unsure if they could bring the service back up. Downtime estimates ranged from weeks of round-the-clock work, to permanent irreparable failure.
“Black Swan” existential threat: In fact, in or about Spring of 2021, just such an event was underway, and shutdown looked imminent. Hundreds of engineers nervously watched the data centers struggle to stay running. The senior executive who supervised the Head of Engineering, aware that the incident was on the verge of taking Titer offine for weeks, months or permanently, insisted the Board of Directors be informed of an impending catastrophic “Black Swan” event. Board Member [REDACTED] responded with words to the effect of “Isn’t this exactly what Mudge warned us about?” Mudge told [REDACTED] that he was correct. In the end, Twitter engineers working around the clock were narrowly able to stabilize the problem before the whole platform shut down.
That’s not to say that this has anything to do with the outages last night, but at the very least there are strong arguments that Twitter’s infrastructure is inherently fragile, and shutting down “sensitive” server racks or closing down entire data centers without careful planning seems like the sort of thing that could, well, backfire pretty badly.
Meanwhile, the only comment so far from Musk appears (it’s tough to know because Twitter only loads intermittently) is him responding to someone saying “works for me” when they asked about site problems. Also, in context, Musk is replying to a joke about the site being down, rather than a legitimate concern (someone asks if anyone can see or respond to their tweet, and one of Musk’s biggest fans tweeted “I can’t see or respond to it” (obviously making light of the whole thing) and then Musk responds with “works for me.”
So it’s not entirely fair to say this is a comment directly about the widespread outages. Assuming Musk realizes Billy is joking, then… it could just be a weak attempt at playing along? But here’s the actual funny part. The Guardian has an article about Musk’s tweet saying stuff “works for me” except that stuff isn’t working, because the Twitter embed is not showing properly, but instead is showing in failover mode, where if the embed won’t load it just shows the alt-text in as “tweet-like” a form as possible. This screenshot is just pure irony.
I eagerly await the comments from folks who were insisting to me just yesterday that Twitter under Musk was functioning much better than before, and that this all proved he was right to get rid of approximately 75% of the workforce who obviously did nothing…
Oh and just as this post was being completed, Elon has a new story, claiming that Twitter was just rolling out “significant backend server architecture changes” and that “Twitter should feel much faster” (it doesn’t, unless you’re talking about the difference from not working at all… to kinda working some of the time?).
Even if that was the cause of the outage (and… I’m doubtful), that
still raises all sorts of questions about how the company prepared for
the switchover, if it caused such a massive disruption in the process.
That’s… not how any of this should work."
Filed Under: datacenters, downtime, elon musk, fail whale, mudge
Companies: twitter
No comments:
Post a Comment