Microsoft is blaming a software “code issue” for an outage that impacted Microsoft 365 services for five hours on Monday.
“A code issue caused a portion of our infrastructure to experience delays processing authentication requests, which prevented users from being able to access multiple M365 services,” said Microsoft in an email update to Microsoft administrators impacted by the outage.
Microsoft said it is currently “reviewing our code” to understand what caused the code to “stop processing authentication requests in a timely fashion.” Microsoft promised a post-incident report within five business days.
Microsoft said the software code issue impacted users on Sept. 28 from 7:25 am to 12:25 pm.
Microsoft customers started reporting their inability to access Office 365 on Downdetector.com at 5:21 pm Monday —within an hour, more than 18,000 posts documenting those problems had flooded the website that tracks cloud outages.
Microsoft told administrators users may have been unable to access multiple Microsoft 365 services that leveraged the Azure Active Directory including Outlook, Microsoft Teams and Teams Live Events as well as Office.com.
Furthermore, Microsoft said Power Platform and Dynamics365 properties were also impacted by the outage.
Separately, Microsoft said in a public Azure status update last night that a “subset of customers in the Azure Public and Azure Government clouds may have encountered errors performing authentication operations for a number of Microsoft or Azure services, including access to the Azure Portals.” Microsoft said that Azure issue lasted from 5:25 pm EST Monday to 8:23 pm EST Monday.
Microsoft attributed the Azure service outage to a “recent configuration change impacted a backend storage layer, which caused latency to authentication requests.”
Microsoft said the configuration was rolled back to “mitigate the issue.”
As for that Azure issue, Microsoft said services that “still experience residual impact will receive separate portal communications.” It promised a full post incident report on that issue within the next 72 hours.
A senior executive for one of Microsoft’s top partners, who did not want to be identified, said it appears that a Microsoft software developer made a software code change that took Office 365 and Azure down.
“It’s amazing to me that a change in code could cause a platform as big as Azure to go down,” said the executive. “It sounds like someone wrote some code that was merged into a production environment and it broke authentication. That’s ridiculous. If you can’t get into email or documents for five hours it’s pretty bad.”
The senior executive said Microsoft is going to need to do a deep dive to determine how someone could deploy a software code change that cause a five hour outage.
“Everyone expects outages, a hiccup here or there is understandable,” said the executive. “But this appears to be a faulty source control software policy issue. They would presumably be in a source control/DevOps environment that should have prevented this. With billions and billions of dollars invested in Azure how could one developer write some code, release it into production and take the whole thing down? It looks like somehow someone over-rode the continuous software integration cycle.”
An outage like the one Microsoft just experienced definitely has a ripple effect in the sales trenches, said the executive, who noted that large companies with mission critical applications often use such outages as a reason not to go to public cloud.
“It’s a tough scenario for sales reps,” said the executive. “There are a lot of frozen middle accounts that hang onto an issue like this and it causes another three year evaluation cycle. In industries like oil and gas and financial services they hang on to something like this. It has a snowball effect.”
In an email response, Tony Safoian, president and CEO of SADA, a top Google Cloud partner, said he sees Google Cloud as “the most resilient and reliable” cloud platform. At the same time, he said, outages are to expected from “time to time” with hyperscaler providers.
Larry Cannell, a senior research director at market research firm Gartner who focuses on Microsoft Teams and the digital workplace, said in an email that an outage is “not a good look for any cloud service.” That said, he applauded Microsoft for doing a “a good job keeping everyone up to date on what actions they were taking.”
Additional reporting by O’Ryan Johnson.