With Microsoft suffering its second major Office 365 outage within a three-day period, some partners believe the tech giant is grappling with a DevOps crisis that has resulted in software code changes taking down production systems.
“It looks to me like they have a DevOps problem,” said a channel source impacted by one of the outages. “It looks like they are pushing out software updates that are causing the outages. They have so much going on right now, rolling Teams out at a breakneck pace. I think they are running into an issue where code tested out fine but there is a configuration problem when they deploy it.”
DevOps is a set of practices that, according to the Wikipedia definition, shortens the systems development life cycle and provides continuous delivery of code with high software quality. “For Microsoft DevOps this is a very serious issue because they are pushing out software updates hourly for mission critical applications,” said the executive. “It’s not a small organization with a weekly release schedule. Microsoft can’t do that. They are doing updates hourly. I am sure they are working around the clock to figure this out. There may be some finger-pointing going on inside the organization between the developers and DevOps group. There is too much money for them at stake to be going down like this.”
A Microsoft spokesperson refused to comment on whether the outages were due to a DevOps issue. In a statement Microsoft said: “No cloud vendor is immune to downtime. Our number one priority is to get to resolution as quickly as possible and ensure our customers stay updated along the way, as was the case here. We continuously invest in the resilience of our platform and focus on learning from these incidents to ultimately reduce the impact of inevitable outages.”
The latest Office 365 outage – which resulted in some users intermittently being unable to access Microsoft Exchange online - came on Thursday starting at 12:52 a.m. and lasted until 10:50 p.m., according to a Microsoft email update to Office 365 administrators.
Microsoft said the Thursday outage affected users “intermittently if they were routed through the affected infrastructure.”
Microsoft said a “configuration update to the components that route user requests caused impact to specific features and services that utilize the Representational state transfer (REST) functionality within Microsoft 365.”
A senior executive for one of Microsoft’s top partners, who did not want to be identified, said he sees both recent outages as clearly DevOps-related. “The REST functionality within Office 365 cited in the latest outage is all about DevOps and quality of code,” he said. “It totally looks like a DevOps issue. Remember DevOps is supposed to ensure good code quality and integration with existing code.”
Companies that are going to be successful in the “new software-eats-the-world era” must be well-versed in DevOps, said the executive. “Remember, DevOps stands for development operations, which means how you are integrating your code into version control, safely and intelligently allowing multiple people to contribute to that code,” said the executive. “It’s all about how you are testing that code from an automated perspective and how you are managing it so it doesn’t go into production before it has been tested. The whole purpose of DevOps is to allow you to deploy code rapidly and safely.”
The Thursday outage came after a software code issue was blamed for an outage that impacted Microsoft 365 services for five hours Monday night. “A code issue caused a portion of our infrastructure to experience delays processing authentication requests, which prevented users from being able to access multiple M365 services,” said Microsoft in an email update to Microsoft administrators impacted by the outage.
The senior executive said it is extremely troubling to see two Office 365 outages in such a short period of time that appear to be the result of software code changes. “Microsoft is a development first company, well known in general for DevOps, so the question is: why is this happening?” said the executive. “I love Microsoft but why is a company that paid $7.5 billion for Github, the leading source code repository company in the world, getting taken down by code that is not being well tested or has a single point of failure. That is ridiculous. If we caused this kind of production outage for a customer we would be fired and possibly blacklisted from the ecosystem. We have to bat 1,000 as a partner.”
The lesson from the outages may well be that a company’s DevOps is only as “good as the humans who configure it and execute upon it,” said the executive.
The executive said the outages will definitely have a ripple effect in the channel. “I bet the Google G Suite sales reps threw a party when they saw this,” he said.
Donna Goodison contributed to this story.