Microsoft has blamed a botched update applied to a load balancer for an outage that cut access to a number of hosted services including Office365 and Windows Live a fortnight ago.
The outage, which occurred on September 9, was initially thought to have been caused by power cuts in Southern California although speculation had centred on whether it was domain name service (DNS) related.
A post-mortem released by Microsoft confirmed it was an issue in Microsoft's DNS service.
"A tool that helps balance network traffic was being updated and the update did not work correctly," Windows Live test and service engineering vice president Arthur de Haan said.
"As a result, configuration settings were corrupted, which caused a service disruption."
The file corruption occurred for two reasons, Microsoft said. Firstly, the load balancing tool was unable to parse an incorrectly constructed line in the updated configuration file, de Haan said.
"The second condition was related to how the configuration is synchronised across the DNS service to ensure all client requests return the same response regardless of the connection location of the client," he said.
"Each of these conditions was tracked to the networking device firmware used in the Microsoft DNS service."
De Haan said Microsoft was focused on hardening the DNS service, improving redundancy and failover capabilities.
"We are also developing an additional recovery process that will allow a specific property the ability to fail over to restore service and then fail back when the DNS service is restored," he said.
"In addition, we are reviewing the recovery tools to see if we can make more improvements that will decrease the time it takes to resolve outages."
Copyright © iTnews.com.au . All rights reserved.
Issue: 340 | July 2015