Teams, Exchange Online, and other services were knocked offline for more than 14 hours
Microsoft has blamed a key rotation issue for a large-scale 365 outage that affected many of its services on Monday and Tuesday.
The outage – which took down Teams, Exchange Online, and other 365 services – kicked in at around 19:00 UTC on Monday and was only resolved more than 14 hours later, at around 09:25 on Tuesday.
Problems in the periodic rotation of cryptographic keys caused authentication checks to fail for any application that relied on Azure Active Directory, causing problems that persisted overnight until engineers were able to apply a fix.
In a status update, Microsoft explained that the authentication problems arose because a key marked for retention had erroneously been deleted by the system. This caused particular problems because the key was needed to manage a migration project, as the company explained:
The preliminary analysis of this incident shows that an error occurred in the rotation of keys used to support Azure AD’s use of OpenID and other identity standard protocols for cryptographic signing operations. As part of standard security hygiene, an automated system on a time-based schedule removes keys that are no longer in use.
Over the last few weeks, a particular key was marked as “retain” for longer than normal to support a complex cross-cloud migration. This exposed a bug where the automation incorrectly ignored that “retain” state, leading it to remove that particular key.
Azure Admin Portal, Teams, Exchange, Azure KeyVault, SharePoint, and Storage were all effected to a lesser or greater extent by the problem.
Security vendor Venafi warned that outages of this nature are likely to become more common as digital transformation accelerates, thus heightening the importance of key rotation.
Michael Thelander, director of machine identity strategy at Venafi, commented: “Poorly orchestrated key rotation is the Achilles heel of modern digital transformation efforts; this oversight is capable of bringing down entire applications and services in an instant.
“Keys and certificates have numerous ‘states’ that guide their automation and orchestration processes. They also have hard-coded expirations.
“‘Retain’ is a tag that tells the system, ‘This key may be retired or expired, but the system needs to keep it to enable any overlap between dynamic processes’.
“If the ‘retain’ tag is overlooked and the keys are deleted before replacements are ready – and this all happens in microseconds – systems fail,” he added.
Thelander concluded: “Unfortunately, these kinds of outages will only continue until organizations adopt an enterprise-wide approach to managing the machine identities these keys and certificates represent.
“Digital transformation is not going to slow down, and this requires automation of keys and certificates found in workloads, containers, and across cloud environments as well as those in on-prem environments.”