DNS-related outage follows extended downtime last week

Microsoft Azure multi-factor authentication (MFA) was out for more than two hours on Tuesday, days after a much longer outage at the start of last week.

The latest issues meant that users had trouble signing into Azure resources, such as Azure Active Directory, when MFA is required by policy, between 14:25 UTC and 17:08 UTC on Tuesday (November 27).

Preliminary indications point to Domain Name System (DNS) resolution glitches. “Engineers found that an earlier DNS issue triggered a large number of sign-in requests to fail, which resulted in backend infrastructure becoming unhealthy,” Microsoft explained.

Tuesday’s authentication headache follows hot on the heels of a much longer, 14-hour outage on November 19.

Microsoft has published a post-mortem on the earlier outage, which was triggered by three interlinked problems. The tangle of issues goes a long way towards explaining why it took so long to restore the cloud-based system to normal.

In cases where MFA was required, users of Office 365, Azure, Dynamics, and other services which use Azure Active Directory for authentication were left unable to log in.

Enterprises in Europe, Asia, and the Americas were all affected to a lesser or greater extent.

The first two vectors in the downtime have been pinpointed as issues on the MFA frontend server, both introduced in a rollout of a code update that began in some data centres about a week prior to problems kicking in.

The update worked smoothly at first until higher traffic was encountered, at which point things began to fail.

A latency issue in the MFA frontend’s communication to its cache services started to cause problems when business got brisker, and this fed onto the second problem with the system, a race condition in processing responses from the MFA backend server, to lead onto things going seriously awry.

A third issue – a previously undetected issue in the backend MFA server that was triggered by the second issue – compounded problems and meant the whole system stopped working properly.

“This issue causes accumulation of processes on the MFA backend leading to resource exhaustion on the backend at which point it was unable to process any further requests from the MFA frontend while otherwise appearing healthy in our monitoring,” Microsoft explained.

The practical upshot was that authentication requests weren’t processed promptly or failed, leaving users obliged to use MFA to log into Azure-based systems locked out as a result.

Login codes are generally only valid for 60-90 seconds, so latency in the system that leads to even delays in users receiving codes cause all sorts of problems.

Worse yet, Microsoft’s operational staff were left with no clear indication of what was going wrong.

Microsoft explanation of how it diagnosed the problem and restored service can be found in an update to its Azure status history page.

It vowed to review its update deployment and status monitoring systems in the wake of the outage, an assurance that seems a little less reassuring in the aftermath of similar problems just a week later.

Microsoft’s authentication service has historically been reliable. If this changes it might encourage orgs to drop MFA in order skirt the impact of future issues, leaving business users far more exposed to hacker attack as a result.