How Microsoft scrambled to fix its COVID cloud capacity crunch      
                             Tom Krazit                       Protocol June 17, 2020
                How Microsoft scrambled to fix its COVID cloud capacity crunch 
  This is not a drill
  What would you do if demand for several of your cloud services doubled overnight?
   That sounds like an interesting hypothetical question for a planning session. Or a Google interview question. But it's what happened to Microsoft in March, as the pandemic took hold of Europe and the U.S. after devastating China earlier in the year. 
  Now that demand has stabilized, Microsoft released several details Tuesday about the technical steps it took to deal with the surge in demand, which almost doubled across several of its key services including Teams, Virtual Desktop, and Xbox Live during one week in early March. Some of the details are fascinating.  In a video, Microsoft Azure CTO Mark Russinovich explained how the company moved traffic around the globe and rewrote code to tweak the way some of its applications consume computing resources, all in just a few weeks.
  For example, Microsoft turned off the little "your co-worker is typing" and read receipt notifications for Teams users in peak-demand regions, reducing the CPU capacity required to process those functions by 30% and returning that capacity to Azure customers.The company also pleaded with local ISPs and video-game publishers to delay the release of game updates over Xbox Live until after business hours, freeing up computing capacity that could also be returned to Azure customers. Microsoft also made a number of changes to its networking strategy, as people left their office parks, with local-area networks and better internet connections, and started working from home, putting a strain on wider-area networks.
   Azure relies on four undersea cables between the U.S. and Europe to handle traffic traveling across the pond, and to increase capacity on one of those cables it "borrowed" some advanced networking equipment from another Azure region to quickly upgrade that connection.It moved some Xbox Live activity from China and Europe back to the U.S. in order to free up Azure capacity for business customers in those regions.Engineers quickly built a time-based system for managing traffic between regions, which automatically balanced the surges in traffic as people in one region woke up and started working while people in other regions were off the clock.
  It also relied on the tried-and-true method of building capacity: buying all the servers it could get its hands on.
  During its last earnings call, Microsoft cited supply-chain delays from hard-hit server manufacturing regions in China during January as another part of the scaling problem.The trouble with upgrading amid the pandemic, though, was that installing new hardware required people, who had to figure out how to quickly upgrade racks of servers while staying six feet apart from each other.Still, Azure added 12 new so-called "edge sites," essentially mini-datacenters that serve as entry points into the broader Azure network in parts of the world far from actual Azure data centers.
  The experience validated a few modern application design philosophies. Microsoft now plans to shift Teams from virtual machines to containers, for instance, and said it found it easier to quickly scale and adjust because several Azure applications were designed around microservices. 
   Still, it's worth noting that Microsoft was the lone company among the Big Three cloud providers to endure this type of crunch during the first half of the year, and that it was already struggling with Azure capacity  long before most people had heard of COVID-19. 
  The company deserves credit for quickly reacting to an unprecedented event, but if AWS and Google faced similar challenges this year, they kept it quiet. protocol.com |