Google Is Searching for a Way to Win the Cloud
  Trailing far behind its biggest rivals, the company is undergoing a multiyear effort to improve the reliability of its infrastructure.
  By  Nico Grant Bloomberg February 2, 2022, 6:00 AM CST
  When Amazon’s cloud computing network  went down on Dec. 7, it hampered a broad swath of companies that rely on its servers, including  Disney,  Netflix, and  Ticketmaster. The rare company with reason to welcome the widespread disruptions caused by  Amazon Web Services’ worst outage in more than a year? Google.
  Despite its dominance in consumer services,  Alphabet Inc.’s search giant has long trailed  Amazon.com Inc. and  Microsoft Corp. in the fast-growing cloud computing industry. Cloud companies compete on various fronts—speed, features, reliability—but a key part of Google Cloud Chief Executive Officer Thomas Kurian’s plan to catch up is to convince customers that Google’s cloud infrastructure is more reliable than the competition’s.
  There may be no way to evaluate that claim. Cloud industry analysts say measuring the relative downtime of competing services is almost impossible because of the scale of the networks, the diversity of services they offer, and the complex mix of factors that lead to failures. Corey Quinn, chief cloud economist at the Duckbill Group, which works with companies to reduce their AWS bills, says Amazon and Google Cloud are “neck and neck with regard to reliability,” with Microsoft’s Azure trailing because of significant disruptions in 2020. (A Microsoft spokesperson says that the company’s cloud offers “industry-leading reliability” and that it gives customers payment credits after some outages.)
  Still, Google faces technical challenges of its own. When the company first began building its global system of data centers, the aim was to serve its own consumer-facing tech products. Its design was well suited to the task of keeping Google’s search, email, and video streaming running around the globe. But using the same server farms as the backbone of a cloud computing network introduces a new set of technical complications, and resolving that tension has been a major engineering focus for Kurian.
  Cloud computing is in one segment of the tech industry where everyone anticipates runaway growth. The cloud market will grow about 30% annually until 2025, when it will reach $400 billion, according to research company IDC. In 2020, Amazon commanded 41% of the public cloud market, Microsoft held 20%, and Google had 6%, according to Gartner, another researcher.
  That doesn’t mean Google is doing badly. Analysts expect its cloud division to generate $26 billion in revenue this year, about four and a half times what it made in 2018, the year before Kurian became CEO. The operation isn’t profitable, but Kurian has cut its losses, and he’s said its focus is still on growth rather than profit. Google said on Feb. 1 that its cloud unit generated $5.54 billion in sales in the fourth quarter, beating analyst estimates. Google Cloud’s workforce has grown to 40,000, from 25,000 when he took charge; its roster of multinational clients includes  Goldman Sachs,  HSBC, and  Twitter.
  Kurian, a former  Oracle Corp. executive, took over from Diane Greene, a brilliant engineer who co-founded  VMware Inc. before joining Google to turn its cloud division into a  serious business. When she left, Google’s cloud was a distant third. Under Kurian the company amped up its client services. Its sales force grew rapidly and began to emphasize the strategic partnerships that were likely to keep large customers engaged. But he also quickly focused on reliability after a litany of consumer complaints in his first few months.
  One big challenge is the extreme centralization of Google’s network of data centers. The company designed its infrastructure so that machines in far-flung parts of the world would depend highly on ones closer to home. This design made it easier for Google to provide the same range of services to billions of people all around the globe. It also allowed it to keep data fresh and quickly update software.
  That approach has pitfalls, though, which came to a head in June 2019 in an incident that Googlers now refer to as the Maya Apocalypse. Google data center workers were in the process of making physical repairs on some machines in Oregon; during the process, a bug in a software program called Maya, which automatically shifts responsibilities among servers, shut down another system, the Borg Masters, which effectively acts as a control for the entire network. This set off a domino effect that crashed services across North and South America. As servers failed, Google’s network capacity shrank and became more congested, causing slowdowns for YouTube viewers and delaying efforts to get the system back online.
  After the Maya Apocalypse, Kurian told employees the company needed a “reliability reset.” Initially he froze all software updates for a month, anticipating it would take that long to resolve the reliability issues. Instead the company has spent much of the past three years on the project, according to three people familiar with the situation, who asked not to be named discussing internal operations.
  To a certain extent, the goal has been to re-create a specific aspect of Amazon’s cloud infrastructure. “AWS has done something most providers don’t—they have a strong regional separation,” says Duckbill’s Quinn. “An outage in one region almost never impacts other regions. Meanwhile, Google has its vaunted global network, where as a result you see things like global rolling outages. When Google Cloud goes down, it tends to go down a lot harder and in more regions than AWS does.”
  Google Cloud undertook a series of initiatives to isolate its servers from one another when the need arose. The seeds of the main effort, Project Drawbridge, were planted in the aftermath of the Maya Apocalypse, though the project didn’t officially begin until spring 2021. The idea was to give customers the option to temporarily sever—or “pull up”—the connections between regions to separate data and keep problems from spreading. This was particularly important for banks and other regulated industries operating in multiple jurisdictions, according to Google. Within Drawbridge, Google also introduced Project Moat, a program that allows customers to host independent versions of their applications and services in various zones around the world, so clients can connect to the closest regional version of a service if the main version is down.
  “The reliability of our customers’ workloads is our top priority—it’s how our team demonstrates customer empathy,” Ben Treynor Sloss, a vice president at Google who oversees technical staff, wrote in an email. He added that customers had differing preferences with regard to regionalization, so Google’s approach is to offer a choice.
  The company’s efforts are still a work in progress, as illustrated by an incident on Nov. 16, when a network configuration issue disrupted multiple Google cloud computing products, undermining the websites and apps of clients  Home Depot,  Snapchat, and  Spotify. As they’ve strived to solve the engineering issues, Kurian and Sloss have also tried to make Google’s engineers understand the stakes. They’ve asked technical staff to join meetings with clients or read descriptions clients have written about their experiences with outages to understand how they’re affected when Google’s services go down.
  Some clients have expressed deep frustration and anger, saying they’d lost faith in Google or might consider alternative cloud providers, employees say. It’s hard to assess how serious these threats are; changing cloud providers is a significant undertaking, and competing providers have suffered their own periodic outages as well. Given the difficulty outsiders face in assessing networks’ relative reliability, they can’t be sure they’ll encounter fewer disruptions elsewhere. Even if Kurian does succeed in making Google the most reliable cloud, turning that into a business advantage may be as much a marketing challenge as a technological one.
   Google Searches for Way to Win Cloud Share From Amazon Web Services, Microsoft - Bloomberg |