Google chief's endless search for fast and reliable services

“IF GOOGLE ever goes down it’s my fault,” says Ben Treynor, Google’s global vice-president for site reliability

“IF GOOGLE ever goes down it’s my fault,” says Ben Treynor, Google’s global vice-president for site reliability. “I also like to refer to it as a low-stress job with regular hours. Note I had a full head of hair when I joined the company.”

Treynor may be laughing but he has a deadly serious job at the global internet giant. He and his team of engineers not only have to make sure the myriad Google services – search, e-mail, maps etc – are available when users log on but that they work quickly and with minimal time lag.

“The user experience for Google should always be good,” says Treynor. “Google should always be available, it should always be fast; you should always be able to reach it wherever you are.”

On a recent visit to Dublin, one of the locations where the global site reliability team are located, Treynor explained that, when he joined the firm five-and-a-half years ago, a small team of senior engineers dubbed “production” made sure the search engine didn’t crash. Treynor doesn’t recruit system administrators but rather engineers who try to automate as much of the management as possible. He says the decision to locate some of the team in Ireland in 2004 was partly pragmatic, as Google already had an office here at a time when the firm was still largely US-based. “The eight-hour time difference between Mountain View and Dublin is ideal for maintaining 24/7 coverage with some human . . . on deck for people to call,” says Treynor.

READ MORE

With more than half of web searches going through Google and the company adding new services every month, is there ever a tension between what services engineers want to introduce and what Treynor knows will work effectively?

“Site reliability knows a lot about what works and what doesn’t at scale,” says Treynor. “The best testing in the world will not tell you how the feature is going to perform when it is exposed to Google-scale numbers of users. The query stream we get is extremely high and extremely varied. The query string we get from Kenya, for example, is very different from what we get from Korea.”

Being a global service means Treynor and his team have to be aware of cultures outside the US, as he discovered in the summer of 2006. Sitting in his California office, he suddenly realised traffic to Google from around the world was falling off dramatically.

He rushed over to one of his engineers in a panic to see if he knew what was happening.

“He looked at the traffic graph and chuckled,” remembers Treynor. “He found an e-mail from four years earlier because my predecessor had popped up with the exact same question – the World Cup final was on.”

Although it is not always apparent to users, Google is constantly revising its services and adding and removing features.

“The reality is that search quality at Google is one of the most important things that we do. We are constantly changing the algorithms that are used to determine what is a good answer to a query of this type from a person in this geography.”