‘You should automate yourself out of a job every 18 months’

Dublin-based site reliability team has become central to Google’s infrastructure

 

“One of the primary skills I look for in an engineer is laziness.” Dave O’Connor’s statement may seem rather glib, but when you think about what the site reliability engineering (SRE) team that he leads has to deal with every day, it actually makes sense.

The 250-strong team is responsible for managing products such as Google’s Ads network and search engine, Google’s corporate engineering efforts, and the storage and network infrastructure that underpins it all. That’s a lot of services and resources to be responsible for keeping up and running. And it is something that is practically impossible to do with human intervention only.

“It’s scale that is the primary driver of what we’re trying to do. We have to do it with automation and software. And systems that can self-diagnose and heal and only tell a human it’s broken if a human needs to intervene.”

So he looks for people who will fix something and then figure out a way so that they never have to do that task again. “That’s a virtue. We want people to be in that position. We say you should automate yourself out of a job every 18 months.”

To work in O’Connor’s team you also need to have a healthy dose of cop-on, he says.

“People who can approach a problem from first principles and think about it not in terms of what they’ve seen before or taught or experience. We attract people who like to build stuff, who thrive on ambiguity. When a problem comes out of left field they say ‘let’s see where that takes us’.”

O’Connor’s official title is director of SRE Dublin and Global SRE lead for Infrastructure Storage. The site reliability team in Ireland was formed in 2004, the same year that O’Connor joined Google. It has been growing ever since, and is now central to Google’s infrastructure.

Engineering team

It’s not the only engineering team in Dublin; there is also a network engineering team which is another crucial part of Google’s infrastructure.

“When I graduated from college in 2001 it was with an expectation that I would be writing software to ship on CD or – this brand new thing – that people would download it from the internet and install it on their PCs,” he says. “Most software engineers of my generation would be of that ilk, that your problems end when you ship your software.”

That’s no longer the case. In fact, deploying software into an environment where more and more demands are placed on it is just the start of the problems. Neither software nor the tech environment in general stand still, so software over time can become less reliable and stop working just when you need it most. That’s where the SRE team does its work. It may not be the first thing that people associate with innovation, but it is an important one.

“When the average person thinks about innovation in engineering they think about in terms of the person who produces the shiny box and the people who produce ‘the thing that I click’,” he says. “A lot of the work we do here is innovative work in terms of having that shiny box not be an expensive paper weight because it can’t talk to whatever Internet of Things service it wants to talk to.”

The key thing is reliability for products. As O’Connor points out, having something with “whizz bang” features is great. But it still has to fulfil its promises.

“If it doesn’t work people will go elsewhere. To do that you have to engineer reliability into your product from day one, at the core of your product. So reliability is the product in a lot of cases.”

‘Out of a job’

While many professions watch the growing march towards automation warily, O’Connor and his team actively work towards it. Without it Google’s infrastructure wouldn’t be able to work effectively; the scale of the set-up would require a huge team.

“We need to automate ourselves out of a job, and once you’re done you have to do it again,” he says. “You’re inventing new problems for yourself the minute you automate away problems you have. It’s a constant process.”

SRE is still a relatively new discipline in engineering, although Google has been involved in it for about 15 years. In recent years, however, O’Connor says other companies have begun to talk about it. The problems are universal; the only difference is scale.

The SRE team in Dublin quite literally writes the book on site-reliability engineering. Published in 2016, Site Reliability Engineering featured chapters contributed by some of the Dublin team, including O’Connor. The book was all about how Google handled site reliability engineering; a second book is set to be published with worked examples from the first book, written by people from other companies.

So Google’s influence in SRE is spreading. It is shaping the future of networks both inside Google and out. One thing the Dublin team was heavily involved in is zero-trust networks. Inside Google it’s known as Beyond Corp.

Rory Ward explains exactly what it means. “It’s a complete re-architecture of the enterprise security infrastructure in Google.”

Typical security setups are a perimeter-based model. On the inside everyone is good and has access to everything, on the outside everyone is bad, with access – hopefully – to nothing, and there’s one way in.

Security model

But walls don’t work. That was the conclusion Google’s engineers came to. While it was once the best solution, when everyone worked inside a building and no one had access from the outside, the trend towards mobile working has effectively blown holes in that security model.

“We put all the internal services of Google on the internet. Which sounds like a really dumb idea,” said Ward. “But there is no way that you can assume that the internal network is every really going to be safe.

“The idea of having a firewall and a protected internal privileged network really is broken. It worked well when everyone had to go into the building but that no longer works.”

With a zero-trust network there is no internal network as such, and Beyond Corp is the first large-scale implementation of that type of network. Dublin is the site behind it, says Ward. “It’s an innovative programme we put together over a couple of years.”

First of all they had to get the backing of Google’s executives. Then they had one brief: don’t break the system for anyone. The implementation had to be done without people losing access to Google’s systems.

“We do it based on what we know about you, and what we know about the device. We’ve built a bunch of technology to be able to identify and categorise every device that we have. Every access is authenticated and encrypted at transit and at rest. Every Googler can do all their job everywhere without a VPN,” said Ward. “We managed to completely redefine the security model in Google, but also improve productivity.”

Five papers

The next step was to tell the world. Google engineers – including Ward and others on the Dublin team – have written five papers on the subject and published them, and a sixth is on the way.

Zero-trust networks and Beyond Corp are almost synonymous. The concept has now begun filtering into other companies.

Meanwhile, inside Google the team achieved its goal. No one noticed any interruption to access when the zero-trust network was rolled out. But staff contacted their IT support more often anyway; they noticed they could access Google internal resources on devices without the traditional authentication process they were used to and thought there was a glitch.

So the Dublin team’s project has had a global impact.

“That’s the sign of a really good SRE team. You should really only hear about SREs when things go bad,” said Ward. “They haven’t heard of us too much.”