Technology is only one part of microservices. This is how Yelp thought about their organization and operational ownership.
John: Okay, can you hear me? Yes? Good. So, my name is John. I’m a Tech Lead in the infrastructure team at Yelp. And we’re going to be talking about the human side of services. And this is Joey.
Joey: Hi, I’m Joey. I’m going to be helping him.
John: Okay, so just a little bit about Yelp. What do we do? We connect people with great local businesses. And we have 89 million monthly unique visitors as of Q3 2015. We have 90 million reviews now. Seventy-one percent of our search is now mobile. And we’re in 32 different countries.
Joey: Cool. All right, so whenever we’re talking about architecture…can I have the [inaudible]? Okay, whenever we’re talking about architecture there are kind of two important things to consider. We’re going to look at technology, code, architecture, things like that. And we’re going to look at the developers, the people who actually write that code and ship it. And our talk is primarily going to be about the second one.
And just kind of a disclaimer from the very beginning, these are obviously our opinions from a mid-size company, a couple hundred [inaudible] org. When you’re Google, you’re going to have different opinions on how to build an organization. And when you’re a two-person startup you’re going to have different opinions. But we’re going to be focusing on that, lessons learned, and what we’ve seen there.
We’re going to mention technologies. I have opinions about technologies. John has opinions about technologies. But we’re going to try not to focus on those, and that’s because what we think is really important when you’re thinking about microservices is getting better at this. Getting better at shipping your code, getting better at delivering that business value to the customers, which means that you’ve got to optimize your development process. So okay, cool. How are we going to do that?
John: So, you may recognize this graph. Moore’s Law. In 1971 we had about 2.3 thousand transistors. In 2011, we’re up to 2.6 billion so we’re an exponential increase in complexity. You might say, “Well, what’s that got to do with services, John? This is a crazy thing.”] Yes, we know this. Well, we have a monolith. We started in 2005 with approximately zero lines of code. And then fast forward to 2016 and we have 3 million lines of Python code in our one sort of single thing that we deploy.
And so this is great. Obviously, the business has succeeded. We are still writing code, shipping code. But we found that we were kind of slowing down a bit. You can only keep adding lines of code and programmers so far until the push process kind of gets a little bit awkward. You might say, “Well, John, maybe you’ve got a really bad push process.” And so what does that look like? Well, what we do is we do not have a QA team. It’s all dev-driven and so there’s a rotating responsibility for Push Master.
And so about three times a day a Push Master, who’s a developer, what they do is they accept about 20 push requests. These are just branches of reviewed code. And then those go out into production. And hopefully the Push certifies, and so that’s 20 changes, 3 times a day, so that’s 60 changes. Okay, so that’s what we do. But there’s a bit of a bottleneck here. We can’t keep pushing code ever faster. So what did we do? Well in 2011, we had an idea. Services, this is a thing, right?
Let’s give it a go. Let’s just start with something. It doesn’t have to be great. It’s just an experiment. And so what did we do? We wrote this geocode of service. And it’s not really a geocode of service. It was completely misnamed. The core of it is this quadtree, which takes a latitude and longitude, and it just returns either the neighborhood that that lat and long is in, or the country. That’s all it does. It’s very simple. But we happened to hit this service quite a lot and so it was a really good way of exercising our infrastructure.
And we also had to figure out how to actually get the code out to production, how to do monitoring, how to do alerting. And so just having this one service was really helpful. And now fast forward five years later, we’ve now got our PaaS, our infrastructure is a Platform as a Service, which we recently opened source, which uses Masos [SP], Marathon, and Docker. And so we’ve iterated quite a lot, but this was how we began. And so maybe some of you are in a similar position. How do you get started?
Well just try something and work from there. So, over the five years, we had this Cambrian explosion of services. We started off with 0 services, we now have over 150 production services and so it’s been very successful. Maybe we’ve overshot the mark a little too much and Joey will talk about that in a bit. And along with a kind of this proliferation of services we’ve had to change our organizational structures quite a lot. So the big change is that we’ve had to eliminate queuing on our ops team.
We still have an ops team. They are responsible for the monolith, for Yelp main, but what we…initially we kind of, if you wanted to ship a service you talked to the ops team, they’d get back to you. You’d talk to them. You’d go backwards and forwards. Maybe this sounds a bit familiar. And maybe six or eight weeks later it would actually get out into production. And this was just slowing everyone down, so it wasn’t working very well. So we found that we had to really spread out responsibilities across the organization.
And a key idea here is education. It’s been so important for us. So why is education important? Well, so there’s this idea that programming the monolith is hard. If you’ve just come out of college, you’re hit with three million lines of Python code. That’s hard. And now you’re trying to actually do distributed systems on top of that, and it’s sort of 10 times harder. So two hard problems in distributed systems is exactly once delivery, guaranteed order of messages, exactly once delivery.
And we actually saw this problem a few weeks ago where we had some message duplication due to overloading Kafka and some of these messages were responsible for putting photos in business timelines. And so we got duplicates. Some of them were duplicated 60 times so sometimes we saw 60 duplicate photos in our message timelines, on our website, which was embarrassing. So ad in potency, it’s important. How do we educate people? Well there are a load of different ways we’ve done this.
First of all, we published this service principles document. It’s on GitHub if you want to check it out. And it’s meant to be a technology agnostic set of principles for designing and operating services. We also wrote this tutorial. And what it does is it walks service authors through from kind of first steps to building up a fully functional service that uses some sort of backing data store, how to do monitoring, and people find that very useful. Another very important thing that we’ve done is to create these deputy programs.
And so what is a deputy program? Well, the idea is that, say, on the operations team, before they were the ones doing the puppet changes or doing DNS changes, that sort of thing. Well, ideally maybe you want everyone to be able to do this, but it’s kind of dangerous because there’s a really large blast radius if everyone’s touching puppet or everyone’s making DNS changes. So what you do is you take one or two more senior developers from each team and you train them up to be able to do these tasks.
Ideally, you’d automate all of them away, but the reality is you can’t automate everything. And so you train these people up. You put them on the teams they came from. And these are probably people who wanted to do that anyway so they’re very happy to get these privileges. And then you spread this knowledge across the organization. And so we did that both for release engineering, ops, and also mobile.
We also run these office hours. So every week, there will be a team of people who are kind of experts in distributed systems who will just sit in a meeting room in Yelp. And if you have a problem, you can just come along and talk to these people. And it’s a great way of fostering sort of community around this and discussions. We do deep dives. So every Monday we have an engineering-wide meeting. And at the end of it there’s 5 or 10 minutes where somebody gets to talk on a topic of interest.
And so this is a great way, again, of disseminating knowledge across the organization. And the final thing we do is we have this service creation form, SCF. And for any new service, the service authors have to fill in this form. And what does it specify? It specifies things like how you’re going to load balance your service. What happens if your service goes down in the middle of the night, will the website go down or is it just a minor inconvenience? What data stores are you using? Those sorts of things.
And this gets reviewed by two or three experts in that sort of area. Now, it is a balancing act. You don’t want…if you have no process, then we found it becomes really chaotic. If you have too much, then again you have queuing so it slows everything down. So we found this is a pretty good middle ground. So now let’s talk about consistency. And this is a topic that’s come up a few times already.
So here is our map of the world. You may have a different map. What do we use? We use things like MySQL, Elastic Search, and we use Python and Java as our preferred languages. We have Cassandra, ZooKeeper, Sensu for monitoring. Initially, when we started doing services, it was like the Wild West. It was amazing. Everyone could choose whatever they wanted, whatever language, and we kind of ended up doing that quite a lot. And the problem was, we got these single points of failures in terms of people.
There would be the one expert in a particular language who had gone off and written their service in their language. Then maybe that person changed teams. And so their team was left trying to learn a new programming language and also run the service. And so nowadays, we don’t say you can’t use particular tool X. We just say, “Well, throughout the organization these are our core competencies, so if possible, really try and stick to those. But if you have a compelling reason not to, that’s fine. But you have to be aware of what you’re taking on.”
So interface design. So one of the things we pretty much do, really encourage people to do, is use HTTP. I think almost all of our services have HTTP interfaces. And on top of that, we use Swagger for quite a few of our services to define the interfaces. Hopefully, quite a few of you are familiar with it. I gave a talk last weekend at Scale about Swagger. It’s really nice for being able to define your interfaces and there’s some great tooling on top of that to generate client libraries, to create a central directory of service interfaces, so that sort of thing. Over to you, Joey.
Joey: All right, so you, as an org, might have decided on certain technology stacks, certain tools. Obviously, they’re going to be different than ours. But one thing that we realized after we had kind of standardized on tooling and languages and things, was that we started running into some cross cutting organizational objective that we had to start tackling that previously we kind of had experts for.
So we might have the operations team who understood how much things cost or we had a security team that was in charge of making the single web app was secure. And when we started moving to a distributed architecture, service-oriented architecture, we realized that we had to start disseminating this knowledge across the org. So you go, “Okay, cool. That’s no problem guys. We’ll just send out a memo. We’ll say, ‘Hey everybody. Performance is important. You’ve got to be secure. Make sure that your service stays up and keep the costs down.’”
Okay, problem solved. But what we found is this didn’t solve the problem at all. And in fact, developers just kind of ignored it. So instead, we turned to kind of incentives. And what we found was that if you take inspiration from this quote, which is “Gets measured, gets managed”, you rapidly start getting those cross cutting objectives. So for example, with performance, one thing we invested in very early was the ability for all of our services to, in a standard way, export performance metrics.
And this is an example of one of our services, which is responsible for routing database queries for other services. And you’ll notice, every single end point on that service has great performance metrics. And everyone talks about metrics, but the next step is taking this and putting up on a wall. So one of my favorite examples of this is back when I was on the search team. I was really unhappy with how slow search was. I was like,“Come on, guys. We can do better than this.”
So I took a graph and I put it up on the wall and I just colored it red. So anything about 150 milliseconds was red. And all of a sudden, out of nowhere, quality developers who prior to that day never cared about performance, they came to me and they were like, “Hey, Joey. I made it faster. Look, it’s underneath the red line.” And I was like, “All right.” So without sending out an email, I didn’t say, “Hey, quality developers. Make sure that when you develop things it’s fast.”
I just gave them numbers. And they did it themselves. We see this again and again with developers. We saw this with security. We were having all these situations where our security engineers were like, “Guys, we write super insecure code. We have all these cross content violation policies. What’s up with this?” And so what we did was we started recording those metrics, we put them up on a wall, and we found things like this. We were like, “Hey, that sounds like a lot of violations. We should probably lock that down.”
And some developer goes and fixes this and overnight you go from 1,000 per day to almost none. So we saw it with performance, we saw it with security. We see it with reliability. The moment we give people the ability to measure nines on their service, to measure the availability of their service across all their end points, across all their data centers, and aggregate and disseminate that information. In the case of reliability, really importantly, was the ability to alert on this.
So instead of just saying, “Hey yeah, guys. We’ll get paged when something gets slow,” we can say, “Okay, also we can get paged when it drops below your SLA agreement of 99.9% availability for 5 minutes.” And then this kind of goes hand in hand with interspection tools, so this is where things like Splunk or other log analysis tools might come in, where, okay, cool, we know it’s down, but why? And finally, cost. So at least at Yelp, cost kind of came sort of late.
And the previous speaker gave a really good example of why you have to expose these, because originally we didn’t expose this. Developers just launched whenever they wanted and built whatever they wanted and it got really expensive. So instead of just emailing out, “Hey guys. Stop spending money,” we started sending each team a report that said, “Hey, this week you spent this much money. Did you mean to do that?”
And within a week we started having drastic cost reductions because people were like, “Oh, we don’t need that giant elastic search cluster. Nobody even uses that anymore.” But now that they actually have the metric, they’re able to take action based on it. So yeah, I guess really this kind of wraps up nicely in the idea of contracts and service level objectives, which some previous speakers have talked about. And this is something that Yelp hasn’t done yet.
We started recording these metrics. We started sending these reports. We started exposing these to developers. But I think the next step for us is definitely writing these things down and actually having a contract between service authors. We think that’s exciting and we think that as other organizations pursuing metric services, definitely consider doing something like this. Cool. All right, so we’ve got these services. How do we operate them?
John: All right. So, sharing is caring, right? Well, when it comes to microservices that’s not always true. We try quite hard to define ownership by team, so mostly by dev team. If you’re on a dev team you’ll have these services. These are the services you own. If there’s a problem, you talk to the manager of that team. So it’s very clear who owns what code. And the same goes for operational responsibilities. If you build it, you’re responsible for running it until it’s decommissioned, if that happens.
And we found that the ops team was quite good at kind of their best practices. They’ve been doing this for a long time. But when we got dev teams doing this there was a learning curve. And so again, it was all about disseminating that knowledge. And so understandably, failures do happen. That’s the reality of making changes to code. How do we deal with it? Well, we use Pager Duty across the organization. Each team, each dev team has their own on-call schedule in Pager Duty. And so it’s very easy to contact a team if there’s a problem.
So maybe the ops team will contact them if there’s a problem, or maybe they’re monitoring will go for their services and alert them. What else do we do? Well, when there is an issue, we use JIRA to create a ticket. So you see it in the top left. Iad1 is sick. One of our data centers isn’t working. We use that to coordinate a lot of the response. And then once we’ve solved the problem the JIRA ticket actually goes into the postmortem state. And so bottom left, there’s an example of our postmortem mailing list that goes to every single engineer in the company.
And we really encourage people to write postmortems. From my own personal experience, it does take a lot of time to write the postmortem if you actually want to reconstruct what happened. There’s a timeline you need to figure out. You need to talk to other engineers. But Yelp’s very good at supporting that process.
Joey: And I think something that’s interesting about that is that our operations team always did what John just talked about. For us, the challenge with moving to a service-oriented architecture was getting all the other teams to do it and really kind of writing these down and saying, “Hey, guys. Having a page iteration is important. Here’s how you respond to on-call [inaudible].” And while we’re speaking of things we’ve done wrong, let’s chat a little bit about pitfalls.
So we’ve heard a couple of different pitfalls in previous presentations and we’re just going to recount some of the ones we ran into. So the first one, I kind of took this inspiration from a wonderful post. I think the guy works at Twitter. And basically his contention was that a good way to engineer something like microservices was to let 1,000 flowers bloom and then rip 999 of them out by the roots and keep the one that won, kind of like natural selection. Yelp was really good at the first part of this.
We were really good at starting 1,000 things. And we were really bad at going up afterwards and saying, “You know what? Actually, those 999 solutions are not so great.” And this really bit us hard because we had a really quick spurt of velocity in terms of we were able to ship 1,000 services in 14 days, but then our velocity rapidly tanked because we weren’t able to maintain all of those solutions. So this is definitely something that we learned through experience and it kind of hurt.
This was the second one. And I believe this phrase is coined by Jay Kreps over at LinkedIn. But we ran into this. It’s hilarious to me because we were talking to Jay Kreps before we did our service-oriented architecture. And he was like, “Guys, be careful to not have an append only service architecture.” And we’re like, “Okay, we won’t do that.” Then we did it. So what is an append only service architecture?
It’s basically an alignment of incentives whereby developers have incentives to launch a new service rather than rearchitecting or putting their code in an existing service. So originally it was very hard for us to launch services and then we really rejoiced when it was like, “Okay, cool. A developer can launch a service in 30 minutes. This is awesome.” But then the incentives wildly shifted. And developers now…when they wanted to develop a new feature they would just go, “Okay, I don’t need to worry about my neighbor. I can just develop this in a service. I can ship it with one function. I’ll be great. It’ll be a microservice. I’ll have 100 of them and I’ll be the only person that understands how they work. And it’s okay because I shipped it."
And we [inaudible] where we ended up with [inaudible] who had shipped that one method service that was written in Closure. It was deployed totally differently and everything else. And we had to get a grip on this. We had to start encouraging developers to actually like, “Hey guys. Why don’t you look at other services that exist and see if maybe it makes more sense for you to put your search end point in the existing search service instead of creating a whole new one that someone else has to be on call for.”
All right, so this one’s interesting because I think that what we’re going to talk about here kind of goes in contrast with some of the other things that we’ve talked about. But we think it’s a pretty big anti-practice to ditch libraries entirely or try to make them into services. Because we all have an incentive to do this when we start having a service-oriented architecture. We can get rid of all these bad things. We don’t have to have our tests break anymore. We don’t have to manage 20 versions of my library.
Now somebody can call it from any language. And what we found was that this is kind of happened. People just take a library and they just put in a service. And they go, “All right, guys. I’ve solved all these problems.” But the reality is that this just slows down everything because it’s not a service-oriented architecture at that point. They’re really just taking what used to function calls and turning them into RPC. And if you’re Google and you have super low latency networks and very reliable network calls, sure, go ahead and do this.
But we don’t have that kind of network. Our network drops packets. I’m sure yours do, too. And this rapidly led to comet trail explosion of performance characteristics. So this kind of led us to recognize that libraries are kind of awesome. All those things that we previously thought were terrible, it’s kind of cool to have my test break instead of me getting woken up at 2:00 in the morning. I really don’t like waking up at 2:00 in the morning, so I’d much prefer that the build break at 4:00 p.m. than me getting woken up at 2:00 a.m.
And deploying 20 versions actually can be really useful. It’s somewhat nice to not have to upgrade the whole org because the core library as a service has to be upgraded and we want to deprecate an old version. So the reality that we found is that it’s a balancing act. And I think that’s really what we’ve been finding throughout this entire process is that sometimes things make sense to be a library, sometimes make sense to be a service. And this kind of tradeoff decision we codified in that principle document.
All right, so the final pitfall I want to talk about is this one. So initially, when we started doing this we were really excited. We were like, “All right, guys. We’ll dev ops it up. Our devs are frustrated. Our ops are burning out. It’s all right. We’ll make them dev ops and then it’ll be better.” Right? No, it won’t be better. What we rapidly figured out was that not all of our devs wanted to do ops. In fact, a lot of them were very content with never stracing [SP] a process.
Not all ops wanted to do dev. They don’t maybe care about dependency diamonds and software. They want to just get on the box, and SH the box, and debug it. And we also realized that we also had a lot of specialized roles, at least in our organization with a couple hundred engineers. We had people like DBAs who didn’t know about puppet and that’s okay. So what we aimed for instead, after we kind of took the deep dive into everyone will do ops and then we’re like, “Oh God, this is terrible” was we wanted to encourage cooperation between teams and within teams.
And acknowledge that our engineers had varied skills. Within a team, you might have somebody who wants to do ops. That’s cool. Empower them. They might want…the might be interested in security. That’s cool. Empower them. But don’t make them do things that they’re not good at or don’t want to do. And that kind of led to us…I don’t know if it’s fair to say we coined the term, but we build dev ops teams. We don’t really try to hire dev ops engineers.
We hire engineers with diverse skill sets and then we combine them into teams which are able to deliver functionality and business volume. We found that this was much more useful and led to everybody being happier. So that’s always good. Cool. So are there any questions? No questions. Oh no. Surely you can’t all agree with us. We said some crazy stuff. All right, there we go. What’s up?
Question:
Joey: Yeah, so the question was do we give pagers to devs and is that something that do we encourage our devs to take primary responsibility?
Question:
Joey: Yeah, so we’re still trying to find the right compromise. Initially, we went very hardcore. All the developers carried pagers all the time. And what we found was that rapidly led to miserable developers. And instead, what we’ve tried to do is bring the pager escalation up a level. So there’s the two extremes. There’s the operations team doing all the pages. That’s bad. We had that seven years ago. That was bad.
There’s the other extreme where the two people who understand that service are on call for it and they’re on call every week and they’re miserable. And we found that kind of trying to compromise somewhere between there where we have reasonable size pager rotations but they’re still closer to the code tends to be a good compromise. But obviously, that would change depending on your organizational structure. So for example, is you have lots of engineers, like if you’ve managed to hire a ton of people who work dev and ops, then that’s a different tradeoff, or if you have a lot of devs, that’s a different tradeoff.
John: Yeah, I’d add the ops team still is responsible for the on call for our monolith, Yelp Main. That hasn’t really changed because they were always doing that anyway. But certainly we’ve seen ops responsibilities pushed out to individual teams that own the services, so they take on additional duties.
Joey: I think there’s a question.
Question:
Joey: So I think that we…
John: Repeat the question.
Joey: Oh yeah, sorry. So the question was how fundamental is something like Kaska or other kind of shared deployment infrastructures, monitoring, alerting, and such? Do you want to take it?
John: Yeah, so initially we didn’t have sort of a PaaS and once we got the problem sorted out, it was quite easy to deploy services. But then we hit the problems of scale and we found that we were manually allocating services to machines. This was great until the machine failed and then suddenly, on a Friday afternoon, everyone would be running around trying to move the services between machines, contacting individual owners. And so that really did become a problem at scale.
So I would say initially you don’t need fancy PaaS infrastructure, but as soon as you have more than, I don’t know, it varies, 10, 20, 30 services, then you start hitting those problems.
Joey: I mean, if you can offload that to a provider like if your company is all in AWS, use their products. They’re pretty good I would say.
Austin: So at this point, do you have a standard deployment process with a single group of people who own that or how do you deal with that?
Joey: I’d say we have many standard deployment processes.
John: I mean, we are converging on [inaudible]. A lot of our critical services are either on it or are moving to it. But there is always a long tail.
Joey: The service that nobody knows about that’s existed for 12 years. And we don’t know what it does, whether it’s important or not. And that one doesn’t get ported.
John: It takes time. It takes a long time to move everything over, but certainly there is this migration.
Joey: And I think we still maintain a pretty big separation between stateless and stateful services. At least we haven’t figure a technology that does both well.
Question:
John: The question is quality engineer, what is that? I think…
Joey: Sorry, so on the search team there are kind of different segregations of responsibility. A quality engineer in that role has to do with the quality of search results. They tend to be machine learning engineers or maybe even data scientists as opposed to a hardcore infra engineer who would SSH the box and s-trace it.
John: It’s not a ranking amongst engineers.
Joey: It’s not like high-quality, low-quality. It’s like quality, infra.
John: Other questions?
Question:
John: So the question is we mentioned spinning up new services, so maybe append only services as opposed to putting that functionality into existing services. Who is in control of that process? I think it’s fair to say we don’t actually have a single body that is in charge of that process. We certainly encourage a lot of communication between our tech leads and our engineers, but it is rare that anyone can say, “No, don’t do that.” It’s rather just trying to educate everyone into watching for anti-patterns, and just trying to get people to kind of understand what the tradeoffs are.
Joey: And I think that principles doc, for us, has been pretty huge because when someone wants to make a service you say, “Have you read this doc?” And they go, “No,” and then they read it and they go, “Oh, maybe this shouldn’t be a service.” And then we go, “Yeah, it shouldn’t be a service,” or, “Yeah, it should be a service.” It goes both ways.
John: At the back.
Question:
John: So the question is, so we have the monolith and then we have all of these new fancy services with new fancy deployment mechanisms, metrics, monitoring. What happens to developers who are still working on the monolith? Are they left out in the cold with all our old technology? Do you want to take this, Joey?
Joey: Yeah, sure. So both yes and no. So there are certainly some things which just don’t apply to the monolith, so like our service metric system doesn’t apply to the monolith. But I would say a majority of our developers still do a lot of work in the monolith. So we actually have entire teams that are kind of dedicated to making sure that the monolith experience doesn’t suffer at the expense of the services. But obviously, there are some things that don’t apply.
And I think that that has encouraged, to a certain extent, people to split things out from the monolith, kind of organically. And that’s really what we saw as a…I think originally we were like, “All right, we’re going to carve up the monolith,” and we decided about six months into that, we were like, “Oh God, this isn’t going to work.” The teams don’t want to do this. But over time, people then organically have been like, “You know, that pasta thing, it’s pretty cool. If only we could take our one endpoint that we care about from the monolith and put it in that system,” and they do it naturally.
John: I would add yes, but when new features come along a frequent discussion is, “Can we actually put this out in a service?” Is it much less common for a developer to look at the monolith and go, “Hey, I want to break this apart.” So the other thing I’d add is that we are trying to…we’re working right now on treating our monolith as a regular service, albeit a very big service. And so hopefully, at some point in the future through deployment and some of the monitoring, it will look exactly like any other service that we’re running.
Austin: So is that your end goal with the monolith or is our end goal with the monolith to have it go away or be much, much smaller or something along those lines? What do you aspire to here?
John: What do we aspire? We aspire to shipping code.
Austin: Excellent.
John: I’m avoiding the question. I think our current goal is to very much make the monolith look like a big service. I don’t think it is one of our current goals to get rid of the monolith because we’re too busy shipping code and adding features, basically.
Joey: Yeah, I think that if we’re still around 20, 30 years from now, that’s the eventual goal, is to have more manageable components and modules. All right, I think there was a question in the back?
Question:
Joey: Okay, awesome.
Austin: I think that’s it. Cool.
John: Thank you very much.
Try the open source Datawire Blackbird deployment project.