This talk breaks down microservices as a developmental methodology for systems, rather than an architectural methodology, and breaks down microservices into 3 perspectives: people, process, and technology.
Austin: Coming up next is Rafael Schloming from Datawire. Before he joined Datawire one of the things that he did was he was a spec author of AMQP, so if you guys have ever run a credit card transaction his code has somehow not gotten in the way of that, which is cool. So without further ado, take it away Rafael.
Rafael Schloming:
Hey. All right. Thank you.
Thank you Austin. In order to understand the talk I'm about to give I'd like to give a little bit of info about the history of Datawire and tell you a little bit more about where I started.
Datawire was founded in about 2014 and we knew we wanted to focus on micro-services, and as Austin said, I had lots of distributed system experiences. I had participated in a actually every version of the AMQP specification and I had worked with Arun, was one of the founding members of the CuPiD project so I'm not just defining the spec but also implementing it and helping people use application level protocols to build distributed systems for years.
So back when we started this in 2014 I thought it was all good to go with micro-services because it's just more distributed systems, right? And you know, the distributed systems are the hard part. But now three years later, kind of looking back, I realize even though I didn't know it, I was pretty much starting from zero with microservices. So what I'm going to talk about is what I've learned over those three years and what I wish I could sort of tell myself if I could go back in time, and what I hope will be useful to people coming and trying to sort of understand all the information there is out there about microservices.
So this is a question that, What exactly is a microservice that I get asked a lot and it always bugs me, because I don't feel like there's a great answer. I can kind of explain what microservices is, but you know, it takes a long time. And if you Google, you can see this. You look at the Wikipedia definition that's well at least current as of a few days ago. The first paragraph and even the second, even sort of the detailed description, it doesn't have a whole lot that's really super useful. It actually literally says there's no industry consensus on what microservices really are and then it has sort of other things like, an implementation approach for SoA, processes that can communicate with each other to fulfill a goal, which I found super useful ... And, you know things that maybe aren't necessarily super accurate, like just using services will naturally enforce a modular structure.
There's a lot of other stuff if you Google, but a lot of it's sort of really voluminous. There's tons and tons of essays and they're really good ones actually. And some of them are not so great, and some of them are these ugly stories of how things can go horribly wrong and if you're trying to understand what microservices are when you're coming into this it's hard to know how to sift through that and figure out, Okay, what's relevant to me?
So as I go through sort of the story of what we learned over the past few years I'm going to really put things into three different buckets: the technology, the process and the people. This is something I found really useful in helping to sift through and figure out what is actually important to pay attention to in any given situation. And I'm going to talk about what we learned really from three sources: what we can learn from all the technology that all the experts have created, what we learned ourselves sort of bootstrapping a microservices system from scratch, and what we learned talking to lots and lots of different people at various stages of migrating over to microservices from various different origin points.
And just to give you an idea to what our starting point was for what a microservice is and what microservices are three years ago, it was very technically focused. It said okay an application that's composed of a network of small services. And the service, that's a microservice. It's not really a lot more than is in the word itself. And there was stuff about forcing better abstractions and things like that, but I was kind of skeptical about that part of [inaudible 00:05:44] because, I don't really believe that tools create good abstraction so much as a long series of less bad abstractions create better abstractions. And our view on the process and people was pretty much nonexistent.
So we started learning as much as we could from all the experts who had succeeded and were running microservices at scale. So we read just about every first-hand story we could find. We went on any conferences that had info on this, and we talked to a lot of people. We established some relationships with people at companies that were operating microservices at scale like Netflix and Twitter. And that was, as Austin mentioned, at the beginning of the day that was really the impetus for starting the first summit because we were so much valuables share about how to do this stuff, and including stuff beyond the technology that people just didn't have a good format to share it in.
So armed with this little bit of knowledge we started to fill in the picture. People picture, we said alright, pretty much everyone doing this has this sort of developer happiness, or tooling, or platform team. It was called different things at different organizations, it looked to us like it was the same. You had these sort of tooling teams builds the infrastructure and the service teams, well, they build the features.
The technical picture that emerged was something you've actually seen here today; this idea of a control plane that has things like service discovery, logging and metrics configuration. Then this mesh of services that are all tied into the service, so the control plane via smart endpoints, and then the traffic layer where these services talk to each other through a variety of different protocols depending on what the needs of that particular service are.
So we said all right, this is making a little more sense now. There's a little more going on in the technical picture than just a network of small services. It's this network of small services that are connected via a control plane and a traffic layer. And the people picture you've got this platform team and the service teams. That's when we said all right, well, it's a lot of work to build one of these control planes, so let's make it easier for people to migrate and build one that works out of the box that works as a service.
That's when we started bootstrapping our own system and we started out with about five engineers building to build this thing. We thought of this thing as this service that ingested interesting application level events. So whenever a service node comes up you start event, stop event, heartbeat events, to let you know that service nodes are still alive and [inaudible 00:08:51] ... thank you ... and you know log messages, metrics, [inaudible 00:08:58], anything interesting, right, ingest all these things. Store them in whatever appropriate piece of infrastructure there was, a service registry or log store depending on the type of data. Then transform these into derived data stories that provide a value added view. In this case we wanted to provide a realtime view of routing table service health and sort of historic views of request traces, and lots of other nice things.
So when we were building microservices structure we were actually building the same kind of application that most microservices applications are targeted at. The same sort of data processing pipeline where you ingest lots of interesting events, store them in some source of truth, create lots of derived information, put them in value added stores and present them to audiences. Hopefully you can make a lot of money along the way. So despite the fact that we were trying to build infrastructure for microservices rather than trying to use microservices, a lot of our experiences were the same because of the shape of the problem we were solving were very much the same. This kind of shaped our development experience too.
For version one we started with discovery because that was kind of the core thing that we thought everyone needed. And we thought about the requirements up front in a lot of the terms that you see here. We wanted something that was really highly available, we didn't put a huge emphasis on throughput because we knew, okay not lots of arrivals and departures of nodes necessarily, but we did want it to be low latency so we could present that sort of realtime viewpoint, what nodes are actually out there running. And we really wanted low operational complexity. Actually the first talk, where Matt was talking about [inaudible 00:11:23] nature discovery eventually consistent, was great. That was great. That was something we saw as well. We saw all these people building these discovery systems on top of eventually consistent stories that kind of confused us too.
So we said all right, we're going to be make it really low operational [inaudible 00:11:45] we're going to make it be eventually consistent, we actually went further than that and said we can actually make it stateless too because it's really just presenting a cached view of everything that's actually out there. We also wanted it to survive a complete restart and be capable of handling spikes. We really wanted it to be rock solid because it is something your infrastructure depends on.
So this drove our initial choices and we chose Vert.x and Hazelcast implement the service. We used Anysync protocol over web sockets to talk to our smart clients. And because we are offering this as a service we used Auth0 because we didn't want to write that stuff, and we had a Python Shim around it.
So we started out with roughly two services at the beginning. So things went pretty fast at the start, then for version two we added tracing and we knew this thing had some different requirements. It had to be high throughput, highish latency was okay, but we knew this thing should never actually impact application so we made some more reasonable choices. Vert.x and Hazelcast were fine again but we stored a transient buffer of log messages and then we had this smart circular buffer so the clients were never impacted if the service fell behind. And so we were at about three services here.
Then version three we showed this thing to people, people liked the proof of concept so we needed to add persistence to this thing. Because that's kind of the obvious thing you want. We wanted to keep history, provide full text search, and do some filtering and sorting. So we threw Elastasearch into the mix here and built a query service around that. And we had a total of about four services at that point.
This is where the first hint of pain kind of showed up. This was really when we stopped operating in an append only mode and we started actually changing things. So we had to reroute some data pathways that were already there on a running system. This touched multiple services, we had coupled changes, and we didn't have a great tooling biz, we had a really poor local dev experience and so manually firing up our fabric meant we had to wire together manually. The different services had inconsistent different configurations and we had a slow deployment pipeline on top of that.
So we ended up with a whole lot of bunched up changes at this point, the whole thing took a lot longer than we wanted it to. And the result was a big scary cut-over from the previous version. Not the kind of thing you really want. So we said all right, well, we're just going to be better about this, we know we were not following best dev practices, so we're going to soldier on and see what happens.
Next version we wanted to add some persistence for discovery, we wanted to track errors associated with particular service nodes so we could do some of that reputation based routing stuff, and store different routing strategies. We threw in Postgres for this, we were a small team so we didn't want to incur any more operational overhead than we really had to, so we picked RDS. And low and behold we ended up with another big cut-over. So at this point we said, enough is enough and we took a look at why this was happening, and we decided, all right our tooling is just inadequate for this kind of thing. We need to fix it once and for all.
We had tried various deployment strategies over the course of developing so far. Everything from delivering everything ass a Docker image, which actually was good it solved certainly inconsistent setup issues, but we still had to wire everything together from scratch in order to actually bootstrap the system. We tried using Kubernetes for everything but this kind of forced an unpleasant trade-off on us, because we depended on Amazon services, but to get a managed Kubernetes cluster at the time the only option was Google cloud, and so we had an unpleasant choice there. So we kind of backed off on that. We knew we needed something that would sort of meet the development requirements we had. We wanted a fast dev cycle because we were iteratively building and evolving this thing. We wanted good visibility so that we could see the impact of our changes, and we wanted fast rollback in case we made a mistake. On top of that we wanted the ability to leverage commodity services like Auth0 and RDS, and Elastacache we were using as well.
So we did a big redesign of our deployment system and we said all right, we want a complete system definition for this thing that contains all the information necessary to bootstrap this thing from scratch in whatever environment we want to run it in. Whether it's dev, test, or prod. But it doesn't run the same way in all these different environments, so we need that system definition to be well factored with respect to its environment.
So we had an abstract definition that says basically, okay my service needs Postgres and Redis. We had separate mappings for how to actually bootstrap that abstract service in a different environment. So for development we used MiniCube locally, and then we would bring in Docker images for Postgres, Docker images for Redis so we could get our local dev experience. For tests, well, we were a small team and we held off on fancy automated test environments for the moment we said, Our system isn't huge yet, we can just use the same development environment setup.... even though in the future we would want test data sets, shared test data sets and things like that.
For production of course we did need to bring in a Kubernetes cluster, we decided that was worth it to manage our stateless services. But for everything else we could run them in Amazon, take advantage of the commodity services there, so Postgres was in RDS, Redis was in Elastacache and that would let us minimize the operational overhead.
So we built tooling to cater to this and give us that sort of fast feedback cycle for dev, repeatable environments for test, and give us that sort of quick and safe updates and rollback for production. It mostly boiled down to some scripting around Kubernetes because that does a lot of the hard work there.
One of the paint points we had before was the slowness of our pipeline. That made it really difficult to debug any sort of configuration differences or changes. So we really wanted the tooling to help maintain the parody between the environments. It was really frustrating if you had to make a one line change for something that works completely fine in your dev environment and wait twenty minutes to see if that actually fixes something when it hits the staging environment.
You might wonder at this point, well, didn't we just kind of figure out DevOps again the hard way? That's probably true .. Except I think there's something a little more going on. Because we knew we weren't sort of paying enough attention to tooling along the way but we thought about DevOps and we thought about some of the sort of DevOps advice it was all really presented in an organizational context. A solution for organizational problems, you know the typical DevOps diagram you've got that Venn diagram with your different departments and you know, you've got the synergy area in between them all. And we all sat in the same room, we were pretty much forced to be on the same team by nature of our size. But one thing we did have in common with a lot of the DevOps thinking was that we were actually thinking about operational factors from day one. We were already this cross-functional team. We were thinking about throughput, latency, availability and building a service really, not a server. This really forced us, because we were doing this work of trying to keep the system running while trying to evolve it, this forced us to follow a really incremental process and our tooling for this process was really inadequate. When we thought about that process it helped us figure out the tooling.
This helped fill in part of the missing picture for us. That quote from earlier about people not understanding the process really hit home. People try to look at the tools and just try to apply the same tools that big companies like Netflix use, and without understanding the process that can be pretty dangerous. So it was helpful for us to look at the process in terms of architecture versus development. And systems, the shape of systems in particular, traditionally they've been architected and that's a really particular kind of process. You generally do lots of up-front thinking, you have a really slow feedback cycle and you try to keep things super simple and general so they don't have to change, because you don't know what the impact of your changes will be.
This is with the classic distributed systems or hard kind of thing. You can have these small changes to the shape of the system cause massive ripple effects.
Development on the other hand is a much more incremental process. You have all these frequent small changes and you have a really quick feedback cycle. You measure the impact in every step. This is because you're working on a much more complex system in some way. With architecture the point is to keep things simple, the shape of the system simple, so you understand all the failure modes. With development, well, it's more like it's too big to fit in my head so I'm going to make these really small changes and make sure at every step I haven't broken anything so I'm going to automate checks for everything to make sure it stays working.
We felt like we're actually doing this sort of systems development, but we want to use this much more developmental process. This was an Ah-ha moment for us, it felt like we could understand something more here because it wasn't that we didn't understand the process ... we did actually understand the process. We just didn't know that instead of trying to apply that process to a code base, we're actually trying to apply that process to a running system. From that idea you can figure out more of what you need in terms of tooling and what's important to you. This process applied to code bases, you have lots of tooling to give you rapid feedback. Your compilers, your IDs will tell you if you made syntax errors right away, you've got incremental builds, you've got test suites, and you've got tooling for good visibility, like printf debug statements at a minimum, and then fancy logging and debuggers and profilers, to give you insight into the logic of your program.
But when you go to systems, the key characteristics of a system go beyond just logic and correctness. Part of a service is the fact that it is available. You care about the fact that it is running. You need the performance to be within whatever specified tolerance of that running system. That's a critical cross-cutting feature that every single service in a microservices application has to provide. And tests don't cut it anymore for measuring your actual impact on the throughput and the availability and the latency of your service as a whole.
So we said all right we need to update the dev cycle. The tests that assess the impact on correctness, they're not good enough for this system level impact. So instead of the dev cycle going from build to test to deploy, we need to add this way of assessing the impact on the system. The first level of that is to actually measure that, and this is something that I think the Google SRE book talks about, about these service level objectives, and it's a great read. Of all the statistics out there, knowing what to focus on and what to bring to the attention of the individual service authors is actually really helpful. It boils down to the factors that affect overall system stability and that is throughput, latency and availability. And the other really valuable point there is, don't think about availability in terms of binary on or off, it's actually an error rate, you can never achieve 100%. These three things, we're trying to bring visibility to those into the dev cycle.
And understanding the process, we look back at all the tech the experts were using, things like Canary Testing, Circuit Breakers, Dark Launching, all the tracing and metrics and the deployment tooling, these all fit into the picture much better. These are all ways to enable that dev cycle for running systems. Making small frequent changes, measuring the impact of those changes on the running system, providing good visibility so you can tell when something goes wrong.
This helped us fill in the picture a bit more. We had the technical picture of, okay it's a network of small services, but it's more than that. All of that control plane, it actually has a purpose. It's not just a tool, it's a tool that we now know how to use. It's the scaffolding to safely enable these small frequent changes.
The process, it's service oriented development. Thinking about it as an architecture can get a little misleading. We were happy with this and we felt we learned a lot about the process as you would expect from doing it ourselves. But because we were a small team we really didn't learn more about the people picture until we started talking to people who were trying to migrate.
The migration perspective helped us fill in things even more. We talked to people at a variety of stages, everywhere from your typical monolith, django or rails monolith, to the mothership of monolith plus a bunch of little duckling services. There were the small SoA-ish flock of services, and a whole lot of in-between things. Something that was maybe 30/50 services ... Somewhere in between the migration to full fledges microservices. We noticed that some of these were moving really slowly, taking just months, even a year to create one microservice. Others were moving much faster. So we said, what's the difference here?
One big theme seemed to be how people were thinking about the problem. Different people had different starting points. Some people would look at this as a technical problem. Oh we need to pick the perfect tech stack for our entire organization to adopt. That's actually really slow because you have lots of organizational friction around trying to converge on the perfect tech stack.
There were the other kinds of companies that thought of this as a sort of re platforming or refactoring exercise, take the monolith and break it down into a lot of services, and they had lots of questions about how to do that. Looked at it as a very architectural thing. That was also slow, that actually had lots of organizational and orchestrational friction associated with it. But creating a relatively autonomous team to actually tackle a problem in the form of a service, that was fast. You just had the people to apply to the problem. And you could do that over and over again and create lots of services quickly.
That sort of helped us understand why different companies had different levels of success getting off the ground. There was also this growing pains thing. There seemed to be a sticking point, when you get more and more services, between stability and progress. It's great at first, you start out, you grow quickly, you add lots of service and then at some point you have these stability issues that you run into and you have to slow down your progress.
In order to understand what was going on there we found that it really helped to think of microservices in terms of dividing up the work. The work of building a Cloud application. This is a very people centric view. The work of building a Cloud application has two aspects: you need to build the features, the dev, and you need to keep the system running, the ops. And what DevOps says it's really an observation about the nature of this work, and we have that same observation even as a small team without any of the organizational aspects that come with DevOps. You can't usefully divide the workup along these lines. We saw it in sort of the process space, because we were small we were doing work over time and so ... a process that took the work of building the features and running the system and breaking that work up over time didn't work for us. We needed to work in these small incremental steps in order to keep the system running while evolving it. And DevOps is the organizational observation associated with that.
The reason behind this is the basic fact that if you're trying to do this, new features are the biggest source of instability. Because every new feature comes with bugs. If you have separate roles for these dev and ops you really have this misalignment you have part of your organization stable and running to deliver functionality to these end users and then you have the other part of the organization trying to deliver features, accidentally delivering the bugs with them and making life harder for the ops part of your org.
So you can't really divide the work along these lines without creating these misaligned incentives, but a big part of the work is still actually keeping things running. Microservices is the flip side of this coin. If DevOps says, You can't usefully split up work across the DevOps boundary. Microservices says, Well actually you don't have to there's other ways to split up the work. You can split up the work by breaking your big application into smaller ones. And I think one of the things that gets lost here when you think about microservices as an architecture rather than as a strategy for dividing up the work, is sometimes you forget to divide up the operational responsibility as you actually divide up the application. That is a fundamental part of delivering the application as a service.
So you need to figure out a way to do that in order to align the incentives in your organization. That really explains that sticking point. As these organizations get more and more services and generally grew their team along with that, the roles and responsibilities of the teams can get muddied ... You can have cross service dependencies, and you can have all of the hard distributed systems problems come in. And if you don't have a way to focus people on fixing those. People won't really learn the tools to actually do that. So this filled in this sort of third part of the picture for us. Microservices from a people perspective, it's really about dividing up the work of building a Cloud application and by using service teams to deliver features to the user, rather than to an ops team, and having that platform team support all the service teams, you can create something where you can align incentives so you can actually have stability and progress at the same time.
This is something we kind of learned the hard way by starting with the tech, reverse engineering the process and the people. And creating a whole lot of mistakes along the way. And hopefully you guys can learn from them. But if you want to do things the easy way, I'd recommend understanding the principals of the people and the process and use this as a framework to pick the technology that fits and to learn from other people's mistakes. To help with that I've created this microservices cheat sheet, which I thought was going to be way too big to be readable but this is a big screen so maybe it is. But it's in the deck if you guys want to look, it's pretty much just a summary of what we learned in matrix form. Because I'm an engineer and I think if you can quantify people things it's easier to understand. So the way you read this is the top row is the what, the middle row is the why, the bottom row is the how. Then in the columns you have the domain.
My hope is with this framework it's a lot easier to understand what, of all that information is out there about microservices, you need for your situation.
So I'm always looking to improve this and evolve it, and so if this framework fits with your experiences I'd love to hear that and if your experiences are different I'd love to hear that as well. And if there's missing stuff that can be added I'd like to hear that as well, so please find me and talk to me.
And I'm happy to answer any questions now.
Flynn:
So at the beginning of your talk, and this is always a little dicey because we work together, so I have to be a lot more careful what question I ask him ... But at the beginning of your talk Rafael you said something to the tune of, looking back over this with an eye towards coming up with what you would have told yourself at this point if you knew it when you were starting out.
So, how would you boil all this down to the one message to young Rafael going into microservices? In the meantime while he's answering that who has a question?
Rafael Schloming:
I think the big thing I would say is start with the people and think about how to divide up the work. That's the thing. Coming at it from the technical perspective I was very biased to look at and understand all the technology, I think the people part of the picture is really more important in a lot of respects if you want to succeed.
Audience:
About how much time did you spend on research, on looking for technologies, on basic background work that really never saw the light of day of production, versus how much time did you actually spend writing code that made it all the way through?
Rafael Schloming:
It's hard to quantify that because it was all kind of fragmented. I kind of told the story in phases because that's how it's easy to tell, but it was all kind of jumbled up.
You kind of ... you need to solve a problem you Google, and if there's easy answers you go where they take you, and if there's not you don't and that's one of the reasons you get this sort of fragmentation and this very fragmented view.
Audience:
Hi, do you see the effects of Conway's Law on your team size?
Rafael Schloming:
It's interesting, yeah you do. There's definitely an impact there, but one of the things is sort of going through and trying to fit the information, all the stuff into this picture ... It did make me look at Conway's Law a little differently. It's not just the people. Conway's Law is always often parroted as, the organization drives the shape of the technology. But it happens the other way too. The shape of the problem, if you're building one of these data processing pipelines, there's laws of physics and sometimes you need to distribute your business logic throughout the data processing pipeline. I think one of the really interesting things about developing software today is that there are more and more data driven systems being built every day, so you see the effect of that. It's like the whole industry is turned on its side. So you see the shape of the problem coming up and influencing the shape of organizations.
Try the open source Datawire Blackbird deployment project.