Christian Posta, principal architect at Red Hat discusses how to manage your data within a microservices architecture at the 2017 Microservices.com Practitioner Summit.
Austin: This is Christian Posta from Red Hat. He’s going to talk about managing data inside of your microservices. Take it away, man.
Christian Posta: Thank you, Austin. Can you guys hear me okay out there? Perfect.
All right, so I’m going to be talking about a topic that is way bigger than what I’m going to be able to fit in this 40-minute slot, and certainly more expansive than the talk I’ve prepared. If you’re curious about the full scope, the complete set of slides is available. I had to trim them significantly and I’m hoping to pull off a live demo for you all. I’m Christian Posta, a principal architect at Red Hat, and author of “Microservices for Java Developers.” In the realm of open-source, I’m an active contributor and committer, notably on Apache ActiveMQ, and I sit on its board.
In my scarce spare time, I contribute to other projects too, such as Fabric8.io, a microservices tooling community on Kubernetes, as well as Apache Kafka and Debezium.io, which I’ll delve into later, hopefully with a demo. I’ve had tenure at a renowned SOA internet unicorn well before the surge of the devops cloud and microservices trend. Currently, my role involves guiding our enterprise customers through the maze of microservices architecture, akin to how detailed non-gamstop casinos reviewed guide players through the myriad of gaming options. It’s about imparting the knowledge I gained at the forefront of the tech revolution to traditional enterprises grappling with modernization. Let’s see by a show of hands—how many of you hail from traditional enterprise backgrounds rather than internet companies or startups? Alright, a few of you. My insights today draw from the very real challenges that such enterprises face when embracing the microservices paradigm.
Technology’s hot. As developers we love new technology, we see new technology all the time. Containers and cloud services and so on they’re becoming very popular and there’s this mythology about them, but when as developers we start to approach building these sort of systems of services that at one point we called them SOA, now we’re calling them microservices, but there’s the application level stuff that we still have to contend with. Building cloud native applications is not just about the infrastructure that they run on. Furthermore, I run into the scenario where our customers believe and when talking to developers and architects, but they believe that if they just use the same technology that the Internet companies use that they will then be doing what the Internet companies are doing and that’s a very false premise.
The Internet companies evolved organically to some of these architectures wherein the architecture at, let’s say Amazon, might share some commonalities in that it can scale, but it’s a very difficult and a different implementation that for example Google or Twitter might have. The principles and the methodologies are very similar. They’re coming from a common background and I think where enterprise IT starts to approach this new world, there’s a bit of a heritage that incompatibility with the way that those companies originally evolved their principles to the way enterprises work. I think that can be summarized in this line here. That microservices and cloud and agile and all this stuff is really about optimizing, it’s about changing the focus and the optimization of the way the company has operated in the past and focusing on speed.
I don’t mean speed as in performance speed and efficient use of the CPU speed, I mean in the ability to make changes to the system. To be able to do deployments and get new changes and experiment and try new things out in production and gather that feedback and generate those feedback loops and understand where to ask the next set of questions in terms of how they run their experiments. This is very different from how traditional enterprise companies operate. IT has always been seen as a cost center. How do we minimize cost? How do we reduce the cost of operating the business with this technology stuff which was really just how do we automate paper processes? How do we stand up email servers and ERPs and all the stuff?
Now, technology is becoming the business. Technology, the famous quote Software is eating the world, technology’s eating the world, it has already happened. That is so five years ago. I mean, the Ubers of the world, the Amazons of the world, the Lyfts of the world, they’re driven by technology and they’re providing service through technology and enterprises need to quickly grapple with that and recognize that, otherwise they risk some of the disruption that we see. How do you change a system that was not designed from the beginning or never evolved to be able to deal with this level of change and instrumentation and learning? How do we change it to go fast? There’s a lot of challenges in doing that. I’m going to narrow the microscope down to a set of those challenges and it comes, not surprisingly, down to, How do you manage or reduce in some cases your dependencies and your dependencies between services?
As developers, architects we can think about our dependencies between services, but we also should be thinking about our dependencies between teams and how we work together and how we have to synchronize with all these other teams just to get a single change done. Dependencies are a very important part of the equation, I can give a whole talk on that, but we’re going to talk about data. Data is a very big dependency in any system, but before we start looking at what are the hardest parts of data, we have to give a definition to what data is. Data is really how we conceptualize and understand something in the real world and explain it to another person. In our IT systems, unfortunately, what we’re doing is we’re taking that concept and we’re going to use a interpretor or a mediator, we’re going to use a computer. We’re going to take this idea and try to put into a computer and use that to explain it to other people.
The problem is that as people, when we’re talking about data and trying to explain concepts and ideas, we can naturally disambiguate based on context of what we’re talking about, what those terms really mean. In the computer system it’s not that simple. I wrote a book called Microservices for Java Developers. If you’re trying to model what that looks like in a system, what is one thing? What is a book? Well, I know it’s got a cover and some pages. I wrote a book, but there might be 100 different copies of the book. Is each one of those a book? Is each one of those a book? Sometimes books get so big that they actually have to be split down into smaller chunks, smaller physical chunks. Is each one of those a book or is the whole collection a book, and on and on and on. The very first step is explaining what the data is and then our technology systems, there will be different understandings, different semantics about what the data is and how we implement that.
For example, carrying that silly little example forward. A book checkout system for an online retailer that sells books may see a book, may absolutely care about each visual copy of the book because they’re trying to make money on each physical copy of the book. A title search engine/service might not care about each individual copy. A title, an author, and genre and those sort of things might be more important. A recommendation service or recommendation engine might not see book in those terms at all. It might just see it as a set of loosely related metadata that it uses to model the book. What we see is the same concept is shared across multiple services, each service looks at things a little bit differently and disambiguating that, crystallizing the models behind our data, is absolutely first step. There’s this little community called Domain Driven Design that’s been out for a little while that helps with patterns and practices and principles around doing this.
Now, I’ve actually run into people who’ve asked, Well, I don’t hear Netflix talking about Domain Driven Design or I don’t hear Amazon talking about Domain Driven Design, and that’s partially true. That community doesn’t really exist, the terms and the jargon from the Domain Driven Design community might not exist on the Internet company side, but when you look at what they’re doing, there are similarities between the different concepts. I think more importantly, in our enterprise, these big enterprises, that have been around for decades and drag legacy around with them and process around them, these enterprise companies are a lot more complex in their domain then, let’s say, some of the Internet companies. The Twitters in LinkedIns and so on had to solve for massive number of users and scale.
Data scale and so on, but their domain … Hosting at tweet on Twitter is fairly straightforward. Updating your LinkedIn profile is fairly straightforward. Processing an insurance claim or security audits in our FSI companies, those are far more complex. I think our enterprise companies are going to have to deal with complexity and scale and infrastructure, but we shouldn’t overlook the complexity that we have innately in our domain. The second part of the story, I think, is that developers … If you follow these patterns of breaking things up into nice mounted contexts and models and so on, developers and enterprises are very familiar with deploying things in collocated fashion. Where everything is bundled up into a single deployment and we try to maintain modularization within the application the best we can.
Sometimes it all just kind of melds together, but down here where we actually store the data representations, there’s a lot of … This abstraction is incredibly powerful and we built a lot of assumptions on top of that and as developers, getting away from that if we’re trying to break things apart is going to be really hard. Because these relational databases, these traditional databases, they have things like normalization and SQL which you look at the Internet companies, they’re doing things at tremendous scale where they have these more optimized databases for what they’re doing at scale, but the developers, you write a bunch denormalized data into your database. Going back now and asking questions about that data that you may never have planned for or thought about so you don’t really have to think about what you’re going to be querying as much with traditional SQL as you might with some of these scaled out databases were you actually have to plan up front for the exact types of queries you’re going to be running.
Things like the acid properties of the database. Atomicity, where you are shielding developers away from the impacts of the network and failures and partial failures. The developers, it’s so warm and cozy, that’s what the C stands for is Cozy. Isolation, dealing with concurrency issues. Just making it look and feel like you’re the only application thread in the database and all this other stuff, yeah it happens in order, don’t worry about. All that stuff is shielded from us. Now, I definitely think that if you prematurely try to go down the microservices path, you’re going to end up in a lot of hurt and I see this at some of these cases, some of the teams and our enterprise customers. If you can, especially if you’re starting, if you can just stick with those abstractions for as long as you can.
Of course, we’re going to evolve, we’re going to need to change, we’re going to hit limitations. We want to look at how we reduce our dependencies, but with respect to data and the database that we’ve relied on and built all kinds of assumptions on, really what we’re saying is, Our database designers that have spent 40 years implementing the way the database works, we got it. We’ll take it from here. That’s a big leap for the developers and our enterprise customers. Especially, when we go though this litany of checklists and requirements. Well, their microservice has to do this and it has to be that and it has to be this technology and so on. Then, we get to the part where it says, Well, a microservice owns its own data and it has its own database.
That starts to though people for a loop and rightfully so, right? They’re not used to that. Fundamentally, what we’re doing when we start to go down this path, is we’re building a data oriented distributed system and that might sound, Okay, what’s with the big problem with that? The nature of the distributed systems, a lot of the stuff that Matt was talking about earlier, we can’t ignore that, we have to actually make that a first-class citizen and it starts with the fact that we’re talking with services over the network. In a collocated system, we just do inside the machine method dispatch, but if we want to communicate with other systems, we’re talking over this asynchronous network and this asynchronous network has no guarantee of anything, really.
No guarantee of time, there’s no guarantee that your message from service A that we send to service B is going to get there in a certain amount of time. These asynchronous networks go through routers and they can be backed up and they could crash and they can do all these kinds of things. Things that look like failures are really just delays. Building systems in a distributed environment is really hard. Now, some of the problems that we end up facing, and this is just a minor subset. Like I said, I had to cut back a lot of the slides. We end up bringing the same assumptions and the same mindset that we had in the traditional database world over along with us, at least in the enterprise customers like I said. Where we run into situations where we may need to update this data that might be similar and shared across all these services.
We do things like dual writes or triplicate writes in this case. Maybe we want to update the customer profile so then we tell the loyalty service, Well, the customer profile was updated or maybe an audit service or a recommendation engine or something that’s keeping track of customers. Now, when we look at this diagram and we try to do that, the assumption we’re making is that these things will always be available. We’re always going to be able to talk to those services, when in reality we just said the network is … We can’t make those kinds of assumptions so then we build in things like, Well, if if it doesn’t work, then we’ll just roll it back. We’ll do some compensating transaction. There’s nothing wrong with compensating transactions, it can be a way of solving some of these problems.
Now, this service becomes more complicated because now we have to keep track of state and whether and who responded correctly and where things failed and if we go down we need to be able to restart and take over where we left. We’re kind of building a transaction manager in some ways. Furthermore, if your assumptions are that where you’re used to dealing with a traditional database, what you’re actually doing here is you’re sending data downstream that could be visible to any other service that might be calling this service. Then, we’re rolling it back and saying, Nevermind, that never happened. In the database world that’s called read uncommitted isolation and we typically don’t want to do that. Another problem that you might run into is this … You make a query to a service and you could potentially return an unbounded number of objects or a list of a set of objects that for each one you now have to go and enrich.
Maybe you call a customer recommendation engine that returns a list, you don’t know really what the size is going to be and it turns a gigantic size and now for each one of those entries you have to go off and call the customer profile or the customer service and get enriched data about them. Now, you run into a situation now where you don’t really know what the load’s going to be, it could be unpredictable depending on the use case. You’re asking this one service that, Hey, can I call you billions of times a second, and they may or may not like that. Maybe we do things like add APIs, maybe we’ll do a bulk API and we’ll add pagination. Each service will kind of do this differently. It’ll look kind of hokey. We’ll even just say, Well, I just want certain types of data coming back from this bulk API. Well, just fine. Just let me … I’ve actually seen this where they say, Fine, just send me a query object, and you end up building a database and that’s not what we’re trying to do here.
We need expectation of shared data to be updated together in some way, we need to be able to deal with failures because that’s an anate part of our environment and this whole thing is starting to sound like CAP. Who’s heard of CAP? The CAP term, conjecture or whatever. Now, the CAP theorem is interesting as a way to get started talking about these problems. It’s a bit extreme in how it characterizes the trade-offs that you’re making. The trade-offs that the conjecture says that you have to pick consistency or availability or partition tolerance, you can only pick two, but like I said, the networks we’re dealing with are inherently, they’re asynchronous networks, you can’t predict anything about them. You can’t trade-off P. P is not a trade-ff. You get P. You have to pick C or A in the presence of P, but C is defined in the CAP theorem that was eventually proven by Brewer and others, that C is actually the most strict definition of consistency that you can get and availability is the most strict, the highest level of availability that you can get. When in reality, there’s many more consistency models.
Consistency is a continuum, it’s not an absolute, it’s not a binary and there are many more consistency models that we can explore. When we try to model network failures and time and delay into our system. Strict consistency, where we say if something changes in our data, then it’s immediately visible to everyone. That changes immediately, there’s no arbitrary delay. If I return from write, I know that it was visible at some point unless somebody else changed it to the rest of the system. You start trying to do that over multiple nodes in a cluster and they have to coordinate and they have to build up some consensus to make sure that this is actually true. You start to go down the line and maybe say, Well, okay. We do want things in order, but they don’t have to be immediately available. You look at something like sequential consistency or causal where you say, These set of events are related to each other and they’re caused in some way by each other. For example, in a stream of events, I don’t want to see a comment for a blog post before I’ve actually seen the blog post.
There’s some causal ordering there or first in first out or increasing reads, all the way down to eventual consistency which is kind of, we could read anything really as long as it eventually converges we can just read anything. Sort of the lowest of the consistency models. Now, is this useful? In real life … There’s very, very few examples of strict consistency. In real life we actually and the way businesses operated before computers even, there was very few examples of strict consistency. A lot of business problems, actually aligned better with some of these less consistent models. I can’t go into the details about all of this and how this applies to real life, but Doug Terry did an amazing job. If you go find his paper on explaining consistency through baseball where he shows that each observer of the game really has different needs for consistency and you don’t need to think about it strictly through linearized ability or strict consistency at all times.
Which is what we’re used to when we’re looking traditional database. Maybe we can use some of these relaxed consistency models for exposing our data and interacting with other services that share some of the same data. Here’s a quick example of where we tried to keep those downstream systems up-to-date using a relaxed consistency model based on being sequentially consistent. Where a customer profile service may update the customer profile and the downstream services might be interested in knowing about that, but instead of trying to make and build own transaction manager and try to coordinate two-phase commit her three-phase commit type scenarios, what we say is, We’re going to put this in a log, we’re going to put this in a queue and they can read it whenever they are ready to read it and then react to it and store it locally, mung it up, increase the counter. Whatever they’re doing with the data.
If we add more services that are interested in the stated then go for it, it’s all right there, but what we’ve done, if you squint at this a little bit, what we’ve done is actually take in a lot of what the database does, traditional acid databases do for us. Things like replication, things like indexes, things like materialized views, and we’ve made this a first-class citizen of our data services here. We’ve gone off and built this at the application layer and this is not inconsistent with, I think, how the Internet companies have arrived at some of these solutions. At Yelp, they built an entire ecosystem of tools that … Are there people here from Yelp? One person? They built an entire ecosystem of tools built around this idea where we can take changes from a service, those changes might actually be going into a relational database. In this case they use MySQL. We’re going to take the data from MySQL and turn that into an event stream and published that.
LinkedIn did the same thing, Zendesk did the same thing and there’s probably others, but these are the three open source ones that I found. Now, at Red Hat we have customers that have this same problem and we were looking at how do we build this. Because our customers use MySQL and Oracle and Postgres and Mongo, lots of different databases. How do we build something like this that takes advantage and uses some of the best-of-breed open source technology already and is pluggable and is a little bit more slimmed down so that you can pull it into the use cases that you might have without having to absorb the entire ecosystem of tools and allows you to start you doing some of this stuff without making tremendously big impactful changes to your applications already. Because you can build this sort of thing at the application layer using event sourcing and that kind of thing, but that requires quite a bit of application change.
There’s this project called Debezium.io that we started that helps to solve these types of problems. Debezium is a change data capture system. Change data capture, if you’re familiar, especially at enterprises, is not a new idea. It’s been around for a while, some of the big database vendors have had that, it’s quite expensive. What we’re doing is we’re building Debezium on top of … Ideally, with a very pluggable architecture. We’re able to plug in whatever database that you use and do that in a way that allows us to bring those database changes into a streaming environment like Apache Kafka. What we do with Debezium is we have these connectors for these different databases and we can point them at specific databases and the transaction logs for these databases and parse those transaction logs and convert them and stream them to Apache Kafka.
Debezium is a standalone connector so you don’t have to run it in exactly this scenario. You can embed it in your own applications directly, but we’ve chosen to build on top of Kafka Connect. Have people heard of Kafka Connect? Raise your hand if you’ve heard of Kafka Connect. A couple of people. It’s a framework for getting data into Apache Kafka and getting data out of Kafka. The source to Kafka and a sync from Kafka. It does things like manage high availability, manage consumer rebalancing, manage offset positions in the database or in your source or sync for you without having to add additional components of additional databases and add a bit more operational complexity to the system. Now, we can start to look at the problems that we described earlier through the lens of well, we’ll make the data available. We’re not goin got try to implement all these APIs that try to do the same thing. We’re just going tom make the data available and let consumer decide what they want to do with the data and how they want to interpret it.
Maybe they want to cache it locally and save the, I don’t know how many billions of calls that they’re going to have to try to make over the network to solve that N+1 problem. We saw this slide here as an example of using change data capture to build materialized views of the data in the context of other services. Debezium.io, go check it out. I’m going to try to give a demo of it if things cooperate here and let’s see if this demo. Anybody going to sacrifice some data back there, Austin? Let’s see how this goes.
This is a live demo, but I’m going to have a little helper thing type it for me. First thing we’re going to do is startup ZooKeeper because we’re going to be putting these events into Apache Kafka and Kafka needs Zookeeper. We’re going to start up and everything we’re going to run in Docker containers. It looks like everything started up fine with ZooKeeper. Now, we’re going to run Apache Kafka, start it up, look at the logs, we should see it come up. Everything’s all good so far. Now, we’re going to create a database, we’re going to create a MySQL database, it’s running in Docker. This particular database has … We’ve collocated a SQL script with it so that it’ll automatically import a sample database so that we can use it for our demos like this. Wait for MySQL to come up and it looks like it came up. Everything’s good. Now, we’re going to create a client to the database. This is where my demo comes into the picture.
What the hell’s going on?
No.
I was expecting the demo to work. Questions, if there’s any questions.
Flynn:
Anybody? Any questions? Yep, there we go. All right, hand on just a moment.
Speaker 3:
I don’t understand. What were you going to demo?
Christian Posta:
Thank you. What I was going to demo is, I was going to start up, I was going to start up a MySQL database and I was going to show using Debezium how we connect to MySQL and start reading the bin logs to MySQL transaction logs and how we consume that, parse it, and write that into Apache Kafka. Actually, what it was supposed to do was generate a bunch of entries like … Kafka is a key value and basically byte oriented message payload and the structure that Debezium creates. When we parse the bin log in MySQL is a JSON structure where we capture things in the key and the value of the Kafka payload that we send. In the key, what we use as the key, is usually the primary key of the table and here’s an example of what that would look like where we actually have the schema, we identify the schema of the primary key and then the primary key ID.
What actually gets it sent out is the data that we’re capturing. The before and the after data and the before and after structure of the data. Because databases can change and what we do is we actually read the DML, the parts that change the database, and the DDL, the parts that change the tables, and we keep that representation in memory so that we can explain in a little more detail what are the columns, what are the column types when we publish the change event that happens in the database. Here we a before and after, these are the fields. If columns get added or deleted and so on we can detect that. Then, we see down here the payloads. Here’s a change event where we … We just added this to the database.
The before, this is a brand-new insert to the database. What we’re doing is we’re reading the MySQL, the demo was going to read the MySQL bin logs, publish that to Apache Kafka, and then show certain failure scenarios where we take the Debezium down and start screwing with that. The data still gets tracked and puts the transaction log on the MySQL side and when the change data capture system comes back up it’ll know where it left off and continue to process things after that. It’ll do things pretty fast, pretty instantaneously, but the expectation that is built on is this is not a strictly consistent system and that changes will be seen eventually.
Flynn:
Back here. Back corner. Over here. Question about change data capture against managed databases in something like RDS. I’ve been trying to make this work against Postgres under RDS and I’m wondering if you have any particular suggestions for environments which don’t allow arbitrary extensions.
Christian Posta:
Right, in Postgres. We are working on a connector for Postgres right now. That slide that I put up … We have MySQL and MongoDB right now. We have a PR for Postgres right now and we’re still trying to sort it out, but there was a feature recently, I forgot the name of the … There was something that was added in Postgres more recently that allows us to do it I believe without doing a Native C plugin. To your point, you can read the Red Hat logs from Postgres, but it requires a native plugin to the database and in a cloud environment you can’t, we can really do that. I believe there was something added more recently to Postgres that allows us to that outside concept of the constraints of having to embed a plugin. We can go look at the Debezium source code of that PR. I’m not writing that PR so I’m not totally sure how they’re doing it, but I thought that’s what I heard.
Flynn:
I think we have time for two more.
Speaker 5:
If I were to use AWS, I can tie my events to any database, write, and trigger lambda functions and queue them through SQS or to even put in Kafka. How will this system in comparison to that, how does it fair better?
Christian Posta:
It sounds like it’s very similar. In this world, or in Debezium specifically, we can’t connect today to the Postgres instances running RDS just because that connector is still coming, but we can connect to RDS on the MySQL side. We can use Debezium to actually ingest events directly from the MySQL databases. Then, you can put them … Debezium, we’ve built it on top of Kafka and Kafka Connect, but like I said in one of these slides. The actual engine that does this is actually not tied. We’re not using any Kafka APIs for the actual database connectors. You can take the connector and run it in your application directly or run it anywhere. It’s a Java JVM-based application. It would be very similar to how you might implement some things in AWS without being tied to you AWS specifically.
Flynn:
One last question before food.
Speaker 6:
I was wondering. When you mentioned Domain Driven Design, is there SQRS in play here where you’re reading the events and basically generating the read model here?
Christian Posta:
Yes. This would be a good solution for how you … The questions around SQRS, which is basically separating out different read/write workloads into potentially different data systems. Different data models and different data systems. This sort of solution would aid in maybe you have a very complicated read model and you have it against a certain database and you want to take advantage of acid transactions and all that stuff, but your reads might be more simple and you have different load characteristics of your reads. You can use this to now publish the data directly over to the read side of the world and then write the data into the read databases and denormalize them into however you’d like. It would be a good solution for them.
All right. Well, I appreciate you guys sitting through some of the torture up here, but I’m around all day. I’ll try to go to the happy hour too and if you have any other questions, then come up and ask. Thank you.
Try the open source Datawire Blackbird deployment project.