As Uber has grown, the architecture that powers this platform has gone through many changes. From its simple roots as a PHP program, Uber has grown into a complex distributed system deployed across multiple datacenters using multiple databases and programming languages. This talk will cover the evolution of Uber's architecture and some of the systems we've built to handle the current scaling challenges.
Matt: Okay, nice. So thanks for having me out everyone. I want to talk to you about scaling Uber, of course, but as a bit of a little background I want to kind of sort of demystify maybe some of what… What I hear a lot of times, I go to a lot of technology conferences, and it seems like everyone’s got it super together and they have all these amazing systems that work really great. And all their problems are solved and they’ve reimplemented Mesos. Everyone’s reimplemented Mesos or something.
Everyone’s got a service discovery framework that they’ve developed. And it’s like all super nailed down and great. And wow. Ours is not. Not at all. Not even close. We struggle with all these same sort of things and I bet you, if you asked the people that have these amazing stories, you’ll of course find that it wasn’t always amazing. They went through terrible periods, and I’m pretty sure that that’s just how it goes. You’ve got to struggle with these things to get them to a good point.
And so please don’t think that we’ve gotten this all figured out. In fact, we’re just barely starting to figure it out now. If anything, it’s just lucky that it’s worked as well as it has. Anyway, so let’s get right to the fun numbers, because everyone likes to see the fun numbers. And these are the official numbers that I’m allowed to share. So the really exciting part about this thing is look at that engineers number. What is going on there?
Seventeen hundred engineers, holy cow. This is amazing. And interesting thing to note about this is the growth rate is so fast that half of the engineering team has only been here for six months. So that is a weird environment to operate in. And so growth is really the fundamental constant to our whole operation.I’ve got a couple cool videos here that show one year of growth in some cities in China. And you can see just this is from one year in a major Chinese city, went from nothing to all of the things are lit up constantly.
So this is pretty much happening across the Uber organization, and particularly in engineering. You see a video like that and you think, “Wow, you guys got it all together,” but we certainly don’t have that same equivalent on the engineering side because this is a successful product that we have to support. And so we’re hiring engineers and we’re trying to support this incredible growth in the business, but sometimes you just can’t make software good that fast. It takes time.
And so we had to deal with this. I mean deal with, it’s a good problem to have, right? Successful product, successful business. But sometimes, man, there’s only so fast you can dig those software trenches. So it’s been a challenge. So let me just take you through some of the history, some of the major architectural decisions that we made. And just remember, we started out incredibly small, so small that at first we didn’t even write any software. In the earliest days of Uber, it was 100% outsourced.
Now it didn’t seem like it was a technology problem. It just seemed like, “Look, we’ve got this two-sided market. We’re trying to get off the ground.” There are a bunch of harder problems other than this technology, so it was 100% outsourced. Back end and all the mobile clients were outsourced. And that worked for a while until it started to become a successful business. And we got right to the business of making multiple services. In the very first two things that we brought that we did as in-house engineering, in addition to starting to write our own, bring in-housing the mobile phone apps, we also built two services.
One was dispatch. We built that in Node.js and then there’s this thing called API, which sounds like everyone ends up with a thing called API. And this is our, still today, our monolith. We’ve got this thing called API that we’re desperately trying to break apart or whatever, and same story. But so that worked well at the time. It got us going, got the infrastructure in-house. The software engineering was in-house. And we’re starting to build a lot of things. And things are going okay for a while.
We decided we don’t really need this MondoDB in here because we have some scaling problems. And dispatch just needs sort of ephemeral state anyway, a state that has to last for the duration of the trip. So we don’t actually even need disc drivers involved. So we’re just sticking in a cluster of Redises. So that kind of fixes that problem. But we had this other problem, which is the API side, which is written not in one of these trendy, fancy, happy news languages like Node.js. It’s serious grown-up stuff written in Python.
It’s serious software engineering decision. No one got fired for writing a service in Python. And of course, you’re going to use ORM because who wants to handle all that SQL, right? Who has time for that? Well, funny story, but with a bunch of databases that no one can understand because they’re going through this crazy abstraction layer, and engineers growing like crazy and making unintentionally bad decisions just because there’s no one else to ask. They just had to get some code working so they just started writing it.
Our back end became the Python side of the…the API side of the world became very unreliable. It would break all the time. And so the dispatch has the property that it’s the part of the business that basically has to be up or Uber’s down. People cannot take trips. And it turns out that we can put an availability barrier there so… People need to get home, right? There’s no free users. No one just checking their Uber because they’re bored or whatever.
People have to get home and so we’ve sort of isolated the components that basically have to be up to the dispatch. And there’s this layer. We call it ON. It stands for Object Node, which is basically an availability cache between dispatch and the rest of the infrastructure. And this is important for many reasons, but it ends up sort of…it sort of establishes a pattern that we’ll see more of later. So then after a while, we started realizing uh-oh, we’re building a monolith.
I just read on the internet that monoliths are bad and we should super not do that because otherwise we’ll have one of those. Apparently Twitter has a monolith and they keep blogging that it’s bad. So we definitely needed to get some non-monolith services. So we start doing this, and this is going well. And for all the obvious reasons, right? If they’re supporting a service then the teams can evolve independently. I’m only partially joking that we did it because Twitter said monoliths are bad.
But it’s definitely a tradeoff that’s not necessarily obvious, because as soon as we start to do this, things start to get more complicated. It’s harder to know what’s going on. It’s harder to debug problems. And while teams might be able to move independently, they might often step on each other independently, whereas if it was one gigantic program they would find these things out right away. So no…trade-outs are hard, right?
So anyway, time moves forward and just it seems random but there’s a good reason that we totally switched our database. We needed to do some more advanced geospatial queries. And so Postgres has this really, really cool post GIS. And so we’re doing a lot of geospatial stuff in Postgres. And so we switched the API. It was a very, very prolonged, expensive engineering process to switch over to Postgres.
And this is fine. But then after a while we realized that Postgres performs even worse under this constraints of being behind this ORM we don’t understand. So the cost of each connection that we can’t fully manage because no one can understand the inner workings of this massively changing…or rapidly changing, massive ORM, are even worse on Postgres. So of course, then we decide let’s go back to MySQL.
And another thing you’ll notice is a lot of these dates are kind of early in the year. And that’s on purpose, because a funny thing about Uber is the busiest time of year is Halloween…or sorry, is New Year’s Eve and almost as busy is Halloween. And so basically, toward the end of the year, you can’t change anything because don’t break it before Halloween and New Year’s happen, which means that once Halloween and New Year’s do happen, then everyone relaxes and we can say, “Okay, okay. Now we can finally start rolling out all these things that we’ve been meaning to do for the last half of the year.”
It’s not a coincidence that major changes happen at the beginning of the year. So anyway, we’re starting to roll this new stuff. We’re getting separate services and not building on our monolith anymore. I mean not as much if we can help it, of course. Sometimes it’s just easier to add more code to the monolith, but we’re trying. We’re trying to break these things out. They got their own databases. That’s pretty good.
But things are sort of sharded vertically. Like this service needs a database and so we’ll give them a database. And we realized that particularly for our main trip service, which is in Postgres, we realized that all of that is not going to work come New Year’s Eve 2014. We can just project the graph forward and it is not going to make it. So we realize we’re going to need one of these sharding things that everyone invents, because every big company has to write their own sharding thing that sits on top of MySQL.
So we did that. And ours is called Schemaless. And it’s pretty cool. And it gives us scalable, available layer of MySQLs. And the thing is we have to finish before the end of the year. Otherwise, the business ends. But it does, so yay us. We managed to do that. So we get that going and we’re deploying, instead of for new services now, we’re giving them a Schemaless cluster instead of a MySQL, for example.
But technology marches on, and meanwhile, this dispatch system that we wrote in 2011 is starting to show its age and we want to roll out some new products. We want to deliver food and boxes and all this other stuff. So we undertook a year-long rewrite of the dispatch system. And it finally rolled out sort of summer toward the end of last year. It kind of got fully rolled out just right before we stopped changing anything last year. And that introduced a bunch of other new technology that we’re going to talk about a little bit more in a minute, particularly around distributed system stuff that we Open Sourced and highly available cluster databases, like when you’re [inaudible] Cassandra.
So but of course, things keep moving. And so now we’re starting to write more services in Go. We’re starting to write stuff in Java. We’re contemplating moving stuff to the cloud. This whole time it’s been in our own data centers. And so this is kind of like some new stuff for us. But in general, we’re trying to give people higher level abstractions that we just couldn’t, we didn’t have the time to give them when we were growing super-fast and we just had to get stuff working to just keep up with the growth and keep up with our competition, which is incredibly fierce and aggressive.
And we just have to deliver features quickly. We just did not have time to give our engineers reasonable abstractions or high enough level abstractions. So we’re finally, this year, able to spend some more time on that. So as you can imagine, through this rapid evolution we have accumulated a lot of what you might call technical debt, which is a funny word because everyone likes to talk about it, “Oh, technical debt. Too much technical debt. We need a [inaudible] to work on technical debt.”
But a funny thing is, I don’t think anyone actually agrees, for sure, what technical debt is. It just kind of means stuff you don’t like, a guess. I think it depends on who you talk to, everyone has a different idea of what technical debt it. For me and sort of for this discussion, technical debt means decisions that we made not because…I mean, because they helped survive and not because they were the best long-term decision. But they were things that we appropriate. They were decisions that were appropriately made at the time.
But then the time moves very, very quickly on this timeline and we have to refactor some things. And so along this way, you can imagine, trying to add all these developers so quickly. Obviously, obviously we are going to do microservices. And so microservices, microservice, microservice. Okay, everyone’s got a lot of microservices and also it’s so fun when software engineers try to be all angry. Microservices, yeah. It’s just great. And then like…
So anyway, check out our graph. We’ve got one of these. Everyone has a graph like this that shows how many microservices they’re adding. And so over two years, we added, man, really a ton of microservices. So more or less 700 at this point, and now we’re adding multiple languages and it’s getting kind of complicated out there. And of course, we also, everyone’s got a dependency graph. It’s supposed to be really scare and look like a Hubbell Space Telescope or like an exploding star nebula or something like that
But the thing is, it’s an obvious tradeoff to do. If you’re trying to add productivity while adding engineers very, very quickly with minimal ramp-up time, microservices are an obvious thing that you would do. It allows teams to move more independently. And there are some, maybe, surprising costs and some things that I hadn’t realized, actually, at least when I first got involved in this kind of stuff at Uber, which is like a year and a half ago, which is the…I mean, I was already on board with the idea of microservices, but I guess I sort of assumed 10 was a good number, 20 maybe.
You can kind of like one person can know what they all do. That sounds like a good idea. But with an engineering team this large, simply not possible. And we’ve definitely gotten to the point where we’re questioning whether, at some point, should microservices become immutable? Don’t change them, like actually I think append only microservices might be a feature as well as a bug. I think that it’s…if you look at when Uber service is the most reliable, it is on the weekends.
Why? Because we’re not releasing code on the weekends, I mean as much. Every time we go to change something, we’re going to risk breaking it. And so I don’t know. I think that append only and immutable services sounds bad because there’s some kind of implied cost. You think, “Oh, it must be really expense to keep all these old versions of things going.” But I really want to question that assumption.
And especially if you can drive that cost down, if you can drive the cost down so low that you can…you get on the good side of this equation where who cares if there are 700 services and you’re only using 20. But you’re not sure what the other 600 of them are doing, but they’re free. Why would you risk breaking them as long as you’ve got a reliable way to still move forward? It’s adding these new things that are where the real value is.
And so as long as you can safely keep them around, I don’t know, maybe come up with a way to garbage collect your services somehow. But I think it’s worth considering whether the cost is indeed that high of having a lot of services, or whether you can drive that cost down for sort of older stuff, to the point where you can just let it sit out there. So as I mentioned, our environment has accumulated a lot of complexity.
And one of the big reasons that we have this, or at least perceived big reasons, is we have a ton of different programming languages. So we’ve got Node.js, which is the whole…all the dispatch stuff is written in. We’ve got Python, which is kind of all the core services are written in. But basically, these days everyone is really excited about Go and so we’re writing all kinds of new stuff in Go on both sides of that camp. All the machine learning stuff is in just the math stuff, it’s all written in Java.
And so man, we’ve got a really complex mix here. And we have sort of microservices, I guess, to thank for the fact that we are able to actually do this. Somehow, we got four languages. We still have four languages running in production. I think only because we were able to just glue them together with RPCs. And so I know it sounds kind of scary, but I don’t know. Maybe it is, maybe it isn’t. I guess we’ll see.
But so when we were building the new dispatch service, we were sort of…which is kind of an interesting problem. It’s the part of Uber that has to be up or the system doesn’t work. We wrote this thing in Node.js and as you know, Node.js “doesn’t scale” and it only used one CPU at a time. Same problem as Python, right? So the challenge is we’ve got this existing dispatch system all written in Node and we want to scale this thing. We want to rewrite it and add all these new features. And so the scaling node requires some sort of specialized frameworks and tooling.
And we have written some of those and Open Sourced them. And I just want to explain briefly here just because I think it’s an interesting technique. Not that all of our services use this, but some of the most interesting ones do, or the dispatch ones do. They’re all interesting. Anyway, so we build this thing. So dispatch runs. It is a state machine. It has supply elements and demand elements that all have state and there are various things that can cause transitions in those states.
And so that state exists in the memory, like in worker processes running in our various data centers. The thing we use to locate and shard out all that state is this thing called Ringpop. You can find that on the GitHub. There’s a bunch of docs. Anyway, it works like that. You want to write a service. It has some logic. You want to put this logic, deploy it somehow. But the way that you scale node of course is you can’t add threads. You have to add instances across multiple machines.
And so you get a bunch of those things. And then how do you let these multiple instances running on multiple machines all act as the same logical thing and know about each other and distribute enough information so the state can be located and replicated? So you put this Ringpop library in these services and then when a consumer comes along and sends a request to a random healthy node, and then as the service author you don’t have to understand what node it should be on or whatever.
You have this API that’s handle or forward. And if it’s happens to have landed on the right node it will get evaluated there or it gets forwarded to the appropriate node where the request gets handled. So we use this for all kinds of things. We use this, for example, to park HTTP long pole socket. So we’ve got all the phones are parked in a socket. And so we want to be able to push messages from the back end so we shard that by the user ID. So then if a service wants to push a message to a phone it’s got this Ringpop cluster that it uses to find the user.
Similarly, we have a geospatial index that shards by S2 cell. And so as the supply and the demand updates come in, they’re distributed across some Ringpop workers, sharded by S2. Ringpop, it’s kind of interesting. You just have to name check a paper to get more gravitas. Uses this SWIM protocol. SWIM is pretty awesome. A lot of the stuff uses SWIM, but if you don’t know how it works, it’s pretty neat.
The nodes scalably health check each other, so we want to add… we don’t have a load balancer here. All of the nodes need to know about each other and coordinate so there needs to be some kind of health check that happens but it needs to be scalable and they have to be able to detect failed nodes scalably, so that’s why we went with SWIM. So they periodically will ping a random node. If that doesn’t seem like it’s working, what it will do is it will ask two other nodes, “Hey, do you think…do you agree with me? Do you think that that first node is down? Because I kind of think that one’s broken.”
And so then those two will try it and decide whether Instance 2 should be evicted. So anyway we built some cool UI around Ringpop so you can figure out how healthy your cluster is and how it’s been doing over time and stuff like that. Anyway, that’s all on the GitHub. You can check that out. So here we are. We’re scaling this dispatch stuff and we’re doing all this Ringpop stuff, which is gossiping over HTTP. And this is working. It’s fine, except what we’re finding is HTTP is incredibly expensive compared to Redis or some other protocol that is a lot, a lot simpler.
And furthermore, there are all these extra knobs and things you can twiddle in HTTP that we just don’t ever use. And in fact, are causing problems because there are all these weird conventions about what does status codes mean and should you put a header or a query string. How do you encode things? Can you have variable parts in the middle? And what method do you use? And all this HTTP stuff that everyone says is so great and you should be restful or whatever. Unless you have really well-defined conventions and libraries and tooling to make all these RPCs go, I think really just work against you.
But the main problem we had though, was that HTTP was too slow. So we started to build our own RPC protocol basically just to make Ringpop work, just so that it would so efficient that we didn’t worry about the extra time from the gossip. So we wrote this thing called TChannel, and that’s also on the GitHub. And TChannel is kind of like a lightweight HTTP too. It has a lot less flexibility and it’s specifically designed to be efficient to make forwarding decisions. And so we switch all of our Ringpop stuff to use TChannel.
And then we started switching some of the clients who talked to this thing to use TChannel because it’s so much faster and it’s simpler, that we were starting to convert some HTTPs to TChannel. And I know the question everyone asks is, “What about the tooling? How will you know what it’s doing and can you have curl?” And so we made a TChannel curl and we made a TChannel tcap thing. So that turned out actually to be not that big of a deal.
But the really big deal that we started to realize is we were really unhappy with the way our RPCs were working. Specifically, the performance of HTTP but the flexibility and conventions of how you do HTTP across a rapidly changing organization as well as JSON. We were swapping JSON strings. All these services are swapping JSON strings. And there are so many weird problems that were caused by the fact that there was JSON and there was no obvious way to validate whether you’re doing the right thing or not.
So we decided we’re going to have to…yeah, I already said that. We decided that we’re going to move to Thrift on top of TChannel as our primary way of doing RPCs. And along the way, we realized that Apache Thrift does not work very well at all in Node or Python, which are our two main languages and so this is a real bummer. And so it seems like Thrift is all batteries included. You get our client. It’s going to be great. And in retrospect we probably should have just done ProtoBuff.
But this is before Protus 3 came out and so it was like, I don’t know. But anyway, in the process, we’ve done what it turns out all big companies do who use Thrift, which is we wrote our own Thrift compiler, which is what I think you have to do is you write your own Thrift. And so especially in Node and Python, we didn’t want to…it was like a whole deployment tool chain hassle with this generated code that you get from Thrift, or from Protus for that matter. And so we made this library called thriftrw that you can just work with native Thrift.
You just put a Thrift file and your instance starts up and you can just serialize and deserialize Thrift objects and that huge win. Made everything much, much simpler. And so of course, we have to solve the problem of service discovery. And in true giving a talk at a technology conference forum, I’m going to end with, “Look, we wrote our own.” But I’ll just show you kind of the path of how we got there.
So this is kind of where we find ourselves. And actually, there’s still a whole bunch of stuff running in Uber in production right now that works just the exact same way, which is we have haproxy everywhere and we have a sort of SmartStack alike thing that knows about where all the services go and then it periodically rewrites haproxy [inaudible]. You want to make a connection. Service B wants to consume Service A. It’s just know, somehow, through magic, that 7,000 is Service A.
And so in the code we say, “Hey, let’s get a local host 7,000,” and it finds you a Service A somehow. The haproxies talk to each other and they work it out. So the problem we had with this is that the propagation time was really slow. It was really low. And particularly, the… actually I think I have it right here. There we go. We had the problem of because we were writing all this stuff in Node and Python, the number of workers that we had to deploy, since you have to get one per CPU at least, you probably put more than one per machine because they’re not going to be all running maximum all the time.
So on each machine, we’re running 40 instances of these things. It starts to be taxing to have these massive, massive haproxy files that need to all be coordinated and resynchronized. And this is kind of painful. And then also, we, as I said, I showed you that call graph before. That was made by crawling the code. That was not made by watching the system work. Turns out there’s no way to watch the system work. It just…if people complain that it’s not working, that’s how we know that it’s not working.
And then we…lots of people look at logs and they’ll know about their little thing and I should have more log lines here or whatever. But it’s really, really hard to know who’s calling whom and how that whole thing is going to work. And so often we just DoS ourselves. Somebody changed the thing and it turns out to have just some weird bug. And ends up just crushing some other service. And it’s just really hard to figure out who’s doing it. And so…and of course we have cascading failures.
And so what we came to is in this kind of environment where it’s just things are moving so crazily fast that it’s hard to know what’s going on, we needed to have a lot more sophisticated traffic control as well as service discovery. And so we built the system out of Ringpop and TChannel that more or less looks like the haproxy model, but it’s with a different process on each side called hyperbahn. It’s also on the GitHub site. But basically swap out all the haproxies for these hyperbahn router instances. And it works basically the same way, except now we’re making Thrift requests and there’s built-in tracing and all this stuff.
But more importantly, the TChannel protocol is incredibly efficient to make forwarding decisions and so all those forwards that we were doing before now can be done much faster. And even if the Service B was, for example, running Ringpop, maybe it has a forwarding decision to make, and so because it’s TChannel end-to-end that’s much more efficient. And of course, it’s a request response sort of model, but these are all multiplexed over a stable set of TCPs for performance.
So the kind of traditional, the old way that we had all of our services was [inaudible] there. There’s this haproxy that kind of runs on the machine. And it has an interesting set of problems, which is we were afraid…I mean, well if it ever broke or got misconfigured or something bad happened to it we didn’t know what to do, because we just depended on this thing working all the time.
And in particular, we are also embracing failure testing and that was one of those things that everyone’s like, “You can’t. Don’t failure test the haproxy. Are you insane? The whole thing will stop working. It will be bad. You cannot take down haproxy.” So what we came to is we need to move some of the logic, some load balancing logic up into the service. Just enough to be able to get you to some other instance if your first choice is not working. And that is… this is kind of inspired by Twitter’s Finagle.
And speaking of Twitter, another thing that we added in the TChannel suite is baked in zipkin tracing and so you can get this distributed trace on all your services, which turns out to be absolutely crucial, even though we just somehow managed to live without it for this long. Because you can really tell what’s going on and before…this is…I just grabbed this this morning. The more things we get moved onto this platform, the more people are just like, “Really? We’re doing that? Oh, that’s so easy to fix.”
I had no idea. What are we spending? Why are we even calling that from here? I already have that in memory or whatever. So the more exposure to zipkin tracing we get, sort of the happier everybody is about the state of our infrastructure. So, it’s a pretty nice setup. Like I said, we got it all figured out. Now we have horizontal scalability and zipkin tracing and circuit breaking and rate limiting. Now we can do failure testing on our hyperbahn nodes.
And it has almost no configuration, which is good. And is as available as possible. But you might be thinking wow, you guys got it all figured out. But maybe there’s some problem. Surely it’s still kind of broken, right? And yeah. Because the overall latency in a system like this is greater than or equal to the slowest component. And so we’ve seen this stuff before, but this is a really sort of obviously bad graph, right?
The more of those services, you can have 700 services in production. What are the chances that you’re using, I don’t know, 5 or 10? Put you into the slow case pretty quickly the more of those pieces that you have. And so what we’re starting to do in the TChannel hyperbahn world is to implement an idea that we got from this presentation, from Jeff Dean, which is called backup requests with cross server cancelation. And it works like this.
Let’s say that you have an RPC that you can send to either B1 or B2. It doesn’t matter which one you send it to. You’ll get the same answer. What you can do is you can send it to B1 and say, “Hey, by the way, I also sent this to B2.” And then some amount of time later, five milliseconds, whatever, some magic number of milliseconds later, you send the same RPC to B2 and say, “Well, by the way, I also sent this to B1.” And then whoever works on it first cancels it from the other one.
So that way, in the common case, you only do the work once. And if anything ever goes wrong, like B1 has some horrible GC pause or it gets randomly failure tested or I don’t know, has some weird bug and it gets stuck, anything at all, the most latency that you’re adding here is five milliseconds. The B2 will just run, try to cancel it from B1. B1 may or may not be even up or healthy. But either way, users don’t notice when these kind of latency problems or just faults happen.
So we implemented this is TChannel with this claim option, if you want to check that out. So really though, the biggest journey about all this stuff for scaling Uber has not been the technology. It has been the culture. It has been an incredible challenge getting failure testing instilled as a requirement. Because most of our stuff, honestly, we would kind of…all of our legacy stuff, we would be sad if we ran failure testing on it. We just know. We just know it’s going to break. There’s no sense failure testing it.
We already know it’s going to break. And so getting it in everyone’s head that that’s actually not okay, what we have to do is build stuff that we do failure testing on as you build it and then leave it in production that way. And related to that is that you have to have retries. Things need to be retriable. Otherwise, when things break you have nothing to do. You’re going to expose that to the user, which means make most requests item potent. Make them retriable. And that kind of blows people’s minds.
They’re like, “But they’re not item potent requests.” And you have to rethink your architecture a little bit to make it so that you can retry things, but it’s worth it if you want scalability and availability. So another clever thing that we did to scale this whole thing up is we’re using the fact that we’ve got a mobile phone, which has got storage on it and is plugged into reliable power to handle our data center failover for live trips.
So the trip is going on. The partner phone checks in every four seconds. So what we do is after a couple location updates, we encrypt a state digest and we send it back down to the partner phone. And then if the data center fails and the phone tries to make progress on the other data center, because the second data center has the same encrypted [inaudible] credentials, it can request the state digest and the trip can continue.
And then we can fail the different cities over to anywhere without having to have replicated it everywhere. Anyway, that is all I have. Thank you so much for your time.
Try the open source Datawire Blackbird deployment project.