What does it take to run an active-active architecture? It boils down to data, and how you interact with it. The most common disaster recovery strategies I encounter with folks are either โhope and prayโ on a single location for running an application, or โwe have a backup facility we can switch to in a disasterโ with an active-passive configuration. This is a shame, as running an active-active architecture is not science fiction, and while it takes some work, for many business use cases the pros greatly outweigh the cons.
The Devilโs in the Data
If you work your architecture from the outside in, an active-active architecture is relatively easy to understand: you have more than physical datacenter or cloud region where you run your application, you have your application running in each, and you have traffic management to load balance users to the closest, fastest and most available location for them. Where folks get stuck is in the database. How do you make sure a change made in one location gets propagated to the others? What happens if a user โbouncesโ from one facility to the next and attempts to read back data they just wrote? How do you avoid race conditions where user A at location 1 and user B at location 2 make a conflicting change to the state of the application?
The key to handling the scenarios above is understanding the business requirements of your data, and treating each type of data appropriately for an active-active architecture. If you try to โmake all data available everywhereโ, youโll quickly run into either CAP theorem constraints or financial constraints on the high costs of this approach. Instead, letโs break the bigger problem down into smaller, more tractable problems.
Three Classes of Data
For each type of data your application accesses or produces, letโs classify it according to one of three types, using a banking metaphor for simplicity (and for easier explanation to your business peers):
1) Is this data like a bank account balance? It changes frequently, and itโs incredibly important to make sure two conflicting changes donโt โbreakโ the balance. We donโt want users A and users B making a withdrawal from two locations at the same time, thereby potentially overdrawing the account.
2) Is this data like your address and mailing information your bank keeps for your account? It changes, but not frequently, and during a change, itโs OK if thereโs some short period of time where the multiple locations that host your application are out of sync for this data. Theyโll eventually gain consistency.
3) Is this data like your banking statements? Once produced, they donโt change. They can be archived.
Account balance. Mailing address. Historic statement. Three categories, and for each, we have a data replication strategy that can support active-active. By categorizing the data and treating each with a different replication strategy, weโre accomplishing two very important things:
1) Weโre being thoughtful about what data REALLY needs near realtime replication and race condition protection. By doing so, weโre reducing the amount of overall data that needs high priority replication. The less data to replicate, the better and more efficiently replication works.
2) Rather than try to replicate all data with the same strategy, weโre willing to use the right tool in the toolbox for each type of data, which means weโll save costs by using archival and eventually consistent strategies where they make sense.
For each class, weโll use an appropriate replication strategy:
1) For โaccount balanceโ information, weโll replicate changes as quickly as possible, and make sure our application is aware and capable of resolving change conflicts. This is certainly one of the hardest pieces to get right, but by reducing the scope of how much data we need to address with this strategy, the problem is much easier to solve than by trying to apply this strategy to all three categories of data.
2) For โmailing addressโ information, weโll use an eventually consistent replication strategy.
3) For โhistoric statementsโ information, we may choose NOT to replicate this to all of our facilities. We may revisit some old assumptions on whatโs important to save and whatโs not, and decide to discard some of this information. We may choose to simply regenerate the statements should they be lost. We may choose to bulk replicate this data at certain times of the day.
The Top Three Benefits
After we migrate an application to active-active, we can expect three key benefits:
1) Faster and more reliable failover in the event of a disaster. As opposed to an active-passive disaster recovery strategy, which depends on idle infrastructure coming up to full production speed in the event of a disaster, we avoid the risk of โwe thought it was ready to handle production but something broke in between disaster recovery drillsโ. In an active-active scenario, weโre sending production traffic to each location all of the time, so we donโt have an โidle to full capacityโ ramp-up problem; instead, weโll have a โ50% to 100%โ (for two locations going to one) or โ33% to 50%โ (for three locations going to twoโ capacity challenge. Weโll need to make sure we always have spare capacity in an active-active configuration to handle the additional load during a failure.
2) More flexibility for making changes to applications and infrastructure. During a site maintenance event, we can shift all traffic away from our facility, perform our maintenance, and then pull the traffic back, without downtime.
3) Improved user experience. By connecting users with an application instance that is near where they are in the world, theyโll receive a better and faster user experience than if they were to be all sent to a single location.
There are others, but those are the biggest three.
The Top Three Drawbacks
Of course, itโs not all sunshine and kittens, and itโs not without work and investment:
1) Youโre going to need spare capacity available, and be diligent and have conviction about maintaining a safe buffer for spare capacity. If you run two facilities each at 75% load, youโre going to have a big problem when one fails. Know your limits, and stick to them. Donโt fall behind in investing in this capacity.
2) Youโre going to need to change your application. If you were looking for a โdrop in solutionโ to active-active, this is not it. Youโre going to need to change your application. The most common scenario is youโre going from a single database instance on a single database technology to three separate data services that treat each type of data with the appropriate data management strategy. The most common problem is โI used to be able to JOIN these two tables to get a result, but one was an ‘account balance’ type, and the other was a โmailing addressโ type, and now theyโre stored separately”. In this case, your application will need awareness to retrieve the data from each appropriate location, and โJOINโ in software.
This sounds hard and time consuming, but thereโs a great strategy for how to implement these changes. Presumably, your application accesses data using a core set of libraries (often an object relational mapping or ORM approach). In these libraries, youโll want to make the changes for โwhat type of data is retrieved from whereโ. This will help minimize the impact to the rest of the application in terms of the interfaces to retrieve data, and keep the โdata load balancingโ functionality in one location in your codebase. Add in helper functions as appropriate to implement the JOIN functionality you need that previously was done in the database. Keep in mind, if you have a LOT of JOINs across categories of data, this strategy as a whole may not be viable for you.
Itโs highly recommended you take a dark architecture approach to these modifications. For some period of time after migrating data to a new strategy, have your application perform reads and writes to BOTH the legacy and new implementations for a single data interaction use case. Compare the values and results, and log/alert when different, while returning the legacy result. This gives you operational experience with the new data strategy while minimizing risk, since the application is still using the legacy implementation for all functionality. After you have confidence that the two approaches are functionally equivalent, you can start using the values from the new implementation, and turn down the legacy implementation. For more details on dark architecture, see here:ย http://gigaom.com/2013/06/20/making-it-change-less-scary-using-dark-architecture/
3) Youโre going to be running multiple database and data storage implementations. Thereโs a definite cost here in training your team, licensing, infrastructure, and having the monitoring and associated operational infrastructure required to run a dedicated strategy for each type of data. Youโll have to understand for your business whether the benefits outweigh the costs.
There are others, but those are the biggest three.
Your Next Step
How important is uptime, infrastructure agility and user experience to your business? Are these investments worthwhile? Only you will be able to tell for your business. Once you quantify the value of uptime (e.g., how much money do you lose and damage to your brand is there during an outage?), the value of agility (e.g., how much faster could your team evolve your application if they had more flexibility to make changes anytime to production without affecting customers?), and the value of user experience (e.g., how much faster could your user experience be by splitting your infrastructure into multiple active-active locations, and how much would your users appreciate that improved experience?). Come up with those figures, and compare them to the technical investments required above, and youโll have a pretty clear GO/NO-GO for evolving your application to active-active.



