InfoQ Homepage Presentations Legacy Modernization: Architecting Realtime Systems around a Mainframe

Legacy Modernization: Architecting Realtime Systems around a Mainframe

View Presentation

Speed:

Download

51:27

Summary

Jason Roberts and Sonia Mathew discuss architecting resilient real-time systems interacting with mainframes. They explain how Change Data Capture, Domain-Driven Design, Event-Driven Architecture, and Team Topologies were crucial for technical, organizational, and semantic decoupling. Learn their strategies for overcoming challenges with legacy systems and building a unified, scalable platform.

Bio

Jason Roberts is Lead Software Consultant @Thoughtworks, 15+ years in software development, Azure Solutions Architect Expert. S]onia Mathew is Director, Product Engineering @National Grid, 20+ years in tech.

About the conference

Software is changing the world. QCon San Francisco empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Mathew: About six months back, our team was in a session, about 50 or 60 people in a PI planning session. We still do SAFe Agile at National Grid. This was a session with our primary business stakeholder. In the middle of the session, half the room got up and left. We found out that our major customer application had faced disruption. For the half of the room that stayed, that's why we are going to talk to you about architecting resilient real-time systems with mainframes and how you do that with change data capture, domain-driven design, and event-driven architecture.

My name is Sonia Mathew. I am Director of Engineering at National Grid.

Roberts: I'm Jason Roberts. I'm a Lead Software Consultant at Thoughtworks. I've been working with National Grid for about two and a half years. For two of those, I've been serving as the Technical Program Lead for this program for the 2.0 portal and with their mainframe, very deep in there.

Context-Setting

Just to set the context here, there's a lot going on, but ultimately this is a story about decoupling. Many kinds of decoupling: technical decoupling, organizational decoupling, and semantic decoupling. These four paradigms that we're going to talk about all work together in concert to make all that happen, and they're complementary to each other. A lot of you probably know what these things are already. Domain-driven design, that's a software development methodology focusing on modeling, specifically modeling data and modeling events that happen within a system. It gives us the tools to redefine and reshape the data, which is in the mainframe and hard to consume, to something that's more business-friendly, more developer-friendly, across the entire spectrum of that value chain.

Team topologies is an organizational design methodology. There's that word methodology again, that prescribes shaping teams according to different types. There are four, but the two that are important to us are stream-aligned, that's the one that actually talk about value delivered through the product, and complicated subsystem team. I'll let you guess what that might be related to. Team topologies aligns well with domain-driven design, because those boundaries, those circles we draw around the domain and the teams, can be very similar or the same in many cases. They also give us the tools to carve that out for the complicated subsystem teams. Event-driven architecture is a system topology. We have another through-point here, topology, using asynchronous messages, events, as a mean for components to communicate.

Another decoupling mechanism, again, following with our core themes. The advantages of events is that they don't have an implicit or explicit answer, like an API does. In a sense, they're a stronger abstraction, because you could have no response, many responses. You can't do that with an API, it's request-response. That's another decoupling point. The other consequence of event-driven architecture, is that we're shifting from something that's strongly consistent, to something that's eventually consistent. This is a specific tradeoff that we took, and we'll get more into that. What that brings us to, is that we pivot here to change data capture. Change data capture is a way to capture changes in a datastore, that's it.

Those two things can work in concert to create what we call a system-of-reference, that isn't the source of truth, isn't the system-of-record, but it can functionally act like one for any consuming applications. It can be designed around domains and domain-driven design, and have teams exist within those boundaries in a uniform, consistent way.

Unified Web Portal - Where Did Half the Room Go?

Let's talk first about the first version of Unified Web Portal.

Mathew: Where did half of the room go? What is the Unified Web Portal? This is a self-service website for National Grid customers. It provides key features. You can look at your bills, look at what you owe to National Grid, look at how to make payments, and maybe pick payment plans where appropriate. You can go about starting a utility service to your place of residence or business, stopping a service, just transferring over when you're moving houses. It also provides you the capability of looking at your energy usage patterns, real-time energy usage, and make decisions about efficiently using energy. The word unified is interesting because National Grid is a regional utility provider. It delivers both gas and electric service.

Over a period of years, they've acquired a lot of companies, divested some, but through these acquisitions and these providers, in order to continue business continuity, we ended up with more than one system-of-record for billing systems. What you end up with is different portals. If you have a gas connection, you might need to go to a different portal. If you have an electric connection, you might have to go to a different connection.

The first version of UWP 1.0 was an attempt to solve this problem amongst many others. The solution was to unify this data between multiple mainframes. The solution of choice was ETL batches from mainframes to a unified data model, SQL database, and into the SaaS platform, which was the solution of choice, which had its own datastore. Combining this data in an ETL led to a lot of data quality errors. Also, the batch nature of ETL meant this was run only a few times a day, which meant there was freshness issues with data, and this could definitely be a single point of failure. ETLs are perfectly fine if you're looking at analytical data. However, in this case, we were dealing with operational data, and the batch nature of ETL was ill-suited to give customers the most recent information about their accounts.

For most critical use cases, applications would have to make synchronous calls to the mainframe, and thus bypassing the two datastores that it did ETLs to because neither of them could guarantee freshness of data. What this resulted in is a highly elastic source, talking to tightly coupled with an inelastic source, and also a complex infrastructure under the hood. Due to the low frequency of updates, Backend for Frontend API was required when we are talking about data currency, and for the highest value use cases, data currency was required. The drawbacks with the BFF was, it tightly coupled to the mainframe, whereby it inherits the scaling concerns of the mainframe. It is also tightly coupled to the web use cases, making it very hard to reuse. API platform was introduced. This provides some amount of decoupling from the web platform, but at the same time, this is using synchronous processes to achieve distributed transactions.

Roberts: That API integration platform ended up hosting many dozens of Backend for Frontends.

Mathew: This is an example for synchronous connections to the mainframes. When the mainframes could be overwhelmed by a highly elastic resource like the web, and the result of this could be failure in datastores, and sometimes that failure could result in all consuming applications failing. That's why half of the room left.

Roberts: That's one of the unintentional consequences of automation.

Mathew: Our solution also created a very siloed team structure. It fit with the architecture where we have very specific skillset and teams aligned to very specific skillset. We have four siloed teams. This required a high amount of coordination and effort to release a single customer user experience. While the Unified Web Portal 1.0 achieved some of what it had set out to, but what it ended up doing in the way it was architected is it had failed interruptions to the portal, and this increased the call center volumes. The customer experience and satisfaction were degraded as a result of that. When in case of failures or even to release new features, there was a high lead time required due to organizational silos.

Roberts: That's where our story starts. It was obvious that things had to change. There were a lot of points of failure and points where the customers could experience degraded experience. Looking at just the broad goals here, we're focusing on the technical goals on the left. Again, I said this is a story of decoupling, and that's really the name of the game. That also ties into the fact that we had lots of dependencies. That's the siloing that Sonia described, different teams delivering in different cadences with different ways of working.

Then there's also those third-party solutions we talked about, which come with their own set of requirements to integrate with. Then the third thing is by doing this, we create an empowered engineering organization because they're owning the end-to-end slice of all that work, despite the fact that they're ultimately dealing with this mainframe underneath. It turns out that we can make it so that that's not the most important thing. The business goals flow down from that. The theory is that we'll succeed on all these points, reduce the cost on our volumes, get rid of the licensing costs, improves customer satisfaction and stability.

This is a bird's-eye view of the architecture. On the left there, we have the mainframe. We have change data capture configured on that mainframe, as that green pipe. Again, that's the core, that's the foundation that makes it possible. From the change data capture, we have a system that emits events into what are essentially Kafka topics. Technically, it's on Azure Event Hubs, but it uses the API, so same difference. The technology here I want to point out is not necessarily important. It's the strategies and the patterns that we use. You could replace Azure with AWS. Many things you can replace. Those topics flow into a set of background services and those background services populate a document database. All this is in the cloud. That document database then turns around and is hooked up to a series of APIs that can expose those domain objects through an API gateway and serve the applications, like web, or mobile, or a chatbot.

Domain-Driven Design, with GraphQL

Let's talk about domain-driven design. Notice here I say with GraphQL, and I'll get into why that is. Domain-driven design is a methodology that focuses on modeling data and events and establishes a common language to design systems. The core of domain-driven design is the first D, domain. Let's talk about surfacing the data. This is just a broad stroke of the process here. One of the ways to design domain-driven design systems, is start with an activity called event storming. It's a workshopping activity that's used to identify some of the key constructs and artifacts that domain-driven design uses. Those are things like on the second line. Bounded context, we're going to talk about a lot. That has to do with a solution space, basically, that ties together various types of data. Those types of data being our entities.

In the next slide, I'm going to talk about some of the definitions in a little bit more detail, but for now, just add that to the queue. Commands are things that change. Events are things that happen. Once we know all that, again, the first two lines have happened, ideally, without any reference to a mainframe. We don't know what's in there. We just know that these are the pieces of data that are needed to make the portal work. Having done that, we started with a fresh canvas, and we then mapped the source to the target, and that's where the real fun begins. We have that mapping. On the top side, we can create application-neutral APIs to deliver experiences to consumers.

One of the things, again, I want to talk about here is these two terms. Some people may have a correct, incorrect, doesn't really matter, a conception of a bounded context being one-to-one with a microservice, and that's not true in our case. That's not how we think of it. What we consider a bounded context is much closer to the original definition as defined in Eric Evans' book. Here, as you can see, it's the delimited applicability of a particular model or data, and it gives team members, here's another through-point, teams, a clear and shared understanding of what has to be consistent and what can develop independently.

Again, consistency is going to be very important, and developing independently is also very important, because that speaks to our decoupling. Within bounded context, we represent entities. An entity is an object fundamentally defined not by its attributes, but by a thread of continuity and identity. I'm going to reiterate that continuity and identity are of utmost importance here. I'm going to build up a little bit about what we mean in our bounded context, and I want to start very simple. I told you what an entity is. You can imagine what an operation is, but it's a request, pay my bill. It's something that's going to cause, very likely, those entities to change. It's going to cause events to our system.

At a fundamental level, this is how we get data from our mainframe, or any legacy system, into a bounded context. There's change data capture, which we have as a black box here for now, intentionally, I don't want to get into that yet. It's data into an API, and that API has a database. In our case, it's a document collection, where it just puts them in there, and they're available in near real time, from when the change happens on the source system, for any application to query and make available to somebody on the portal.

I'm going to talk about how GraphQL fits into the picture. The main benefit for us is allowing us to avoid the proliferation of Backends for Frontends. You may recall that earlier, we talked about how we had this API integration platform. It was very big and unwieldy. It had many pre-prescribed layers, by the fact that it's an out-of-the-box solution, and there were many dozens of Backend for Frontends designed there. One of those silos we talked about. To avoid that, that's where GraphQL comes in. I'm just going to talk specifically about the over-fetching and under-fetching problem, which is what GraphQL helps to solve. We knew that the system would be utilized by other applications, and GraphQL, we have a node, which corresponds to one of those bounded contexts in the graph, and a way to compose those nodes together into a super-graph. Why do we need to do that? The most naive implementation would look something like this. I have applications going through a gateway.

I have four APIs, that's those boxes there on the middle, that have things you might expect, my username, my email, payments that I've made. I can write an application that will call those endpoints and do something useful with the data for a user. What can end up happening, as we'll see, is if I don't solve a particular class of problems, I could get myself into trouble, which is why GraphQL is important here. This is an example of a type in GraphQL. It's, again, nothing to write home about: it's first name, last name, so on and so forth. I can write a query with GraphQL on the left there, and look up all the fields that are defined for a user. I get my information back.

Now, here, you could envision a scenario where I don't want all those fields, I only want two of them. If I'm a REST API, I don't have that choice. I just have to get all the data. GraphQL does not have that option, or does not have that limitation. Instead of writing a Backend for Frontend, which I would have to do if I were using REST, which would mean that we either start with a Backend for Frontend, which would misalign with the domain-driven nature of the system, or I wrap my domain-oriented APIs around a Backend for Frontend, vice versa, which increases complexity and reduces maintainability, also not desired. GraphQL allows me to get around that.

For those of you who are not familiar with GraphQL, you might be thinking that, ok, I'm writing this query, but when I call the API, is the backend really going to write a SQL query that pulls all the fields regardless, and I only get the benefit half the way? No, that's not true. GraphQL is written in such a way that each of those fields defines what's called a resolver. I could, for instance, write a resolver for my user type that says, get first name from a flat file and get last name from a document store. I would never do that, but you could do that. That's an example of the flexibility that would give you.

If that solves the over-fetching problem, I still have the under-fetching problem. I need to get all of the user's accounts. Because my APIs are shaped like the domains, I need to connect data from different ones. What I could do is force my developers who are writing the frontend to write code that looks like this. I get a user. I get the user's accounts. I construct a local version of the model. Then I loop through all the accounts that I was told about from my user, and get the details of that. Then it can get worse. Now I need payments, and I'm in a nested loop. If I have to do this a lot, which, because it's domain-driven, and the things are disconnected, you would have to do this a lot.

You think there's a better way? Yes. It's called stitching, in our case. There's actually two ways to do this in GraphQL, broadly speaking. For us, it's defining relationships between those different nodes, and this is what composes our super-graph. A user knows about account link, and account link knows about billing account. Then I can write a query like this on my frontend code, in my JavaScript, that says, give me all my user payments. Here's the query that gets the specific data I need. Now I'm writing one query per use case. It's a lot cleaner. It makes your frontend application code not have to worry about the relationships. It's all encapsulated within the graph itself. This is what we call the customer graph. This is an example of graph traversal. This is a bigger picture of the different types of APIs that are the nodes in our graph. In the example of the query I showed you on the last slide, what's happening is it traverses the graph through that blue path. It hits user.

Then user is going to go ahead and delegate the second level information to account link, account link to billing account, to payment. The application doesn't know any of this. It just knows that it has a query that it asked for this data. I could write a different query. I want to get my bank accounts. It takes that path. The neat thing about this too is that I don't have to start in user. I might write a component for React that says, here's a component that shows billing account information, without any user context. I could just hit billing accounts and get the information related to that billing account that I care about. That enables teams to, again, focus on the data that's relevant to the use cases they're implementing, and not so much about the technical details of what's underneath it, because we've created a system-of-reference, and not so much about the relationships of that data, more about how to use that data.

Team Topologies - Organizational Design is Architecture

That's a lot about GraphQL, domain-driven design. Let's talk about now how those domains fit with teams.

Mathew: Earlier, when we talked about the previous solution, one of the things that we talked about is silos in the teams and how the teams were structured. Organizational design is a part of the architecture. Execution of successful software delivery is as much as about scaling teams as it is about scaling hardware and software. Without the ability to scale teams, you are limited in how fast you can deliver. Creating a functional organization structure should go hand-in-hand when you're thinking of your technical architecture, because this is what encourages ownership and innovation, and it also provides teams with the necessary autonomy and guardrails and tools that, once again, help with quality, speedy delivery.

With team topologies, we define three types of teams. The first is a stream-aligned team. These teams can focus on the specific business functionality that creates customer value. Enablement teams focus on the technical, operational functionality and frameworks that enable the other teams to operate with less friction and at speed. The third is the complicated subsystem team. This is a team that focuses on a very narrow, complicated part of the system. It's not vertically aligned, but it provides very critical technical capabilities that are needed by the other teams. This is how that team structure looks in the UWP 2.0 world. If you remember the features that we talked about when I was describing the portal, the ability to pay bills, payments, start and stop service, or move, and these are all aligned with the horizontal teams, which are the stream-aligned teams.

The vertical teams here are the enablement teams, which is the observability and the DevOps team. These teams provide frameworks for your observability. They provide help with your CI/CD pipelines and keeping up with the infrastructure. They provide frameworks for the other teams, the stream-aligned teams, or the complicated subsystem teams, to be able to take that up and implement it the appropriate way for their own teams. Here we establish the team boundaries as system boundaries. When we talk about complicated subsystem teams, the mainframe is a dependency for us, and understanding the mainframes, it's not trivial at all.

The architecture decoupled that dependency from a system, but by aligning the teams that we did, the way we did, we have created an entire team that helps decouple that dependency as well. This particular team is able to focus on the integration points and the challenges of integration with the mainframe, while the rest of the teams can focus on customer value-aligned use cases in their subdomains without having to worry about the intricate details of how the mainframes work. This is how our team structures work with our subgraphs and use cases.

Each of these subgraphs define use cases, which then go on to become the subdomains, which are our team boundaries here. You see the enterprise integration team, which removes the complexity from each of these teams. There are also some bounded contexts, which are common between teams. We could make them their own subdomains, but because of the way the teams are structured, we have chosen to have one of the teams that has an overlap on those bounded contexts within their subdomains.

Roberts: The important point is that there's clear owners for these portions of functionality.

Change Data Capture - Creating a "System-of-Reference"

Mathew: How do we decouple ourselves from the mainframe? That's by creating a system-of-reference, and change data capture is a primary way of doing it. What is a system-of-reference, as Jason alluded to earlier? System-of-reference. We are not the system of truth. The system of truth is still the system-of-record, which is the mainframes. System-of-reference is not a proxy of the system-of-record either. It isn't dependent on the schematic constraints of the system-of-record either. This is a near real-time replication of the system-of-record. We synchronize this through CDC events, and we model using domain-driven design so that this is decoupled from the upstream structure and the semantics that other teams do not have to worry about with what is in the system-of-record.

In the simplest form, you've seen this diagram earlier as well. It's change data capture. We have the APIs and the entity database within the bounded context, and the application querying those. However, that's not it. Although it's a black box, from a change data capture perspective, we publish topics to Kafka, and then the data processors provide an anti-corruption layer and the separation and the decoupling in this case, so that the mapping between what is in the system-of-record to what will go into the system-of-reference, which is then consumed by all consuming applications, that separation is happening here.

The mapping, so the left-hand side of this talks to the mainframes. To understand any of the fields in the mainframe table, you need to have a deep understanding of what the mainframes are, how the system works. It's not trivial for someone new or someone who doesn't have an understanding for years to come and look at this. On the other side, you look at the API naming convention, and it's very functionally aligned to our functional domains, so you can look at each. These are meaningful names. As we said, the system-of-reference is not a proxy to the system-of-records, so this is only part of what is in the system-of-record, and this covers what we need for the system-of-reference to operate.

Roberts: Another way to look at one of the goals of these efforts is to reduce the cognitive load of any of those individual teams, so that they don't have to know implicitly what all those things mean.

Mathew: How do we achieve scaling? We have horizontal scaling through K8s, and we have resilient messaging with Kafka and Event Hubs. We can scale the topics via partitions, to accommodate any kind of change volume that we have, and the same with the data processors. In theory, we can scale up, up, and up, as much as we need.

Roberts: Not just theory, we have.

Mathew: Why do we need to be able to scale? These numbers here are indicative of what we process on a daily basis in the system that we are running today.

Roberts: Except for the top one, that's total. The rest are daily.

Event-Driven Architecture

We've talked about change data capture and how it powers the system-of-reference for us, and in turn, that powers the applications. Now let's talk about the rest of the eventing business within the system. I'm going to start in the less complex part of this, which is data events in the system-of-reference, and by that, I mean within. The key point to point out here is that I don't mean the change data capture. I'm actually talking about something different now. I'm talking about once change data capture has persisted information inside of our system, because each of those bounded contexts owns its data and is the semi-authoritative source of it, there might be other bounded contexts that are interested in knowing when that changed. It's bounded context to bounded context so that there can be coordination around it. This is what that looks like. We're building out our view of what a bounded context is, and we're adding another chunk to it.

On the top there, I have a bounded context, which has its API and its entity database. Now I have an event processor. This is the outbox pattern. It's just another form of change data capture. It's the same pattern. In this case, it's on a smaller scale per bounded context. It publishes an event that says, ok, something in my domain changed, not in the source system, but in mine. Anybody else who cares about it might need to look at it and can use that data to augment the data that it owns. The important thing to note there is that the other bounded context, the consuming one, it doesn't own the data that it's hearing about, it just can use that data to do more interesting things. What kind of things? Performance is one. There might be a use case where I have a query that's very frequently written that always calls from one node to the other. In lieu of always making it, it could cache that data in a secondary cache for a micro-optimization.

Another use case, there is data inside of the mainframe that surmounts to what we call a derived value. It's never persisted. There's business logic there that can do the calculation of it. It's basically sort of forcing us to fall back to that synchronous model that we were trying to avoid with 1.0. One way around that is this pattern, because we can have background services that can pre-compute that. The third reason is, operations may result in changes across multiple bounded contexts. That's not unexpected. There are events that can cascade to different types of data. That's totally something that you would expect in a complicated business application.

Now we get to the trickier part, operations or commands. This essentially represents a request that we are delegating on behalf of the user to the mainframe. Again, I'm going to go back to the example of making a payment. It should sound simple, but it's not super simple. What we have to do is make sure we track consistently our version of that request and the version that that request takes through the mainframe. What we do from a high level is we use the saga pattern, which is orchestration. We have state machines which define our distributed workflows, and that state is owned by a single component which is called the orchestrator.

Each workflow uses a pair of topics for state change events which they emit. They are the authority that says, ok, the state of this workflow has changed, and for reactions to that state event. That's the thing that decouples things that know about the mainframe from things that know about the business domain. Then there are components listening to those state change events which are part of the anti-corruption layer that was mentioned previously.

Let's look about how we model those workflows with our state machines. Again, going back to the example of making a payment. In the purple box there I have my orchestrator, and the workflow state machine looks like this. I'm starting the requested state which means that a user has said, clicked on the button, submit my payment. That's going to cause an event to be emitted up here, right there, operation requested or state change. That will go down to another component who's going to send back a response that says it succeeded or it failed. The state will transition to either scheduled or couldn't schedule. That status can be reported back to the applications through the API. That's not a complicated state machine.

A little bit more about workflows and how modeling via workflows can be useful, and one of the reasons is that it can be composed. There's an example of a workflow use case on the portal where you can at the same time make a payment and add new bank account information to make that payment with. What you can do is write a workflow that composes those two. In its starting state it triggers the add bank workflow which can execute and also be developed and tested independently. Then, based on the success or failure of that, either fail the whole thing or do the next chunk which is make the payment, which is the one we already looked at.

I can also take other actions for a given state transition. I may be interested in sending an email to the customer if the payment they tried to submit failed. Probably useful. Here's where it gets fun. This is what I call the anti-corruption saga. You see here we have down our component which we call a context translator. That's because it's taking data from a bounded context and translating it to the API of the mainframe. It's then going to return via that outcome topic that I mentioned before.

Again, in our simple use case, success or failure. The interesting thing here, not only that this again is functioning as an anti-corruption layer and that these events are in the language of our domain, but also there's going to be two signals that come back in this process. There's going to be the response that says, ok, here's the outcome of the operation. We're also going to see the change from the data through our change data capture come up into our entity database.

The reference data, our entity, will be eventually consistent. Our state machine here will be eventually consistent because it'll learn about the outcome. Those two things are different. It's essentially a race condition. This could land first, that could land first. What we do here is we have an aggregate that can reconcile those two things and present a unified state for that request to an application. I'll show you an example of that. This is what the statuses of payments look like on the mainframe. It starts out in a scheduled state. It can go to pending, processing. Don't ask me what the difference between these two are. I never learned. This is one of the challenges with mainframes. Then, it can be canceled. Eventually, one of its terminal states is paid or canceled.

Again, this is our reference data. What we have to do is build an aggregate state. We have our workflow states up here in blue, and we have the reference state. This aggregate will look at, based on the availability of these things, what the true status of the delegated request is to the user. The important point here is the continuity that this brings. We talked about how entities are defined by identity and continuity. The identity to make sure that the reference aligns with the command, and to make sure that the status, the continuity aligns with the customer's understanding of what they asked for. That's basically a bounded context.

Again, not a single microservice, but a set of services that can collaborate and coordinate to present a consistent view of the system state, and knows how to manage the requests to some other system. That system doesn't always have to be a mainframe. You could employ a Strangler Fig pattern over time here, and eventually get rid of the mainframe. That's a much larger goal. This is how you would do that, is by taking these chunks out, abstracting them away from the mainframe, and then replicating that functionality in a way that, again, isn't tied to the constraints of those source systems.

Challenges

We'll talk a little bit about some of the challenges there, because there are a lot of these too. Event-driven architecture, it's hard. People don't understand it. It takes a paradigm shift most of the time. Unless you've already been doing this for a few years, grokking it is hard. While this may be obvious, it's extra painful if you don't design for observability first with event-driven architecture, because you have a lot of things going on, and a lot of ways things can go haywire. This is one of the drawbacks that's more specific to the fact that this is the mainframe, rather than any legacy system, let's say. That's because the mainframe does batch processing. The example here that's convenient is, every so often, it generates a bunch of bills from the customers. Pretty standard business operation. Because we're using change data capture, that means there's a huge chunk of bills that have just come into our data stream. What that ends up looking like is that my pipeline is now backed up, represented by that fat red arrow.

If I happen to have an orchestrator that's waiting on one of those statuses to change, it's now backed up because of the lag created by the batch. That's a thing to keep in mind for when designing a system with batch. Customer graphs, so I mentioned before that there are two general high-level approaches to creating a super-graph in GraphQL. We use stitching, but the other option is federation. There are tradeoffs there. In a non-federated super-graph, which is what we have, nodes in the graph with an outward relationship need to reference a common schema in order to provide those queries that I showed you back at the beginning, the example where you get your payments for a user.

In order for it to support and interpret all that with that whole query, it needs to know about any reachable node that it can get to. You can imagine with that graph that I showed you earlier that that can be a big shared schema. It is. It's not fun to maintain. It can lead to drift. This has to be a versioned component. It does give you some performance scenarios where calling node-to-node is advantageous. The other way to do it is federation. With federation, essentially your schema is computed at runtime, whereas with stitching, it's computed at build time.

With federation, I have, in this example, a user, and he publishes his schema to a gateway, a GraphQL gateway specifically, not just any API gateway. He knows that account link exists, but that's all. He just defines how he extends account links. That information is published up. The maximum that it's ever going to know is one more node. There's no shared schema. The tradeoff there is that there are some query situations where it becomes more expensive, and there's more infrastructure to host, because the gateway needs to have a form of persistence to persist that schema at runtime. It's maintainability for performance inconvenience.

Mathew: We solved a lot of the team structure problems with team topologies, but there were some challenges with that as well. While the enablement teams can provide the structure for really helping the other teams with their delivery, if not very carefully managed and planned, that could become a single point of failure. We had to carefully manage and plan those. The other aspect is, the teams are organized by vertical slice, once again, their own domains and helping them to stay outside of the concerns of other teams. However, the cross-cutting concerns, we don't really want each team to have their own way to solve these concerns. We don't quite have a completely satisfying answer, but what we went with is communal ownership.

Along with that, automation to keep things updated, like private libraries published as packages using semantic versioning, using Dependabot in consuming repositories to automatically test and integrate my minor version. A Tuesday type of a model where new minor versions are automatically published every week. That helped with some of these cross-cutting concerns. The other aspect here was, because we did not want to do a big bang release, we incrementally replaced features for the website, which is why half of the room stayed, half had to leave.

In doing that, we introduced a hybrid web architecture. We used edge routing, but this meant making sure that both systems had context awareness to each other in order to be a seamless interface to the user. This also introduced a challenge to how we release our software. In doing that, we introduced release trains. This helped with the release management aspect of the overall ecosystem, and at the same time, allowed the vertical teams to be able to autonomously release features when they were ready.

Conclusion - Restating Our Mission

While we have had a lot of challenges, but to refocus on what we started out as our mission, were our technical goals of decoupling, reducing dependencies on third-party solutions, and creating an empowered organization. That was to achieve the business goals of reduced software licensing cost, reduced call center volumes, and overall customer satisfaction through a stable platform. How did we achieve this? To reiterate what we have gone through the presentation, change data capture was a fundamental strategy to establish a system-of-reference. We achieved that through near real-time replication. Event-driven architecture is a natural fit, and that helped decoupling the systems, attaining message durability and eventually consistent systems. Team topologies maps organization design to system design, so we were able to isolate high complexity from business cases, align teams with values, enablement teams helped with acceleration. Domain-driven design brought it all together.

Roberts: This here is some of the relationships. This highlights some of the ways that these concepts overlap with each other and complement each other. The fact that domain-driven design gives us the business value and the business language to focus on those use cases, and then we can use team topologies as an organizational design that follows through with that to align those streams with those domains. Team topologies also gives us the complicated subsystem team, which ties into our event-driven architecture by being a decoupling mechanism so that boundaries are clear and the languages of the exchange can be defined concretely.

The fact that we're eventually consistent through event-driven architecture and change data capture gives us the ability to make sure that data is all replicated, and again, served as fast as possible while removing the constraints from the mainframe from our application, future applications. Really, the sky is the limit after that because we've built a lot of points of evolution where these things can evolve on their own without any of those constraints that existed previously with those Backend for Frontends, with those three-way replications, with those synchronous transactions, distributed synchronous transactions, and so on. That's our point of view, and we think this is a good solution. Just to circle back on the release trend, we released every two weeks. Fairly consistently. This was incremental functional progress where we delivered real change. This product has actually almost entirely replaced the 1.0 product now in production.

See more presentations with transcripts

Recorded at:

Jun 03, 2025

InfoQ Software Architects' Newsletter