2021-10-15

The Stateful Microservice

I did a quick search for the term and did not find anything concrete, so I thought I might as well publicly document my thoughts.

Photo of two elephants friendly interacting with each other, from The Scientific American: Fact or Fiction?: Elephants Never Forget


Almost everyone doing microservices today will tell you that microservices need to be stateless. In another post of mine I explain that statelessness is not an end in and of itself; instead, it is just a means to an end. The desired end is scalability and resilience, and statelessness is just one among possibly many ways of achieving it. Furthermore, I explain that statelessness in particular performs very badly, and from an engineering standpoint it is a very cowardly solution.  For details, please see michael.gr - On Stateless Microservices.

What remains to be shown is whether there exists an alternative.

Obviously, an alternative to the stateless microservice would be a stateful microservice, so what we are about to examine here is what a stateful microservice would be like, and how it would compare to a stateless microservice.

What is a stateful microservice


A stateful microservice maintains state for the purpose of expediting the processing of incoming requests, reducing overall server load, (trading memory for processing power and data storage traffic,) and achieving certain things that are difficult to achieve otherwise, such as server-initiated client updates.

The state kept by a stateful microservice can include:
  • State that has been obtained from the main data store and has possibly undergone expensive transformations. The benefit of maintaining such transient state within the microservice is that the data store does not need to be re-queried, and the possibly expensive transformations do not need to be repeated, with each incoming request; the loading and processing of the data only needs to happen once when the microservice starts, and to be repeated only in response to a notification from the system's messaging backbone that the original data in the main data store has changed.
  • State that does not exist in the main data store, and does not need to, because it is of a transient nature, for example information that is only needed during user's visit to a web site and can be dismissed afterwards. This can include information necessary for maintaining a session, such as the session token, and view-related information, such as which page (or pages) of the web site the user is currently viewing. View-related information may be useful for the server to have for various reasons, for example for the purpose of sending server-initiated client updates that are specific to the web page(s) that are being viewed.
  • State that may eventually be entered into the main data store but has not yet been entered due to various workflow demands or optimization concerns. For example, the user may be sequentially visiting each page of a wizard workflow, and entering information on each page, but this information should not be merged into the main data store unless the user first reaches the last page of the wizard workflow and confirms their actions.
From the above it should be obvious that a stateful microservice is necessarily session-oriented, meaning that it requires a specific client to talk to. Session-agnostic stateful microservices already exist, and we do not think of them as anything special; they are microservices that implement caches, containing information that is pertinent to not just one client, but to all clients. These microservices are already scalable and resilient because a cache can be trivially duplicated to an arbitrary degree and it can also be destroyed and trivially re-created from scratch.

We now need to show how a stateful microservice can still be called a microservice.  

In a previous post of mine I examined what a microservice really is, and I came to the conclusion that it is simply a scalable and resilient module. (See michael.gr - So, what is a Microservice, anyway?) Even if you disagree with this definition, and you regard microservices as necessarily more than that, I hope you will at least agree that the purpose of statelessness in microservices is precisely to achieve scalability and resilience, so the definition of a microservice as a scalable and resilient module can serve as a working definition for the purposes of this discussion.

So, we need to show how stateful microservices can be scalable and resilient, just as their stateless counterparts are.

Scalability in stateful microservices can be achieved by means of a session-aware load balancing gateway which routes new session requests to the least busy server, and from that moment on keeps routing requests of that same session to the same server. Under such a scenario, rebalancing of the server farm can be achieved simply by killing microservices on overloaded servers and letting the resilience mechanism described next make things right.

Resilience can be achieved by having each instance of a stateful microservice continuously persisting its transient state in an efficient manner into a high-performance backup store which is accessible by all servers in the farm. Thus, if a microservice unexpectedly ceases to exist, it can be reconstructed from that backup on any other server.  The trick, as we shall see, is that the backup is taken very efficiently, and in the event that the microservice needs to be reconstructed, the restoration from the backup is also done very efficiently.

In more detail, it works as follows:
  • When a client initially connects to the server farm, no session has been established yet, so the first request that it sends is sessionless.
  • The sessionless request arrives at a load-balancing gateway, which routes it to the least busy server in the farm. This mostly takes care of scalability, since we can always add more servers, which will initially be idle, but as requests for new sessions keep arriving, they are routed to the idle servers instead of the busy ones, so over time, the load distribution evens out.
  • The server that receives the sessionless request creates a new instance of a stateful microservice to handle that request, and the session is established between that microservice and the client.
  • From that moment on, any further incoming requests for that same session are routed by the  session-aware load-balancing gateway to the same server, and the server delegates them to the same instance of the stateful microservice. (Alternatively, the microservice and the client may negotiate a direct persistent connection between the two, thus bypassing any middlemen from that moment on.)
  • The newly spawned stateful microservice registers with the messaging backbone of the system to receive notifications about system-wide events, so as to be able to keep its state always up to date.
  • The newly spawned stateful microservice loads whatever state it is going to need, and keeps that state in memory.
  • The microservice processes the request and sends back a response.
    • Possibly updating the main data store with information that must always be globally available and up to date, and causing system-wide notifications about these changes to be issued.
    • Possibly also changing its own transient state.
  • If the processing of the request resulted in any change in the transient state of the microservice:
    • The microservice serializes the entirety of its state into a binary blob
    • The blob is written into a persistent key-value store, using the session id as the key.
This persistent key-value store is used as a backup, meaning that it is written often, but it is never read unless something bad happens.
  • Continuous persistence of stateful microservices is not expected to pose a performance problem, because:
    • Serialization to and from a binary format performs much better than general-purpose serialization in textual markup like JSON or XML.
    • The size of the blob is expected to be relatively small. (Of the order of kilobytes.) 
    • Key-value stores tend to have very high performance characteristics.
    • The backup store can be physically separate from the main data store, (even on a different network,) thus avoiding contention.
    • The act of serializing an in-memory data structure into a single in-memory blob and then sending that blob as one piece into persistent storage is bound to perform far better than a series of operations to update a structured data store. (For one thing, there are no index updates.)
    • Persisting the blob can be done asynchronously and in parallel to the sending of responses, so it does not affect client-perceived latency.
  • For as long as the session does not expire, the stateful microservice can remain alive, continuing to serve requests efficiently, taking advantage of the transient state that it contains and keeps up-to-date. Contrast this with the stateless microservice approach, which requires that any request can be handled by any server, therefore each microservice must contain no state at all:
    • The processing of each request begins with zero knowledge of the state of the system, so persistent storage must always be queried to obtain state.
    • These queries represent overhead, and this overhead must be suffered in full before the request can be serviced, thus manifesting as latency to the client.
    • The results of these queries may not be cached, because they may at any moment be rendered out-of-date by any other microservice in the system.
  • An instance of a stateful microservice may prematurely cease to exist due to a number of reasons:
    • The microservice may be terminated on demand in order to rebalance the server farm.
    • The server hosting the microservice may become unavailable due to hardware failure.
    • The microservice may fall victim to the whim of the chaos monkey.
  • If for whatever reason a microservice ceases to exist, the load-balancing gateway discovers this either on its own, or when the next request arrives from the client.
    • If the gateway discovers it on its own:
      • The gateway finds the least busy server in the farm and requests it to spawn a new microservice instance for the same session that the old instance was handling.
    • If the gateway discovers it when the next request arrives:
      • The gateway does what it always does for incoming requests with no known microservice to handle them: it routes the request to the least busy server in the farm.
      • The server that receives the request sees that there is no microservice to handle requests for that session, so it creates a new one.
    • In either case:
      • The newly instantiated microservice checks, during initialization, whether the key-value store contains a backup for the current session, and discovers that it does.
      • The microservice restores its state from the backup key-value store.
      • Operation continues from that moment on as if nothing happened.
    • Between the moment in time that a certain stateful microservice instance prematurely ceases to exist, and the moment in time that a new incarnation of that microservice is ready for showtime on a freshly assigned server, some events from the messaging backbone may be lost. To avoid inconsistencies in the state of the microservice, we must utilize a messaging backbone which is capable of replaying events. For example, if we use Kafka, then the stateful microservice can make sure to include among its persistent state what is known in Kafka terminology as the "consumer offset". Thus, when the microservice gets reconstructed, it asks Kafka for events starting at that offset, so Kafka replays any missed events before it starts sending new ones. Thus, we ensure that the state of the microservice is always up to date, even in the case of termination and reconstruction.

Thus, stateful microservices can achieve not only scalability but also resilience.



No comments:

Post a Comment