Dataflow Structure—Derived Knowledge Views and Eventual Consistency | by caleb lee | Oct, 2024

Robust consistency ensures that each learn displays the newest write. It ensures that each one information views are up to date instantly and precisely after a change. Robust consistency is often related to orchestration, because it typically depends on a central coordinator to handle atomic updates throughout a number of information views — both updating abruptly, or none in any respect. Such “over-engineering” could also be required for methods the place minor discrepancies may be disastrous, e.g. monetary transactions, however not in our case.

Eventual consistency permits for short-term discrepancies between information views, however given sufficient time, all views will converge to the identical state. This method usually pairs with choreography, the place every employee reacts to occasions independently and asynchronously, without having a central coordinator.

The asynchronous and loosely-coupled design of the dataflow structure is characterised by eventual consistency of knowledge views, achieved by way of a choreography of materialisation logic.

And there are perks to that.

Perks: on the system degree

Resilience to partial failures: The asynchrony of choreography is extra strong towards element failures or efficiency bottlenecks, as disruptions are contained domestically. In distinction, orchestration can propagate failures throughout the system, amplifying the problem by way of tight coupling.

Simplified write path: Choreography additionally reduces the accountability of the write path, which reduces the code floor space for bugs to deprave the supply of fact. Conversely, orchestration makes the write path extra complicated, and more and more tougher to keep up because the variety of totally different information representations grows.

Perks: on the human degree

The decentralised management logic of choreography permits totally different materialisation levels to be developed, specialised, and maintained independently and concurrently.

The spreadsheet preferrred

A dependable dataflow system is akin to a spreadsheet: when one cell modifications, all associated cells replace immediately — no guide effort required.

In a really perfect dataflow system, we would like the identical impact: when an upstream information view modifications, all dependent views replace seamlessly. Like in a spreadsheet, we shouldn’t have to fret about the way it works; it simply ought to.

However guaranteeing this degree of reliability in distributed methods is much from easy. Community partitions, service outages, and machine failures are the norm quite than the exception, and the concurrency within the ingestion pipeline solely provides complexity.

Since message queues within the ingestion pipeline present reliability ensures, deterministic retries could make transient faults look like they by no means occurred. To attain that, our ingestion employees must undertake the event-driven work ethic:

Pure capabilities don’t have any free will

In laptop science, pure capabilities exhibit determinism, which means their behaviour is solely predictable and repeatable.

They’re ephemeral — right here for a second and gone the subsequent, retaining no state past their lifespan. Bare they arrive, and bare they shall go. And from the immutable message inscribed into their delivery, their legacy is set. They at all times return the identical output for a similar enter — all the things unfolds precisely as predestined.

And that’s precisely what we would like our ingestion employees to be.

Immutable inputs (statelessness)
This immutable message encapsulates all vital info, eradicating any dependency on exterior, changeable information. Basically we’re passing information to the employees by worth quite than by reference, such that processing a message tomorrow would yield the identical end result as it might at present.

Activity isolation

To keep away from concurrency points, employees shouldn’t share mutable state.

Transitional states inside the employees needs to be remoted, like native variables in pure capabilities — with out reliance on shared caches for intermediate computation.

It’s additionally essential to scope duties independently, guaranteeing that every employee handles duties with out sharing enter or output areas, permitting parallel execution with out race situations. E.g. scoping the consumer health profiling activity by a specific user_id, since inputs (exercises) are outputs (consumer health metrics) are tied to a novel consumer.

Deterministic execution
Non-determinism can sneak in simply: utilizing system clocks, relying on exterior information sources, probabilistic/statistical algorithms counting on random numbers, can all result in unpredictable outcomes. To forestall this, we embed all “transferring elements” (e.g. random seeds or timestamp) straight within the immutable message.

Deterministic ordering
Load balancing with message queues (a number of employees per queue) can lead to out-of-order message processing when a message is retried after the subsequent one is already processed. E.g. Out-of-order analysis of consumer health problem outcomes showing as 50% completion to 70% and again to 60%, when it ought to enhance monotonically. For operations that require sequential execution, like inserting a document adopted by notifying a third-party service, out-of-order processing may break such causal dependencies.

On the software degree, these sequential operations ought to both run synchronously on a single employee or be cut up into separate sequential levels of materialisation.

On the ingestion pipeline degree, we may assign just one employee per queue to make sure serialised processing that “blocks” till retry is profitable. To keep up load balancing, you need to use a number of queues with a constant hash change that routes messages based mostly on the hash of the routing key. This achieves the same impact to Kafka’s hashed partition key method.

Idempotent outputs

Idempotence is a property the place a number of executions of a chunk of code ought to at all times yield the identical end result, regardless of what number of occasions it received executed.

For instance, a trivial database “insert” operation just isn’t idempotent whereas an “insert if doesn’t exist” operation is.

This ensures that you simply get the identical consequence as if the employee solely executed as soon as, no matter what number of retries it really took.

Caveat: Word that not like pure capabilities, the employee doesn’t “return” an object within the programming sense. As a substitute, they overwrite a portion of the database. Whereas this will appear to be a side-effect, you possibly can consider this overwrite as much like the immutable output of a pure perform: as soon as the employee commits the end result, it displays a last, unchangeable state.

Dataflow in client-side purposes

Historically, we consider internet/cellular apps as stateless shoppers speaking to a central database. Nonetheless, trendy “single-page” frameworks have modified the sport, providing “stateful” client-side interplay and protracted native storage.

This extends our dataflow structure past the confines of a backend system into a mess of shopper units. Consider the on-device state (the “mannequin” in model-view-controller) as derived view of server state — the display shows a materialised view of native on-device state, which mirrors the central backend’s state.

Push-based protocols like server-sent occasions and WebSockets take this analogy additional, enabling servers to actively push updates to the shopper with out counting on polling — delivering eventual consistency from finish to finish.