Thinking about Syncing, Part 3: separation of concerns
In Part 1 we framed synchronization as exchanging information to allow clients to converge on a shared understanding of the world, specifically involving the merging of timelines.
In Part 2 we discussed several ways applications might do this: through snapshots (like Firefox Sync), change operations (like Google Docs), or revisions (like CouchDB).
We outlined a number of limitations of snapshot- and transformation-oriented approaches, and discovered that applications with certain requirements — offline operation, client-side crypto, etc. — might be best served by an approach that builds a concrete shared timeline between clients.
In this post we explore how the needs of a typical client application, and the data model that supports those needs, differ from the needs and corresponding data model of synchronization code. The UI usually wants to quickly examine the current state of a small slice of the world, while the synchronizer wants to reliably manage change over time. We will look at a way in which these two sets of concerns can be separated.
We will expand on the example given in Part 2, draw an analogy to DVCSes, and see some further examples of systems that separate the concerns of data management from the concerns of data consumers.
Events and tables
In Part 2 we saw a brief example of an eminently syncable data representation: the independent event, a stand-alone historical fact.
{"title": "Bohemian Rhapsody", "played": "2017–10–02T15:48:44Z"}
This is easy to sync because it can just be copied around: it doesn’t refer to anything (no identifiers to manage, and no identifiers introduced), it doesn’t reflect a change to existing data, and it can’t conflict with anything.
But this isn’t everything we need for a typical app. Let’s take a step back and think about features, and how those relate to storage.
There’s implied and missing context in that event.
We want to be able to record that this user played a particular song — not just one with that title! — by that artist, in a playlist, on a device, and so we run into issues of identity, uniqueness, and reference to other entities. Modeling these things is non-trivial: like most application data it’s relational, even if it’s stored in a document database, and that means we need careful management of identifiers and consistency.
We want to record facts that change over time — star ratings and playlist memberships, renaming playlists, and so on. That means having some conception of updates and breaking of relations.
We need to handle deletions and creations (e.g., recreating a playlist with the same name) with care. And we need a way to permanently delete data; some guilty pleasures need to be forgotten.
These things need to sync, so the changes you make on your phone are reflected on your laptop.
A log-structured model is a good fit for this: we can record additions and changes as they happen, and merge our changes in when we sync. We can record new kinds of data easily.
But we’re not done yet: the front-end code has its own requirements.
We want to slice and dice this data to support the UI on all of a user’s devices. We want them to be able to quickly find the last ten songs they played, their top 20 most played songs, their top rated songs. They need to be able to browse their current playlists, search by artist or date, and sort the results.
It’s straightforward to see how our syncing requirements map to an event log, but these front-end retrieval tasks are expensive with a pure event model: play count is a sum aggregate, star rating is last-write-wins, last played is a max aggregate.
Conversely, it’s easy to see how these front-end features map efficiently to a conventional tabular or object-based storage system, but it’s hard to implement a change-based or log-based syncing system on top of a table for playlists, a table for songs, etc. with SQL UPDATE
queries. With in-place updates we must manually manage timestamps and change counters and tombstones.
We’d like to be able to build new front-end features without affecting, or even fully understanding, how the supporting data is synchronized. And we’d like to be able to reuse or change our synchronization code, or extend the data that’s synced, without having to worry about the existing complex query needs of the UI. This is classic separation of concerns.
Our sync-related requirements are in tension with our ‘direct’ application requirements. Fortunately, our little music app isn’t the first to have to resolve these tensions.
An abridged history of version control
Most of us — developers, writers, musicians, and more — start out using unsophisticated tools to manage change: copying or zipping directories if we need to preserve an older version of some work.
If a developer need to move some changes between those different directories, they manually copy files. A sophisticated user might point diff
at the relevant files to produce a patch, edit it, and apply that patch with patch
.
If you’ve ever had a file on your desktop called something like Essay (v1) final draft FINAL(2) EDITED (TO PRINT).pdf
, then you’ve used this method.
We might jokingly call this snapshot-oriented version control.
Early version control systems improved on this somewhat. RCS tracked per-file versioning metadata in an adjacent ,v
file in the filesystem. It turns out that the needs of file-oriented tools that use versioned data — source code, documentation, etc. — are very different from the needs of version control tools themselves, particularly at scale.
Build tools and IDEs want fast hierarchical access to the current state of your source code, but version control tools want to do things like quickly list every user who changed files in a directory, across moves and renames, in the last fifteen years. Your linter wants to find files missing a copyright header, but your coworker wants you to send her that debug commit that you never landed.
CVS and later VCSes split files and changes, beginning to record log files, to store deltas, and to support atomic operations over multiple files.
Modern version control systems like Git completely divorce the working tree — the files on disk — from the internal data model of the version control system itself. A Git repository can exist with no working tree at all, or with multiple working trees. The working tree is simply a checkout, a convenient instantiation of a particular state in the repository.
When a Git client updates one repository from another, it does so by replicating objects — internal representations of new files, trees, and commits — then advancing refs, then optionally rebasing local changes on top of the new remote changes, and finally optionally updating a working tree to match the new head. Git doesn’t send operations, it efficiently packs changes, and the remote repo doesn’t need to know anything about the local working tree.
We can even produce different kinds of checkouts from a single Git repository. It’s easy to grab individual files at any point in history using git show
, and check out only part of a tree with a sparse checkout. And users can extract different kinds of non-file data from these tools, too.
The concerns of the data management tool, and the concerns of the consumers of the data it manages, are very different. Modern DVCSes resolve this tension by using two different data representations, deriving each from the other when necessary. The internal data representation is the one that manages change, consisting of atomic commits arranged into branches. The secondary, derived representation — a tree of files — is the one typically consumed by user-facing applications.
Seeing similarities
Once we see these tensions, and the solution of not sharing a representation between consumers, we can see the same separation in other places — email clients, photo libraries, even the MVVM pattern.
Some good examples are found in the database industry itself.
A common configuration of SQLite writes changed rows to a Write-Ahead Log (WAL), rather than directly updating the main database pages. This allows SQLite to get writes on disk cheaply and support concurrent readers and writers. The main database file is updated by ‘replaying’ the WAL. PostgreSQL supports log-shipping replication on top of its WAL; the main database format can be specialized to meet the needs of queries, rather than having to accommodate replication metadata.
As with DVCSes, these internal models of change can also offer value to application code — git log
is useful! CouchDB and PouchDB expose change feeds — the same data they use for reliable syncing — to application code. Datomic is structured around a transaction log that is the source of its current indexed state, encouraging applications to take advantage of long-lived persistent data.
DVCSes have tensions between log-centric and file-centric consumers, and they resolve them by deriving the working tree from the repository.
SQL databases have tensions between readers, writers, and replication, and they resolve them by (in very simple terms) deriving database tables and indices from a written log that is also available to replicate.
A document store like CouchDB has to act like a simple object store while also managing multi-master replication and conflict, and it does so by storing a tree of document revisions.
We can see that a similar separation of concerns can apply to client-side application storage. Applications should structure their writes as a log, and derive tabular or object-oriented representations from it.
Generalizing the argument: CQRS
Synchronization is not the only feature in tension: the needs of different parts of the application can be at odds with each other. They benefit from having different representations, too.
In a browser we might naturally store bookmarks as a tree in memory, or in a flat file, so they can be shown in folders. We also need fast textual search over the titles, for which we would use a full-text index. We want fast lookup by URL to check whether the current page is bookmarked, so we want some kind of indexed lookup there, perhaps a Bloom filter. We want bookmarks to share icons, so we need some way to store and identify those. We want to find the last five bookmarks the user created, so we need some kind of timestamp index. And of course we want to write a new bookmark quickly without updating all of those structures! Features grow over time, and data stretches to try to keep up.
This isn’t news; certainly not for anyone who works on big sites.
CQRS asserts that ‘command’ (writes) and ‘query’ (reads) are best handled with different data representations, even a different representation for each reader. Event Sourcing declares that what products want to do with data is going to change over time, and so we should record data as generally as we can — as events — in order to adapt. Even ad hoc log-oriented enterprises, those that haven’t had a consultant fill whiteboards with sagas and schemas, take for granted that different tools will build varied representations from the same log.
Client apps often do keep multiple representations of data, but in an ad hoc way: write-back in-memory caches, DB indices, and queries that update two tables. By structuring the application around a rich log, it becomes relatively straightforward to derive multiple specialized representations, and add to and change those representations over time.
This is not a new observation in the wider industry.
Leading analytics tools like Amplitude, and Mozilla’s own data platform, are designed as aggregators of immutable logs, constructing derived data sources to support various query systems, including SQL-based Redash queries.
If you talked to a data analysis engineer, and told them that you were going to drop raw event data on the floor as soon as today’s derived dashboards were compiled, rather than warehousing them to answer a different set of questions next week, they’d be horrified. Yet this is what we do when a client app turns a user action, like clicking a toolbar icon, into a decontextualized INSERT INTO bookmarks (url, title, date) …
: we cut down a rich arrangement of data, data that is of interest to other features, into a single simple representation that’s specialized for a particular use. We can do better.
Conclusion
- It’s better to sync logs than state or changes, particularly when data is encrypted or devices go offline.
- Log-structured data is well understood outside of client apps: it’s the underpinning of DVCSes, databases, and some web apps, as well as being a core part of analytics systems like Amplitude.
- Different parts of larger client apps have differing query needs, and those needs are in tension, too.
- Understanding an app’s data as a log of changes, transformed into varied representations for specific uses, not only brings clarity to syncing, but also makes it easier to target those differing needs.
Still to come in this series: more details on merging; more concrete exploration of how an application might be structured around a log, from modeling the domain through to defining views; and discussion of the differences between event-structured and log-structured data.