An API for data that changes over time

Joseph Gentle

•

June 14, 2019

What do all these things have in common?

RSS feeds
Gamepads and MIDI devices
An email client
Filesystem watching (FSWatch, kqueue, ionotify, etc)
Web based monitoring dashboards
CPU usage on your local machine
Kafka
RethinkDB Changefeeds
A Google Docs document
Contentful’s sync protocol
Syntax highlighting as I type in my editor, with red squiggly error underlines for errors

All of these systems have data that changes over time. In each case, one system (the kernel, a network server, a database) authoritatively knows about some information (the filesystem, your email inbox). It needs to tell other systems about changes to that data.

But look at this list — all of these systems have completelydifferent APIs. Filesystem watching works differently on every OS. RSS feeds poll. Email clients … well, email is its own mess (JMAPlooks promising though).

Google’s APIs use a registered URL callback for change notifications. Kafka’s API queries from a specified numbered offset, with Events returned as they’re available. Getting information about a running linux system usually requires parsing pseudo-files in /proc. Can you fs watch these files? Who knows. Even inside the linux kernel there's a handful of different APIs for observing changes depending on the system you're interacting with (epoll/ inotify/ aio/ procfs/ sysfs/ etc). Its the same situation inside web browsers - we have DOM events (onfocus/ onblur, etc). But the DOM also has MutationEventsand MutationObserver. getUserMediaand fetchuse promises instead. MIDI gives you a stream of 3 byte messages to parse. And the Gamepad API is polled.

The fact that these systems all work differently is really silly. It reminds me of the time before we standardized on JSON over REST. Every application had their own protocol for fetching data. FTPand SMTPuse a stateful text protocol. At the time Google’s systems all used RPC over protobuf. And then, REST was born and now you can access everything from weather forecaststo a user’s calendarto lists of exoplanets from NASAvia REST.

I think we’ll look back on today in the same way, reflecting on how silly and inconvenient it is (was) for every API to use a different method of observing data changing over time.

I think we need 2 things:

A programmatic API in each language for accessing data that changes over time
A REST-equivalent network protocol for streaming data changes (or a REST extension)

You might be thinking, isn’t this problem solved with streams? Or observables? Or Kafka? No. Usually what I want my program to do is this:

Get some initial data
Get a stream of changes from that snapshot. These changes should be live (not polled), incrementaland semantic. (Eg Google Docs should say 'a' was inserted at document position 200, not send a new copy of the document with every keystroke).
Reconnect to that stream without missing any changes.

Stream APIs usually make it hard to do 1 and 3. Pub-sub usually makes it impossible to do 3 (if you miss a message, what do you do?). Observables aren’t minimal — usually they send you the whole object with each update. As far as I can tell, GraphQL subscriptions are just typed streams — which is a pity, because they had a great opportunity to get this right.

One mental model for this is that I want a program to watch a state machine owned by a different program. The state machine could be owned by the kernel or a database, or a goroutine or something. It could live on another computer — or even on the blockchain or scuttlebutt. When I connect, the state machine is in some initial state. It then processes actionswhich move it from state to state. (Actions is a weird term — in other areas we call them operations, updates, transactionsor diffs / patches).

If my application is interested in following along, I want that state machine to tell me:

A recent snapshot of the state
Each action performed by the state machine from that state, with enough detail that I can follow along locally.

When I reconnect, the state machine could either tell me all the actions I missed and I can replay them locally, or it could send me a new snapshot and we can go from there. (That said, sometimes its important that we get the operations and not just a new snapshot.)

With this, I can:

Re-render my app’s frontend when the data changes, without needing to poll or re-send everything over the network, or do diffing or anything like that.
Maintain a computed view that is only recalculated when the data itself changes. (Like compilation artefacts, or a blog post’s HTML — HTML should only be rerendered when the post’s content changes!)
Do local speculative writes. That allows realtime collaborative editing (like Google Docs).
Do monitoring and analytics off the changes.
Invalidate (& optionally repopulate) a cache
Build a secondary index that always stays up to date

One of the big advantages of having REST become a standard is that we’ve been able to build common libraries and infrastructure that works with any kind of data. We have caching, load balancing and CDN tools like nginx/ cloudflare. We have debugging tools like cURL and Paw. HTTP libraries exist in every language, and they interoperate beautifully. We should be able to do the same sort of thing with changing data — if there was a standard protocol for updates, we could have standard tools for all of the stuff in that list above! Streaming APIs like ZMQ / RabbitMQ / Redis Streams are too low level to write generic tools like that.

Time (versions) should be explicit

We need to talk about versions. To me, one of the big problems with lots of APIs for stuff like this today is that they’re missing an explicit notion of time. This conceptual bug shows up all over the place, and once you see it its impossible to unsee. Props to Rich Hickeyand Martin Kleppmannfor informing my thinking on this.

The problem is that for data that changes over time, a fetched value is correct only at that precise time that it was fetched. Without re-fetching, or some other mechanism, its impossible to tell when that value is no longer valid. It might have already changed by the time you receive the value — but you have no way to know without re-fetching and comparing. And even if you do re-fetch and compare, it might have changed in the intervening time then changed back.

If we add in the notion of explicit versions, this becomes much easier to think about. Imagine I make two queries (or SYSCALLs or whatever). I learn first that x = 5then y = 6. But from that alone I don't know anything about how those values relate across time! There might never have been a time where (x,y) = (5,6). If instead I learn that x = 5 at time 100, then y = 6 at time 100, I have two immutable facts. I know that at time 100, (x,y) = (5,6). I can ask follow up questions like what is z at time 100?. Or importantly, notify me when x changes after version 100.

These versions could be a single incrementing number (like SVN or Kafka), a version vector or an opaque string or a hash like git.

This might seem like an academic problem, but having time (/ version information) be implicit instead of explicit hurts us in lots of ways.

For example, if I make two SQL queries, I have no way of knowing if the two query results are temporally coherent. The data I got back might have changed between queries. The SQL answer is to use transactions. Transactions force both queries to be answered from the same point in time. The problem with transactions is that they don’t compose:

I can’t use the results from two sequentially made transactions together, even if the data changes rarely.
I can’t make a SQL transaction across multiple databases.
If I have my data in PostgresQL and an index to my data in ElasticSearch, I can’t make a query that fetches an ID from the index, then fetches / modifies the corresponding value in postgres. The data might have changed in between the two queries. Or my ElasticSearch index might be behind the point in time of postgres. I have no way to tell.
You can’t make a generic cache of query results using versionless transactions. Isn’t it weird that we have generic caches for HTTP (like varnish or nginx) but nothing like that for most databases? The reason is that if you query keys Aand Bfrom a database, and the cache has Astored locally, it can’t return the cached value for Aand just fetch B. The cache also can’t store Balongside the older result for A. Without versions, this problem is basically impossible to solve correctly in a general way. But we can solve it for HTTP because we haveETags.

The caching problem is sort of solved by read only replicas — but I find it telling that read only replicas often need private APIs to work. The main API of most databases aren’t powerful enough to support a feature that the database itself needs to scale and function. (This is getting better though — Mongo/ Postgres.)

Personally I think this problem alone is one of the core reasons behind the nosqlmovement. Our database APIs make it impossible to correctly implement caching, secondary indexing and computed views in separate processes. So SQL databases have to do everything in-process, and this in turn kills write performance — they have ever more work to do on each write. Developers have solved these performance problems by looking elsewhere.

It doesn’t have to be like this — I think we can have our cake and eat it too; we just need better APIs.

(Credit where credit is due — Riak, FoundationDBand CouchDBall provide version information in their fetch APIs. I still want better change feeds APIs though.)

Minimal Viable Spec

What would a baseline API for data that changes over time look like?

The way I see it, we need 2 basic APIs:

fetch(query)-> data, version
subscribe(query, version)-> stream of (update, version) pairs. (Or maybe an error if the version is too old)

There’s a lot of forms the version information could take — it could be a timestamp, a number, an opaque hash, or something else. It doesn’t really matter so long as it can be passed into subscribecalls.

Interestingly, HTTP we already has a fetch function with this API in the GETmethod. The server returns data and usually either a Last-Modifiedheader or an ETag. But HTTP is missing a standard way to subscribe.

The update objects themselves should to be smalland semantic. The gold standard for operations is usually that they should express user intent. And I also believe we should have a MIME-type equivalent set of standard update functions (like JSON-patch).

Lets look at some examples:

For Google Docs, we can’t re-send the whole document with every key stroke. Not only would that be slow and wasteful, but it would make concurrent editing almost impossible. Instead Docs wants to send a semantic edit, like insert 'x' at position 4. With that we can update cursor positions correctly and handle concurrent edits from multiple users. Diffing isn't good enough here - if a document is aaaaand I have a cursor in the middle (aa|aa), inserting another aat the start or the end of the document has the same effect on the document. But those changes have different effects on my cursor position and speculative edits.

The indie game Factoriouses a deterministic game update function. Both save games and the network protocol are streams of actions which modify the game state’s in a well defined way (mine coal, place building, tick, etc). Each player applies the stream of actions to a local snapshot of the world. Note in this case the semantic content of the updates is totally application specific — I doubt any generic JSON-patch like type would be good enough for a game like this.

For something like a gamepad API, its probably fine to just send the entire new state every time it changes. The gamepad state data is so small and diffing is so cheap and easy to implement that it doesn’t make much difference. Even versions feel like overkill here.

GraphQL subscriptionsshould work this way. GraphQL already allows me to define a schema and send a query with a shape that mirrors the schema. I want to know when the query result set changes. To do so I should be able to use the same query — but subscribe to the results instead of just fetch them. Under the hood GraphQL could send updates using JSON-patch or something like it. Then the client can locally update its view of the query. With this model we could also write tight integrations between that update format and frontend frameworks like Svelte. That would allow us to update only and exactly the DOM nodes that need to be changed as a result of the new data. This is not how GraphQL subscriptions work today. But in my opinion it should be!

To make GraphQL and Svelte (and anything else) interoperate, we should define some standard update formats for structured data. Games like Factorio will always need to do their own thing, but the rest of us can and should use standard stuff. I’d love to see a Content-Type:for update formats. I can imagine one type for plain text updates, another for JSON (probably a few for JSON). Another type for rich text, that applications like Google Docs could use. I have nearly a decade of experience goofing around with realtime collaborative editing, and this API model would work perfectly with collaborative editors built on top of OT or CRDTs.

Coincidentally, I wrote this JSON operation typethat also supports alternate embedded types and operational transform. And Jason Chen wrote this rich text type. There’s also plenty of CRDT-compatible types floating around too.

The API I described above is just one way to cut this cake. There’s plenty of alternate ways to write a good API for this sort of thing. Braidis another approach. There’s also a bunch of ancillary APIs which could be useful:

fetchAndSubscribe(query)-> data, version, stream of updates. This saves a round-trip in the common case, and saves re-sending the query.
getOps(query, fromVersion, toVersion / limit)-> list of updates. Useful for some applications
mutate(update, ifNotChangedSinceVersion)-> new version or conflict error

Mutate is interesting. By adding a version argument, we can reimplement atomic transactions on top of this API. It can support all the same semantics as SQL, but it could also work with caches and secondary indexes.

Having a way to generate version conflicts lets you build realtime collaborative editors with OT on top of this, using the same approach asFirepad. The algorithm is simple — put a retry loop with some OT magic in the middle, between the frontend application and database. Like this. It composes really well — with this model you can do realtime editing without support from your database.

Obviously not all data is mutable, and for data that is, it won’t necessarily make sense to funnel all mutations through a single function. But its a neat property! Its also interesting to note that HTTP POST already supports doing this sort of thing with the If-Match/ If-Unmodified-Sinceheaders.

Standards

So to sum up, we need a standard for how we observe data that changes over time. We need:

A local programatic APIs for kernels (and stuff like that)
A standard API we can use over the network. A REST equivalent, or a protocol that extends REST directly.

Both of these APIs should support:

Versions (or timestamps, ETags, or some equivalent)
A standard set of update operations, like Content-Typein http but for modifications. Sending a fresh copy of all the data with each update is bad.
The ability to reconnect from some point in time

And we should use these APIs basically everywhere, from databases, to applications, and down into our kernels. Personally I’ve wasted too much of my professional life implementing and reimplementing code to do this. And because our industry builds this stuff from scratch each time, the implementations we have aren’t as good as they could be. Some have bugs (fs watching on MacOS), some are hard to use (parsing sysfs files), some require polling (Contentful), some don’t allow you to reconnect to feeds (GraphQL, RethinkDB, most pubsub systems). Some don’t let you send small incremental updates (observables). The high quality tools we do have for building this sort of thing are too low level (streams, websockets, MQs, Kafka). The result is a total lack of interoperability and common tools for debugging, monitoring and scaling.

I don’t want to rubbish the systems that exist today — we’ve needed them to explore the space and figure out what good looks like. But having done that, I think we’re ready for a standard, simple, forward looking protocol for data that changes over time.

Whew.

By the way, I’m working to solve some problems in this space with Statecraft. But thats another blog post. ;)