Introducing Project Mentat, a flexible embedded knowledge store
Evolving storage is hard. Can we make it easier?
Edit, January 2017: to avoid confusion and to better follow Mozilla’s early-stage project naming guidelines, we’ve renamed Datomish to Project Mentat. This post has been altered to match.
For several months now, a small team at Mozilla has been exploring new ways of building a browser. We called that effort Tofino, and it’s now morphed into the Browser Futures Group.
As part of that, Nick Alexander and I have been working on a persistent embedded knowledge store called Project Mentat. Mentat is designed to ship in client applications, storing relational data on disk with a flexible schema.
It’s a little different to most of the storage systems you’re used to, so let’s start at the beginning and explain why. If you’re only interested in the what, skip down to just above the example code.
As we began building Tofino’s data layer, we observed a few things:
- We knew we’d need to store new types of data as our product goals shifted: page metadata, saved content, browsing activity, location. The set of actions the user can take, and the data they generate, is bound to grow over time. We didn’t (don’t!) know what these were in advance.
- We wanted to support front-end innovation without being gated on some storage developer writing a complicated migration. We’ve seen database evolution become a locus of significant complexity and risk — “here be dragons” — in several other applications. Ultimately it becomes easier to store data elsewhere (a new database, simple prefs files, a key-value table, or JSON on disk) than to properly integrate it into the existing database schema.
- As part of that front-end innovation, sometimes we’d have two different ‘forks’ both growing the data model in two directions at once. That’s a difficult problem to address with a tool like SQLite.
- Front-end developers were interested in looser approaches to accessing stored data than specialized query endpoints: e.g., Lin Clark suggested that GraphQL might be a better fit. Only a month or two into building Tofino we already saw the number of API endpoints, parameters, and fields growing as we added features. Specialized API endpoints turn into ad hoc query languages.
- Syncability was a constant specter hovering at the back of our minds: getting the data model right for future syncing (or partial hosting on a service) was important.
Many of these concerns happen to be shared across other projects at Mozilla: Activity Stream, for example, also needs to store a growing set of page summary attributes for visited pages, and join those attributes against your main browsing history.
Nick and I started out supporting Tofino with a simple store in SQLite. We knew it had to adapt to an unknown set of use cases, so we decided to follow the principles of CQRS.
CQRS — Command Query Responsibility Segregation — recognizes that it’s hard to pick a single data storage model that works for all of your readers and writers… particularly the ones you don’t know about yet.
As you begin building an application, it’s easy to dive head-first into storing data to directly support your first user experience. As the experience changes, and new experiences are added, your single data model is pulled in diverging directions.
A common second system syndrome for this is to reactively aim for maximum generality. You build a single normalized super-flexible data model (or key-value store, or document store)… and soon you find that it’s expensive to query, complex to maintain, has designed-in capabilities that will never be used, and you still have tensions between different consumers.
The CQRS approach, at its root, is to separate the ‘command’ from the ‘query’: store a data model that’s very close to what the writer knows (typically a stream of events), and then materialize as many query-side data stores as you need to support your readers. When you need to support a new kind of fast read, you only need to do two things: figure out how to materialize a view from history, and figure out how to incrementally update it as new events arrive. You shouldn’t need to touch the base storage schema at all. When a consumer is ripped out of the product, you just throw away their materialized views.
Viewed through that lens, everything you do in a browser is an event with a context and a timestamp: “the user bookmarked page X at time T in session S”, “the user visited URL X at time T in session S for reason R, coming from visit V1”. Store everything you know, materialize everything you need.
We built that with SQLite.
This was a clear and flexible concept, and it allowed us to adapt, but the implementation in JS involved lots of boilerplate and was somewhat cumbersome to maintain manually: the programmer does the work of defining how events are stored, how they map to more efficient views for querying, and how tables are migrated when the schema changes. You can see this starting to get painful even early in Tofino’s evolution, even without data migrations.
Quite soon it became clear that a conventional embedded SQL database wasn’t a direct fit for a problem in which the schema grows organically — particularly not one in which multiple experimental interfaces might be sharing a database. Furthermore, being elbow-deep in SQL wasn’t second-nature for Tofino’s webby team, so the work of evolving storage fell to just a few of us. (Does any project ever have enough people to work on storage?) We began to look for alternatives.
We explored a range of existing solutions: key-value stores, graph databases, and document stores, as well as the usual relational databases. Each seemed to be missing some key feature.
Most good storage systems simply aren’t suitable for embedding in a client application. There are lots of great storage systems that run on the JVM and scale across clusters, but we need to run on your Windows tablet! At the other end of the spectrum, most webby storage libraries aren’t intended to scale to the amount of data we need to store. Most graph and key-value stores are missing one or more of full-text indexing (crucial for the content we handle), expressive querying, defined schemas, or the kinds of indexing we need (e.g., fast range queries over visit timestamps). ‘Easy’ storage systems of all stripes often neglect concurrency, or transactionality, or multiple consumers. And most don’t give much thought to how materialized views and caches would be built on top to address the tension between flexibility and speed.
We found a couple of solutions that seemed to have the right shape (which I’ll discuss below), but weren’t quite something we could ship. Datomic is a production-grade JVM-based clustered relational knowledge store. It’s great, as you’d expect from Cognitect, but it’s not open-source and we couldn’t feasibly embed it in a Mozilla product. DataScript is a ClojureScript implementation of Datomic’s ideas, but it’s intended for in-memory use, and we need persistent storage for our datoms.
Nick and I try to be responsible engineers, so we explored the cheap solution first: adding persistence to DataScript. We thought we might be able to leverage all of the work that went into DataScript, and just flush data to disk. It soon became apparent that we couldn’t resolve the impedance mismatch between a synchronous in-memory store and asynchronous persistence, and we had concerns about memory usage with large datasets. Project Mentat was born.
Mentat is built on top of SQLite, so it gets all of SQLite’s reliability and features: full-text search, transactionality, durable storage, and a small memory footprint.
On top of that we’ve layered ideas from DataScript and Datomic: a transaction log with first-class transactions so we can see and annotate a history of events without boilerplate; a first-class mutable schema, so we can easily grow the knowledge store in new directions and introspect it at runtime; Datalog for storage-agnostic querying; and an expressive strongly typed schema language.
Datalog queries are translated into SQL for execution, taking full advantage of both the application’s rich schema and SQLite’s fast indices and mature SQL query planner.
You can see more comparisons between Project Mentat and those storage systems in the README.
A proper tutorial will take more space than this blog post allows, but you can see a brief example in JS. It looks a little like this:
// Open a database.
let db = await datomish.open("/tmp/testing.db");
// Make sure we have our current schema.
await db.ensureSchema(schema);
// Add some data. Note that we use a temporary ID (the real ID
// will be assigned by Mentat).
let txResult = await db.transact(\[
{"db/id": datomish.tempid(),
"page/url": "[https://mozilla.org/](https://mozilla.org/)",
"page/title": "Mozilla"}
\]);
// Let's extend our schema. In the real world this would
// typically happen across releases.
schema.attributes.push({"name": "page/visitedAt",
"type": "instant",
"cardinality": "many",
"doc": "A visit to the page."});
await db.ensureSchema(schema);
// Now we can make assertions with the new vocabulary
// about existing entities.
// Note that we simply let Mentat find which page
// we're talking about by URL -- the URL is a unique property
// -- so we just use a tempid again.
await db.transact(\[
{"db/id": datomish.tempid(),
"page/url": "[https://mozilla.org/](https://mozilla.org/)",
"page/visitedAt": (new Date())}
\]);
// When did we most recently visit this page?
let date = (await db.q(
\`\[:find (max ?date) .
:in $ ?url
:where
\[?page :page/url ?url\]
\[?page :page/visitedAt ?date\]\]\`,
{"inputs": {"url": "[https://mozilla.org/](https://mozilla.org/)"}}));
console.log("Most recent visit: " + date);
Project Mentat is implemented in ClojureScript, and currently runs on three platforms: Node, Firefox (using Sqlite.jsm), and the JVM. We use DataScript’s excellent parser (thanks to Nikita Prokopov, principal author of DataScript!).
Addition, January 2017: we are in the process of rewriting Mentat in Rust. More blog posts to follow!
Nick has just finished porting Tofino’s User Agent Service to use Mentat for storage, which is an important milestone for us, and a bigger example of Mentat in use if you’re looking for one.
What’s next?
We’re hoping to learn some lessons. We think we’ve built a system that makes good tradeoffs: Mentat delivers schema flexibility with minimal boilerplate, and achieves similar query speeds to an application-specific normalized schema. Even the storage space overhead is acceptable.
I’m sure Tofino will push our performance boundaries, and we have a few ideas about how to exploit Mentat’s schema flexibility to help the rest of the Tofino team continue to move quickly. It’s exciting to have a solution that we feel strikes a good balance between storage rigor and real-world flexibility, and I can’t wait to see where else it’ll be a good fit.
If you’d like to come along on this journey with us, feel free to take a look at the GitHub repo, come find us on Slack in #mentat, or drop me an email with any questions. Mentat isn’t yet complete, but the API is quite stable. If you’re adventurous, consider using it for your next Electron app or Firefox add-on (there’s an example in the GitHub repository)… and please do send us feedback and file issues!
Acknowledgements
Many thanks to Lina Cambridge, Grisha Kruglov, Joe Walker, Erik Rose, and Nicholas Alexander for reviewing drafts of this post.