A conceptual introduction to Project Mentat

Publish date: 2017-02-21

Tags:

This post is intended for a particular audience: developers who perhaps haven’t done lots of database or data representation work, want to get started working on Mentat, and are looking for a ‘quick fix’ of context.

You should have already read Introducing Project Mentat. This post might fill in some gaps for you.

This post covers, very briefly:

What is a database?
What is a schema?
What is event sourcing?
What is Datomic?
What is SQLite?
What is Project Mentat?

The answers are relatively brief and somewhat opinionated, but they should offer enough of a starting point for further research. Disagreement is welcome!

What is a database?

Databases of all kinds share a common goal: to provide persistent storage and querying to one or more applications. Beyond that, different tradeoffs yield a surprising variety of solutions.

The traditional “databases”, as most developers now understand the term, are relational SQL databases. The only interaction is via a textual query language, SQL. These databases are typically ACID: atomic, consistent, isolated, and durable.

Atomic means that a write (indeed, a related collection of writes) either happens in its entirety or doesn’t happen at all.

Consistent means that rules and conditions expressed in the database (e.g., foreign key constraints, NOT NULL constraints, type definitions) continue to apply at all times.

Isolated means that readers and writers don’t see each other while they’re working. Writes conceptually happen in one instant in time, and within a particular transaction you are isolated from those moments.

Durable means that once data is written it isn’t lost.

Note that these properties have nothing to do with whether a DB uses SQL, is relational, or otherwise. However, so-called “NoSQL” databases often neglect some of these properties: it’s not uncommon for them to lose data that’s been confirmed as written (MongoDB being the most mocked example), not expose transaction primitives (or for their transactional guarantees to only apply within a single row or document), or not bother with consistency constraints at all.

Some have argued that these properties (and other characteristics of relational databases) are obsolete: for example, that in-database consistency constraints are better handled by business logic inside an application. Different databases draw these lines in different places.

You might have heard terms like “eventually consistent” when used to describe distributed systems. (Eventual consistency simply means that, if you wait long enough, all readers will see the same last write.) Distributed systems are hard, and many hosted NoSQL databases are clustered in order to scale, forcing them to contend with the CAP theorem. We won’t dig any deeper into that, because for the purposes of this post we aren’t concerned with distributed storage.

Different databases have opinions about the kinds of data they store and the way they model it. Sometimes these opinions are so pervasive that we don’t really notice them.

Relational databases store relations between entities. SQL databases model relations as tables with an arbitrary number of columns. Entities are rows in some table, and are identified by keys. Queries join relations (tables) to yield new relations. SQL databases are not ideal for storing graphs — graph traversal requires recursive joins, which is a relatively recent SQL feature — documents, unstructured data, etc., though PostgreSQL is often good enough.
Document databases store content without establishing an up-front explicit schema. They often use JSON or XML as a native data format. (‘Schemaless’ storage seems like a time saving, but see below.)
Graph databases model data as links between nodes. Sometimes those links can themselves be annotated. Querying is via path traversals and conditionals, which can be very natural for some domains. Typically graph databases are designed to be faster at graph operations like “find me related actors” than querying a graph modeled in another kind of database.
Geospatial databases focus on spatial coordinates as a primary way of finding things.
Applications often use an ad hoc flat file as a database: it’s read into memory and flushed to disk when changed. Typically this is a knee-jerk response to a badly configured database (“I don’t need all that complex database stuff!”), or to a proliferation of independent databases, and it forces the application developer to manually choose when to flush, how to query data in memory, how to handle scaling, etc. Flat files are a good solution for data that rarely changes, isn’t concurrently modified, and is simple to query; configuration files are a good example. It’s a bad solution for data that changes frequently and needs to be read or written transactionally. Firefox has well-documented problems with session store.
And so on.

What is a schema?

A description of your domain. A recipe for the shape of your data. Go read Martin Fowler’s take on schemaless storage.

Database schemas are also often where indices/indexes are described: relational databases conflate semantic, structural, and index descriptions of data into a single schema.

Relational databases (and others) use indexes to make queries fast. An index in a relational database is (usually) a copy of all or part of a table, stored in a different order, with metadata to facilitate finding the right values.

What is event sourcing?

The concept that your application state is the end result of a sequence of events that took place since an earlier state, and the idea that the fundamental modeling construct for dealing with this is to record the sequence of events directly, deriving other data structures from the event stream. This should feel very familiar to React/Redux developers.

Event sourcing is loosely related to CQRS, which is the idea that the readers and the writers in your system are best served by different data representations. Our approach is not opinionated — CQRS, not ES — but you’ll often see them discussed together.

Again, go read Martin Fowler, and see this list of further reading.

What is Datomic?

Datomic is a closed-source database, written in Clojure and running on the JVM, and built and maintained by Cognitect. Datomic has a rich schema language, stores relational data (albeit a little more loosely than a SQL database does), and is distinguished by its attitude to time and change. The history of all changes is accessible to application code. The schema is similarly accessible. Schema definitions can evolve over time, with older definitions available just like older data. Applications can query past (and hypothetical future!) states of the system.

Datomic’s record of the data it stores is — unlike traditional relational databases — very aware of time and state, in a similar way to how Clojure makes explicit the distinction between values and identity in state.

All databases in consumer applications need to handle changing and growing data over time; Datomic includes this as part of its data model. By contrast, most databases entirely forget that changes ever took place, with changes only stored in a log for long enough to provide durability or replication: the stored data in the database itself at the current time is purely a snapshot, and applications that need to reflect time and change in their data model must do so explicitly. “Deletion” in Datomic is actually one of two things: retraction (stating that an earlier fact is no longer true) and excision (cutting out part of history as if it never took place, typically for legal or privacy reasons).

There are lots of other things that are a bit special about Datomic, like the distinction between peers and the transactor.

Read a conversational introduction (part 1, part 2), and watch Rich Hickey explain:

Datomic is a service that runs alongside a broad array of existing storage systems (including AWS and Cassandra), using them to store index chunks.

What is SQLite?

SQLite is a very stable, quite fast, extraordinarily well-tested, embedded SQL database. Embedded (also called “serverless”) databases are no longer that common; most SQL databases — indeed, most ACID databases — are relatively large hosted servers like PostgreSQL, MySQL, etc.

We use SQLite extensively in Firefox, and Mozilla has a good relationship with its developers.

What is Project Mentat?

Mentat is an embedded datom store: essentially Datomic’s data model and schema interface expressed on top of a SQLite database.

Naturally, many of Datomic’s concepts — e.g., scaling reads by replicating index chunks to peers — don’t apply, and the concept of database-as-value is less relevant in an embedded system. But we preserve the ideas of a first-class transaction log, a domain-level schema, transaction listeners, and so on.

The principal advantages of Mentat in applications like Tofino are:

It’s natural to grow the schema and make new relations between entities. Schemas change all the time in living applications.
Schema modeling is done at the domain level (“a page can have multiple visits”) not at the storage level (“the visits table has a column with a non-unique, not null foreign key constraint that refers to the pages table”).
Different parts of the application can cooperatively share a single database.
The transaction log is available for querying (and for synchronization purposes).
The query language makes it easy to express joins, particularly graph-like self joins that are very complex in SQL. Here’s an introduction to the Datalog query language used by Datomic, DataScript, and Mentat.
The architecture of the database makes it natural to address performance via materializing views and indexes, either inside or outside the database itself. For example, an attribute can be marked for full-text searching just by adding “:fulltext true” to the schema. Applications see every transaction as it occurs, and can thus build their own caches.
Many of the mistakes made by developers adding ad hoc flexibility to a database — such as a “metadata” table containing strings, resulting in inefficient storage and slow querying — have been avoided: the schema itself offers enough flexibility that stringly-typed storage is unnecessary.

Mentat uses a combination of SQLite’s own ACID properties and sequential writes to achieve ACID guarantees (more or less).

It’s worth recognizing at this point that how Mentat stores data in SQLite is an implementation detail. We could split up our datoms table into pieces; we could store the transaction log in an ATTACHed database; we could even automatically derive traditional ‘wide’ database tables where appropriate. The abstraction boundary is quite opaque, and only the transactor and the query engine need to know about the details. Abstracting storage in this way is itself valuable: we can make significant changes in how Mentat is implemented without altering our API surface, and improvements under the surface are immediately available to all consumers.

What next?

This post covered some context, but doesn’t address exactly how Mentat is built. Some of the pages on the project wiki cover that, but another post might be forthcoming.