Thoughts on Time-series Databases

Preetam "recently" blogged about catena, a time-series metric store. There was another blog post about benchmarking boltdb by a Fog Creek engineer, also looking to write a time series database. This is something of a pattern in the Go community, which already boasts seriesly, InfluxDB, and prometheus; there are almost certainly others.

Time series data has been de rigueur at least since the Etsy's seminal blog post on StatsD, though in reality that was just an inflection point. Time series modeling and graphing predates computer systems, but they have been a popular way of tracking and visualizing systems and networking data since at least the early 90s with MRTG. A few factors are converging now to make these kinds of systems more important: "Big Data" is getting much, much bigger; virtualization and containerization has increased the number of independent "nodes" for a typical distributed application; and the economies of the cloud have put the brakes on the types of performance increases typically attributed to "Moore's Law."

This topic is relevant to my work at Datadog, and I've been thinking about it for a long time now. I've wanted to collect my thoughts somewhere for a while, because some of them are identical to those expressed in other recent projects, and others are antithetical to them. I figured this would make my input at worst interesting.

For a primer on this subject, please read Baron's Time-Series Database Requirements. There's a reason that most other recent articles cite it; it contains a brief but complete description of the problem, the requirements for many large-scale time-series users, and some surprises for the uninitiated. Some of the implications that are implicit in these requirements but perhaps not obvious to onlookers:

users of these systems over-collect to enable post-facto investigation into events/outages
because of this there is generally a ton of data, 99% of which is likely never read
the number of distinct timeseries can be very large
timeseries can range from being very dense (every second or more) to very sparse (a report run once a day/week/month)
the expectation is that reads on recent data is very low latency ("real time")
writes are inherently difficult to batch; If you are doing 1mm points/sec, chances are none of those are to the same series (translate this as file/row/column/key, depending on your data store)
reads must be efficient; most useful charts involve joins across dozens if not hundreds of series
data is collected at high resolution (a second or less) but eventually displayed at low resolution (minutes or more); despite this, these reads must be fast and accurate

Time-series Storage Layer

People often ask themselves:

do you optimise for writes?
do you optimise for reads?
what semantics must be supported by the store to enable the desiderata of the query layer?

If answers to the first two questions is yes; the storage layer must optimise for both writes and reads. You can't have a useful system if reads are not optimised, and you can't have a scalable system if writes are not optimised.

If you try to write multiple series to one indexed collection; think something like an sql table with (key, timestamp, value) rows, then you're ranging across too much data when servicing reads. If you separate every series out to its own collection, you have possibly the worlds worst write pattern for persisting to disk: high frequency writes across all of your data.

This is why you can't measure your system's performance for 100m points by just batching 10000 points up for 10000 metrics: even if you have 1s resolution reporting, it takes almost 3 hours to get 10000 points; at the statsd popularized 10sec resolution, it takes over a day. It's not a realistic write pattern.

Despite this, it's the desired write pattern. So people spool points in memory and try to flush them later in larger chunks. This is the idea behind LSM Tree stores, and it's why people are increasingly using them to solve this problem. I've seen a few basic approaches:

Use files (eg. RRD, Whisper)
Use an LSM tree backed store (eg. LevelDB, RocksDB, Cassandra)
Use a B-tree ordered k/v store (eg. BoltDB, LMDB)

These approaches have advantages and disadvantages. To explore how they work in practice, lets describe the problem a little differently:

A time-series is a vector of points.

There are only a few primary operations we have to deal with:

create a new vector
find a vector (varies)
append to a vector (O(1))
read from a vector (varies)

Using Files

Obviously all databases "use files" somehow, but the specific approach I am talking about here is the file-as-a-series approach. With this, you get highly efficient buffered appends and buffered linear reads for free. On reasonably sophisticated filesystems (zfs, ext4, xfs, likely others), directory entry lookups can be hash table based, which means they are O(1).

Filesystems seem to be getting increasingly complex and difficult to understand, and tuning them is getting to be something of a black art. Although some of them were designed to handle the case of millions of files, none of them are optimised specifically for it. If your data is very sparse, the overhead for a small amount of data is significant (1 block, typically 4k). If your query ranges are small (and typically they are), you may be paging in mostly empty space when ranging across many timeseries when they could be more efficiently packed together. Finally, ranges across your entire dataset, which are vital to your operations in terms of backups, batch jobs, etc, are slow, and it takes a long time to get all of your handles in memory.

Using a tree-based store

With tree-based k/v stores, we have a few possible approaches. Keys can either identify a single series, a single point, or perhaps a time slice of a series. Regardless of your decision on your key structure or even your tree type, key lookups are going to be O(log n), which is potentially leaving a lot on the table.

At scales I'm interested in, the number of points is too high for a range over the keys to be anything but a CPU cache disaster, so k/v pairs that look anything like {series, timestamp} -> {point} simply won't work. Open source time-series stores in use at smaller scales may attempt to go this route, but I can say with a high degree of confidence that files are simpler and faster unless you regularly read a large percentage of your dataset. With O(log n) lookups, we don't want to add extra time to our reads as well.

That leaves us with writes to look at, which is where image and reality can start to become blurred. The COW approach used by both B-tree databases seems like it'd be a very poor fit for a huge volume of random writes; BoltDB creator Ben B. Johnson says as much. In practice it's a little less clear, but I think it's likely a wash. These all use mmap, and frequent tiny writes to most of the pages in your database will obviously dirty most of the pages in your database, causing some mighty I/O on f/msync.

Appends on LSM's might seem better, but the tree merging compaction they use is a very intense operation and I just can't see a way where it seems like an irreducible complexity for storing and concatenating related vectors.

In general, I feel the tree approach puts efficient writes and efficient reads at odds with eachother. Optimised local linear reads clash with the optimal write pattern of appending keys. Preferably, these would be complementary or worst orthogonal.

Query Semantics

Regarding the semantics that must be supported in the time-series store in service of the query language, I suspect that the only required semantics are that you can batch and parallelise reads; everything else is better served as a operation on top of a different store that narrows these reads, and better implemented outside the specialized, demanding constraints of time-series storage.

Dimensionality

Having thorough experience with systems that have and lack dimensionality, I have to say that it really is absolutely vital; at the "product" level, lacking support for it is a guarantee that something that does support it will eventually replace you. This is true of both companies attempting to enter the market and open source solutions.

Stuffing all your metadata into the metric key and then treating things as a key/value lookup isn't a particularly great model for querying; you want to be able to have at least or/and logic across dimensions; eg: az=us-east-1 AND role=db-master, version=2.1 OR version=2.3, and preferably a richer language. Graphite's metric name globbing system (system.cpu.az-us-east-1.role-db-master.host-*.avg) definitely isn't it.

But does the storage layer need to support any extra semantics to enable this?

Currently, dimensions are already implemented successfully on top of stores unaware of them; Prometheus does this with a separate index, for example. The general pattern is that queries across dimensions first lookup the appropriate keys in an index (LevelDB for Prometheus), identify the series' involved, and then joins are done across those series by a query engine that supports all sorts of other aggregations.

It is perhaps at this level that some kind of cooperation would really bear fruit; some of the real world cardinalities I've seen makes me think it's unlikely related series could always be stored near each other on disk, but making them happen at least on one node is a distinct possibility.

In a dimension-aware store, the indexes involved in these queries will either point to tuples stored across potentially vast sectors of unrelated data, or contain the data themselves and be both enormous and slow to write to. While SSDs might smooth this, they still perform best when reading whole pages; locality still matters.

I suspect it will remain faster to do 1000 highly optimised reads than to do these kinds of ranges at scale.

I think there may be some clever maths to be discovered or combined here to save us from some of this madness, and do for time-series what HyperLogLog does for counting or what Bloom-Filters do for set membership, but I'm not that confident it will be generally useful. Google can get away with these tricks for Google trends, but people generally test TSDBs with known data and will see inaccuracies and query weirdness quickly.

Features like InfluxDB's continuous queries can alleviate this for known high-cardinality joins, but this particular implementation is expensive, and in practice these requirements can generally be aggregated into a single time-series manually without much effort.

Aggregation, Expiry, Etc.

Time-aggregation and data expiry are first-class citizens in round-robin databases, which are still really the open-source standard. There's rrdtool, an old mainstay in the industry, and there's graphite; specifically, its old "whisper" engine.

These have a few significant problems that make them unusable for some. RRDs are constant space, and lots of legitimate time series usage involves a large number of sparse series. Another problem is that it does compaction/aggregation on writes, which causes lots of write amplification.

If you get the basic storage layer right, aggregation and expiry start to look a lot like dimensionality: they can be implemented asynchronously in a separate policy layer. It doesn't seem important that the actual storage engine is mindful of these; in fact, it's probably better ignored for efficiency's sake.

If you are treating a Time-series database as a holistic concept, you'll definitely want some kind of approach to this; there's simply no good way of combining thousands of 1s-resolution series across 6 months that doesn't involve a lot of processing power or waiting quite some time.

Jun 30 2015

jmoiron plays the blues