Choose Wisely

Choose SQL seems to have struck something of a nerve.

I agree with the underlying premise. I've written a few popular blog posts about using SQL and a few SQL focused libraries in different languages.

I've also written a distributed key/value time-series database (on easy-mode, based on a pre-existing embedded storage library) which has nearly antipodal guarantees and semantics to a relational database.

So yes, if you're starting from zero, a relational database is almost certainly the Right Option™ because it's so flexible.

Still, despite the fact that I basically agree, I'm a little uncomfortable with Choose SQL because it doesn't have much substance. Pretty much every popular logical fallacy is well represented: straw man, appeal to authority, bandwagoning, etc. Classic problems with relational databases are waved away as YAGNI or "you can just pay someone else to solve these."

Instead of redressing this with quality research and reasoned argument, I'm going to share some observations (admittedly without any data) I've made on large data systems and struggles I've seen and had with both relational and non-relational databases.

Lots of this will sound very equivocal, even though there are nearly always caveats. Generally, you should consider such advice either a) wrong or b) applying to anyone who doesn't already realize it doesn't apply to them.

on structure

The promise of "schemaless" is a fiction. Data structures are central to programming, and the way you structure your data will determine what you can do with it more than any other factor. This relationship is bidirectional; what you need to do with your data usually also ends up influencing how you structure it. All data sets have a schema, and if yours doesn't, then building that schema is likely to be the core of your product.

RDBMS indexes are key/value stores with transactional reads and writes that can be queried via SQL. All else being equal, this is an incredibly powerful mix of features. You will have a very hard time finding a non-relational database that gets you the same level of performance, data integrity, tooling, and query flexibility.

Keys can make or break your schema, and most real world relational schemas will still require some ID type key because most records in a relational database will need to be referenced from somewhere else.

32 bit keys are verboten: it doesn't save meaningful resources for small datasets, and isn't enough keyspace for larger ones. It doesn't matter that you'll "never have 4 billion foobars" because there are a bunch of other ways that contention could force you into distribution.

Avoid auto-increment/serial keys generated on insert, which are the default on lots of ORMs, as it's another impediment to distribution. If ordered keys are a requirement, use flake IDs and if they aren't, then generate a key based on the natural uniqueness requirements in your schema.

Coordination is so difficult and so fraught that any attempt to avoid it is worth it.

on size

There is confusion about what constitutes "big" data because it's become such a vacuous marketing term and because the definition of big shifts as hardware and software systems evolve. There are also different kinds of bigness; fine grain and high cardinality will generally hurt quicker than volume will. The answer to "what weighs more, 1GB of youtube videos or 1GB of tweets?" is the tweets, because it's likely to result in more read and write ops. If your data is generated by humans and you don't have a 10+ million users, your data probably isn't big.

Before you scale out, know how far you can scale up, and where on your platform that will top out for cost effectiveness for you. EC2 instances, as of this writing, can scale up to ~2TB of memory and 48TB of SSD; they are pricy, but it may be worth it if you have the cash and also have something better for your top engineers to do.

If you still aren't sure what "big" is, and your dataset is smaller than the entire web ca. 2002, you probably do not meet this threshold. Making that call when you're on the edge of bigness is difficult, but making it from the start is nearly impossible because your schema and access patterns have not yet fully revealed themselves.

There are a lot of problems out there that can be solved on a single laptop as good as a 5-node MR cluster, and with GNU parallel and a few worker boxes as good as a 20-node MR cluster, but there's no particularlly effective shorthand for an N-node MR cluster. Eventually, the overhead can't be avoided, and that's when a "natively clustered" solution starts to pay off.

on querying

JSON style query languages (MongoDB is the major example) have some upsides: they compose very naturally from trees, which means they can be created, modified, analyzed, serialized, and other types of ized very easily without special purpose tools. They are also inside-out (like C-style declarations), which makes them awful in a REPL. SQL composition is much more linear, and is far superior for ad hoc reporting and poking around.

I didn't realize there was a groundswell of backlash tarring Map/Reduce as hipster faff. If this is a real thing, then it's wrong. It's not hipster faff, it's a seminal academic paper with nearly 20000 citations. There is a reason why it's subtitled "Simplifying Data Processing on Large Clusters", though. It's a very labour intensive way to process data, so the benefits must offset the costs.

As due diligence, you should probably start feeling some of that pain before you go there, because the overhead to running such systems can be high. Generally, distributed computation for your data size must be a physical-as-in-physics requirement, and the results of your queries should be many orders of magnitude smaller than the amount of data they access.

Dynamo-style free distribution ain't free, but that doesn't mean it's not potentially cheaper and easier than RDBMS distribution techniques (generally: sharding). If you can handle the drawbacks: restricted query/access patterns, varying degrees of throughput/consistency tradeoffs, locality opacity, and more!, the upsides can be significant.

strengths and weaknesses

The primary strengths of relational databases are safety, storage efficiency, and application flexibility.

Non-relational databases tend to vary more than relational ones, but their strengths are generally one or more of: simplicity, native clustering for scaling and availability, and distribution.

Note that speed is missing. Speed has more to do with how a datastore can implement your schema than with any natural quality of a datastore itself. If you have key/value access, then pretty much any B+Tree implementation is going to look similar, whether that's LMDB or a PostgreSQL index.

Guarantees have costs. It feels all warm and fuzzy to have ACID guarantees, but that doesn't mean that your problem requires them, and there can be significant advantages to dropping them where appropriate.

Writes are not reads. Most relational databases are designed for write-once-read-many type loads, but increasingly machine-generated data is write-once-read-maybeduringanoutageorsomething; sadly the abbreviation there is the same. Cassandra and other LSM style databases are specifically designed for writes. LSM stands for "a technique to delay write amplification until it will cause an outage."

Suitability for "scalability" is something that is over-emphasized in the industry, which fetishizes large solutions regardless of quality. It sounds really cool to run a "10000 node cluster", because it's actually quite a difficult thing to do. Anyone who does that probably has a thing or two to teach you, even if the system is only that large because it is inefficient. It's just about as hard to manage a large system that is inefficient as it is to handle a large one that is efficient. In my experience, nothing you get out of the box is going to save you from issues as you start to scale out a cluster to a large number of machines.

Most open source relational databases do not have native clustering for failover and high availability, meaning that even if you are safe and have standbys and replicas a primary outage is still going to cause you issues. Those that do violate single-primary consistency guarantees or sacrifice latency (you know who you are), and multi-primary gets HA and read/write scalability but not dataset scalability.

Most databases of any type will eventually exhibit difficult to diagnose slow behaviour, regardless of type. In distributed databases like Dynamo, this slow behaviour can spread across the cluster and impact availability in ways that did not seem obvious. Databases written in managed memory languages will inevitably cause you a slow down due to GC, unless they manage all of their own memory on the sly (generally via system mapped memory + a custom allocator). For everything else, a compaction strategy will eventually go awry and do much the same thing, whether that's a vacuum or an LSM merge.

Finally, if you can find someone who has built the same type of system you want to build, ask them all the questions and listen to their answers. I have spoken to people who have worked on publically available metrics products that compete with the one I work on, to people who have created the in-house metrics products for enormous companies that you definitely know about, and not one of them has sounded the slightest bit alien. That kind of experience is invaluable.

Mar 22 2017

jmoiron plays the blues

Choose Wisely

on structure

on size

on querying

strengths and weaknesses