Joe Celko's Complete Guide to NoSQL: What Every SQL Professional Needs to Know about NonRelational Databases (2014)

Chapter 6. Key-Value Stores

Abstract

The key–value store model uses a key to locate a value of some kind. That value can be traditional data, BLOBs, files, or anything that can be put into computer storage. Likewise, the key can be accessed by hashing, indexing, brute-force scans, or any other appropriate method. This is the most primitive model for data retrieval short of a pile of unorganized data. It is not intended for queries, as all it can do is insert, update, delete, and retrieve (or return a “not found” message). The Berkeley database is the most popular open-source product of this type. Any meaning attached to the values comes from the host program as there is no schema to define anything.

Keywords

BDB (Berkeley database); BLOB; CLOB; key–value store; schema versus no schema

Introduction

A key–value store, also called an associative array, is a collection of pairs, (< key >, < value >), that generalize a simple array. The keys are unique within the collection and can be of any data type that can be tested for equality. This is a form of the MapReduce family, but performance depends on how carefully the keys are designed. Hashing becomes an important technique.

This model has only four basic operations:

◆ Insert pairs into the collection.

◆ Delete pairs from the collection.

◆ Update the values of existing pairs.

◆ Find a value associated with a particular key. If there is no such value, then return an exception.

6.1 Schema Versus no Schema

SQL started off as a project at IBM under the name SEQUEL, which stood for Structured English-like Query Language; the structure came from the DDL. This is the sublanguage in SQL that defines the schema. Files are not anything like tables; rows are not records; columns are not fields. Rows are made up of columns. Columns have a known data type with constraints and are scalar. There is no ordering; you find a column by its name.

Likewise, tables are made up of rows. The rows have no ordering within a table; you find them with keys and not by a physical position. There are table-level constraints and relationships. Finally, the unit of work is the whole schema, which has intertable constraints and relationships (CREATE ASSERTION, and FOREIGN KEY and PRIMARY KEY).

An empty table still has structure! Its columns are still strongly typed and its constraints are still enforced. It just works out that all table-level predicates are true for an empty set. An empty file is, well, nothing. A blank reel of magnetic tape is an empty file. But so is a disk directory entry that has a name and a NIL pointer or other end-of-file marker as its only data. All empty files look and behave the same way; you read them and immediately get an end-of-file flag raised.

Having no schema puts all of the data integrity (if any!) in the application. Likewise, the presentation layer has no way to know what will come back to it. These systems are optimized for retrieval and appending operations, and often offer little functionality beyond record storage. The safety and query power of SQL systems are replaced by better scalability and performance for certain data models.

6.2 Query Versus Retrieval

NoSQL works with a huge quantity of data that does not need relational queries and integrity. The data can be structured, but that is a bonus, not a requirement. The classic example is an online store with a large inventory, and we want to pull up parts of a web page with graphics and a short text description in HTML when a customer gets to the website. Imagine an Internet shoe store as the example for the rest of this section.

This organization is particularly useful for statistical or real-time analysis of growing lists of elements, such as Twitter posts or the Internet server logs from a large group of users. There is no indexing or constraint checking; you deal with a huge pile of raw data. But it might not be clean.

6.3 Handling Keys

In this environment, how the keys are handled is vital. They have to be designed for the target data and not just a sequential numbering. For our shoe store, we will have some internal product identifier, but it will probably be so obscure that no customer is going to know it when he or she comes to the website. But if we can construct an encoding that the user can understand, then life is easier.

Good key encodings are not always easy to find. As Jorge Luis Borges comparatively stated in his essay “The Analytical Language of John Wilkins”:

These ambiguities, redundancies, and deficiencies recall those attributed by Dr. Franz Kuhn to a certain Chinese encyclopedia entitled Celestial Emporium of Benevolent Knowledge. On those remote pages it is written that animals are divided into (a) those that belong to the Emperor, (b) embalmed ones, (c) those that are trained, (d) suckling pigs, (e) mermaids, (f) fabulous ones, (g) stray dogs, (h) those that are included in this classification, (i) those that tremble as if they were mad, (j) innumerable ones, (k) those drawn with a very fine camel’s hair brush, (l) others, (m) those that have just broken a flower vase, (n) those that resemble flies from a distance.

A dropdown list with these categories would not be much help for finding shoes on our website. There is no obvious key. Many years ago, I worked with a shoe company that was doing a data warehouse. The manufacturing side wanted reports based on the physical properties of the shoes and the marketing side wanted marketing categories. Steel-toed work boots were one category for manufacturing. However, at that time, marketing was selling them to two distinct groups: construction workers with big feet and Goth girls with small feet. These two groups did not shop at the same stores.

The ideal situation is a key that can quickly find which physical-commodity hardware device has the desired data. This is seldom possible in the real world unless your database deals with one subject area that has a classification system in place.

6.3.1 Berkeley DB

Berkeley DB (BDB), a software library, is the most widely used (< key >, < value >) database toolkit. Part of its popularity is that it can be used by so much other software. BDB is written in C with API bindings for C++, C#, PHP, Java, Perl, Python, Ruby, Tcl, Smalltalk, and most other programming languages. The (< key >, < value >) pairs can be up to 4 gigabyte each and the key can have multiple data items.

This product began at the University of California at Berkeley in the late 1980s. It changed hands until Oracle bought out BDB in February 2006. The result is a bit of a legal fragmentation with it. While it is not relational, it has options for ACID transactions, with a locking system enabling concurrent access to data. There is a logging system for transactions and recovery. Oracle added support for SQL by including a version of SQLite. There is third-party support for PL/SQL in BDB via a commercial product named Metatranz StepSqlite.

A program accessing the database is free to decide how the data is to be stored in a record. BDB puts no constraints on the record’s data. We are back to the file model where the host program gives meaning to the data.

6.3.2 Access by Tree Indexing or Hashing

There is nothing in the (< key >, < value >) model that prohibits using the usual tree indexes or hashing for access. The main difference is that targets might be on other hardware drives in a large “platter farm” of commodity hardware.

6.4 Handling Values

SQL is a strongly typed language of which columns have a known data type, default, and constraints. I tell people that 85–90% of the real work in SQL is done in the DDL. Bad DDL will force the programmers to kludge corrections in the DML over and over. The queries will repeat constraints in predicates that could have been written once in the DDL and picked up by the optimizer from the schema information tables.

The (< key >, < value >) model does not say anything about the nature of the values. The data can be structured, but when it is, the data types are usually simple, exact, and approximate numerics, strings, and perhaps temporal. There are no defaults or constraints. The purpose of this model is the ability to store and retrieve great quantities of data, not the relationships among the elements. These are records in the old FORTRAN file system sense of data, not rows from RDBMS.

The advantage is that any host language program can use the data immediately. If you look at the SQL standards, each ANSI X3J language has rules for converting its data types to SQL data types. There are also indicators for handling nulls and passing other information between the host and the database.

6.4.1 Arbitrary Byte Arrays

Unstructured data is also kept in these products. The usual format is some kind of byte array: a BLOB, CLOB, or other large contiguous “chunks” of raw data. The host program decides what it means and how to use it. The (< key >, < value >) store simply wants to move it as fast as possible into the host.

The fastest way to move data is to have it in primary storage rather than on secondary storage. Notice I am using “primary” rather than “core,” “RAM,” etc., and “secondary” rather than “disk,” “drum,” “RAID,” etc. The technology is changing too fast and the line between primary (fast, direct addressing by processor) and secondary (slower, retrieved by other hardware) storage is getting blurred. My bet is that in a few years of 2015, the use of SSD (solid-state disk) will replace the moving disk. This will make radical changes in how we think of computing:

◆ There will be no difference between primary and secondary storage speeds.

◆ Cores (formerly known as CPUs) will also be so cheap, we will put them at all levels of the hardware. We will have mastered parallel programming.

◆ The results will be that a parallel table scan will be faster than using an index and not that different from hashing.

In short, everything becomes an in-memory database.

As of April 2013, IBM was sending out press releases that the hard drive would soon be dead in the enterprise. They put U.S. $1 billion in research to back up this prediction. IBM has launched a line of flash-based storage systems, called FlashSystems, based on technologies IBM acquired when it purchased Texas Memory in 2012. A set of FlashSystems could be configured into a single rack capable of storing as much as 1 petabyte of data, and capable of producing 22 million IOPS (input/output operations per second). Getting that same level of storage and throughput from a hard drive system would require 315 racks of high-performance disks.

IBM also has flash and disk hybrid storage systems, including the IBM Storwize V7000, IBM System Storage DS8870, and IBM XIV Storage System. They realize that not all systems would benefit from the use of solid-state technologies right now. Performance has to be a critical factor for systems operations. As of 2013, generic hard drives cost about $2 per gigabyte, an enterprise hard drive costs about $4 per gigabyte, and a high-performance hard drive costs about $6 per gigabyte. A solid-state disk is $10 per gigabyte.

6.4.2 Small Files of Known Structure

Since (< key >, < value >) stores are used for websites, one of the values is a small file that holds a catalog page or product offering. XML and HTML are ideal for this. Every browser can use them—hypertext links are easy and they are simple text that is easy to update.

6.5 Products

This is a quick list of products as of 2013, taken from Wikipedia, using their classifications.

Eventually consistent (< key >, < value >) stores:

◆ Apache Cassandra

◆ Dynamo

◆ Hibari

◆ OpenLink Virtuoso

◆ Project Voldemort

◆ Riak

Hierarchical (< key >, < value >) stores:

◆ GT.M

◆ InterSystems Caché

Cloud or hosted services:

◆ Freebase

◆ OpenLink Virtuoso

◆ Datastore on Google Appengine

◆ Amazon DynamoDB

◆ Cloudant Data Layer (CouchDB)

(< key >, < value >) cache in RAM:

◆ Memcached

◆ OpenLink Virtuoso

◆ Oracle Coherence

◆ Redis

◆ Nanolat Database

◆ Hazelcast

◆ Tuple space

◆ Velocity

◆ IBM WebSphere eXtreme Scale

◆ JBoss Infinispan

Solid-state or rotating-disk (< key >, < value >) stores:

◆ Aerospike

◆ BigTable

◆ CDB

◆ Couchbase Server

◆ Keyspace

◆ LevelDB

◆ MemcacheDB (using BDB)

◆ MongoDB

◆ OpenLink Virtuoso

◆ Tarantool

◆ Tokyo Cabinet

◆ Tuple space

◆ Oracle NoSQL Database

Ordered (< key >, < value >) stores:

◆ Berkeley DB

◆ IBM Informix C-ISAM

◆ InfinityDB

◆ MemcacheDB

◆ NDBM

Concluding Thoughts

Because this is one of the simplest and most general models for data retrieval, there are lots of products that use it. The real trick in this technique is the design of the keys.