Columnar Databases - Joe Celko's Complete Guide to NoSQL: What Every SQL Professional Needs to Know about NonRelational Databases (2014)

Joe Celko's Complete Guide to NoSQL: What Every SQL Professional Needs to Know about NonRelational Databases (2014)

Chapter 2. Columnar Databases

Abstract

This chapter discusses columnar databases, which use a nontraditional storage model. Columnar databases use traditional structured data and often run some version of SQL. The difference is in how they store the data. The traditional row-oriented approach is replaced by putting data in columns that can be assembled back into the familiar rows of an RDBMS model. Since columns are drawn from one and only one data type and domain, they can be compressed for faster access, and columns are easier to share among multiple users and distribute over modern redundant hardware storage systems, such as RAID.

Keywords

columnar data storage; hashing; MPP (massively parallel processing); multidimensional database (MDB); OLAP (online analytical processing); RAID (redundant array of independent disks); row-based optimizers; SMP (symmetric multiple processing); Sybase IQ; text compression

Introduction

Since the days of punch cards and magnetic tapes, files have been physically contiguous bytes that are accessed from start (open file) to finish (end-of-file flag = TRUE). Yes, the storage could be split up on a disk and the assorted data pages connected by pointer chains, but it is still the same model. Then the file is broken into records (more physically contiguous bytes), and records are broken into fields (still more physically contiguous bytes).

A file is processed in record by record (read/fetch next) or sequentially navigated in terms of a physical storage location (go to end of file, go back/forward n records, follow a pointer chain, etc.). There is no parallelism in this model. There is also an assumption of a physical ordering of the records within the file and an ordering of fields within the records. A lot of time and resources have been spent sorting records to make this access practical; you did not do random access on a magnetic tape and you could not do it with a deck of punch cards.

When we got to RDBMS and SQL, this file system model was still the dominant mindset. Even Dr. Codd fell prey to it. He first had to have a PRIMARY KEY in all tables, which corresponded to the sort order of a sequential file. Later, he realized that a key is a key and there is no need to make one of them special in RDBMS. However, SQL had already incorporated the old terminology and the early SQL engines were built on existing file systems, so it stuck.

The columnar model takes a fundamentally different approach. But it is one that works nicely with SQL and the relational model. In RDBMS, a table is an unordered set of rows that have exactly the same kind of rows. A row is an unordered set of columns all of the same kind, each of which holds scalar values drawn from a known domain. You access the columns by name, not by a physical position in the storage, but you have the "SELECT*" and other shorthand conventions to save typing.

The logical model is as follows: a table is a set of rows with one and only one structure; a row is a set of columns; a column is a scalar value drawn from one and only one domain. Storage usually follows this pattern with conventional file systems, using files for tables, records for rows, and fields for columns. But that has nothing to do with physical storage.

In the columnar model, we take a table and store each column in its own structure. Rows and tables are reassembled from these rows. Looking at the usual picture of a table, you can see why they are called vertical storage as opposed to horizontal storage models.

2.1 History

Column stores and inverted or transposed files are not that new. TAXIR was the first column-oriented database storage system that was built in 1969 for biology. Statistics Canada implemented the RAPID system in 1976 and used it for processing and retrieval of the Canadian Census of Population and Housing, as well as several other statistical applications. RAPID was shared with other statistical organizations throughout the world and used widely in the 1980s. It continued to be used by Statistics Canada until the 1990s.

For many years, Sybase IQ was the only commercially available columnar DBMS. However, when OLAP (online analytical processing) became popular, other products saw that they could take some of the techniques used for cubes and rollups and apply them to more general databases.

Given a column that has only one kind of simple data type in it, you can do a lot of compression. To reconstruct the rows, you have to break the pure set model and number the rows, as you did in a sequential file. Each row is reconstituted by looking at this row number. Contiguous row numbers with the same data value can be modeled as ranges: {start_position, end_position, data_value}. This is the simplest format of compression, but it is very powerful. Much data is encoded in discrete values, so there are often long runs of identical values in a column.

Then we can compress the data value using specialized routines built for just that domain and the data type it uses. It would be silly to have a domain token the same size or larger than the data value. The payoff comes with wider columns. Building a lookup table with a short token for a longer value is easy. Now the model changes to {start_position, end_position, domain_token} references {domain_token, data_value}.

For example, the area code in U.S. phone numbers use the regular expression of [2–9][0–9][0–9], which means that we can have at most 800 of them. Instead of three ASCII characters, we can use a SMALLINT or a BCD for the token and get a lookup table that is tiny. This example is not a big savings, but it adds up in terabyte databases. A stronger example would be city and town names in the United States; there are slightly over 30,000 municipalities as of 2012. You can store 65,535 tokens in a 2-byte integer; very few towns have a name of only two letters, and many of them are duplicates (before you ask, “Springfield” is the most common town name, which is why it is used on television shows, most famously The Simpsons). Likewise, a 2-byte integer can sequence almost 180 years of dates.

It also gives us a single place to add certain integrity constraints. For example, area codes 666 and 777 are valid areas codes but were not assigned as of 2012. Likewise, 555 is dummy phone exchange that you will see in movies; before all-digit dialing, “KLondike 5” was how you guaranteed the phone number could not be called. Listen for it in old movies, cartoons, and modern American television shows. You can simply leave invalid values out of the lookup table!

Freeform text compression is well known. For a database, we want a lossless compression—that means that when you decompress it, you get back everything you put in it. In music and video, we can trade some loss for speed or space without being hurt. The Lempel–Ziv (LZ) compression methods are among the most popular algorithms for lossless storage. DEFLATE is a variation on LZ optimized for decompression speed and compression ratio, but compression can be slow. DEFLATE is used in PKZIP, gzip, and PNG. LZW (Lempel–Ziv–Welch) is used in GIF images. Also noteworthy are the LZR (LZ–Renau) methods, which serve as the basis of the Zip method.

LZ methods use a table of substitutions for repeated strings. The table itself is often Huffman encoded (e.g., SHRI, LZX). This works well for human languages because grammatical affixes and structure words (prepositions, conjunctions, articles, particle markers, etc.) make up the bulk of text. But encoding schemes also tend to follow patterns. A bank account number starts with the code for the American Bankers Association bank number in the United States. A product serial number includes a product line. Hierarchical and vector encoding schemes invite this kind of compression.

A lot of work has been done in recent years with minimal perfect hashing algorithms (see the “A Quick Look at Hashing” sidebar if you do not the technique). When the column is a relatively static set of strings, this is ideal. Any set of values, even those without common substrings, are hashed to a short, fixed-length data token that can be computed quickly.

The naive reaction to this model is that it is complicated, big, and slow. That is only half right; it is complicated compared to simple sequential field reading. But it is not slow. And it is not big.

At one point the computer processing unit (CPU) was the bottleneck, but then SMP (symmetric multiple processing), clusters, and MPP (massively parallel processing) gave us hundreds or thousands of CPUs running faster than was ever possible before. Boyle’s law is a bit of IT folklore that computer processing doubles in speed and halves in cost every 18 months. At the same time, data is getting bigger. A decade ago, 20 gigabytes was considered unmanageable; now small Internet startups are managing terabytes of data.

MPP uses many separate CPUs running in parallel to execute a single program. MPP is similar to SMP, but in SMP all the CPUs share the same memory, whereas in MPP systems, each CPU has its own memory. The trade-off is that MPP systems are more difficult to program because the processors have to communicate and coordinate with each other. On the other hand, SMP systems choke when all the CPUs attempt to access the same memory at once. However, these CPUs sit idle more often than not. This is due to the inability of the pre-CPU layers of memory—L2 (especially) and L1 caches—to throughput data rapidly. We are still moving complete records in a row-based model of data.

As an analogy, think about your personal electronics. If you want a few tracks from an album, it is cheaper to purchase only those tracks from iTunes. When you want most of the tracks, you will save money and time by purchasing the whole album, even though you only play some of the tracks.

A query is not the same kind of search-and-retrieval problem. The query either wants a column or it does not, and you know what that column is when you compile the query. The speed comes from the fact that assembling the rows with modern processor speeds and large primary storage is immensely faster than reading a single byte from a moving disk. In 2012, IBM was able to reduce the size of tables in DB2 to one-third or smaller of the original tables. This savings keeps paying off because the {domain_token, data_value} is all that is read and they can be put in faster storage, such as solid-state disk (SSD).

The major significant difference between columnar- and row-based storage is that all the columns of a table are not stored in data pages. This eliminates much of the metadata that is stored on a data page. This metadata varies from product to product, but the most common metadata is the displacement of the rows (records) from the front of each data page and perhaps some other metadata about the data page. Within that there is metadata for the displacement of the columns from the front of each row and if the column is null. This is how the SQL data management system knows where to place the read–write head on the disk and what to return.

In particular, when early SQL products put a VARCHAR(n) column into a row, they would be allocated in the order of the columns in the CREATE TABLE statement. But this meant they would allocate storage for the full n characters, even if the data is shorter. The kludge we used was to put all the VARCHAR(n) columns at the end of the row in the DDL manually. Today, DB2 and Oracle do this internally—rearrange the columns on output and hide it from the user.

The columnar model can compress data in place or store a pointer to a list of strings, such as first_name with a lookup table {1 = Aaron, 2 = Abe, 3 = Albert, …, 9999 = Zebadiah, …}. But now there is a trade-off; we can store VARCHAR(n) as fixed length to speed up searching and not be hurt because the lookup table is small. Each lookup value appears only once, so we still have a relatively small amount of storage.

Obviously, these metadata maps had to be updated as records are inserted or deleted on the data page. Deletion is easy; the rows that are removed can be flagged immediately and ignored when the column is read. Insertions can be added to the end of the column structure. While this works, it is nice to have clusters of identical values to keep the size small and make searching by the data values easier. There are utility programs to recover storage—columnar versions of the garbage-collection routines from other memory management systems.

Columnar databases can have indexes, but most of them do not. In effect, columnar storage itself is an index. The displacements also has to be handled by the indexes.

A Quick Look at Hashing

Indexing and pointer chains involve a physical search to locate data. Given a search key, you can traverse from one node in a pointer chain to another until you either find a node with the search value or find that it does not exist.

Hashing is a disk-access technique based on mathematics. You take the search key and put it into a formula. The formula returns a location in an array or lookup table. That location contains a physical storage address and you can then go directly to it to find the data in one disk access.

For larger amounts of data, hashing is much faster than indexing. Tree-structured indexes become deeper as the amount of data increases. The traversals eventually have to get to a leaf node in this tree to find the data. This can mean more disk accesses with larger amounts of data.

It is possible that two or more different search keys will produce the same hash key. This is called a collision or (more poetically) a hash clash. The search key can then be rehashed with another function; the second function if often a member of the same family of hashing functions, with some of the constants changed. There is proof for certain hashing functions that a maximum of five rehashings will produce unique results.

A hashing function that has no collisions is called a perfect hashing function. If the hashing function has no empty slots in the array, then it is minimal. A minimal perfect hashing function requires that the set of search values is fixed, so it can work for keywords in a compiler and similar data sets.

The basic tools for most hashing algorithms are:

Digit selection. Given a search key, we pull some of the digits from the number, perhaps rearraging them. If the population is randomly distributed, this is a good technique. This is actually used by department stores at the order pickup windows. The first letter of the last names clusters around certain letters (“S” for Smith; “J” for Jones, Johnson, etc., but almost nobody for “X”, “Y,” and “Z”); however, customer phone numbers are uniformly random in the last two digits.

Division. The mod (< key >, m) function with a prime number (m) can be very a good hash function (Lum et al., 1971). It will give you a result between (0, (m − 1)). The TOTAL and IMAGE/3000 databases came with a list of large primes to allocate hash tables.

Multiplication. This technique squares a key and then pulls out the middle digits. A five-digit key will produce a ten-digit square and you can use the middle three digits with good results. For example, 54,3212 = 2,950,771,041 and the middle digits are 077. The hash has to come from the middle or you will get too much clustering.

Folding. This technique pulls out continuous subsets of digits of size n and simply adds them. For example, given a five-digit key, you can add all five digits to get an answer in the range (0 ≤ hash ≤ 45). If you used pairs of digits, the range would be (0 ≤ hash ≤ 207). This is a weak technique, so exclusive-OR might be used instead of arithmetic and it is generally used with another technique.

Collision resolution techniques are also varied. The most common one is the use of buckets. A bucket is a hash table location that can hold more than one value. The two basic approaches are as follows:

Open address. This method tries to find a bucket in the hash table that is still open. The simplest approach is to start at the collision location and do a linear search for an open space from that point. Other similar techniques use more complicated functions than increments. Common functions are quadratics and pseudorandom number generators.

External chaining. You can add the new value to a linked list. This list might be in the main storage or have parts of it on disk. But with the right choice of the main storage table size, the overflow to disk can be less than 15% of the size of the hash table in the main storage. This method can be very effective.

2.2 How It Works

Since all values in a columnar store are of the same type and drawn from the same domain, calculating where the nth row is located is easy. All the columns are in the same order as they were in the original row, so to assemble the ith row, you go to the ith position in the relevant column stores and concatenate them. In the phone number example, go to area_codes, phone_exchange, and phone_nbr column stores and look for the ith records in each in parallel.

The area codes are relatively small, so they will come back first, then the exchanges, and finally the phone numbers. When I first saw this in the Sand (nee Nucleus) database, it was surprising. The test data was a set of over 5 million rows of Los Angeles, CA, public data and was set up to step through the data slowly for monitoring. The result set appeared on the screen of the test machine in columns, not rows. Columns did not materialize in the result set in left-to-right order, either!

2.3 Query Optimizations

Some columnar databases use row-based optimizers, which negates many of the benefits of columnar storage. They materialize “rows” (comprising of only the columns of the query, in effect doing selection and projection) early in the query execution and process them with a row-oriented optimizer. Column-based optimizers are able to divide the selection and projection functions into separate operations, which is a version of the MapReduce algorithms (these algorithms will be explained later).

The goal is to do as much with just the row numbers as possible before looking up the actual data values. If you can gather the columns in parallel, so much the better. Obviously, projection will come first since having the data in columns has already done this. But selection needs to be applied as soon as possible.

Notice that I have talked about columns being drawn from a domain. Most of the joins done in actual databases are equijoins, which means the columns in different tables are drawn from the same domain and matched on equal values. In particular, a PRIMARY KEYand its referencing FOREIGN KEY have to be in the same domain. The PRIMARY KEY column will contain unique values for its table, and the FOREIGN KEYs will probably be one to many.

We can add a table name to the columnar descriptor, making it into domain descriptors: {table_name, start_position, end_position, data_value}. This vector can be fairly compact; a schema will seldom have the 2-plus million tables that can be modeled with a simple integers. This structure makes certain joins into scans over a single domain structure. An index can locate the start of each table within the domain descriptor and access the tables involved in parallel.

2.4 Multiple Users and Hardware

One of the advantages of a columnar model is that if two or more users want to use a different subset of columns, they do not have to lock out each other. This design is made easier because of a disk storage method known as RAID (redundant array of independent disks, originally redundant array of inexpensive disks), which combines multiple disk drives into a logical unit. Data is stored in several patterns called levels that have different amounts of redundancy. The idea of the redundancy is that when one drive fails, the other drives can take over. When a replacement disk drive in put in the array, the data is replicated from the other disks in the array and the system is restored. The following are the various levels of RAID:

◆ RAID 0 (block-level striping without parity or mirroring) has no (or zero) redundancy. It provides improved performance and additional storage but no fault tolerance. It is a starting point for discussion.

◆ In RAID 1 (mirroring without parity or striping) data is written identically to two drives, thereby producing a mirrored set; the read request is serviced by either of the two drives containing the requested data, whichever one involves the least seek time plus rotational latency. This is also the pattern for Tandem’s nonstop computing model. Stopping the machine required a special command—“Ambush”—that has to catch both data flows at the same critical point, so they would not automatically restart.

◆ In RAID 10 (mirroring and striping) data is written in stripes across primary disks that have been mirrored to the secondary disks. A typical RAID 10 configuration consists of four drives: two for striping and two for mirroring. A RAID 10 configuration takes the best concepts of RAID 0 and RAID 1 and combines them.

◆ In RAID 2 (bit-level striping with dedicated Hamming-code parity) all disk spindle rotation is synchronized, and data is striped such that each sequential bit is on a different drive. Hamming-code parity is calculated across corresponding bits and stored on at least one parity drive. This theoretical RAID level is not used in practice.

◆ In RAID 3 (byte-level striping with dedicated parity) all disk spindle rotation is synchronized, and data is striped so each sequential byte is on a different drive. Parity is calculated across corresponding bytes and stored on a dedicated parity drive.

◆ RAID 4 (block-level striping with dedicated parity) is equivalent to RAID 5 except that all parity data is stored on a single drive. In this arrangement, files may be distributed between multiple drives. Each drive operates independently, allowing input/output (I/O) requests to be performed in parallel. Parallelism is a huge advantage for a database. Each session can access one copy of a heavily referenced table without locking or read head contention.

◆ RAID 5, RAID 6, and other patterns exist; many of them are marketing terms more than technology. The goal is to provide fault tolerance of drive failures, up to n disk drive failures or removals from the array. This makes larger RAID arrays practical, especially for high-availability systems. While this is nice for database people, we get more benefit from parallelism for queries.

2.5 Doing an ALTER Statement

ALTER statements change the structures in a schema. In the columnar model, ADD COLUMN and DROP COLUMN are pretty easy; a new columnar structure is created or an old one is removed from physical storage. In a row-oriented model, each row has to be compressed or expended with the alteration. The indexes will also have to be restructured.

Changing the data type is also harder in a traditional row-oriented database because of the same space problem. In the real world most of the alterations are to increase the physical storage of a column. Numbers are made greater, strings are made longer; only dates seem immune to expansion of a data value since the ISO-8601 has a fixed range of 0001-01-01 to 9999-12-31 in the standard.

In the columnar model, the changes are much easier. Copy the positional data into a new columnar descriptor and cast the old data value to the new data value. When you have the new columnar structure loaded, drop the old and add the new. None of the queries will have to be changed unless they have a data type–specific predicate (e.g., if a date became an integer, then foobar_date < = CURRENT_TIMESTAMP is not going to parse).

2.6 Data Warehouses and Columnar Databases

Data warehouses can move some workloads, when few columns are involved, to columnar databases for improved performance. Multidimensional databases (MDBs), or cubes, are separate physical structures that support very fast access to precomputed aggregate data. When a query asks for most columns of the MDB, the MDB will perform quite well relatively speaking.

The physical storage of these MDBs is a denormalized dimensional model that eliminates joins by storing the computations. However, MDBs get large and grow faster than expected as columns are added. The data in a MDB can be compressed in much the same way that it is in a columnar database, so it is relative easy to extract a subset of columns from a cube.

The best workload for columnar databases is queries that access less than all columns of the tables they use. In this case, less is more. The smaller the percentage of the row’s bytes needed, the better the performance difference with columnar databases.

Concluding Thoughts

A lot of important workloads are column selective, and therefore benefit tremendously from this model. Columnar databases perform well with larger amounts of data, large scans, and I/O-bound queries. While providing performance benefits, they also have unique abilities to compress their data.

Columnar databases have been around for a while and did very well in their niche. But they made a leap into the current market for two reasons. First, improved hardware, SSD in particular, made the differences between primary and secondary storage less clear. When there was a sharp difference between primary and secondary storage, compressing and decompressing data in and out of secondary storage was overhead. In SSD, there is no difference. The second factor is better algorithms. We have gotten really good at specialized compression on one hand, but we also have parallel algorithms that are designed for columnar data stores.

Reference

1. Lum VY, Yuen PST, Dodd M. Key to Address Transform Technique: A Fundamental Performance Study on Large Existing Formatted Files. Communications of the ACM 1971;228–239.