Columnar Databases - Information Management: Strategies for Gaining a Competitive Advantage with Data (2014)

Information Management: Strategies for Gaining a Competitive Advantage with Data (2014)

Chapter 5. Columnar Databases

A columnar DBMS is an implementation of the relational theory, but with a twist. The data storage layer does not contain records. It contains a grouping of columns.

Due to the variable column lengths within a row, a small column with low cardinality or variability of values may reside completely within one block, while another column with high cardinality and longer length may take a thousand blocks. In columnar, all the same data—your data—is there. It’s just organized differently (automatically, by the DBMS).

The main reason why you would want to utilize a columnar approach is simply to speed up the native performance of analytic queries. There is no difference in the SQL or data access tool used to interface with the data or the logical data modeling.

Learn about the columnar orientation and how it can be effective for your needs. This is the orientation of most new analytic databases and is becoming an option on many leading DBMS.

Keywords

database management; analytics; columnar databases; Vertica; analytic databases

One twist on the DBMS that fits in the post-operational architecture is a columnar database. A columnar database is a relational DBMS, stores data in tables, and accepts SQL for interaction. There is, however, a twist to the story from there.

Columnar databases reduce the primary bottleneck in those analytic queries from Chapter 3: I/O. I/O is a bottleneck due to the inability of the pre-CPU layers of memory, L2 (especially) and L1 caches to throughput data rapidly. One of the primary reasons for this is that complete records are sent through the layers by row-based systems, which are designed to process rows instead of columns. The caches add a lot of processing value to the CPU operation by reducing the time to access data and removing this valuable component from processing would not be a desired goal of any system administrator. The time-consuming part of the operation is the reading of the data page in the first place. If you’re interested in just one column, it would be great to be able to find a bunch of that column’s values on the page, without having to skip over those other columns that aren’t interesting.

While disk density has gone up significantly in the last 20 years or so, packing much more data down into smaller spaces, I/O is still limited by the physical head movement of the arm. Physics simply won’t allow such a small component to move much faster without the arm flying right off its handle (which could ultimately be why solid state disk becomes the norm over time).

Other than the larger page sizes, our designs have not changed much over the years to accommodate this I/O-bound reality. We have tried to make the OLTP database more analytic with specialized indices, OLAP cubes, summary tables, and partitioning, but we would need hundreds of drives to keep 1 CPU truly busy in a robust, complex, utilized data warehouse environment. It’s not feasible. Incidentally, because of this bottleneck, random I/O has sped up much more slowly over the years than sequential I/O, which doesn’t require nearly as much head movement.

Columnar Operation

iTunes Demonstrates how Columnar Databases Work

iTunes has taught us we can buy by the song and not necessarily the album. When I want just a few songs from an album, it’s cheaper to purchase only those songs from iTunes that I want. This is analogous to the columnar database.

The major significant difference between columnar and row-based stores is that all the columns of a table are not stored successively in storage—in the data pages. This eliminates much of the metadata that is stored on a data page. See Figure 5.1. The columns are tied together as “rows” only in a catalog reference. This gives a much finer grain of control to the RDBMS data manager. The need for indexes is greatly minimized in column-based systems, to the point of not being offered in many.

image

FIGURE 5.1 Columnar data page.

Columnar databases know where all column values begin on the page with a single offset calculation from the beginning of the file. There is no value-level metadata. All column data stores keep the data in the same row order so that when the records are pieced together, the correct concatenation of columns is done to make up the row. This way “Santa” (from the first name column file) is matched with “Claus” (from the last name column) correctly—instead of matching Santa with Bunny, for example. Columnar databases match values to rows according to the position of the value (i.e., 3rd value in each column belongs to the 3rd row, etc.).

To optimize the I/O operation, it is essential to read as much as possible in each I/O. Many page sizes have been dramatically expanded in columnar systems from the typical 8 K/16 K/32 K/64 K you find in row-based relational systems.

Columnar databases not only provide higher amounts of data in I/Os but also higher amounts of “relevant” data in the I/Os primarily because they know all the data values must be processed that they read and the reads are less cluttered by page metadata for the DBMS’ use.

An I/O in a columnar database will only retrieve one column—a column interesting to the particular query from either a selection or projection (WHERE clause) capacity. The projection function starts first and gathers a list of record numbers to be returned, which is used with the selection queries (if different from projection) to materialize the result set.

In row-based databases, complete file scans mean I/O of data that is nonessential to the query. This nonessential data could comprise a very large percentage of the I/O. Much more of the data in the I/O is essential to a columnar query. Columnar databases can therefore turn to full column scans much quicker than a row-based system would turn to a full table scan. Query time spent in the Optimizer is reduced as well, since this decision is not nearly as critical or fatal, if still less than ideal, to the overall query time.

Compression

The high likelihood of the same value being referenced successively on disk opens up a variety of unique possibilities for compression. Compression also greatly aids the I/O problem, effectively offering up more relevant information in each I/O. Below are some types of columnar compression. Please note not every columnar database will necessarily have every type of compression.

Dictionary Encoding

For fields above a certain number of bytes (and variable-length characters), a separate dictionary structure is used to store the actual values along with tokens. These fixed-byte tokens are used in place of the value in the data page and allow the data page to continue to operate without value-level metadata. The higher the repeat level of the values,1 the more space will be saved.

For example, A=United States of America, B=People’s Republic of China and C=United Kingdom, could be in the dictionary and when those are the column values, the A, B, and C are used in lieu of the actual values. See Figure 5.2. If there are 1,000,000 customers with only 50 possible values, the entire column could be stored with 4 megabytes (4 bytes per value). Some DBMS implement dictionaries at a “file” level—on a somewhat smaller basis than the entire table.

image

FIGURE 5.2 Dictionary and data page snippet showing country column data on a data page.

Trim Compression

Insignificant trailing leading zeroes and trailing insignificant spaces from character fields are compressed out, furthering the space savings.

Columnar Compression Versus Columnar Storage

Some of these compression techniques can be utilized by DBMSs that are not necessarily storing the data in columnar fashion. Except for run-length encoding, these techniques are concerned with the values and not necessarily the storage. Columnar stores are generally better at columnar compression and overall compression. Having the values physically together makes the compression easier to do.

Run-Length Encoding

When there are repeating values (e.g., many successive rows with the value of ‘12/25/13’ in the date container), these are easily compressed in columnar systems, which uses “run-length encoding” to simply indicate the range of rows for which the value applies. For example, run-length encoding indicating the value of ‘12/25/13’ applies to rows 123 through 145.

‘12/25/13’ 123–145

Delta Compression

Fields in a tight range of values can also benefit from only storing the offset (“delta”) from a set value. Some DBMSs calculate an average for a container and can store only the offsets from that value in place of the field. Whereas the value itself might be an integer, the offsets can be small integers, which doubles the space utilization. Compression methods like this lose their effectiveness when a variety of field types, such as those found in a typical row, need to be stored consecutively.

The compression methods are usually applied automatically (if desired) to each column, and can vary across all the columns of a table based on the characteristics of the data in the column. Multiple methods can be used with each column as well—for example, run-length encoding on dictionary tokens. The compounding effect of the compression in columnar databases is a tremendous improvement over the standard compression that would be available for a strict row-based DBMS. This is due to the vastly increased likelihood of similar values in columns as opposed to entire rows. In the near future, I expect that the customer will be able to utilize increased knowledge of compression techniques and have more capability to override the DBMS compression selections.

Workloads

Now that you have seen the physical differences of columnar databases, it should be clear that the workloads that will find their best platform in columnar are queries that access less than all columns of the tables it touches. In this case, less is more. The smaller the percentage of the row’s bytes needed, the better the performance difference with columnar.

It’s optimal for queries that need a small percentage of the columns in the tables they are in, but suboptimal when you need most of the columns, due to the overhead in attaching all of the columns together to form the result sets.

DBMS Adding Columnar Capabilities

Some relational DBMSs have added columnar capabilities, changing the fundamental structure of having all of the columns of the table stored consecutively on disk for each record. The innovation adds columnar abilities to a table, effectively mixing row structures, column structures, and multi-column structures directly in the DBMS. In conjunction with the DBMS’ partition elimination features, this creates a combination row and column elimination possibility.

It will be an option in your DBMS soon, if not now, to put any column into columnar storage (its own storage) or collect it into grouping of columns (multiple columns that share the same storage). It will be a key information management design principle to make these allocations wisely. Since the storage represents I/O, group columns where the workload will access those columns together most of the time.

Columnar databases allow you to implement a data model free of the tuning and massaging that must occur to designs in row-based databases, such as making the tables unnaturally small to simulate columnar efficiencies.

Workload Examples

Support for Managing the Customer

Identifying a customer across a company requires multiple multi-way joins across the large customer base.

To achieve the necessary company-wide customer profile, comprising activity across all products and channels, data is continually matched, reconciled, and summarized in an organization. Columnar databases are very useful in matching long records by just the keys, a process that happens repeatedly in the customer matching process.

Once a holistic customer profile is determined, many benefits accrue. Accurate and comprehensive customer profiling is essential. The characteristics of high-volume, wide-record data and distributed data sets and profiles that need matching, create the most compelling industry case for columnar deployment.

In addition, with profiling, churn can be much better managed. Once the value of a customer has been determined, the corresponding appropriate level of activity can be undertaken to prevent that customer from churn. Churn patterns can be learned from the data and potential churn can be anticipated and an intervention can be performed.

Cross-company Consolidation

Time to resolution is critical for these functions in a communications company. Anything that improves time to resolution of a query across silos must be considered. Considering that the queries look across all the systems with wide transaction records, the ADB is the matching, profiling, and analysis engine of a modern communications company that understands the importance of information.

Fraud Prevention

Companies need to reduce the latency between analysis and when an action is taken. However, nowhere is this need more critical than in the area of fraud.

Fraud analysis, under a rules engine or using pattern analysis, is looking at a few variables in a sea of data.

Naturally, there are fraudulent patterns of activity. These patterns are continually being learned and tuned “after the fact” and returned to the systems that analyze transactions. However, fraud is not a one-size-fits-all. Customer profiles must be considered in the process of taking remedial, counterbalancing actions. What is fraud for one customer is neither fraud nor unusual for another.

The scan-type queries that fraud engines, whether developed in-house or executed by a data mining tool, execute continuously are ideal for columnar databases due to their need to pull specific information out of a wide transaction record.

Columnar Conclusions

A table scan of all columns will perform best in a row-based database. However, just because the row-based DBMS decided to do a table scan does not mean one is necessary in a columnar database. This is hugely important when assessing a columnar database. Some row-based DBMSs will decide to table scan at 15% selectivity.2 In a columnar database, it would only access those vectors containing the interesting columns. Single column scans can therefore be much faster in columnar.

For small databases, the performance differences are muted.

A single record retrieval is better in row-orientation, especially when there are a high number of columns in the table. If the workload is mostly point queries of full records, you will be better off in row-based.

The bottom line is column selectivity. Where it is high, those queries will perform better in columnar. This also means that longer rows (thereby creating a significant granularity to query selectivity) would lend themselves to columnar.

Columnar databases provide a range of benefits to an environment needing to expand the envelope to improve performance of the overall analytic workload. It is usually not difficult to find important workloads that are column selective, and that therefore would benefit tremendously from, a columnar orientation. Columnar database benefits are exacerbated with larger amounts of data, large scans, and I/O bound queries. In providing the performance benefits, they also have unique abilities to compress their data.

Action Plan

• Investigate if your DBMS(s) have columnar capabilities.

• If there, group columns where the workload will access those columns together most of the time.

• Determine how much of each workload is column-selective, how column-selective they are, and the importance of those queries.

• Put column-selective, large (terabyte+) workloads on columnar databases or DBMSs with columnar capabilities.

• Keep an eye on the QR Code for this chapter. Initially, I will have it pointing to a comparison of three columnar approaches by vendors out there.

www.mcknightcg.com/bookch5


1Or, put another way, the lower the cardinality.

2If it’s estimated that>15% of the row’s bytes are needed.