Joe Celko's Complete Guide to NoSQL: What Every SQL Professional Needs to Know about NonRelational Databases (2014)

Introduction

“Nothing is more difficult than to introduce a new order, because the innovator has for enemies all those who have done well under the old conditions and lukewarm defenders in those who may do well under the new.” —Niccolo Machiavelli

I have done a series of books for the Elsevier/Morgan Kaufmann imprint over the last few decades. They have almost all been about SQL and RDBMS. This book is an overview of what is being called Big Data, new SQL, or NoSQL in the trade press; we geeks love buzzwords! The first columnist or blogger to invent a meme that catches on will have a place in Wikipedia and might even get a book deal out of it.

Since SQL is the de facto dominate database model on Earth, anything different has to be positioned as a challenger. But what buzzwords can we use? We have had petabytes of data in SQL for years, so “Big Data” does not seem right. SQL has been evolving with a new ANSI/ISO standard being issued every five or so years, rather than the “old SQL” suddenly changing into “new SQL” overnight. That last meme makes me think of New Coke® and does not inspire confidence and success.

Among the current crop of buzzwords, I like “NoSQL” the best because I read it as “N. O. SQL,” a shorthand for “not only SQL” instead of “no SQL,” as it is often read. This implies that the last 40-plus years of database technology have done no good. Not true! Too often SQL people, me especially, become the proverbial “kid with a hammer who thinks every problem is a nail” when we are doing IT. But it takes more than a hammer to build a house.

Some of the database tools we can use have been around for decades and even predate RDBMS. Some of the tools are new because technology made them possible. When you open your toolbox, consider all of the options and how they fit the task.

This survey book takes a quick look at the old technologies that you might not know or have forgotten. Then we get to the “new stuff” and why it exists. I am not so interested in hardware or going into particular software in depth. For one thing, I do not have the space and you can get a book with a narrow focus for yourself and your projects. Think of this book as a department-store catalog where you can go to get ideas and learn a few new buzzwords.

Please send corrections and comments to jcelko212@earthlink.net and look for feedback on the companion website (http://elsevierdirect.com/v2/companion.jsp?ISBN=9780124071926).

The following is a quick breakdown of what you can expect to find in this book:

Chapter 1: NoSQL and Transaction Processing. A queue of jobs being read into a mainframe computer is still how the bulk of commercial data processing is done. Even transaction processing models finish with a batch job to load the databases with their new ETL tools. We need to understand both of these models and how they can be used with new technologies.

Chapter 2: Columnar Databases. Columnar databases use traditional structured data and often run some version of SQL; the difference is in how they store the data. The traditional row-oriented approach is replaced by putting data in columns that can be assembled back into the familiar rows of an RDBMS model. Since columns are drawn from one and only one data type and domain, they can be compressed and distributed over storage systems, such as RAID.

Chapter 3: Graph Databases. Graph databases are based on graph theory, a branch of discrete mathematics. They model relationships among entities rather than doing computations and aggregations of the values of the attributes and retrievals based on those values.

Chapter 4: MapReduce Model. The MapReduce model is the most popular of what is generally called NoSQL or Big Data in the IT trade press. It is intended for fast retrieval of large amounts of data from large file systems in parallel. These systems trade this speed and volume for less data integrity. Their basic operations are simple and do little optimization. But a lot of applications are willing to make that trade-off.

Chapter 5: Streaming Databases and Complex Events. The relational model and the prior traditional database systems assume that the tables are static during a query and that the result is also a static table. But streaming databases are built on a model of constantly flowing data—think of river or a pipe of data moving in time. The best-known examples of streaming data are stock and commodity trading done by software in subsecond trades. The system has to take actions based on events in this stream.

Chapter 6: Key–Value Stores. A key–value store is a collection of pairs, (< key >, < value >), that generalize a simple array. The keys are unique within the collection and can be of any data type that can be tested for equality. This is a form of the MapReduce family, but performance depends on how carefully the keys are designed. Hashing becomes an important technique.

Schema versus No Schema. SQL and all prior databases use a schema that defines their structure, constraints, defaults, and so forth. But there is overhead in using and maintaining schema. Having no schema puts all of the data integrity (if any!) in the application. Likewise, the presentation layer has no way to know what will come back to it. These systems are optimized for retrieval, and the safety and query power of SQL systems is replaced by better scalability and performance for retrieval.

Chapter 7: Textbases. The most important business data is not in databases or files; it is in text. It is in contracts, warranties, correspondence, manuals, and reference material. Text by its nature is fuzzy and bulky; traditional data is encoded to be precise and compact. Originally, textbases could only find documents, but with improved algorithms, we are getting to the point of reading and understanding the text.

Chapter 8: Geographical Data. Geographic information systems (GISs) are databases for geographical, geospatial, or spatiotemporal (space–time) data. This is more than cartography. We are not just trying to locate something on a map; we are trying to find quantities, densities, and contents of things within an area, changes over time, and so forth.

Chapter 9: Big Data and Cloud Computing. The term Big Data was invented by Forrester Research in a whitepaper along with the the four V buzzwords: volume, velocity, variety, and variability. It has come to apply to an environment that uses a mix of the database models we have discussed and tries to coordinate them.

Chapter 10: Biometrics, Fingerprints, and Specialized Databases. Biometrics fall outside commercial use. They identify a person as a biological entity rather than a commercial entity. We are now in the worlds of medicine and law enforcement. Eventually, however, biometrics may move into the commercial world as security becomes an issue and we are willing to trade privacy for security.

Chapter 11: Analytic Databases. The traditional SQL database is used for online transaction processing. Its purpose is to provide support for daily business applications. The online analytical processing databases are built on the OLTP data, but the purpose of this model is to run queries that deal with aggregations of data rather than individual transactions. It is analytical, not transactional.

Chapter 12: Multivalued or NFNF Databases. RDBMS is based on first normal form, which assumes that data is kept in scalar values in columns that are kept in rows and those records have the same structure. The multivalued model allows the use to nest tables inside columns. They have a niche market that is not well known to SQL programmers. There is an algebra for this data model that is just as sound as the relational model.

Chapter 13: Hierarchical and Network Database Systems. IMS and IDMS are the most important prerelational technologies that are still in wide use today. In fact, there is a good chance that IMS databases still hold more commercial data than SQL databases. These products still do the “heavy lifting” in banking, insurance, and large commercial applications on mainframe computers, and they use COBOL. They are great for situations that do not change much and need to move around a lot of data. Because so much data still lives in them, you have to at least know the basics of hierarchical and network database systems to get to the data to put it in a NoSQL tool.