NoSQL for Mere Mortals (2015)
Introduction
“Just when I think I have learned the way to live, life changes.”
—HUGH PRATHER
Databases are like television. There was a time in the history of both when you had few options to choose from and all the choices were disappointingly similar. Times have changed. The database management system is no longer synonymous with relational databases, and television is no longer limited to a handful of networks broadcasting indistinguishable programs.
Names like PostgreSQL, MySQL, Oracle, Microsoft SQL Server, and IBM DB2 are well known in the IT community, even among professionals outside the data management arena. Relational databases have been the choice of data management professionals for decades. They meet the needs of businesses tracking packages and account balances as well as scientists studying bacteria and human diseases. They keep data logically organized and easily retrieved. One of their most important characteristics is their ability to give multiple users a consistent view of data no matter how many changes are under way within the database.
Many of us in the database community thought we understood how to live with databases. Then life changed. Actually, the Internet changed. The Internet emerged from a military-sponsored network called ARPANET to become a platform for academic collaboration and eventually for commercial and personal use. The volume and types of data expanded. In addition to keeping our checking account balances, we want our computers to find the latest news, help with homework, and summarize reviews of new films. Now, many of us depend on the Internet to keep in touch with family, network with colleagues, and pursue professional education and development.
It is no surprise that such radical changes in data management requirements have led to radically new ways to manage data. The latest generation of data management tools is collectively known as NoSQL databases. The name reflects what these systems are not instead of what they are. We can attribute this to the well-earned dominance of relational databases, which use a language called SQL.
NoSQL databases fall into four broad categories: key-value, document, column family, and graph databases. (Search-oriented systems, such as Solr and Elasticsearch are sometimes included in the extended family of NoSQL databases. They are outside the scope of this book.)
Key-value databases employ a simple model that enables you to store and look up a datum (also known as the value) using an identifier (also known as the key). BerkleyDB, released in the mid-1990s, was an early key-value database used in applications for which relational databases were not a good fit.
Document databases expand on the ideas of key-value databases to organize groups of key values into a logical structure known as a document. Document databases are high-performance, flexible data management systems that are increasingly used in a broad range of data management tasks.
Column family databases share superficial similarities to relational databases. The name of the first implementation of a column family database, Google BigTable, hints at the connection to relational databases and their core data structure, the table. Column family databases are used for some of the largest and most demanding, data-intensive applications.
Graph databases are well suited to modeling networks—that is, things connected to other things. The range of use cases spans computers communicating with other computers to people interacting with each other.
This is a dynamic time in database system research and development. We have well-established and widely used relational databases that are good fits for many data management problems. We have long-established alternatives, such as key-value databases, as well as more recent designs, including document, column family, and graph databases.
One of the disadvantages of this state of affairs is that decision making is more challenging. This book is designed to lessen that challenge. After reading this book, you should have an understanding of NoSQL options and when to use them.
Keep in mind that NoSQL databases are changing rapidly. By the time you read this, your favorite NoSQL database might have features not mentioned here. Watch for increasing support for transactions. How database management systems handle transactions is an important distinguishing feature of these systems. (If you are unfamiliar with transactions, don’t worry. You will soon know about them if you keep reading.)
Who Should Read This Book?
This book is designed for anyone interested in learning how to use NoSQL databases. Novice database developers, seasoned relational data modelers, and experienced NoSQL developers will find something of value in this book.
Novice developers will learn basic principles and design criteria of data management in the opening chapters of the book. You’ll also get a bit of data management history because, as we all know, history has a habit of repeating itself.
There are comparisons to relational databases throughout the book. If you are well versed in relational database design, these comparisons might help you quickly grasp and assess the value of NoSQL database features.
For those who have worked with some NoSQL databases, this book may help you get up to speed with other types of NoSQL databases. Key-value and document databases are widely used, but if you haven’t encountered column family or graph databases, then this book can help.
If you are comfortable working with a variety of NoSQL databases but want to know more about the internals of these distributed systems, this book is a starting place. You’ll become familiar with implementation features such as quorums, Bloom filters, and anti-entropy. The references will point you to resources to help you delve deeper if you’d like.
This book does not try to duplicate documentation available with NoSQL databases. There is no better place to learn how to insert data into a database than from the documentation. On the other hand, documentation rarely has the level of explanation, discussion of pros and cons, and advice about best practices provided in a book such as NoSQL for Mere Mortals. Read this book as a complement to, not a replacement for, database documentation.
The Purpose of This Book
The purpose of this book is to help someone with an interest in data to use NoSQL databases to help solve problems. The book is built on the assumption that the reader is not a seasoned database professional. If you are comfortable working with Excel, then you are ready for the topics covered in this book.
With this book, you’ll not only learn about NoSQL databases, but also how to apply design principles and best practices to solve your data management requirements. This is a book that will take you into the internals of NoSQL database management systems to explain how distributed databases work and what to do (and not do) to build scalable, reliable applications.
The hallmark of this book is pragmatism. Everything in this book is designed to help you use NoSQL databases to solve problems. There is a bit of computer science theory scattered through the pages but only to provide more explanation about certain key topics. If you are well versed in theory, feel free to skip over it.
How to Read This Book
For those who are new to database systems, start with Chapters 1 and 2. These will provide sufficient background to read the other chapters.
If you are familiar with relational databases and their predecessors, you can skip Chapter 1. If you are already experienced with NoSQL, you could skip Chapter 2; however, it does discuss all four major types of NoSQL databases, so you might want to at least skim the sections on types you are less familiar with.
Everyone should read Part II. It is referenced throughout the other parts of the book. Parts III, IV, and V could be read in any order, but there are some references to content in earlier chapters. To achieve the best understanding of each type of NoSQL database, read all three chapters in Parts II, III, IV, and V.
Chapter 15 assumes familiarity with the content in the other chapters, but you might be able to skip parts on NoSQL databases you are sufficiently familiar with. If your goal is to understand how to choose between NoSQL options, be sure to read Chapter 15.
How This Book Is Organized
Here’s an overview of what you’ll find in each part and each chapter.
Part I: Introduction
NoSQL databases did not appear out of nowhere. This part provides a background on relational databases and earlier data management systems.
Chapter 1, “Different Databases for Different Requirements,” introduces relational databases and their precursor data management systems along with a discussion about today’s need for the alternative approaches provided by NoSQL databases.
Chapter 2, “Variety of NoSQL Databases,” explores key functionality in databases, challenges to implementing distributed databases, and the trade-offs you’ll find in different types of databases. The chapter includes an introduction to a series of case studies describing realistic applications of various NoSQL databases.
Part II: Key-Value Databases
In this part, you learn how to use key-value databases and how to avoid potential problems with them.
Chapter 3, “Introduction to Key-Value Databases,” provides an overview of the simplest of the NoSQL database types.
Chapter 4, “Key-Value Database Terminology,” introduces the vocabulary you need to understand the structure and function of key-value databases.
Chapter 5, “Designing for Key-Value Databases,” covers principles of designing key-value databases, the limitations of key-value databases, and design patterns used in key-value databases. The chapter concludes with a case study describing a realistic use case of key-value databases.
Part III: Document Databases
This part delves into the widely used document database and provides guidance on how to effectively implement document database applications.
Chapter 6, “Introduction to Document Databases,” describes the basic characteristics of document databases, introduces the concept of schemaless databases, and discusses basic operations on document databases.
Chapter 7, “Document Database Terminology,” acquaints you with the vocabulary of document databases.
Chapter 8, “Designing for Document Databases,” delves into the benefits of normalization and denormalization, planning for mutable documents, tips on indexing, as well as common design patterns. The chapter concludes with a case study using document databases for a business application.
Part IV: Column Family Databases
This part covers Big Data applications and the need for column family databases.
Chapter 9, “Introduction to Column Family Databases,” describes the Google BigTable design, the difference between key-value, document, and column family databases as well as architectures used in column family databases.
Chapter 10, “Column Family Database Terminology,” introduces the vocabulary of column family databases. If you’ve always wondered “what is anti-entropy?” this chapter is for you.
Chapter 11, “Designing for Column Family Databases,” offers guidelines for designing tables, indexing, partitioning, and working with Big Data.
Part V: Graph Databases
This part covers graph databases and use cases where they are particularly appropriate.
Chapter 12, “Introduction to Graph Databases,” discusses graph and network modeling as well as the benefits of graph databases.
Chapter 13, “Graph Database Terminology,” introduces the vocabulary of graph theory, the branch of math underlying graph databases.
Chapter 14, “Designing for Graph Databases,” covers tips for graph database design, traps to watch for, and methods for querying a graph database. This chapter concludes with a case study example of graph database applied to a business problem.
Part VI: Choosing a Database for Your Application
This part deals with applying what you have learned in the rest of the book.
Chapter 15, “Guidelines for Selecting a Database,” builds on the previous chapters to outline factors that you should consider when selecting a database for your application.
Part VII: Appendices
Appendix A, “Answers to Chapter Review Questions,” contains the review questions at the end of each chapter along with answers.
Appendix B, “List of NoSQL Databases,” provides a nonexhaustive list of NoSQL databases, many of which are open source or otherwise free to use.
The Glossary contains definitions of NoSQL terminology used throughout the book.