NoSQL for Mere Mortals (2015)
Part I: Introduction
Chapter 1. Different Databases for Different Requirements
“There is nothing new in the world except the history you do not know.”
—HARRY S. TRUMAN
Topics Covered In This Chapter
Relational Database Design
Early Database Management Systems
The Relational Database Revolution
Motivations for Not Just/No SQL (NoSQL) Databases
The history of information technology is a story of increasingly faster computation and greater volumes of data storage. An important subplot of this story is the evolution of data management technologies. Anyone who started to work with data management systems in the past two decades might understandably assume that data management is synonymous with relational database management systems. It is not. Prior to the advent of the relational database management systems, such as Microsoft Access, Microsoft SQL Server, Oracle relational database, and IBM’s DB2, computer scientists and information technology professionals created a variety of data management systems based on different organizing principles. The data management community has recently taken on new types of data management problems that have prompted the development of new kinds of data management systems. These are collectively known as NoSQL databases.
NoSQL gets its name from SQL (pronounced “sequel”), which is a language used with most relational database management systems. The “No” in NoSQL can mean literally there is no SQL used in a database, or it can mean “not just SQL.” For our purposes, we will consider the non-SQL aspects of NoSQL databases.
If you are interested in learning more about SQL, see John L. Viescas and Michael J. Hernandez’s SQL Queries for Mere Mortals (Addison-Wesley, 2007).
This chapter introduces the basic concepts of data management and database systems. It begins with a discussion of early database management systems. The limitations of these early data management systems motivated the development of a new kind of database: the relational database. Relational databases were a major advance over earlier types of data management systems. For example, relational databases help avoid inconsistencies in data, known as data anomalies, that could be introduced in some data management systems under seemingly normal operating conditions. Relational databases have so successfully solved a wide range of data management problems that they have been widely used across industries and application areas.
Relational Database Design
Relational databases are well designed to support hundreds and even thousands of users simultaneously. Even large enterprises can support complex applications serving thousands of users. As businesses and researchers developed new types of applications designed for the Web, they realized that relational databases were not always meeting their needs.
Web applications may need to support tens of thousands of users or more. Some of the most important features of relational databases, such as ensuring anyone reading data will have a consistent view of the data, require time, storage, and computational resources. These types of features are vital to some applications.
For example, if you were to transfer $100 from your savings account to your checking account, it requires two steps: Deduct $100 from your savings account and add $100 to your checking account. If you were to read your account balances after the $100 was deducted from your savings account but before it was added to your checking account, you would appear to be missing $100. Relational databases can group a set of operations, like deducting from savings and adding to checking accounts, as a single operation. If you were to read your balances, you would see the balances either before or after the transfer—never in the middle of the set of operations.
Now consider an e-commerce application. Customers use a web application to select products from a vendor’s catalog. As you select products, they are added to a “shopping cart.” Of course, there is no literal shopping cart; it is a metaphor for a data structure that manages the customer’s selection. For this kind of data management operation, a fairly simple data structure will suffice. You would need a unique identifier for each customer and a list of items selected. (You might also want other details, such as the date and time an item was added to the cart so you can remove items after some period of inactivity, but we’ll ignore those additional details for now.)
A data model using key-value pairs would work well here. The unique customer ID would be the key, which is how you look up or find data. The values would be the list of items in the cart. Because there is no need to support operations like transfers between bank accounts, you do not need to support the additional data management features found in relational databases.
Different applications require different types of databases. This fact has driven the development of data management systems for decades. As you shall see, history repeats itself. Some of the features found in early database management systems appear again in some NoSQL databases. This fact is more than just an interesting coincidence. Relational databases largely displaced early types of data management systems because the relational model addresses limitations of early systems.
As you evaluate NoSQL databases, you should consider how the newer NoSQL databases address the limitations of relational databases as well as any limitations they have in common with earlier data management systems.
Early Database Management Systems
Early data management systems include file and database systems that were designed prior to the advent of relational databases in the 1970s. These include
• Flat file data management systems
• Hierarchical data management systems
• Network data management systems
Flat file–based systems were the earliest form of computerized data management. The hierarchical and network models improved on the flat file approach to data management.
Flat File Data Management Systems
A file is an organized set of data stored on a long-term storage medium such as a disk or, at the time, magnetic tape. At the time flat files were commonly used data management, but magnetic tape was also in widespread use. For this reason, early data management files had to accommodate the physical constraints of physical systems.
Organization of Flat File Data Management Systems
Although there are multiple ways of storing data on magnetic tape, this section just considers block storage for simplicity. Magnetic tape is a long, thin magnetized plastic material that was a popular means of recording audio from the 1950s to the 1970s. It was adapted to store digital data as well. A magnetic tape is divided into a series of blocks with gaps between them (see Figure 1.1). Data is written to blocks by recording heads in a tape drive. Data is read by moving the tape over heads as well.
Figure 1.1 Magnetic tapes store data in sequential blocks.
It is a relatively simple operation to start reading a tape at a particular block and then read the following blocks in sequence. This is known as sequential access to data. This method optimizes the amount of data read relative to the amount of movement of the tape. You can think of a block as a chunk of data that the tape drive reads. Blocks may contain data about multiple entities, such as people, products, and locations. If a business needs to track customers’ names, addresses, and phone numbers, it could use a file-based storage method. The programmers working on the project may decide to leave a fixed amount of storage space for each customer:
• Customer ID—10 characters
• Customer name—40 characters
• Customer address—100 characters
• Customer phone number—10 characters
To store each customer’s information, 160 characters are required. If a block on the tape is 800 characters long, you could store five customer records in each block (see Figure 1.2).
Figure 1.2 A block is a chunk of data read by tape or disk drive in a single read operation.
Random Access of Data
Sometimes it is necessary to access data on different parts of the tape. For example, looking up the addresses of several customers may require moving the tape to several different positions, some of which can be quite far from each other. This is known as random access (see Figure 1.3).
Figure 1.3 Random access to blocks on tape can take more time than sequential access because there can be more tape movement relative to the amount of data read.
Random access is more efficient on disk drives. Read-write heads of disk drives may need to move to be in the correct position to read a data block, but there is less movement than with tapes. Disk read-write heads only need to move at most the radius of the disk. Tape drives may need to move the full length of a tape to retrieve a data block.
Limitations of Flat File Data Management Systems
The programs that use flat files largely determine the organization of data. A team of developers, for example, may decide that they want to organize the file by customer record ordered by customer ID. This makes for efficient addition of new customers. As each new customer is created, the customer can be added to the end of the tape. If you need to produce a list of customers ordered by customer ID, you could start at the beginning of the tape and sequentially read each record. If you need to produce a customer list ordered alphabetically by customer name, it is a more difficult process. You could, for example, read all the data from tape to memory (assuming it would all fit) and then sort the records in memory.
One of the problems with file-based data management systems is that they can lead to duplicated data. Another team of developers may need customer data but want to organize records by customer name instead of customer ID. Another developer who needs access to customer data may not know other customer files exist or does not want to use someone else’s file because the structure of the file might change. If that were to happen, programmers have to update their programs to reflect the new structure.
If a programmer wrote a program that assumed the customer record was organized as described previously, then the program would expect to find the customer address to start at 51 characters after the start of the record; the first 10 characters would be taken up by the customer ID, and the following 40 by customer name. Now, imagine the programmers who designed the original file layout decided they needed 50 characters for a customer name. They changed the organization of the file to be
• Customer ID—10 characters
• Customer name—50 characters
• Customer address—100 characters
• Customer phone number—10 characters
They then created a new file with the new organization, copied the data from the original file to a new version, and replaced the old version with the new version. Programs designed to work with the original file format would start reading the customer address at character 51, which is now part of the customer name.
Another problem with flat file management is that it is difficult to share files that contain information that should be kept confidential from some users. An employee file that contains the names, addresses, phone numbers, employee IDs, and position title of all employees would be useful to a number of different parts of an organization. However, if the file also contained salary information, then that data should be accessed only by those who have a job responsibility that requires it, such as someone working on payroll or in human resources. In this scenario, the easiest solution may be to have two copies of the employee file: one with salary data and one without.
The proposed solution introduces another problem: The data in the two files may become inconsistent. If an employee moves and informs the human resources department of her new address, that department might update the file its employees use; that is, the one with salary information. Unless the employee or someone in human resources informs the person responsible for updating the other version of the employee file, the data in the two files will be inconsistent. One file will have the employee’s new address while the other file will contain the employee’s old address.
To summarize, the limitations of flat file data management systems include the following:
• It is inefficient to access data in any way other than by the way data is organized in the file; for example, by customer ID.
• Changes to file structure require changes to programs.
• Different kinds of data have different security requirements.
• Data may be stored in multiple files, making it difficult to maintain consistent sets of data.
Attempts to address the limitations of flat file data management systems led to the development of hierarchical data model and network data model systems.
Hierarchical Data Model Systems
One of the limitations of flat file–based data management systems is that they can be inefficient to search. Hierarchical data models address this problem by organizing data in a hierarchy of parent-child relationships.
Organization of Hierarchical Data Management Systems
A hierarchy starts with a root node that links to the top layer of data nodes or records. These top-layer records can have child records that contain additional data about the parent record. The logical organization is shown in Figure 1.4.
Figure 1.4 The hierarchical model is organized into a set of parent-child relations.
Consider the kind of data the loan department of a bank may track. It has customers and each customer has one or more loans. For each customer, the loan department would want to track the customer’s name, address, and phone number. For each loan, the loan department should track the amount of the loan, the interest rate, the date the loan was made, and the date the loan is due. Customers can have more than one loan at a time, and a loan might have multiple customers associated with it. Figure 1.5 shows the logical organization of such a database.
Figure 1.5 A hierarchical data model for a loan management database.
An advantage of the hierarchical model over flat files is that searching is more efficient. Rather than having to scan over all the data on a tape to search for a block of data, a program using a hierarchical model could scan just customer records in search of a particular customer’s loan record. Once the customer record is found, the program could search through the customer’s loans looking for the particular loan of interest.
Limitations of Hierarchical Data Management Systems
Hierarchical data management systems work well when the entities you are managing can be organized into parent-child relationships, specifically, one parent to one or more children. One customer with one loan is easily managed. One customer with three loans is easily managed. Two customers with one loan, such as two business partners taking out a short-term business loan, are not so easily represented.
In the case of two customers on the same loan, the hierarchical data management system would have to duplicate information about the loan under both customers. This creates three problems. First, it makes inefficient use of storage space to duplicate data.
In addition, like duplicated data in the case of flat file management systems, it can lead to inconsistent data if care is not taken to ensure that any changes are applied to all copies of the data.
Also, there is a potential for errors when aggregating data. For example, to find the total value of all outstanding loans, a programmer could not just read all loan records and add all loan amounts together. Because some loans have multiple copies, one for each customer, simply adding all copies of all loan records will sum to a total loan amount greater than the actual amount. The programmer must take steps to count each loan only once.
To address the limitations of hierarchical models, data management system designers turned to network data model systems.
Network Data Management Systems
A network data model is like a hierarchical data model in that it uses links between records; however, unlike hierarchical data models, you are not restricted to having one parent record. Also, unlike flat file data management systems and hierarchical data management systems, network data models have two essential components: a schema and the database itself.
Organization of Network Data Management Systems
A network is made up of data records linked together. The data records are known as nodes and the links are known as edges. The collection of nodes and edges is known as a graph. Network data models have two important constraints on how you use edges. Edges have a direction to them. This allows you to represent parent-child relations. Parent-child relations are also known as one-to-many relations (see Figure 1.6). Furthermore, network data models allow for multiple parents, such as two customers on a loan. It can also represent two customers with two loans without duplicating data. This is known as a many-to-many relation.
Figure 1.6 A parent-child relationship is represented by a directed edge.
Another constraint is that you cannot have cycles in the graph. That is, if you start at a node, follow a link to another node, then follow a link from that node, and so on, you will never return to the starting node. Graphs that have directed edges and no cycles are known as directed acyclic graphs (see Figure 1.7).
Figure 1.7 This graph has cycles and, therefore, is not a directed acyclic graph and not a model of a network data management system.
Additional constraints on which nodes can link to other nodes arise from the entities you are trying to model. For example, in a banking database, customers can have addresses, but loans and bank accounts do not. In a human resources database, employees can have positions in the organization, but departments cannot. The kinds of nodes that can link to other nodes are defined in a structure called a schema (see Figure 1.8).
Figure 1.8 A simple network schema shows which entities can link to other entities.
The other part of a network data management system is the database itself. This is where the actual data is stored according to the structure of the schema. One of the advances of network databases over previous approaches is that it became standardized in 1969 by the Conference on Data Systems Languages (CODASYL) Consortium. This standard became the basis for most implementations of network databases.
Limitations of Network Data Management Systems
The chief limitation of network databases is that they can be difficult to design and maintain. Depending on how nodes are linked, a program may need to traverse a large number of links to get to the node with the needed data. For example, if you must start at a customer record to get to a loan record and then to a loan payment history record, you must traverse two links from customer to loan payment history. As data models become more complex, the number of links and the length of paths can become substantially longer.
Also, if after a network database is deployed, a database designer determines another entity or node type is needed, then programs that access the network database will have to be updated. Adding nodes to the schema and the database changes the paths that programs must traverse to get to particular nodes.
Summary of Early Database Management Systems
Early database management systems include flat file, hierarchical, and network databases.
Flat file databases tend to keep data about a single entity together in a single record. This is a simple structure but can lead to duplicated data and inefficient retrieval. It is difficult to implement security controls to protect confidential data in flat file management systems.
Hierarchical data management systems allow for parent-child relationships. This can help avoid duplicating data about parents because only one copy of a parent record is needed. Because data is organized into different records, data retrieval can be more efficient. For example, searching for a customer in a loan database might require scanning all customer records, but at least the program does not have to scan over loan data as well.
Although hierarchical data management systems avoid some duplication found in flat file management systems, there is still a potential for duplicate data. This can occur, for example, in the case of a loan database that must model multiple customers on a single loan.
Network data management improves on hierarchical databases by allowing multiple parents. Network data management systems also incorporate schemas that define valid relations between node types. The ability to represent parent-child and many-to-many relations is an advantage over flat file and hierarchical data management systems.
The disadvantages of early database management systems include duplicate data, difficulty implementing security, inefficient searching, and difficulty maintaining program code to access databases. The reason programs have to change when the structure of the database changes is that there is no independence between the logical organization of a database and the way the data is physically stored on tape or disk. As you will see in the next section, the structural independence of the logical and physical organization of the database is a major advance in data management provided by relational database management systems.
The Relational Database Revolution
Although network and hierarchical data management systems improved on flat file data management systems, it was not until 1970 when E. F. Codd published a paper on the design of a new type of database that data management technology radically changed. There are many important aspects of relational database design that are improvements over previous data management models. Relational databases were based on a formal mathematical model that used relational algebra to describe data and their relations. Relational databases separated the logical organization of data structures from the physical storage of those structures. Codd and others developed rules for designing relational databases that eliminated the potential for some types of data anomalies, such as inconsistent data.
There are many aspects of relational databases that deserve in-depth review. This section, however, provides only a minimal, high-level review of key points. For more on relational databases, see Michael J. Hernandez’s Database Design for Mere Mortals: A Hands-On Guide to Relational Database Design (Addison Wesley, 2003).
Relational Database Management Systems
A relational database management system is an application made up of multiple programs that manage data and allow users of the application to add, update, read, and delete data. Unlike flat file data stores where each time a new file for storage was created, a programmer had to develop a program to manipulate the data, relational database management systems are designed to use a common language to manipulate data. That language is called SQL and is standardized across relational database management systems. Although SQL is a language used with relational databases, it is sometimes used as shorthand for “relational” as in “SQL database” or “NoSQL.”
Most of the users of relational database management systems (RDBMSs) do not work directly with the database software. Instead, they work with applications created by software developers and those applications interact with the RDBMS. To explain relational databases, it helps to separately describe the features of the RDBMS and a typical database application.
Organization of Relational Database Management Systems
A relational database management system is a set of programs for managing data and programs that manipulate that data. The minimal requirements for implementing an RDBMS include four components:
• Storage management programs
• Memory management programs
• Data dictionary
• Query language
Together, these four components provide the core data management and data retrieval services of an RDBMS.
Storage Management Programs
Database systems store data persistently on disks and flash drives for long-term storage. Database storage may be directly attached to a server or other device running a database. For example, a laptop running the MySQL database can persistently store data on the local disk drive. In large enterprises, IT departments may offer shared storage. In such cases, large disk arrays are managed as a single resource and database servers can save data to and read data from these storage arrays (see Figure 1.9).
Figure 1.9 Local storage versus shared network storage.
Regardless of what type of storage system is used, the RDBMS has to track where each piece of data is stored. One of the drawbacks of tape-based storage was the need to sequentially search a tape to retrieve data. Disk and flash devices are not so restricted. This has allowed RDBMS designers to improve retrieval methods.
Like flat file–based data stores, RDBMSs, at the most basic level, read and write blocks of data. Disk technologies made it easier to create and use indexes to data. Indexes are data sets that contain location information about blocks of data saved by the database. Indexes are based on some attribute contained in the data, such as a customer ID or customer name. Indexes point to the location on disk or flash memory that contains the record holding information about the entity referenced in the index. For example, an index with the data “Smith, Jane 18277372” would indicate that the block of data with information about Jane Smith is located at disk position 18277372.
The storage management programs in an RDBMS do much more than keep track of the location of data. They can also optimize the placement of data on disks, compress data to save storage, and make copies of data blocks so data is not lost in case a data block on a disk goes bad.
Memory Management Programs
RDBMSs are also responsible for managing data in memory. Often, the size of data stored in a database is larger than available memory. The RDBMS memory management components are responsible for bringing and keeping data in memory as long as it is needed and deleting it when it is no longer needed or to make space for additional data. Because reading data from memory is order of magnitudes faster than reading it from disk, the overall performance of the RDBMS is highly influenced by its ability to use memory efficiently and effectively.
The data dictionary is the part of the RDBMS that keeps track of information about the structure of data stored in the database (see Figure 1.10). This includes information about multiple levels of database structures, including
Figure 1.10 Data structures managed by a data dictionary.
A schema is a collection of tables, views, indexes, and other structures that are all related to a set of data. Typically, you would have separate schemas for separate major types of applications, such as a schema for inventory of products, for accounts receivable, or for employees and their benefits.
Tables are structures that have data about entities. Entities describe a physical or logical thing related to the business or operation supported by an RDBMS. Entities for a human resources schema might include employees, managers, and departments. An inventory schema might include warehouses, products, and suppliers.
Tables are made up of columns. Columns contain a single unit of information. An employee table might contain the following: employee first name, last name, street address, city, state, zip code, date of birth, and salary. Columns have a type associated with them to indicate the kind of data that can be stored. First name, for example, may be character data, date of birth would be a date type, and salary would be some type of number or currency type.
Indexes, as described earlier, are data structures used by the RDBMS to improve the speed at which the RDBMS can retrieve data. An employee table would probably have an index on the employee’s last name to enable rapid lookup of employee data by last name.
Constraints are rules that further restrict the values of data that can go in a column. Data types associated with columns prevent the wrong type of data from being saved to a column. A program might mistakenly try to write a number to the employee first name column, but the database would prevent it. A negative number would be a valid number or currency value and allowed in the salary column. You could add a constraint to the salary column to specify that a salary must be greater than 0. Constraints are generally based on business rules about the entities and operations the data is representing.
Views are collections of related columns from one or more tables as well as values calculated for data in columns. Views can be used to restrict the data that a user sees. For example, if an employee table has salary information, you can create a view that includes all nonsalary columns from the employee table. Users who need access to employees’ names and addresses can use that view instead of the employee table. Views can also combine data from multiple tables, such as a table with employee names and a table with details about promotions all employees have received.
A query language in an RDBMS performs two types of operations: defining data structures and manipulating data. SQL is the query language of relational databases and includes statements for performing both types of operations.
SQL Data Definition Language
SQL includes statements that allow programmers to create and delete schemas, tables, views, indexes, constraints, and other data structures. It also provides statements for adding and removing columns from tables, and granting access to read or write tables. The following is a sample statement for creating a schema:
CREATE SCHEMA humresc
The following is a sample statement for creating a table:
CREATE TABLE employees (
The specifics of these statements are not important at this point, but they do show the declarative style of SQL. Instead of telling the computer how to create a data structure, such as creating a free block of data at a particular address, you tell the RDBMS what kind of data structure you want. In the first case, the statement creates a schema with the name humresc (short for human resources). In the second statement, a table called employee is created with eight columns. Varchar is a variable-length character type. The number with each varchar term is the maximum length of the column. Int, short for integer, indicates that the emp_id is an integer.
SQL Data Manipulation Language
Once you have a schema with tables, you can start to add data and manipulate it. The SQL data manipulation language includes statements for
• Inserting data
• Updating data
• Deleting data
• Selecting (that is, reading) data
The following is a sample INSERT statement for the employee table:
INSERT INTO employee (emp_id, first_name, last_name)
VALUES (1234, 'Jane', 'Smith')
This statement adds a row to the employee table with an emp_id of 1234, a first name of 'Jane', and a last name of 'Smith'. The other columns of the table would be NULL, a special data value used to indicate the column has no value specified.
Updating and deleting statements allow users to change values in existing rows or remove existing rows.
To read data from a database, use the SELECT statement. For example:
SELECT emp_id, first_name, last_name
would produce output such as
emp_id first_name last_name
1234 Jane Smith
The data manipulation data statements are capable of expressing complex operations and targeting specific rows of data using fairly complex logic in the SELECT, UPDATE, and DELETE statements.
Relational database management systems provide storage management, memory management, a data dictionary, and a query language. Although programmers and software developers may be comfortable working directly with SQL, database applications allow any computer user to work with relational databases.
Organization of Applications Using Relational Database Management Systems
Working with broad concepts, you can think of business applications that use relational databases as having three major components:
• A user interface
• Business logic
• Database code
The user interface is designed to support the workflow of users. For example, a person using a human resources application might need to look up an employee’s salary, change an employee’s position, or add a new employee. The user works with menus and other user interface abstractions to invoke data entry forms, update the data as needed, and save changes to the database. There is no exposure to SQL or to the RDBMS.
The business logic is the part of the program that performs calculations and checks business rules. A business rule, for example, might check the age of an employee to verify the employee is over 21 before assigning the position “bartender” to that employee. Business rules can be implemented in programming languages, such as Python, Visual Basic, or Java, or in SQL.
Database code is the set of SELECT, INSERT, UPDATE, and DELETE (and so on) statements that perform operations on the database. The statements correspond to the operations that users can perform through the user interface.
Database applications make the functionality of relational databases, and other types of databases, accessible to nonprogrammers.
Limitations of Relational Databases
Relational databases have been the dominant type of database used for database applications for decades. Relational databases addressed many of the limitations of flat file–based data stores, hierarchical databases, and network databases. With the advent of the Web, however, the limitations of relational databases became increasingly problematic.
Companies such as Google, LinkedIn, Yahoo!, and Amazon found that supporting large numbers of users on the Web was different from supporting much smaller numbers of business users, even those in large enterprises with thousands of users on a single database application.
Web application developers working with large volumes of data and extremely large numbers of users found they needed to support
• Large volumes of read and write operations
• Low latency response times
• High availability
These requirements were difficult to realize using relational databases. These were not the first database users who needed to improve performance. The problem is that techniques used in the past did not work at the scale of operations, users, and data that businesses now demanded. In the past, if a relational database was running slowly, it could be upgraded with more CPUs, additional memory, or faster storage devices. This is a costly option and works only to a point. There are limits to how many CPUs and memory can be supported in a single server. Database designers could redesign the database schema to use techniques that would improve performance but at the cost of increasing the risk of data anomalies. (These techniques are known as denormalization.)
Another option is to use multiple servers with a relational database. This is possible, but operating a single relational database management system over multiple servers is a complex operation. This makes long-term management difficult. There are also performance issues when supporting a series of operations that run on different servers but all have to complete successfully or all fail. These sets of operations that succeed or fail together are known as transactions. As the number of servers in a database cluster increases, the cost of implementing transactions increases.
In spite of these difficulties, some companies, such as Facebook, use the MySQL relational database for some of its operations. They, however, have a dedicated MySQL staff that are pushing and expanding the limits of MySQL. Most organizations do not have such resources. For those organizations, if relational databases are not meeting needs, then it may be time to consider a NoSQL database.
Motivations for Not Just/No SQL (NoSQL) Databases
Pressing real-world problems motivated the data management professionals and software designers who created NoSQL databases. Web applications serving tens of thousands or more users were difficult to implement with relational databases. Four characteristics of data management systems that are particularly important for large-scale data management tasks are
Depending on the needs of a particular application, some of these characteristics may be more important than others.
Scalability is the ability to efficiently meet the needs for varying workloads. For example, if there is a spike in traffic to a website, additional servers can be brought online to handle the additional load. When the spike subsides and traffic returns to normal, some of those additional servers can be shut down. Adding servers as needed is called scaling out.
When you work with relational databases, it is often challenging to scale out. Additional database software may be needed to manage multiple servers working as a single database system. Oracle, for example, offers Oracle Real Applications Clusters (RAC) for cluster-based databases. Additional database components can add complexity and cost to operations.
Alternatively, database administrators could choose to scale up, which is upgrading an existing database server to add additional processors, memory, network bandwidth, or other resources that would improve performance on a database management system or replacing an existing server with one with more CPUs, memory, and so on (see Figure 1.11).
Figure 1.11 Scaling up versus scaling out.
Scaling out is more flexible than scaling up. Servers can be added or removed as needed when scaling up. NoSQL databases are designed to utilize servers available in a cluster with minimal intervention by database administrators. As new servers are added or removed, the NoSQL database management system adjusts to use the new set of available servers. Scaling up by replacing a server requires migrating the database management to a new server. Scaling up by adding resources would not require a migration, but would likely require some downtime to add hardware to the database server.
The cost of database licenses is an obvious consideration for any business or organization. Commercial software vendors employ a variety of licensing models that include charging by the size of the server running the RDBMS, by the number of concurrent users on the database, or by the number of named users allowed to use the software. Each of these models presents challenges for users of the database system.
Web applications may have spikes in demand that increase the number of users utilizing a database at any time. Should users of the RDBMS pay for the number of peak users or the number of average users? How should they budget for RDBMS licenses when it is difficult to know how many users will be using the system six months or a year from now? Users of open source software avoid these issues. The software is free to use on as many servers of whatever size needed because open source developers do not typically charge fees to run their software. Fortunately for NoSQL database users, the major NoSQL databases are available as open source.
Third-party companies provide commercial support services for open source NoSQL databases so businesses can have software support as they do with commercial relational databases.
Relational database management systems are flexible in the range of problems that can be addressed using relational data models. Industries as different as banking, manufacturing, retail, energy, and health care all make use of relational databases. There is, however, another aspect of relational databases that is less flexible.
Database designers expect to know at the start of a project all the tables and columns that will be needed to support an application. It is also commonly assumed that most of the columns in a table will be needed by most of the rows. For example, all employees will have names and employee IDs. There are times that the problems modeled are less homogeneous than that.
Consider an e-commerce application that uses a database to track attributes of products. Computer products would have attributes such as CPU type, amount of memory, and disk size. Microwave ovens would have attributes such as size and power. A database designer could create separate tables for each type of product or define a table with as many different product attributes as she could imagine at the time she designs the database.
Unlike relational databases, some NoSQL databases do not require a fixed table structure. For example, in a document database, a program could dynamically add new attributes as needed without having to have a database designer alter the database design.
Refer to Chapter 2, “Distributed Systems and the Variety of NoSQL Databases,” for more information on working with a document database.
Many of us have come to expect websites and web applications to be available whenever we want to use them. If your favorite social media or e-commerce site were frequently down when you tried to use it, you would likely start looking for a new favorite.
NoSQL databases are designed to take advantage of multiple, low-cost servers. When one server fails or is taken out of service for maintenance, the other servers in the cluster can take on the entire workload (see Figure 1.12). Performance may be somewhat less, but the application will still be available. If a database is running on a single server and it fails, the application will become unavailable unless there is a backup server. Backup servers keep replicated copies of data from the primary server in case the primary server fails. If that happens, the backup can take on the workload that the primary server had been processing. This can be an inefficient configuration because a server is kept in reserve in the event of a failure but otherwise is not helping to process the workload.
Figure 1.12 High-availability NoSQL clusters run multiple servers. If one fails, the others can continue to support applications.
Database designers turned to NoSQL systems when existing RDBMSs failed to meet their needs. Scalability, cost, flexibility, and availability are increasingly important concerns for application developers, and their choice of database management systems reflects this.
Data management systems have evolved to meet changing application requirements subject to the constraints of the existing compute and storage technologies at their times. Early data management systems were based on records stored in flat files. These provided a basic capability of persistent storage of data, but suffered from a number of drawbacks, including slow search and retrieval operations, redundant data, and poor security. Hierarchical databases were an improvement over flat files. These systems allowed for parent-child relations between records. This helped reduce, but not eliminate, the potential for redundant data. Network databases further improved on hierarchical databases by allowing for multiple parent–multiple child relations. These are commonly known as many-to-many relations.
The development of relational databases represented a radical improvement over flat file, hierarchical, and network databases. Relational databases are based on a sound mathematical foundation. Rules for designing relational databases eliminate the potential for a range of data anomalies, such as inconsistent data. Relational databases virtually replaced all other types of data management systems in business applications.
In spite of the widespread successful use of relational databases, the exponential growth of e-commerce and social media led to the need for data management systems that were scalable, low cost, flexible, and highly available. Achieving some of these objectives with relational databases is possible in some cases, but often with difficulty and potentially high costs.
NoSQL databases were created to address the limitations of relational database management systems. NoSQL databases are unlikely to displace relational databases the way RDBMSs displaced flat file, hierarchical, and network databases. The two will likely complement each other and adapt features from each other as they both continue to be applied to increasingly complex and demanding applications.
Throughout this book, you will develop a case study around a set of applications needed by a transportation management company. The company, TransGlobal Transport and Shipping, is a fictional company with realistic requirements. As you examine each of the major types of NoSQL databases, you will consider how each can be applied to a specific application for TransGlobal Transport and Shipping.
The four major applications are
• Building a shipment order
• Managing customer manifests, or a detailed description of items in a shipment
• Maintaining a customer database
• Optimizing transportation routes
As you will see, different sets of requirements can demand different types of database systems. In this case, the four types of NoSQL databases will be used to meet the information management needs of TransGlobal Transport and Shipping.
Refer to Chapter 2, “Distributed Systems and the Variety of NoSQL Databases,” to learn more about the four types of NoSQL databases.
1. If the layout of records in a file data management system changes, what else must change?
2. What kind of relation is supported in a hierarchical data management system?
d. No relations are allowed.
3. What kind of relation is supported in network data management systems?
c. Both parent-child and many-to-many
d. No relations are allowed.
4. Give an example of a SQL data manipulation language statement.
5. Give an example of a SQL data definition language statement.
6. What is scaling up?
7. What is scaling out?
8. Are NoSQL databases likely to displace relational databases as relational databases displaced earlier types of data management systems?
9. Name four required components of a relational database management system (RDBMS).
10. Name three common major components of a database application.
11. Name four motivating factors for database designers and other IT professionals to develop and use NoSQL databases.
Codd, E. F. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 13, no. 6 (June 1970).
Intuit. “A Timeline of Database History.” http://quickbase.intuit.com/articles/timeline-of-database-history.
Hernandez, Michael J. Data Design for Mere Mortals: A Hands-On Guide to Relational Database Design. Reading, MA: Addison-Wesley, 2007.
Viescas, John L., and Michael J. Hernandez. SQL Queries for Mere Mortals. Reading, MA: Addison-Wesley, 2007.