Answers to Chapter Review Questions - Appendices - NoSQL for Mere Mortals (2015)

NoSQL for Mere Mortals (2015)

Part VII: Appendices

A. Answers to Chapter Review Questions

This appendix provides answers to review questions throughout the book.

Chapter 1

1. If the layout of records in a flat file data management system changes, what else must change?

Answer: Data access programs will have to change.

2. What kind of relation is supported in a hierarchical data management system?

a. Parent-child

b. Many-to-many

c. Many-to-many-to-many

d. No relations are allowed.

Answer: A. Parent-child relations are supported in hierarchical data management systems.

3. What kind of relation is supported in network data management systems?

a. Parent-child

b. Many-to-many

c. Both parent-child and many-to-many

d. No relations are allowed

Answer: C. Both parent-child and many-to-many relations are supported in network data management systems.

4. Give an example of a SQL data manipulation language statement.

Answer: Examples could include INSERT, DELETE, UPDATE, and SELECT. The following is an INSERT example:

INSERT INTO employee (emp_id, first_name, last_name)
VALUES (1234, 'Jane', 'Smith')

5. Give an example of a SQL data definition language statement.

Answer: The CREATE TABLE statement is a data definition statement. The following is an example of CREATE TABLE:

CREATE TABLE employee (
emp_id int,
emp_first_name varchar(25),
emp_last_name varchar(25),
emp_address varchar(50),
emp_city varchar(50),
emp_state varchar(2),
emp_zip varchar(5),
emp_position_title varchar(30)
)

6. What is scaling up?

Answer: Scaling entails upgrading an existing database server to add additional processors, memory, network bandwidth, or other resources that would improve performance on a database management system. It could also entail replacing an existing server with one that has more CPUs, memory, and so forth.

7. What is scaling out?

Answer: Scaling out entails adding servers to a cluster.

8. Are NoSQL databases likely to displace relational databases as relational databases displaced earlier types of data management systems?

Answer: No, relational databases and NoSQL databases meet different types of needs.

9. Name four required components of a relational database management system (RDBMS).

Answer:

• Storage management programs

• Memory management programs

• Data dictionary

• Query language

10. Name three common major components of a database application.

Answer:

• A user interface

• Business logic

• Database code

11. Name four motivating factors for database designers and other IT professionals to develop and use NoSQL databases.

Answer:

• Scalability

• Cost

• Flexibility

• Availability

Chapter 2

1. What is a distributed system?

Answer: Systems that run on multiple servers are known as distributed systems.

2. Describe a two-phase commit. Does it help ensure consistency or availability?

Answer: A two-phase commit is a transaction that requires writing data to two separate locations. In the first phase of the operation, the database writes, or commits, the data to the disk of the primary server. In the second phase of the operation, the database writes data to the disk of the backup server.

Answer: Two-phase commits help ensure consistency.

3. What do the C and A in the CAP theorem stand for? Give an example of how designing for one of those properties can lead to difficulties in maintaining the other.

Answer: C stands for consistency; A stands for availability.

When using a two-phase commit, the database favors consistency but at the risk of the most recent data not being available for a brief period of time. While the two-phase commit is executing, other queries to the data are blocked. The updated data is unavailable until the two-phase commit finishes. This favors consistency over availability.

4. The E in BASE stands for eventually consistent. What does that mean?

Answer: E stands for eventually consistent, which means that some replicas might be inconsistent for some period of time but will become consistent at some point.

5. Describe monotonic write consistency. Why is it so important?

Answer: Monotonic write consistency ensures that if you were to issue several update commands, they would be executed in the order you issued them. This ensures that the results of a set of commands are predictable. Repeating the same commands with the same starting data will yield the same results.

6. How many values can be stored with a single key in a key-value database?

Answer: One.

7. What is a namespace? Why is it important in key-value databases?

Answer: A namespace is a collection of identifiers. Keys must be unique within a namespace.

8. How do document databases differ from key-value databases?

Answer: Instead of storing each attribute of an entity with a separate key, document databases store multiple attributes in a single document. Users can query and retrieve documents by filtering on key-value pairs within a document.

9. Describe two differences between document databases and relational databases.

Answer: Document databases do not require a fixed, predefined schema. Also, documents can have embedded documents and lists of multiple values within a document.

10. Name two data structures used in column family databases.

Answer: Columns and column families.

11. What are the two fundamental data structures in a graph database?

Answer: Nodes and relations, also known as vertices and edges.

12. You are assigned the task of building a database to model employees and who they work with in your company. The database must be able to answer queries such as how many employees does Employee A work with? And, does Employee A work with anyone who works with Employee B? Which type of NoSQL database would naturally fit with these requirements?

Answer: A graph database because these queries require working with relations between employees. Employees can be modeled as vertices, and the “works with” relation can be modeled as an edge.

Chapter 3

1. How are associative arrays different from arrays?

Answer: An associative array is a data structure, like an array, but is not restricted to using integers as indexes or limiting values to the same type. Associative arrays generalize the idea of an ordered list indexed by an identifier to include arbitrary values for identifiers and values.

2. How can you use a cache to improve relational database performance?

Answer: An in-memory cache is an associative array. The values retrieved from the relational database could be stored in the cache by creating a key for each value stored. Programs that access customer data will typically check the cache first for data and if it is not found in the cache, the program will then query the database. Retrieving data from memory is faster than retrieving it from disk.

3. What is a namespace?

Answer: A namespace is a logical data structure for organizing key-value pairs. Keys must be unique within a namespace. Namespaces are sometimes called buckets.

4. Describe a way of constructing keys that captures some information about entities and attribute types.

Answer: A developer could use a key-naming convention that uses a table name, primary key value, and an attribute name to create a key, such as customer: 1982737:firstName.

5. Name three common features of key-value databases.

Answer:

• Simplicity

• Speed

• Scalability

6. What is a hash function? Include important characteristics of hash functions in your definition.

Answer: A hash function is a function that can take an arbitrary string of characters and produce a (usually) unique, fixed-length string of characters. Hash functions map to what appear to be random outputs.

7. How can hash functions help distribute writes over multiple servers?

Answer: One way to take advantage of the hash value is to start by dividing the hash value by the number of servers. Sometimes the hash value will divide evenly by the number of servers and sometimes not. The remainder can be used to determine which of the servers should receive a write operation.

8. What is one type of practical limitation on values stored in key-value databases?

Answer: Different implementations of key-value databases have different restrictions on values. For example, some key-value databases will typically have some limit on the size of values. Some might allow multiple megabytes in each value, but others might have smaller size limitations. Even in cases in which you can store extremely large values, you might run into performance problems that lead you to work with smaller data values.

9. How does the lack of a query language affect application developers using key-value databases?

Answer: Key-value databases do not support query languages for searching over values. Application developers can implement search operations in their applications. Alternatively, some key-value databases incorporate search functionality directly into the database.

10. How can a search system help improve the performance of applications that use key-value databases?

Answer: A built-in search system would index the string values stored in the database and create an index for rapid retrieval. Rather than search all values for a string, the search system keeps a list of words with the keys of each key-value pair in which that word appears.

Chapter 4

1. What are data models? How do they differ from data structures?

Answer: Data models are abstractions that help organize the information conveyed by the data in databases. Data structures are well-defined data storage structures that are implemented using elements of underlying hardware, particularly random access memory and persistent data storage, such as hard drives and flash devices. Data models provide a level of organization and abstraction above data structures.

2. What is a partition?

Answer: A partition is a logical subdivision of a larger structure. Clusters, or groups of servers, can be organized into partitions. A partitioned cluster is a group of servers in which servers or instances of key-value database software running on servers are assigned to manage subsets of a database.

3. Define two types of clusters. Which type is typically used with key-value data stores?

Answer: Clusters may be loosely or tightly coupled. Loosely coupled clusters consist of fairly independent servers that complete many functions on their own with minimal coordination with other servers in the cluster. Tightly coupled clusters tend to have high levels of communication between servers. This is needed to support more coordinated operations, or calculations, on the cluster. Key-value clusters tend to be loosely coupled.

4. What are the advantages of having a large number of replicas? What are the disadvantages?

Answer: The more replicas you have, the less likely you will lose data; however, you might have lower performance with a large number of replicas.

It is possible for replicas to have different versions of data. All the versions will eventually be consistent, but sometimes they may be out of sync for short periods.

5. Why would you want to receive a response from more than one replica when reading a value from a key-value data store?

Answer: To minimize the risk of reading old, out-of-date data, you can specify the number of nodes that must respond with the same answer to a read request before a response is returned to the calling application.

6. Under what circumstances would you want to have a large number of replicas?

Answer: If you have little tolerance for losing data, a higher replica number is recommended.

7. Why are hash functions used with key-value databases?

Answer: Hash functions are generally designed to distribute inputs evenly over the set of all possible outputs. The output space can be quite large. No matter how similar your keys are, they are evenly distributed across the range of possible output values. The ranges of output values can be assigned to partitions and you can be reasonably assured that each partition will receive approximately the same amount of data.

8. What is a collision?

Answer: A collision occurs when two distinct inputs to a hash function produce the same output.

9. Describe one way to handle a collision so that no data is lost.

Answer: Instead of storing just one value, a hash table can store lists of values.

10. Discuss the relation between speed of compression and the size of compressed data.

Answer: There is a trade-off between the speed of compression/decompression and the size of the compressed data. Faster compression algorithms can lead to larger compressed data than other, slower algorithms.

Chapter 5

1. Describe four characteristics of a well-designed key-naming convention.

Answer:

• Use meaningful and unambiguous naming components, such as ‘cust’ for customer or ‘inv’ for inventory.

• Use range-based components when you would like to retrieve ranges of values. Ranges include dates or integer counters.

• Use a common delimiter when appending components to make a key. The ‘:’ is a commonly used delimiter, but any character that will not otherwise appear in the key will work.

• Keep keys as short as possible without sacrificing the other characteristics mentioned in this list.

2. Name two types of restrictions key-value databases can place on keys.

Answer: Restrictions can be placed on key size and data types.

3. Describe the difference between range partitioning and hash partitioning.

Answer: Range partitioning works by grouping contiguous values and sending them to the same node in a cluster. Hash partitioning distributes values evenly across the cluster.

4. How can structured data types help reduce read latency (that is, the time needed to retrieve a block of data from a disk)?

Answer: By storing commonly used values together in a list or other structure, you reduce the number of disk seeks that must be performed to read all the needed data. Key-value databases will usually store the entire structure together in a data block so there is no need to hash multiple keys and retrieve multiple data blocks.

5. Describe the Time to Live (TTL) key pattern.

Answer: A TTL parameter specifies a time that a key-value record is allowed to exist. The TTL pattern is sometimes useful with keys in a key-value database, especially when caching data in limited memory servers or keys are used to hold a resource for some specified period of time.

6. Which design pattern provides some of the features of relational transactions?

Answer: Emulating tables.

7. When would you want to use the Aggregate pattern?

Answer: Aggregation is used to support different attributes for different subtypes of an entity. For example, a concert venue could have two subtypes, seated and nonseated.

8. What are enumerable keys?

Answer: Enumerable keys are keys that use counters or sequences to generate new keys. Enumerable keys are often created using an entity type prefix along with the generated number.

9. How can enumerable keys help with range queries?

Answer: A range of keys can be retrieved by using a loop to generate the set of keys between the lower and upper bounds. For example, a for loop starting at 1 and ending with 3 could be used to generate the following keys: ‘ticketLog:20140617:1’, ‘ticketLog:20140617:2’, and ‘ticketLog:20140617:3’.

10. How would you modify the design of TGTS Tracker to include a user’s preferred language in the configuration?

Answer: A language preference could be added to the customer value list. For example,

TrackerNS[cust:4719364] = {name:' Prime Machine,
Inc.', currency:'USD', language:'EN'}

Chapter 6

1. Define a document with respect to document databases.

Answer: Documents in document databases are composed of a set of attribute tags and values. Developers can make up their own set of attribute tags; they are not constrained to a predefined set of tags for specifying structure.

2. Name two types of formats for storing data in a document database.

Answer: JSON and XML.

3. List at least three syntax rules for JSON objects.

Answer:

• Data is organized in key-value pairs, similar to key-value databases.

• Documents consist of name-value pairs separated by commas.

• Documents start with a { and end with a }.

• Names are strings, such as "customer_id" and "address".

• Values can be numbers, strings, Booleans (true or false), arrays, objects, or the null value.

• The values of arrays are listed within square brackets, such as [ and ].

• The values of objects are listed as key-value pairs within curly brackets, such as { and }.

4. Create a sample document for a small appliance with the following attributes: appliance ID, name, description, height, width, length, and shipping weight. Use the JSON format.

Answer:

{ "appliance ID": 132738,
"name": "Toaster Model X",
"description": "Large 4 bagel toaster",
"height": "9 in.",
"width": "7.5 in",
"length": "12 in",
"shipping weight": "3.2 lbs"
}

5. Why are highly abstract entities often avoided when modeling document collections?

Answer: Highly abstract entities can lead to document collections with many subtypes. These subtypes will need type indicators to support the frequent filtering required when different document types are in the same collection. Large collections can lead to inefficient retrieval operations.

6. When is it reasonable to use highly abstract entities?

Answer: Abstract entities should be used when many of the queries used against a collection apply to all or many subtypes, for example, in a products document collection. Also, if there is a potential for the number of subtypes to grow into the tens or hundreds, it could become difficult to manage collections for all of those subtypes.

7. Using the db.books collection described in this chapter, write a command to insert a book to the collection. Use MongoDB syntax.

Answer:

db.books.insert( {"title":"Mother Night", "author":
"Kurt Vonnegut, Jr."} )

8. Using the db.books collection described in this chapter, write a command to remove books by Isaac Asimov. Use MongoDB syntax.

Answer:

db.books.remove("author": "Isaac Asimov"})

9. Using the db.books collection described in this chapter, write a command to retrieve all books with quantity greater than or equal to 20. Use MongoDB syntax.

Answer:

db.books.find( {"quantity" : {"$gte" : 20 }})

10. Which query operator is used to search for values in a single key?

Answer: The $in operator is used to search for a value in a single key.

Chapter 7

1. Describe how documents are analogous to rows in relational databases.

Answer: Documents are ordered sets of key-value pairs. Keys are used to reference particular values and are analogous to column names in relational tables. Values in a document database are analogous to values stored in a row of a relational database table.

2. Describe how collections are analogous to tables in relational databases.

Answer: Collections are sets of documents; tables are sets of rows. Both documents and rows have unique identifiers and may have other attributes as well.

3. Define a schema.

Answer: A schema is a formal specification of a database structure.

4. Why are document databases considered schemaless?

Answer: Document databases do not require data modelers to formally specify the structure of documents.

5. Why are document databases considered polymorphic?

Answer: A document database is polymorphic because the documents that exist in collections can have many different forms.

6. How does vertical partitioning differ from horizontal partitioning, or sharding?

Answer: Vertical partitioning is a technique for improving database performance by separating columns of a relational table into multiple separate tables. This technique is particularly useful when you have some columns that are frequently accessed and others that are not.

Horizontal partitioning is the process of dividing a database by documents in a document database or by rows in a relational database. These parts of the database, known as shards, are stored on separate servers.

7. What is a shard key?

Answer: A shard key is one or more keys or fields that exist in all documents in a collection that is used to separate documents into different partitions.

8. What is the purpose of the partitioning algorithm in sharding?

Answer: The partitioning algorithm determines how to distribute documents over shards. Common techniques include range, hash, and list partitioning.

9. What is normalization?

Answer: Database normalization is the process of organizing data into tables in such a way as to reduce the potential for data anomalies. An anomaly is an inconsistency in the data. Normalization reduces the amount of redundant data in the database.

10. Why would you want to denormalize collections in a document database?

Answer: Denormalization is used to improve performance over normalized versions of databases.

Chapter 8

1. What are the advantages of normalization?

Answer: Normalization reduces redundant data and mitigates the risk of data anomalies.

2. What are the advantages of denormalization?

Answer: Denormalization can improve query performance over more normalized models.

3. Why are joins such costly operations?

Answer: Joins retrieve data from multiple tables. Joins use loops, hashes, and merge operations. As the size of tables grows, these operations take longer as more data blocks need to be read. Indexes can improve performance, but they also require disk seeks to retrieve data blocks holding index data.

4. How do document database modelers avoid costly joins?

Answer: They use denormalized data models. The basic idea is that data models store data that is used together in a single data structure, such as a table in a relational database or in a document in a document database. This increases the likelihood that all the data in a document is in a single data block or at least adjacent data blocks.

5. How can adding data to a document cause more work for the I/O subsystem in addition to adding the data to a document?

Answer: Too much denormalization will lead to large documents that will likely lead to unnecessary data read from persistent storage.

6. How can you, as a document database modeler, help avoid that extra work mentioned in Question 5?

Answer: Let queries drive how you design documents. Try to include only fields that are frequently used together in documents. If you have two or more distinct sets of requirements for the same type of data, consider using two document collections each tailored to the different requirements.

7. Describe a situation where it would make sense to have many indexes on your document collections.

Answer: Read-heavy applications, especially those with ad hoc query requirements, might need many indexes. Business intelligence and other analytic applications can fall into this category. Read-heavy applications with ad hoc query requirements should have indexes on virtually all fields used to help filter results. For example, if it was common for users to query documents from a particular sales region or with order items in a certain product category, then the sales region and product category fields should be indexed.

8. What would cause you to minimize the number of indexes on your document collection?

Answer: Data modelers tend to try to minimize the number of indexes in write-heavy applications. Because indexes are data structures that must be created and updated, their use will consume CPU, persistent storage, and memory resources and increase the time needed to insert or update a document in the database.

9. Describe how to model a many-to-many relationship.

Answer: Many-to-many relationships are modeled using two collections—one for each type of entity. Each collection maintains a list of identifiers that reference related entities. For example, a document with course data would include an array of student IDs, and a student document would include a list of course IDs.

10. Describe three ways to model hierarchies in a document database.

Answer: Parent references, child references, listing all ancestors.

Chapter 9

1. Name at least three core features of Google BigTable.

Answer:

• Developers maintain dynamic control over columns.

• Data values are indexed by row identifier, column name, and a time stamp.

• Data modelers and developers have control over location of data.

• Reads and writes of a row are atomic.

• Rows are maintained in a sorted order.

2. Why are time stamps used in Google BigTable?

Answer: The time stamp orders versions of the column value. When a new value is written to a BigTable database, the old value is not overwritten. Instead, a new value is added along with a time stamp. The time stamp allows applications to determine the latest version of a column value.

3. Identify one similarity between column family databases and key-value databases.

Answer: Column families are analogous to keyspaces in key-value databases. In both key-value databases and Cassandra, a keyspace is the outermost logical structure used by data modelers and developers.

4. Identify one similarity between column family databases and document databases.

Answer: Column family and document databases support similar types of querying that allow you to select subsets of data available in a row.

Column family databases, like document databases, do not require all columns in all rows. In both column family and document databases, columns or fields can be added as needed by developers.

5. Identify one similarity between column family databases and relational databases.

Answer: Both column family databases and relational databases use unique identifiers for rows of data. These are known as row keys in column family databases and as primary keys in relational databases. Both row keys and primary keys are indexed for rapid retrieval.

6. What types of Hadoop nodes are used by HBase?

Answer: Name nodes and data nodes.

7. Describe the essential characteristics of a peer-to-peer architecture.

Answer: Peer-to-peer architectures have only one type of node. Any node can assume responsibility for any service or task that must be run in the cluster.

8. Why does Cassandra use a gossip protocol to exchange server status information?

Answer: An “all-servers-to-all-other-servers” protocol can quickly increase the volume of traffic on the network and the amount of time each server has to dedicate to communicating with other servers. The number of messages sent is a function of the number of servers in the cluster. If N is the number of servers, then N × (N–1) is the number of messages needed to update all servers with information about all other servers. Gossip protocols are more efficient because one server can update another server about itself as well as all the servers it knows about.

9. What is the purpose of the anti-entropy protocol used by Cassandra?

Answer: Anti-entropy algorithms correct inconsistencies between replicas.

10. When would you use a column family database instead of another type of NoSQL database?

Answer: Column family databases are appropriate choices for large-scale database deployments that require high levels of write performance, a large number of servers, or multi–data center availability.

Column family databases are also appropriate when a large number of servers are required to meet expected workloads.

Chapter 10

1. What is a keyspace? What is an analogous data structure in a relational database?

Answer: A keyspace is the top-level data structure in a column family database. It is top level in the sense that all other data structures you would create as a database designer are contained within a keyspace. A keyspace is analogous to a schema in a relational database.

2. How do columns in column family databases differ from columns in relational databases?

Answer: Columns in column families are dynamic. Columns in a relational database table are not as dynamic as in column family databases. Adding a column in a relational database requires changing its schema definition. Adding a column in a column family database just requires making a reference to it from a client application, for example, inserting a value to a column name.

3. When should columns be grouped together in a column family? When should they be in separate column families?

Answer: Columns that are frequently used together should be grouped in the same column family.

4. Describe how partitions are used in column family databases.

Answer: A partition is a logical subset of a database. Partitions are usually used to store a set of data based on some attribute of the data. Each node or server within a column family cluster maintains one or more partitions.

When a client application requests data, the request is routed to a server with the partition containing the requested data. A request could go to a central server in a master-slave architecture or to any server in a peer-to-peer architecture. In either case, the request is forwarded to the appropriate server.

5. What are the performance advantages of using a commit log?

Answer: Commit logs are append-only files that always write data to the end of the file. When database administrators dedicate a disk to a commit log, there are no other write processes competing to write data to the disk. This reduces the need for random seeks and reduces latency.

6. What are the advantages of using a Bloom filter?

Answer: A Bloom filter tests whether or not an element is a member of a set, such as a partition. Bloom filters never return a negative response unless the element is not in the set; it may, however, return a true response in cases when the element is not in the set. Bloom filters are used to reduce the number of blocks read from disks or solid state devices.

7. What factors should you consider when setting a consistency level?

Answer: A consistency level is set according to several, sometimes competing, requirements:

• How fast should write operations return a success status after saving data to persistent storage?

• Is it acceptable for two users to look up a set of columns by the same row ID and receive different data?

• If your application runs across multiple data centers and one of the data centers fails, must the remaining functioning data centers have the latest data?

• Can you tolerate some inconsistency in reads but need updates saved to two or more replicas?

8. What factors should you consider when setting a replication strategy?

Answer: One method uses the ring structure of a cluster. When data is written to a node, it is replicated to the two adjacent nodes in the cluster. The other method uses network topology to determine where to replicate data. For example, replicas may be created on different racks within a data center to ensure availability in the event of a rack failure.

9. Why are hash trees used in anti-entropy processes?

Answer: The naive way to compare replicas is to send a copy of one replica to the node storing another replica and compare the two. Even with high-write applications, much of the data sent from the source is the same as the data on the target node. Column family databases can exploit the fact that much of replica data may not change between anti-entropy checks. They do this by sending hashes of data instead of the data itself.

10. What are the advantages of using a gossip protocol?

Answer: Instead of having every node communicate with every other node, it is more efficient to have nodes share information about themselves as well as other nodes from which they have received updates. This avoids the rapid increase in the number of messages that must be sent when compared with each node communicating with all other nodes.

11. Describe how hinted handoff can help improve the availability of write operations.

Answer: If a write operation is directed to a node that is unavailable, the operation can be redirected to another node, such as another replica node or a node designated to receive write operations when the target node is down.

The node receiving the redirected write message creates a data structure to store information about the write operation and where it should ultimately be sent. The hinted handoff periodically checks the status of the target server and sends the write operation when the target is available.

Chapter 11

1. What is the role of end-user queries in column family database design?

Answer: Queries provide information needed to effectively design column family databases. The information includes entities, attributes of entities, query criteria, and derived values. It is users who determine the questions that will be asked of the database application and drive the data model design.

2. How can you avoid performing joins in column family databases?

Answer: Denormalization is used to avoid joins.

3. Why should entities be modeled in a single row?

Answer: Column family databases do not provide the same level of transaction control as relational databases. Typically, writes to a row are atomic. If you update several columns in a table, they will all be updated or none of them will be. If you need to update two separate tables, such as a product table and a books table, it is conceivable that the updates to the product table succeed but the updates to the book table do not. In such a case, you would be left with inconsistent data.

4. What is hotspotting, and why should it be avoided?

Answer: Hotspotting occurs when many operations are performed on a small number of servers. It is inefficient to direct an excessive amount of work at one or a few machines while there are others that are underutilized.

5. What are some disadvantages of using complex data structures as a column value?

Answer: Not all column family database features work well with complex data structures. Using separate columns for each attribute makes it easier to apply database features to the attributes. For example, creating separate columns for street, city, state, and zip means you can create secondary indexes on those values.

6. Describe three scenarios in which you should not use secondary indexes.

Answer:

• There are a small number of distinct values in a column.

• There are many unique values in a column.

• The column values are sparse.

7. What are the disadvantages of managing your own tables as indexes?

Answer: When using tables as indexes, you will be responsible for maintaining the indexes. You could update the index whenever there is a change to the base tables; for example, a customer makes a purchase. Alternatively, you could run a batch job at regular intervals to update the index tables.

Updating index tables at the same time you update the base tables keeps the indexes up to date at all times. A drawback of this approach is that your application will have to perform two write operations, one to the base table and one to the index table. This could lead to longer latencies during write operations.

Updating index tables with batch jobs has the advantage of not adding additional work to write operations. The obvious disadvantage is that there is a period of time when the data in the base tables and the indexes is out of synchronization.

8. What are two types of statistics? What are they each used for?

Answer: Descriptive statistics are used for understanding the characteristics of your data. Predictive, or inferential, statistics is the study of methods for making predictions based on data.

9. What are two types of machine learning? What are they used for?

Answer: Unsupervised learning is useful for exploring large data sets with techniques such as clustering. Supervised learning techniques provide the means to learn from examples. These techniques can be used to create classifiers.

10. How is Spark different from MapReduce?

Answer: MapReduce writes much data to disk, whereas Spark makes more use of memory. MapReduce employs a fairly rigid computational model (map operation followed by reduce operation), whereas Spark allows for more general computational models.

Chapter 12

1. What are the two components of a graph?

Answer: Vertices and edges.

2. List at least three sample entities that can be modeled as vertices.

Answer:

• Cities

• Employees in a company

• Proteins

• Electrical circuits

• Junctions in a water line

• Organisms in an ecosystem

• Train stations

• A person infected with a contagious disease

3. List at least three sample relations that can be modeled as edges.

Answer:

• Roads connecting cities

• Employees working with other employees

• Proteins interacting with other proteins

• Electrical components linked to other electrical components

• Water lines connecting junctions

• Predators and prey in ecosystems

• Rail lines connecting train stations

• Disease transmission between an infected and uninfected person

4. What properties could you associate with a vertex representing a city?

Answer:

• City name

• Population

• Longitude and latitude

• Points of interest

5. What properties could you associate with an edge representing a highway between two cities?

Answer:

• Length

• Year built

• Maximum speed

6. Epidemiologists use graphs to model the spread of infection. What do vertices represent? What do edges represent?

Answer: Vertices represent people. Edges represent interactions between people, such as shaking hands or standing in close proximity.

7. Give an example of a part-of hierarchy.

Answer:

• Federal, state/provincial/local governments

• Part of a car hierarchy

8. How do graph databases avoid joins?

Answer: In a graph database, instead of performing joins, you follow edges from vertex to vertex.

9. How is a person-likes-post graph different from other graphs used as examples in this chapter?

Answer: This is an example of a bipartite graph.

10. Give an example of a business application that would use multiple types of edges (relations) between vertices.

Answer: A transportation company might want to consider road, rail, and air transportation between cities. Each has different options, such as time to deliver, cost, and government regulations.

Chapter 13

1. Define a vertex.

Answer: A vertex represents an entity marked with a unique identifier. A vertex can represent virtually any entity that has a relation with another entity.

2. Define an edge.

Answer: Edges define relationships between vertices.

3. List at least three examples in which you can use graphs to model the domains.

Answer:

• Transportation networks

• Social networks

• Spread of infectious diseases

• Electrical circuits

• Networks, such as the Internet

4. Give an example of when you would use a weighted graph.

Answer:

• In the case of highways, weight could be the distance between cities.

• In a social network, weight could be an indication of how frequently the two individuals post on each other’s walls or comment on each other’s posts.

5. Give an example of when you would use a directed graph.

Answer: In a family relations graph, there is a direction associated with a “parent of” relation.

6. What is the difference between order and size?

Answer: The order of a graph is the number of vertices in the graph. The size is the number of edges in a graph.

7. Why is betweenness sometimes called a bottleneck measure?

Answer: Betweenness is a measure of how important a vertex is to connecting different parts of a graph. If all paths from one part of the network to another part must go through a single vertex, then it can become a bottleneck. Such vertices have high betweenness scores.

8. How would an epidemiologist use closeness to understand the spread of a disease?

Answer: Closeness is a property of a vertex that indicates how far the vertex is from all others in the graph. People (vertices) with high closeness scores have short paths to others in the network. Diseases can spread faster from people with high closeness scores than from those with low closeness scores.

9. When would you use a multigraph?

Answer: Multigraphs are graphs with multiple edges between vertices. Multiple edges between cities could represent various shipping options, such as shipping by truck, train, or plane.

10. What is Dijkstra’s algorithm used for?

Answer: Dijkstra’s algorithm is used to find the shortest paths in a network.

Chapter 14

1. What is the benefit of mapping domain-specific queries into graph-specific queries?

Answer: Once you have your domain-specific queries mapped to graph-specific queries, you have the full range of graph query tools and graph algorithms available to you to analyze and explore your data.

2. Which is more like SQL, Cypher or Gremlin?

Answer: Cypher.

3. How is the MATCH statement like a SQL SELECT statement?

Answer: MATCH is used to retrieve data from a graph database. MATCH supports filtering based on properties.

4. What are the inE and outE terms used for in Gremlin?

Answer: inE is a reference to incoming edges of a vertex; outE is a reference to outgoing edges of a vertex.

5. Which type of edge should be used for a nonsymmetrical relation, a directed or undirected edge?

Answer: A directed edge.

6. What is the difference between a declarative and a traversal query language for graph databases?

Answer: Declarative languages express what is to be retrieved; traversal languages specify how to retrieve data.

7. What is a depth-first search?

Answer: In a depth-first search, you start traversal at one vertex and select adjacent vertices. You then select the first vertex in that resultset and select adjacent vertices to it. You continue to select the first vertex in the resultset until there are no more edges to traverse. At that point, you visit the next vertex in the latest resultset. If there are incident edges leading to other vertices, you visit those; otherwise, you continue to the next item in the latest resultset. When you exhaust all vertices in the latest resultset, you return to the resultset selected prior to that and begin the process again.

8. What is a breadth-first search?

Answer: In a breadth-first search, you visit each of the vertices incident to the current vertex before visiting other vertices.

9. Why are cycles a potential problem when performing graph operations?

Answer: Cycles can lead to traversing the same vertices repeatedly. Keeping track of visited vertices is one way to avoid problems with cycles.

10. Why is scalability such an important consideration when working with graphs?

Answer: Scalability in graph databases must address growth in

• Vertices and edges

• Number of users

• Number and size of properties on vertices and edges

Chapter 15

1. Name two use cases for key-value databases.

Answer:

• Caching data from relational databases to improve performance

• Tracking transient attributes in a web application, such as a shopping cart

2. Describe two reasons for choosing a key-value database for your application.

Answer:

• There is a need for variable attributes.

• The problem domain requires a relatively simple data model.

3. Name two use cases for document databases.

Answer:

• Content management systems

• Back-end support for mobile device applications

4. Describe two reasons for choosing a document database for your application.

Answer:

• There is a wide variety of query patterns.

• There is a need for flexible data structures.

5. Name two use cases for column family databases.

Answer:

• Collecting and analyzing log data from a large number of devices

• Analyzing customer characteristics to generate personalized offers

6. Describe two reasons for choosing a column family database for your application.

Answer:

• There is a need for multi–data center replication.

• There is the need to work with Big Data–scale volumes of data.

7. Name two use cases for graph databases.

Answer:

• Modeling computer networks

• Modeling social media networks

8. Describe two reasons for choosing a graph database for your application.

Answer:

• There is a need to model explicit relations between entities and rapidly traverse paths between entities.

• There is an affinity between the problem domain, such as transportation networks, and graphs.

9. Name two types of applications well suited for relational databases.

Answer:

• Transaction processing

• Data warehouses and data marts

10. Discuss the need for both NoSQL and relational databases in enterprise data management.

Answer: NoSQL and relational databases are complementary. Relational databases offer many features that protect the integrity of data and reduce the risk of data anomalies. Relational databases incur operational overhead providing these features. In some use cases, performance is more important than ensuring immediate consistency or supporting ACID transactions. In these cases, NoSQL databases may be the better solution. Choosing a database is a process of choosing the right tool for the job. The more varied your set of jobs, the more varied your toolkit.