Glossary - Appendices - NoSQL for Mere Mortals (2015)

NoSQL for Mere Mortals (2015)

Part VII: Appendices

C. Glossary

A

anti-entropy

Anti-entropy is the process of detecting differences in replicas. From a performance perspective, it is important to detect and resolve inconsistencies with a minimum amount of data exchange.

B

betweenness

Betweenness is a measure of how much of a bottleneck a given vertex is.

bipartite graph

A bipartite graph, or bigraph, is a graph with two distinct sets of vertices where each vertex in one set is only connected to vertices in the other set.

Bloom filter

A Bloom filter tests whether or not an element is a member of a set. Unlike a typical member function, the Bloom filter sometimes returns an incorrect answer. It could return a positive response in cases where the tested element is not a member of the set. This is known as a false positive. Bloom filters never return a negative response unless the element is not in the set.

C

CAP theorem

Also known as Brewer’s theorem after the computer scientist who introduced it, the CAP theorem states that distributed databases cannot have consistency (C), availability (A), and partition protection (P) all at the same time. Consistency, in this case, means consistent copies of data on different servers. Availability refers to providing a response to any query. Partition protection means a network that connects two or more servers is unable to transmit network packets between the servers.

closeness

Closeness is a property of a vertex that indicates how far the vertex is from all others in the graph.

cluster

A cluster is a set of servers configured to function together. Servers sometimes have differentiated functions and sometimes they do not.

collection

A collection is a group of documents. The documents within a collection are usually related to the same subject entity, such as employees, products, logged events, or customer profiles. It is possible to store unrelated documents in a collection, but this is not advised.

collision

A collision occurs when two distinct inputs to a hash function produce the same output. When it is difficult to find two inputs that map to the same hash function output, the hash function is known as collision resistant. If a hash table is not collision resistant or if you encounter one of those rare cases in which two inputs map to the same output, you will need a collision resolution strategy.

column

A column is the data structure for storing a single value in a column family database.

column family

A column family is a collection of related columns. Columns that are frequently used together should be grouped into the same column family.

commit log

Commit log files are used in databases to improve performance while ensuring recoverability. Instead of writing data immediately to their partition and disk block, column family databases can employ commit logs. These are append-only files that always write data to the end of the file. In the event of a failure, the database management system reads the commit log on recovery. Any entries in the commit log that have not been saved to partitions are written to appropriate partitions.

compression

Compression is a data management technique that uses repeating patterns in data to reduce the storage needed to hold the data. A compression algorithm for databases should perform compression and decompression operations as fast as possible. This often entails a trade-off between the speed of compression/decompression and the size of the compressed data. Faster compression algorithms can lead to larger compressed data than other, slower algorithms.

consistency level

Consistency level refers to the consistency between copies of data on different replicas. In the strictest sense, data is consistent only if all replicas have the same data. At the other end of the spectrum, you could consider the data “consistent” as long as it is persistently written to at least one replica. There are several intermediate levels as well.

D

degree

Degree is the number of edges linked to a vertex and is one way to measure the importance of any given vertex in a graph.

directed graph

A directed graph is one in which the edges have a specified direction from one vertex to another.

document

A document is a set of ordered key-value pairs. A key is a unique identifier used to look up a value. A value is an instance of any supported data type, such as a string, number, array, or list.

E

edge

An edge, also known as a link or arc, defines relationships between vertices or objects connecting vertices.

embedded document

An embedded document enables document database users to store related data in a single document. This allows the document database to avoid a process called joining in which data from one table, called the foreign key, is used to look up data in another table.

F

flow network

A flow network is a directed graph in which each edge has a capacity and each vertex has a set of incoming and outgoing edges. The sum of the capacity of incoming edges cannot be greater than the sum of the capacity of outgoing edges. The two exceptions to this rule are source and sink vertices. Sources have no inputs but have outputs, whereas sinks have inputs but no outputs.

G

gossip protocol

The gossip protocol is a protocol for sharing information between nodes in a cluster. Instead of having every node communicate with every other node, it is more efficient to have nodes share information about themselves as well as other nodes from which it has received updates.

graph traversal

Graph traversal is the process of visiting all vertices in a graph in a particular way. The purpose of this is usually to either set or read some property value in a graph.

H

hash function

A hash function is an algorithm that maps from an input, for example, a string of characters, to an output string. The size of the input can vary, but the size of the output is always the same.

hinted handoff

The hinted handoff is a protocol designed to preserve update information if a server is down. If a write operation is directed to a node that is unavailable, the operation can be redirected to another node, such as another replica node or a node designated to receive write operations when the target node is down.

The node receiving the redirected write message creates a data structure to store information about the write operation and where it should ultimately be sent. The hinted handoff periodically checks the status of the target server.

horizontal partitioning or sharding

Horizontal partitioning is the process of dividing a database by documents in a document database or by rows in a relational database.

I

intersection of graphs

The intersection of a graph is the set of vertices and edges that are common to both graphs.

isomorphism

Two graphs are considered isomorphic if for each vertex in the first graph, there is a corresponding vertex in the other graph. In addition, for each edge between a pair of vertices in the first graph, there is a corresponding edge between the corresponding vertices of the other graph.

K

key

In key-value databases, a key is a reference to a value. In relational databases, a key is also a way to reference a row in a table. Primary keys refer to the way of uniquely identifying a row. Foreign keys refer to keys stored in one table that are used to reference rows in other tables.

keyspace

A keyspace is the top-level data structure in a column family database. It is top level in the sense that all other data structures you would create as a database designer are contained within a keyspace. A keyspace is analogous to a schema in a relational database.

L

loop

A loop is an edge that connects a vertex to itself.

M

multigraph

A multigraph is a graph with multiple edges between vertices.

N

namespace

A namespace is a list of key-value pairs without duplicates for holding key-value pairs. A namespace could be an entire key-value database. The essential characteristic of a namespace is it is a collection of key-value pairs that has no duplicate keys. It is permissible to have duplicate values in a namespace.

O

order of a graph

The order of a graph is the number of vertices in the graph.

P

partition

A partition is a logical subset of a database. Partitions are usually used to store a set of data based on some attribute of the data.

partitioning

With respect to distributed databases, partitioning refers to splitting documents, tables, or graphs and distributing them to different servers.

With respect to the CAP theorem, partitioning refers to losing a network connection in a way that leaves parts of the network unreachable from some other parts.

path

A path through a graph is a set of vertices along with the edges between those vertices. The vertices in a graph are all different from each other. If edges are directed, the path is a directed path. If the graph is undirected, the paths in it are undirected paths.

polymorphic schema

A polymorphic schema is a database that allows for documents of different types and forms to exist in the same collection. This term is usually used with reference to document databases.

Q

query processor

The query processor takes as input queries and data about the documents, columns, or graphs in a database and produces a sequence of operations that retrieve the selected data. Key-value databases do not need query processors; they function by looking up values by keys.

R

replication

Replication is the process of saving multiple copies of data in your cluster. This provides for high availability of distributed databases.

ring

A ring is a logical structure for organizing partitions. A ring is a circular pattern in which each server or instance of key-value database software running on a server is linked to two adjacent servers or instances. Each server or instance is responsible for managing a range of data based on a partition key.

row key

A row key uniquely identifies a row in a column family. It serves some of the same purposes as a primary key in a relational database.

S

schemaless

Document databases do not require data modelers to formally specify the structure of documents. A formal structure specification is known as a schema. Relational databases do require schemas.

sharding or horizontal partitioning

Horizontal partitioning is the process of dividing a database by documents in a document database or by rows in a relational database.

size of a graph

The size of a graph is the number of edges in a graph.

U

undirected graph

An undirected graph is one in which the edges do not indicate a direction (such as from-to) between two vertices.

union of graphs

The union of graphs is the combined set of vertices and edges in a graph.

V

value

In key-value databases, a value is an object, typically a set of bytes, that has been associated with a key. Values can be integers, floating-point numbers, strings of characters, binary large objects (BLOBs), semistructured constructs such as JSON objects, images, audio, and just about any other data type you can represent as a series of bytes.

vertex

A vertex represents an entity marked with a unique identifier—analogous to a row key in a column family database or a primary key in a relational database. It should be noted that the term node is an acceptable replacement for vertex. However, this book only uses the latter to avoid confusion with the use of node to describe a service running in a cluster.

vertical partitioning

Vertical partitioning is a technique for improving database performance by storing columns of a relational table in separate, multiple partitions. This is especially helpful when only a subset of columns is read from a row of data and the columns are read from many rows.

W

weighted graph

A weighted graph is a graph in which each edge has a number assigned to it. The number can reflect a cost, a capacity, or some other measure of an edge. This is commonly used in optimization problems, such as finding the shortest path between vertices.