Graph Databases - Information Management: Strategies for Gaining a Competitive Advantage with Data (2014)

Information Management: Strategies for Gaining a Competitive Advantage with Data (2014)

Chapter 12. Graph Databases

When Relationships are the Data

Over in a corner of the NoSQL world, hidden among the key-value stores, the document stores Couchbase, MongoDB, and the column stores like Cassandra, lies the Graph Databases.

Before relational databases, there were network databases, which are actually quite similar in concept to graph databases. SQL and the relational world came along and was clearly a better fit for the modern workload, which was largely oriented to working with numbers. Now, what matters has expanded and graph databases make a strong value proposition for their intended workload. That workload is highly connected data and includes navigating social networks, configurations, and recommendations. With the high interest in those applications, it’s workload that is poised to expand tremendously.

The structure does not accept SQL. For example, Cypher is the language used with Neo4j. It contains the commands necessary to get nodes, traverse nodes, and return values. It’s simpler than SQL for traversing relationships to find values or the existence of values. Gremlin is another project for accessing graph databases. One very cool feature is to limit the “degrees” that are searched in a query.

Neo Technology, a Swedish company, is the commercial sponsor of Neo4j, a leading graph database. ACID-compatible Neo4j can hold up to tens of billions of nodes, tens of billions of relationships and tens of billions of properties. Andreas Kollegger, Product Experience Designer at Neo4j, noted there were 1000 people participating in the community with thousands of databases deployed at customers of all sizes. Other graph databases include STIG from Tagged and AllegroGraph from Franz. Objectivity’s Infinite Graph is an object-oriented graph database.

Keywords

database management; analytics; graph databases; Neo4J; Neo Technology; analytic databases; STIG; Tagged; AllegroGraph; Franz; Objectivity; Infinite Graph

There is at least one application of NoSQL technology to a workload that makes sense even when the data is not necessarily massive. DBMS models used to have a primary focus on the relationships amongst data. Interestingly, and perhaps ironically, the focus of the relationships in RDBMSs is generally between discrete things that were broken apart (normalized) in order to be stored using the relational model. In contrast, relationships in graph databases are used to relate just about anything, including different types of relationships for like things (people to people, etc.).

Precursors to the relational model—such as hierarchical and later, network—models stored data with meticulous navigational information. Relational turned out to be a better fit for the heterogeneous workloads for the data and took care of the relationship aspects with referential integrity. Further, relational is a good fit for types of data that are inherently tabular and static, such as invoices, order forms, and general ledgers.

Eventually, the modern demands have grown to require the model to fit the workload more closely to drive performance, and graph databases have emerged to, once again, emphasize data relationships. The types of data that people care to monetize these days is more complex and interconnected (i.e., real-world systems, which are by their nature highly interconnected) vs. form data as one has in accounting, which for decades was perhaps the major driver behind much of the IT innovation that occurred when RDBMSs came to the fore in the 80s and 90s. Workloads, likewise, have emerged that emphasize relationships. When relationships are the important aspect to the data, graph databases shine. The data does not have to be petabytes, terabytes, or hundreds of gigabytes, even, for the graph database to provide significant performance benefits over other database technologies.

“Graph” is probably not the most operable term for this class of databases, as it conjures up an image of something with fixed rows and columns and predictable connections (e.g., graph paper). The old term of network may be better, but—regardless—it’s a fit when relationships matter most.

The data itself can be homogenous, such as all people and their relationships as in a “social graph.” For this, you can think of Twitter and how we’re all connected. Fast checks of how many degrees off we are from each other or list pulls of all those a certain number of degrees apart, are workloads that would be nested table self-joins in relational databases—not a fast path query. Those queries can be enormously complex. DBMS optimizers’ interpretations can be quiet variable and costly as well.

Here’s example SQL for finding the author’s friends-of-friends:

Select pf1.person as person, pf3.person as friend_of

From personfriend pf1 inner join person

On pf1.person=person.person

Inner join personfriend pf2

On pf1.friend=pf2.person

Inner join personfriend pf3

On pf2.friend=pf3.person

Where pf1.person=‘William McKnight’ and pf3.person<>‘William McKnight’

Graph databases also yield very consistent execution times that are not dependent on the number of nodes in the graph, whereas relational database performance will decrease as the total size of the data set increases.

Other NoSQL models are even less of a fit for the relationship workload due to their lack of support for relationships in the model itself. Their cost is exacerbated with deep levels of a graph. Think friends of friends of friends of friends. Dealing with these exponentially expanding limitations in other NoSQL models will cause you to accept functionality limitations. Many of these systems try to maintain the appearance of graph-like processing, but inevitably it’s done in batches and is not real-time.

Why You Need to Do a Friends of Friends Query

One reason is to make accurate recommendations. According to James Fowler, a social scientist specializing in social networks, cooperation, political participation, and genopolitics, one’s 2nd and 3rd degree connections exert a great influence on our lives, which is very interesting as this includes people we may not know! The second is that the hops can be between different types of things (i.e. heterogeneous hops). I recently met with a Neo4j customer whose typical queries are of degree eight. The highest I’ve heard of is a customer who is benchmarking queries with a degree of over 200.

I’ll get to the nature of the structure of the graph database that facilitates these kinds of queries later. Graph databases are more specialized in their best fit than most of the other structures here, so I will spend some more time describing their best workload, the relationship workload.

A node representing the author might look like CREATE Author={firstname: ‘William’, lastname: ‘McKnight’}. This node is assigned to the identifier Author. This node is now available. The Author identifier is used later in code to attach relationships to the node. For example, the clause CREATE (author)[:WROTE{date:2014}]->(Information Management)” creates a WROTE relationship from Author to the book Information Management, the one you now hold in your hand or ereader.

Social graphs may be the most obvious example of a graph database use, but it is actually not the most dominant. Many graph implementations use heterogeneous nodes, as in Figure 12.2. If you can visualize it as a graph, it may make sense to be in a graph database. Graph analysis is relevant in many industries and in companies large and small. Consider some of these examples:

• The previously mentioned social networking. Sites like Facebook, Twitter, LinkedIn, etc. are all interested in network-generated social connections. These connections are the lifeblood of their concepts and their navigation must be done in real-time (under 200 ms) because this information is used for everything from online recommendations to user login to parcel routing, and is used thousands of times a day.

• Flight path analysis. Consider airlines visualizing all of the flights in and out of various airports and overlaying the route graph with costs, distances, and other factors to determine where to put new routes.

• Financial traffic analysis. What are the popular routes between institutions, accounts, and people of money flow? Where is fraud and money laundering about to happen?

• Transportation network analysis. Where do packages and, therefore, vehicles (all kinds) need to maneuver to optimally deliver, given the constraints of time and money?

• Continuing the idea of “connections,” telecommunications companies sit on some very interesting connections—connections that call or text one another, with frequency, time, and location dimensions.

• Website navigation analysis. How do visitors navigate the site? What pages do they traverse and in what order? This knowledge leads to better website design. The internet itself is a network of connected pages, which makes storage of any navigation of your own site, or other sites, fit for a graph database.

• Crime and terrorism is seldom done in a vacuum. How people and organizations are related around these topics is of interest to government agencies and companies concerned with fraud like financial systems companies.

• Propensity to purchase. When this is needed in order to make a fast next-best-offer based on precalculated metrics, graph databases shine. They can be used to find out who is similar to the person and who has a similar history of purchases and market baskets.

image

FIGURE 12.1 Graph database showing who called whom.

image

FIGURE 12.2 A graph with heterogeneous nodes.

There’s also recommendation engines, geospatial analysis, product catalogs, biotechnology applications, genealogy research, and linguistic data that is relationship-based.

Taking a broad view of relationships, you can see that relationships are everywhere. Once you start thinking this way, the possibilities for graph databases expand. Relationships are the foundation for graph databases. Graph databases are optimized around data relationships.

Graph database benefits are exacerbated when there is this relationship type of workload for more nodes. A leading graph database server, Neo4J, accepts up to tens of billions of nodes, tens of billions of relationships, and tens of billions of properties (and will soon be lifting these limits).

An example with only nodes representing “friends” finds a large cluster of people connected to all of their friends (connected to their friends, etc.). This inability to definitively draw lines around which very few relationships cross (think of the Oprah effect1 on the Twitter graph) makes sharding difficult.

Should a graph exceed the capacity of a cluster then graphs can be still spread across database instances if the application builds in sharding logic. Sharding involves the use of a synthetic identifier to join records across database instances at the application level. How well this will perform depends very much on the shape of the graph.

Some graphs will lend themselves to this.

Of course, not all graphs have such convenient boundaries. If your graph is large enough that it needs to be broken up and no natural boundaries exist, then the approach you would use is much the same as one would use with a NOSQL store like MongoDB—create synthetic keys and relate records via the application layer using those keys plus some application-level resolution algorithm.

Like other NoSQL databases, graph databases also excel at high availability. With features like master-slave replication, automatic failover with master reelection, and clustering with remote read-only slave nodes for disaster recovery.

Queries run fastest when the portions of the graph needed to satisfy them reside in main memory. A single graph database instance today can hold many billions of nodes, relationships, and properties; some graphs, therefore, will just be too big to fit into the main memory of a single instance. Ways of optimizing in such scenarios have evolved. For example, Neo Technology’s “cache sharding” relies on the natural cache affinity that results when one load balances across a cluster in a consistent manner. Here, queries are routed to instances in such a way that each instance ends up with a different part of the graph in memory, allowing queries to find their data in memory more often than not.

Terms

Graph – A collection of nodes and relationships

Node – The object of interest in a graph. A node contains a set of (name, value) pairs. There can be multiple types of nodes in a single graph. Nodes are essentially the records in a graph database.

Traversal – The navigation of a graph to determine connections, degrees of separation, and shortest paths.

Properties – Attributes of a node or relationship. Relationships – Connections between nodes with a direction. Relationships have properties as well.

Queries – Traversals of the graph.

Paths – development; as well as that between development of an application, and the ability for that application to perisist dataan nsid Routes between nodes, including traversed relationships

While not applicable in every situation, these general guidelines will help you to choose when to use nodes and when to use relationships:

• Use nodes to represent entities—that is, the things that are of interest to you in your domain.

• Use relationships to express the connections between entities and establish semantic context for each entity, thereby structuring the domain.

• Use node properties to represent entity attributes, plus any necessary entity metadata, such as timestamps, version numbers, etc.

• Use relationship properties to express the strength, weight, or quality of a relationship, plus any necessary relationship metadata, such as timestamps, version numbers, etc.

Structure

Like nodes, relationships can have properties such as the age or relationship (public, private) of the relationship represented.

Neo4j, for example, stores graph data spread across a number of store files. Each store file contains the data for a specific part of the graph (e.g. nodes, relationships, properties). The division of storage responsibilities—particularly the separation of graph structure from property data—exists for high performance graph traversals even if it means the user’s view of their graph and the actual records on disk are structurally dissimilar.

The structure does not accept SQL. For example, Cypher is the language used with Neo4J. It contains the commands necessary to get nodes, traverse nodes, and return values. It’s SQL-like for traversing relationships to find values or the existence of values. One very cool feature is to limit the “degrees” that are searched in a query. More about Cypher in the next section.

As alluded to in the examples above, graph traversal can focus on a specific path, finding an optimal path, or the importance (“centrality”) of a path.

In specific path analysis, the graph database will move from out of a node in all directions looking for the relationship. Until it is found, the pattern repeats until the relationship is found or the edge of the graph is reached.

With optimal path analysis, the shortest path is found between two nodes. This yields the level of connectedness—the closer the nodes in the graph, the closer the relationship.

Modeling for a graph database is “whiteboard friendly.” By this, I mean it is how a user would model the problem on a whiteboard. There is close affinity between the logical and physical graph models, which closes various project gaps, including those between requirements analysis, design, and the start of development; as well as that between development of an application and the ability for that application to persist (write/store) data. Relational modeling requires more transformations from the real world to its implementations. There is also no concept of the denormalization technique done to relational models, which impacts relatability in order to improve performance.

image

FIGURE 12.3 Modeling a user’s order history in a graph.

Centrality Analysis

Finally, there is centrality analysis. Various measures of the centrality of a node have been defined in graph theory, which underlies the graph database. The higher the measure, the more “important” the node. Here are some different ways to measure centrality:

Degree centrality: This is simply the number of edges of the edge. The more edges, relatively speaking within the graph, the more important the node. The nodes with higher edges (i.e., the more “important” customers, products, etc.) typically looks like a “hub” of activity if you were to visualize the graph.

Closeness centrality: This measure is the inverse of the sum of all shortest paths to other vertices. Closeness can be regarded as a measure of how fast it will take to spread information to all other nodes. If a node has strong closeness centrality, it is in a position, with its relationships, to spread information quickly. These people (if nodes are people in the graph) can be important influencers of the network.

Betweenness centrality: As an indicator of how many paths a node is on, its betweenness centrality demonstrates how many paths it is a part of, which represent that node’s ability to make connections to other groups in the graph.

Eigenvector centrality: Finally, there is eigenvector centrality, which assigns scores to all nodes in the network that indicate the importance of a node in a graph. Scores are assigned based on the principle that high-scoring nodes contribute more to the score than equal connections to low-scoring vertices.

LinkedIn’s Graph Database

LinkedIn has a neat feature which shows you a visualization of your LinkedIn network.

See http://inmaps.linkedinlabs.com/. This is an example of the power of a graph database.

image

FIGURE 12.4 Central portion of the author’s LinkedIn “InMap.”

Cypher, a Graph Database Language

To demonstrate some of the characteristics of a language used with a graph database, I’ll look at Cypher, the language for Neo4J—one of the leading graph databases.

Cypher is a pattern-matching query language. It uses declarative grammar with clauses, which is like SQL. It has aggregation, ordering, and limits and you can create, read, update, or delete. Most queries follow a pattern in which an index is used simply to find a starting node (or nodes), and the remainder of the traversal uses a combination of pointer chasing and pattern matching to search the data store.

Here is an example of Cypher commands to get a node, traverse a node and return all the friends of friends.

//get node with id 0

start a=node(0) return a

//traverse from node 1

start a=node(1) match(a)→(b) return b

//return friends of friends

start a=node(1) match(a)—()—(c) return c

Here is an example in Cypher for pulling a list of “friend of friends”:

start n=node:People(name=’William McKnight’)

match (n)—()—(foaf) return foaf

Other Examples of graph database servers are InfiniteGraph, AllegroGraph RDFStore, STIG from Tagged, and VertexDB.

Graph Database Models

Neo4J is considered a property graph, which is a “directed, labeled, attributed multigraph” that exposes three building blocks: nodes, typed relationships, and key-value properties on both nodes and relationships. Another graph model is RDF Triples, which is a URI-centered subject-predicate-object triples as pioneered by the semantic web movement. Finally there’s HyperGraph, which is a generalized graph in which a relationship can connect an arbitrary number of nodes (compared to the more common binary graph models).

Hypergraph is a generalized graph model in which a relationship (called a hyper-edge) can connect any number of nodes. More specifically, while the property graph model permits a relationship to have a single start and end node, the hypergraph model allows any number of nodes at a relationship’s start and end.

image

FIGURE 12.5 A simple (directed) hypergraph.

Property graphs are the most pragmatic, giving properties to nodes to make them easier to work with.

Since I am choosing only timeless material for this book and these distinctions could change, it doesn’t make sense to go too deeply into the differences in graph types, especially when I see the “best of” each being a part of all graph databases in the future. Graph databases are, however, very stable in their architecture. Graph databases is sufficient as a category for purposes of allocating workload.

Those relationship-oriented workloads—described above plus their offshoots—find multiple benefits of being on a graph database as opposed to a relational database. These benefits include:

1. Query execution speed

2. Reduced application development cycles for developing the queries

3. Extensibility, scalability, and failover

As with other data stores described in this book, if you are after the power of performance, it behooves you to place, or move, workloads into the data stores that, primarily, deliver that performance. Performance is a powerful motivator for using the graph database for the relationship workload.

Action Plan

• Analyze workloads to see if what is really important in them is relationships.

• Analyze workloads to see if performance issues are predominantly related to navigating relationships.

• Consider the workloads discussed in this chapter, and ones that are similar, for graph databases.

www.mcknightcg.com/bookch12


1The Oprah effect is an expression referring to the effect that an appearance on The Oprah Winfrey Show, or an endorsement by Oprah Winfrey, can have on a business. Because the show reaches millions of viewers each week, a recommendation from Oprah can have a significant and often unexpected influence for a new or struggling business. Source: Investopedia