Indexing the data - Introduction to Neo4j - Neo4j in Action (2015)

Neo4j in Action (2015)

Part 1. Introduction to Neo4j

Chapter 5. Indexing the data

This chapter covers

· Creating and maintaining manual indexes

· Using schema indexing and auto-indexing

· Explore trade-offs when creating an indexing strategy

In the previous chapter you saw how easy it is to move between nodes in the Neo4j graph by traversing relationships. Moving between nodes allows you to quickly and easily find connected nodes, such as a person’s friends and the movies they like. Neo4j is optimized to make graph traversal fast, but reducing the number of nodes that needs to be traversed by knowing where to start is important, and it becomes increasingly so as the size of the data set increases.

To determine where to start in the graph, Neo4j uses indexing. An index in a relational database provides the ability to quickly and easily find rows in a table by the values of particular columns. Similarly, Neo4j indexing makes it easy to find nodes or relationships with particular property values. Unlike a relational database, Neo4j requires your application code to create and maintain index entries.

Because the application code takes on the responsibility for indexing, you need to give careful thought to your indexing strategy. Poor decisions about indexing can lead to poor performance or excessive disk use. In this chapter we’ll show you how to create, maintain, and use indexes with Neo4j. We’ll then explore why you should index and discuss the inevitable trade-offs you’ll need to make when creating an indexing strategy.

Before we start talking about the trade-offs, let’s look at how to create index entries.

5.1. Creating the index entry

The most commonly used option when working with an index is explicitly creating the index, then adding entries as nodes are created. Each index entry typically identifies a node or relationship property value. The index entry contains references to one or more nodes having a particular value for the property you’re indexing on.

Figure 5.1 shows that a node index such as this can be thought of as a value entry that relates to one or more pointers to nodes. In this case, we’re expecting email addresses to be unique, so we expect one primary email address for each user.

Figure 5.1. Index pointing to user nodes as values, using the email property as a key

In Neo4j, the IndexManager provides access to the index using a simple string as the index key. In our social networking application, a common starting point will be a user. To uniquely identify users, you’ll have them use their email addresses to log in. The code in listing 5.1 shows the creation of a new user and the creation of an index entry. This means you can quickly find the node representing the user via the email address. All available index operations are defined on the Index<Node> interface, an instance of which is accessed from Neo4j’s index manager component:

IndexManager indexManager = graphDB.index();

Index<Node> userIndex = indexManager.forNodes("users");

Note that you simply ask the Neo4j index manager component for the index with the required name. From the perspective of the code, it’s irrelevant whether the index exists; if it doesn’t exist, the index will be created when you ask for it.

Listing 5.1. Creating an index entry for a node using Neo4j API

First up, you create nodes using the Neo4j Core API and add properties (, ). Next, you create an index and give it the name users (, )—this name will become the unique identifier for this newly created index.

Finally, you add the node to the index . To add a node to the index, you need to provide three parameters:

· The node you want to index (personOne)

· The index key ("email")

· The indexed value (jsmith@example.org)

The index key and value are then stored by Lucene (Neo4j’s default indexing implementation), along with the reference to the person node, as illustrated in figure 5.1.

If you’d like to configure the Lucene index, Neo4j allows to you pass in various configuration options when the index is created. This can be done using the following method:

IndexManager.forNodes( String indexName, Map<String, String>

customConfiguration );

Map customConfiguration can contain any valid Lucene setting. For a list of Lucene configuration settings, see the Neo4j/Lucene documentation at http://docs.neo4j.org/chunked/stable/indexing-create-advanced.html.

Of course, the real value of indexing a node is being able to find the right node quickly and easily. That’s what we’re going to look at in the next section, where you’ll use the user email index to find user nodes.

5.2. Finding the user by their email

Having a network of users is great, but now it’s time to start letting users interact through a full-blown web application that allows them to connect to friends as well as store movie information.

In this web application, one of the first things you want to do is display a list of the users’ friends. Finding the friends of a user will involve traversing the friend relationships. But first you need to look up the node that represents the user that has logged in.

Since your users use their email addresses as login IDs and you index user nodes by email, it’s easy to look up a user that has logged in. You can then traverse outbound friend relationships to find their friends, as figure 5.2 illustrates.

Figure 5.2. Looking up a user node from the index by using the email property

To find a user by email, you need to obtain a reference to the node index identified by the index’s unique name; in this example you have an index called users. You can then get the index hits for a particular key and value. The next listing illustrates looking up the user node by emailvalue.

Listing 5.2. Finding a single user by index lookup using the email property

Once you’ve found the logged-in user’s node, it’s easy to find their friends using a quick and simple traversal with the Core Java API:

In this section, we’ve demonstrated the typical pattern for working with indexes in Neo4j—find a node by index lookup, then perform the traversal as usual. The code in listing 5.2 assumes that you’ll find at most one match for the user’s email address. Let’s now see how you can deal with multiple hits.

5.3. Dealing with more than one match

In the previous example, you indexed the email property for each user, and, assuming that email addresses are unique, you expected a single result from the index lookup operation.

That won’t always be the case. Let’s add another indexed property to the user nodes—age. Naturally, multiple user nodes can have the same value for the age property. The key in such an index can reference multiple user nodes as values, as figure 5.3 illustrates.

Figure 5.3. User nodes indexed by the age property, with each key potentially referencing multiple nodes

Earlier, we assumed that you’d always find at most one match for the provided user email address. If that’s not going to be the case, you’ll need to iterate over the index hits returned, as shown in the following listing, which retrieves all users who are 34.

Listing 5.3. Iterating through multiple results of an index lookup operation

Note

IndexHits is one-time iterable. Once spent, you can’t use it again. This is a constraint of Lucene. For more information, consult the Lucene documentation at http://lucene.apache.org.

There may be some cases where you want to find all the 34-year-old users; it’s more likely that you’ll want to find users in a range. Lucene supports range queries, and many other types of queries out of the box. Lucene details are beyond the scope of this book, but you can find all the details in the Lucene documentation (specifically at http://mng.bz/Lq6r).

Note

IndexHits should be closed after use. If you iterate through all the hits (as in the previous example), IndexHits will close automatically. If you don’t fully iterate through all IndexHits values, you’ll have to make sure you close it manually by calling IndexHits.close(). This is a constraint of Lucene; for more information, consult the Lucene documentation at http://lucene.apache.org.

Having seen how to create and query index entries, your next concern is how to deal with a change to the graph that requires a change to the index.

5.4. Dealing with changes to indexed data

The creation of index entries is rarely a case of create and forget. Even in this chapter’s simple example, you’d need to deal with users changing their email addresses or wanting to delete their accounts.

Because the indexing strategies employed are entirely at the discretion of the application developer, Neo4j can’t automatically update manually created indexes to reflect changes made to the graph itself. We’ll see examples later on of how you can achieve this kind of flexibility via other mechanisms, but for now we’ll concentrate on how you can deal with changes involving manually created indexes.

What happens when a user wants to update their email address? We already mentioned that all available index operations are defined on the Index<Node> interface:

Index<Node> userIndex = indexManager.forNodes("users");

The first thing you’ll probably notice when looking at the Index<Node> interface is that there are no update methods, and no way of modifying an existing index entry once you have looked it up. The solution is quite simple: “remove then add” equals “update” when dealing with Neo4j indexes. The following listing shows how this can be done using the Neo4j Core Java API.

Listing 5.4. Updating the index using sequential remove and add operations

Indexing relationships

You’ve seen how to create index entries that allow you to quickly look up nodes in the graph, but it’s also possible to index the relationships between nodes. This is much like indexing nodes, but you use relationship properties as index keys.

In our experience, it’s less common for relationship indexes to be good solutions. We’re not saying that relationship indexing is a poor practice, but if you find yourself adding lots of relationship indexes, it’s worth asking why.

Usually a graph will represent entities of some form as nodes, such as people, movies, and cities. The relationships then represent the relationships between those entities. The questions you then usually ask are about who these entities are connected to. Who likes the movie, who is friends with a specific person, or what transport routes leave from a city.

It’s not impossible to think of questions that might be answered by a relationship index, however. If you chose to represent the star ratings in the movie database with different relationships for different ratings, indexing the relationship would allow you to quickly find all the five-star reviews.

You can read more about relationship indexing in the Neo4j Manual: http://docs.neo4j.org/chunked/stable/indexing-relationships.html.

Later in this chapter we’ll take a more detailed look at the trade-offs associated with indexing, but first we’ll complete our tour of the indexing options by taking a look at auto-indexing.

5.5. Automatic indexing

So far in this chapter, we’ve focused on manually creating and maintaining index entries. If you’re coming from an RDBMS world, you may wonder why the database can’t do this work for you; after all, in the RDBMS world, you’d expect to declare which columns on a table you wanted indexed and to have the RDBMS then maintain the contents of that index as rows were inserted, updated, and deleted.

There are two ways that you can have Neo4j maintain indexes automatically—schema indexing and auto-indexing. Let’s take a look at schema indexing first.

5.5.1. Schema indexing

In version 2.0, Neo4j introduced the concept of schema indexing. Conceptually, this is similar to the traditional index-handling approaches used by relational databases.

Schema indexing works closely with the concept of node labels, which we introduced in chapter 3. Each schema index is specific to a label and a set of properties. For example, you can define an index for users on their name property, or an index for movies on the title and productionyear properties. All you have to do is to define the index, and thereafter, Neo4j will take responsibility for maintaining it. This means that when you create a node with a label and a property that matches one or more indexes, all indexes will be updated for that value. When you delete a node, all entries for that node will be removed from all relevant indexes automatically. The same applies when you update node properties.

Let’s take a look at a real example of this in the following listing.

Listing 5.5. Using schema indexes with Java API

This example uses two labels: one for movies, called MOVIE, and one for users, called USER. These are defined upfront .

Now you need to define the schema indexes you’re going to need. This is very similar to defining indexes in relational databases. The Java API for index creation is available under the GraphDatabaseService.schema() method. First, you create an index for the MOVIE label on thename property , then you do the same for the USER label . Note that this code is wrapped in its own transaction, separate from the remainder of the example. That’s because this code would typically only be done once upfront in the application. More importantly, you’ll get an exception if you try to do this all in one transaction—something along the lines of “org.neo4j.graphdb.Constraint-ViolationException: Cannot perform data updates in a transaction that has performed schema updates.”

Having defined the indexes, you can now continue to create some nodes. First you create a node for the movie “Michael Collins” with the MOVIE label . Because the movie’s label and property name match the index you’ve already defined, this node will automatically be indexed and be searchable by its name. Next, just for fun, you create a user whose name is “Michael Collins”, the same as the movie name , then commit the transaction.

Provided all has gone according to plan, you can realistically now expect that your nodes have been indexed. You can verify that by searching for the “Michael Collins” movie. To search the schema index, you use the GraphDatabaseService.findNodesByLabelAndProperty(...)method . As its name suggests, this method will perform the search across all nodes for the label and property values supplied. In this case, you want to search across the MOVIE label with a name property matching the value “Michael Collins”. The returned result is a Java Iterablecontaining the matching nodes. You’d expect exactly one result from this search. You have two nodes with the same name, “Michael Collins”, but only one has the MOVIE label, and this is the one you expect to get. The assertion proves that the result is correct .

Every node can have one or more labels attached, each with an index defined. If the node has multiple labels, Neo4j will ensure that all relevant indexes are updated as required. Here is an illustration.

Listing 5.6. Updating multiple schema indexes

In this example, you define indexes for two labels, one for regular users, USER, and one for admins, ADMIN . Then you create a single node that represents both a regular user and an admin, using two labels . You’d expect that this user would be searchable across normal users as well as admins, and the next two steps confirm that this is the case (, ).

So far you’ve seen how the index gets populated on node creation. We’ve also mentioned that schema indexes get updated when a node is deleted. The following code snippet shows this:

The schema indexing approach groups the indexed data by node label. This means that each index contains nodes with one label. If you need to automatically index all nodes on a given property (regardless of the node labels), you can instead use the auto-indexing feature.

5.5.2. Auto-indexing

To use auto-indexing, you need to tell Neo4j that you want to turn auto-indexing on for nodes, relationships, or both. But simply switching on auto-indexing doesn’t cause anything to happen. In larger data sets, indexing everything may not be practical; you’re potentially increasing storage requirements by a factor of two or more if every value is stored both in Neo4j storage and the index. There will also be performance overhead for every mutating operation because of the extra work of maintaining the index. As a result, Neo4j takes a more selective approach to indexing. Even with auto-indexing turned on, Neo4j will only maintain indexes of node or relationship properties it’s told to index.

How you configure auto-indexing depends on whether you’re running Neo4j in embedded mode or server mode (chapter 10 covers these two modes).

Configuring auto-indexing in standalone mode

To turn on auto-indexing for a standalone server, you’ll need to modify its configuration file with additional properties. The configuration file can be found at $NEO4J_SERVER/conf/neo4j.properties, and you’ll need to add the following two lines:

node_auto_indexing=true

relationship_auto_indexing=true

Specifying what you want to index requires the addition of the following two neo4j.properties:

node_keys_indexable=name, dateOfBirth

relationship_keys_indexable=type,name

Configuring auto-indexing in embedded mode

To turn on auto-indexing for Neo4j in embedded mode, you’ll need to pass in additional values when you create the graph database instance. The following values need to be included as part of the java.util.Map argument containing configuration:

Map<String, String> config = new HashMap<String, String>();

config.put( Config.NODE_AUTO_INDEXING, "true" );

config.put( Config.RELATIONSHIP_AUTO_INDEXING, "true" );

EmbeddedGraphDatabase graphDb =

new EmbeddedGraphDatabase("/var/neo4j/data", config );

Using the config map programmatically requires the addition of two properties, which contain a list of key names to index:

config.put( Config.NODE_KEYS_INDEXABLE, "name, dateOfBirth" );

config.put( Config.RELATIONSHIP_KEYS_INDEXABLE, "type, name" );

Using an automatically created index

Once you’ve added the appropriate configuration properties, the graph should be searchable through the auto-index. For example, given the preceding configuration, all relationships in the graph that have properties type or name, and all nodes in the graph that have properties name anddateOfBirth, should be searchable via the auto-index.

You’ve probably noticed that there’s no way to specify an index name in the configuration, so you can’t specify that you want to index node names in one index and date of birth in another. This is somewhat different from manual indexing, where an index name is always specified. This is because Neo4j has exactly one index for auto-indexed relationships and one index for auto-indexed nodes.

The following code shows how, having turned on auto-indexing of node properties with a key of name, you can access the index and use it to find nodes with a particular value for the name property:

Using the schema and auto-indexing facilities offered by Neo4j means less work for the application developer but also less control over what gets indexed. A naïve strategy of indexing all properties would have significant implications, for performance of graph mutation and also for disk space requirements. The following section will explore these trade-offs in greater detail.

5.6. The cost/benefit trade-off of indexing

Having seen the different approaches to indexing offered by Neo4j, we’ll turn our attention to the trade-offs you need to consider when determining which indexing strategy is right for your application. You’re probably starting to realize that Neo4j is extremely flexible when it comes to both indexing and the ways you can model your data.

For example, you could have modeled the social network data in a completely different fashion, which would have an impact on the indexing strategy. Figure 5.4 shows one alternative representation of the data, with intermediary nodes that represent Kate’s movies and friends.

Figure 5.4. Graph of a social network using intermediate nodes to differentiate between relationships to user and film nodes.

Is this a better way of representing the data for the social networking application? Unfortunately, the answer is that it depends on your application. Determining the best data representation and indexing strategy depends heavily on both the data that you want to store and the way that you want to interact with that data. Later in the book we’ll discuss the trade-offs between different graph representations, but for the moment we’ll focus on indexing.

The main trade-off when it comes to indexing is that the more indexing you have, the more disk space you’ll need, the greater the performance hit you’ll take on inserts and updates to the graph, and the more code you’ll usually need to manage the creation and updating of indexes. The advantage is enhanced performance if your indexing strategy can eliminate large portions of the graph from consideration when querying.

5.6.1. Performance benefit of indexing when querying

The graph in figure 5.5 shows how marked the difference is between a simple search of all user nodes in the graph and the use of an index. This graph shows the average time to look up a random user node.

Figure 5.5. Performance of node lookup using index compared to iterating through all nodes

The solid line shows that the average time taken to find a node, if you simply search all nodes, is directly proportional to the number of nodes in the graph. This is not too surprising; the more nodes you have, the more you’ll stumble blindly around the graph before you happen upon the user node you’re looking for.

In marked contrast, the dotted line shows the time taken to find the user node using an email index. While this number varies a little, the variation is small, with an average of 0.1 to 0.3 milliseconds per search. When the test tops out at 1 million users, the search of all nodes is on average one thousand times slower than finding the user with an index.

Notes on the lookup benchmark

Note that it’s considered bad practice to store the Neo4j ID externally for lookups because it’s possible that this ID may change in some circumstances. The node ID should be treated as an internal implementation detail of Neo4j and shouldn’t be stored in order to enable fast access.

The other way you can get at the nodes in the graph is via the getAllNodes method, which returns an Iterable that allows you to iterate through all the nodes in the graph. This can be a viable strategy for smaller datasets, but there’s a performance overhead to loading large parts of the graph each time you need to find a particular node.

The test uses getAllNodes to access all the user nodes. In a real application, this strategy would be even less performant than the test indicates, because it would encounter a significant number of nodes that were not of type USER.

As we keep saying, decisions around indexing are all about trade-offs. In the next section you’ll see that the performance gains for lookups can easily be negated by the overhead of the index.

5.6.2. Performance overhead of indexing when updating and inserting

Having achieved close to constant times for finding a user by using an index, you might wonder what sort of performance overhead you’re incurring for the extra index maintenance work you’re doing.

The graph in figure 5.6 shows the average time taken to create the user node as the number of users in the dataset increases. The time doesn’t grow much in relation to the number of users; however, the time taken to create a node and index for the email address takes roughly twice the time as creating the node alone.

Figure 5.6. Average time for storing user node with and without indexing

An extra hundredth of a millisecond probably won’t make much difference in this case, even with a million users, because it’s likely that you’ll create this user node once when the user registers and then update it infrequently. But this won’t be the case in all applications. Where data is updated or created more frequently than it’s read, thought should be given to the performance trade-off between fast updates and fast finds. It’s also likely that you’ll need to access your graph data in different ways, so you may have many indexes such as ZIP code, age, and sex, and each of these will further increase the performance overhead for inserts and updates.

In addition to the performance penalty of indexes, each index requires additional storage, and we’ll look at that very briefly next.

5.6.3. Storing the index

The process of indexing essentially involves creating small lookup tables (small compared to the data set), which allow fast access to the right place in the graph. This means that in addition to the performance overhead of writing out the extra index data, you’ll need more disk space to store all of this data.

5.7. Summary

In this chapter, you’ve seen that Neo4j provides flexible mechanisms that allow Neo4j applications to quickly focus in on the parts of the graph that are of interest. But with that flexibility comes the need to consider trade-offs when deciding what to index and what not to index.

This concludes part 1 of the book, in which we described all the basic concepts and techniques for dealing with graph data in the Neo4j database. In part 2, we’re going to take an in-depth look at some key concepts in Neo4j. First we’re going to learn about Cypher—Neo4j’s human graph query language!