Neo4j: embedded versus server mode - Neo4j in Production - Neo4j in Action (2015)

Neo4j in Action (2015)

Part 3. Neo4j in Production

In the first two parts of the book, we covered the basics of Neo4j as well as how to use Neo4j from an application development perspective.

This final part shifts focus and looks at more operational-type areas and aspects, which need equally careful consideration and thinking. There are only two chapters in this part but they cover quite a bit of ground, so strap yourself in!

Chapter 10 explores the two main usage modes in Neo4j—embedded and server—looking at the pros and cons of using each. With much of the book having focused on the use of the embedded option, this chapter provides guidance on and examples of how to take advantage of specific server features such as plugins and extensions to get the most out of Neo4j when running in this mode. Chapter 11 finishes off by taking you on a tour of the high-level Neo4j architecture and showing how to scale and configure Neo4j to be highly available, as well as how to back up and restore your Neo4j database.

Chapter 10. Neo4j: embedded versus server mode

This chapter covers

· The two main usage modes: embedded and server

· How to weigh the pros and cons of each mode

· Getting the most out of your server with Cypher, plugins, extensions, and streaming

Now that you have a good understanding of the approaches and practical techniques required to design and model your world in the Neo4j graph database, it’s time to look at the two main ways you can run Neo4j, namely in embedded or server mode. Before embarking on any serious Neo4j project, one of the first things you’ll need to do is make a decision about which mode you ultimately want to run in production. This choice will influence, among other things, what languages and architectural landscapes your application can operate within, so it’s an important consideration.

You’ll be pleased to know that, regardless of which mode you choose, everything you’ve learned so far is still applicable. The semantics around using each mode are quite different, and it’s important to understand what these are and how to use them appropriately, but the core principles remain the same.

In this chapter, we’ll cover why there are two modes, what the main differences are, and what the trade-offs, pros, cons, and implications are of using each mode. Let’s get into learning mode and get going!

10.1. Usage modes overview

When Neo4j was first released, it was aimed squarely at the Java-based world, and back then it only supported the embedded mode. Within the embedded mode setup, your Java application and new shiny Neo4j database were happily bundled together as a single deployable entity, and together they went forth to conquer the brave new world of interesting graph-based problems.

The broader capabilities and functionality of Neo4j, however, did not go unnoticed by other languages, which were also interested in being able to leverage and make use of this new graph database. Neo4j, although written in Java, is inherently just a JVM-based product. This means that, theoretically, any JVM-based language (provided the appropriate libraries or bindings can be found or written) can also make use of the Neo4j database. Thus, Neo4j’s reach naturally began to extend to other JVM-based languages as various libraries and bindings began to evolve and become available. But it was the need to operate in more network-friendly architectures and to support other non-JVM clients that were the primary drivers behind the introduction of the server mode. With server mode, the Neo4j database runs in its own process, with clients talking to it via its dedicated HTTP-based REST API.

Figure 10.1 shows an overview of the two usage modes and the main ways in which Neo4j can be used by different clients through these modes.

Figure 10.1. Overview of Neo4j usage modes and the main integration options for clients

In embedded mode, Neo4j can be used by any client code that’s capable of running within a JVM. As figure 10.1 illustrates, you can use the embedded mode directly with pure Java clients, which make direct use of the core Neo4j libraries, or indirectly, through additional language-specific bindings and frameworks provided by various communities for other JVM-based languages.

In server mode, client code interacts with the Neo4j server via the HTTP protocol, specifically via a well-defined REST API, with additional options being available to extend this REST functionality when required. The API can be used directly by any HTTP-enabled client , or, to make development life a little easier, by using one of the remote REST client APIs available for a variety of different languages and frameworks . With the inherent network latency introduced in the server mode, performance is naturally not going to be as good as accessing the database using native code directly. To add more flexibility to the server offering, server plugins and unmanaged extensions can also be used to bolster performance and functionality. We’ll be covering server plugins and extensions later in this chapter.

Note

All of the coding examples you’ve seen in the book so far have used an embedded mode setup. Embedded mode is a great way to experiment with Neo4j, and it’s often used to explore what Neo4j is capable of. Even if you decide to use the server mode, there will still be opportunities to make use of native APIs on the server itself, and sometimes it may even be necessary. Additionally, with the REST API often just providing a façade over the raw embedded API, knowing and understanding exactly how the embedded APIs work can be very useful. Understanding embedded semantics and options will go a long way toward helping you get the most out of your setup, even if you opt for a server-based setup.

10.2. Embedded mode

All the examples we’ve looked at so far have used the embedded mode, so it makes sense for us to explore this approach in more depth first. To clarify, embedded mode does not refer to the embedding of the actual physical database on disk with your application, but rather to the embedding of the Neo4j engine (the classes and associated processes) that run and manage the Neo4j database directly.

10.2.1. Core Java integration

The most common embedding scenario involves embedding Neo4j directly within a Java-based application, as shown in figure 10.2.

Figure 10.2. Typical Java-embedded deployment scenario, where the Neo4j libraries are embedded in the Java application

The core Neo4j classes that are packaged and run within your application do more than simply act as a mechanism for funneling data backward and forward between the physical data store and your application. The classes themselves form an integral part of the whole database offering, handling, for example, all the logic and in-memory requirements necessary to perform traversals, queries, and so on. From an application architecture perspective, this is interesting because it means that the logic controlling your application, as well as that controlling the database, needs to be able to live in harmony within the same JVM space. The implications of this cohabitation are covered in section 10.4, but for now it’s enough to know that the embedded mode means that both your application and Neo4j code will be residing and operating within the same JVM.

Embedded mode requires that the appropriate libraries (JAR files) are bundled or made available to your application when it starts up. It’s then your application’s responsibility to gain access to the Neo4j database by instantiating an appropriate instance of the GraphDatabaseServiceinterface. Your application can then use this reference to interact with Neo4j, using all of the APIs we’ve discussed so far. The EmbeddedGraphDatabase class typically is used for a single machine setup; the HighlyAvailableGraphDatabase class is used for a multimachine setup. High Availability (HA) is covered in chapter 11.

Required libraries

If you’re using Maven within your project (http://maven.apache.org), the following listing shows all that’s required to import the appropriate embedded Neo4j libraries into your application.

Listing 10.1. Embedded Neo4j dependencies

Maven will ensure that all the required dependencies will also be downloaded for you as part of its transitive dependency management system. In other words, Maven will work out what other supporting libraries also need to be downloaded in addition to the main Neo4j libraries and ensure that these are retrieved for you.

If you don’t use Maven, you’ll need to download the appropriate zip/tarball from the Neo4j website and extract the necessary libraries found in the lib directory. The next listing shows the result of the Maven dependency:tree execution , showing what additional libraries Maven downloads when you request the core embedded libraries.

Listing 10.2. Dependency tree of core Neo4j embedded library

Gaining access to an embedded Neo4j database

Assuming you have all of the appropriate dependencies available to your code, the next listing details how to obtain a reference to an embedded database, and it also, being a responsible piece of code, provides a mechanism to ensure the database is shut down properly when the JVM exits.

Listing 10.3. Starting and stopping an embedded graph database

It’s important to try to ensure Neo4j shuts down cleanly whenever your application exits. The shutdown code in listing 10.3 will be triggered even if you send a SIGINT (Ctrl-C) signal to try to terminate the process before it’s finished.

Failure to shut down cleanly could result in problems the next time you start up. Neo4j is able to detect unclean shutdowns and attempt a recovery, but this will result in a slower initial startup while the recovery is in process, and sometimes may result in other issues as well. When an unclean shutdown occurs, the following warning message can generally be seen in the logs the next time the database is started up again: non clean shutdown detected.

Testing in embedded mode

Testing forms an important part of any software development project. This book and most of the code detailed within are backed by unit tests proving and illustrating the scenarios and statements made in various chapters. In many cases, these tests make use of a very handy in-memory database implementation (org.neo4j.test.Impermanent-GraphDatabase) that has specifically been created with unit testing in mind.

The ImpermanentGraphDatabase class can be found in the test neo4j-kernel library (specified in of listing 10.1). Using Neo4j with an embedded data store (see in listing 10.3) results in the physical nodes and relationships being stored on disk. Using Neo4j with an impermanent backed data store, however, results in the data only being stored in memory rather than on the filesystem. You’re strongly encouraged to unit test your Neo4j code, and this ImpermanentGraphDatabase implementation provides a fantastic way to test or prove certain graphing scenarios within your problem domain.

10.2.2. Other JVM-based integration

The Neo4j community is a diverse and active bunch, and it has already seen the creation of numerous language and framework bindings for using Neo4j in embedded mode with other languages such as Scala, JRuby, and others. This is typically accomplished through language-specific wrappers that adapt the Neo4j Core Java API into an API that can be used natively (and in some cases idiomatically) by the JVM language and framework in question. Figure 10.3 illustrates what these deployment scenarios typically involve.

Figure 10.3. Other JVM-based embedded deployment approaches, involving language-specific wrappers and drivers for Neo4j

What’s important here is that most of these wrappers merely front the Java-based Neo4j classes and API. This means that all the properties and semantics associated with the Java-based embedded version will typically also apply to a wrapper-based implementation, with any additional peculiarities introduced by the wrapper library itself also needing to be taken into account.

A listing of the latest language wrappers available can be found at http://neo4j.org/drivers/. At the time of writing, this list included the likes of JRuby, Django, JavaScript, Scala, and Clojure.

This concludes our initial foray into how the embedded mode works and what’s required to start making use of it. Next up is a similar exploration of the nature and workings of the server mode, before we move on to looking at the trade-offs, pros, and cons of both.

10.3. Server mode

Unlike embedded mode, running Neo4j in server mode involves having all the classes and logic to access and process interactions with the Neo4j database contained within its own dedicated process, completely separate from any clients wishing to use it. As with many other server-based setups, clients need some mechanism for interacting with the server process, and in the case of Neo4j, this is achieved by using the well-defined, yet extensible, HTTP-based REST API.

What is REST?

REST is sometimes seen as a bit of an overloaded term, but officially it stands for representational state transfer. In a nutshell, it can be thought of as an architectural style that embraces and takes advantage of the way in which the web operates and is structured, in most cases using HTTP as the vehicle of choice to help accomplish this. For the full theoretical definition and explanation of REST, refer to Roy Fielding’s doctoral dissertation, “Architectural Styles and the Design of Network-based Software Architectures” (2000), where this concept was originally proposed and first published (www.ics.uci.edu/~fielding/pubs/dissertation/top.htm).

By and large, web pages and the ways in which people interact with them follow a consistent pattern, both from the point of view of the user navigating them as well as the servers and supporting infrastructure behind them. As a user, you request pages (resources) by providing URLs. These pages are then returned with the data you requested, as well as additional links to other data or functionality associated with the resource. The server understands HTTP GET requests for retrieving data, and POST and PUT requests for modifying or creating new resources. Simply put, Fielding proposed that these same principles could be used for interacting with more general resources, such as those provided by application services (here, read web services). These resources could thus also be offered, navigated, and interacted with in a consistent and predictable manner by humans or systems if they were designed to take advantage of some of these principles that helped the web grow into what it is today.

RESTful-like web services (as opposed to SOAP, for example) are probably the de facto web service implementation nowadays, and Neo4j can be counted among them. Neo4j’s REST implementation makes use of JSON as the default data format and offers a service-orientated way of interacting with and manipulating the Neo4j resources (nodes and relationships).

For more information and a pragmatic treatment of the subject, see the excellent book REST in Practice by Jim Webber, Savas Parastatidis, and Ian Robinson (O’Reilly, 2010). Jim Webber played a very large part in the design and development of the Neo4j REST API as well.

10.3.1. Neo4j server overview

Figure 10.4 depicts the core components involved in a running Neo4j in server mode, along with a client accessing the server using the standard, out-of-the-box REST API.

Figure 10.4. A typical Neo4j server setup with client access via the standard REST API

The Neo4j server is itself simply a JVM-based application. Under the covers, it provides its functionality by wrapping an appropriate instance of the GraphDatabaseService interface (EmbeddedGraphDatabase or HighlyAvailableGraphDatabase), exposing the functionality to the outside world through a well-defined REST interface.

To be able to listen to and react to REST requests made to the server by clients, all Neo4j server instances will start up an embedded web server (currently a Jetty server listening, by default, on port 7474).

Installing and Using Neo4j Server

To install the Neo4j server, you’ll need to download the appropriate tar or zip file for your particular OS, uncompress it, and then make use of the OS–specific scripts to start and stop the server. Appendix A details the exact steps required to do this.

From the client perspective, you don’t necessarily need any specific Neo4j libraries. As long as you can make HTTP requests, you’re good to go. If you’re in a Unix environment, the curl client will do this for you. However, the raw REST API is very verbose, and more often than not you’ll want some client libraries that can make your life a little easier by taking care of some of the more mundane low-level REST mappings. Using one of the appropriate remote API libraries (listed in section 10.3.4) could help immensely in this regard.

Curl and Java Client Examples

This chapter provides examples using two different clients:

· The curl command available in most Unix distributions. This client will demonstrate how to make calls directly to the low-level REST API.

· A Java client using the official Neo4j Java REST binding (neo4j-rest-graphdb) for the server REST API. This option shows how a remote client library can be used to simplify coding against the Neo4j REST API.

Again, if you’re using Maven, the following snippet can be added to your pom.xml file to pull the appropriate libraries in:

<dependency>

<groupId>org.neo4j</groupId>

<artifactId>neo4j-rest-graphdb</artifactId>

<version>2.0.1</version>

</dependency>

10.3.2. Using the fine-grained Neo4j server REST API

The original fine-grained Neo4j REST API was designed from the ground up to be hypermedia driven and to be able to be “discovered” as you make use of its various aspects. Each request results in only the minimal amount of information being returned, with embedded links providing the mechanism for obtaining more information. The starting URL from which all parts of the REST API can be explored or derived is http://<domain-name>:<port>/db/data. This is sometimes referred to as the service root. Thus a Neo4j server running on the local host, on the default port 7474, would have a service root entry point of http://localhost:7474/db/data.

The following listing shows an example of the data used in a request and the associated response for the service root, highlighting some aspects to give you a feel for the discoverable nature of the API.

Listing 10.4. HTTP service root request and response

Because REST is accessible via the standard HTTP protocol, a whole raft of clients can now be catered for, and not only those able to operate on the JVM. The following snippet illustrates how the standard Unix curl client generated the HTTP request, the response to which was detailed inlisting 10.4. This code snippet includes the setting of all the required headers:

curl -X GET -H "Accept: application/json" -H "Content-Type: application/json"

http://localhost:7474/db/data/

Suppose you wanted to explore chapter 9’s social networking graph from a particular starting point, such as to get more information about Adam (an example user in the system with the user ID of adam001). You’d first need to identify your starting REST URL. You know that users in the system are indexed by their user IDs. Using listing 10.4 as our starting point, if you were using legacy indexes, you could use the base URL detailed against the node_index key as the starting point for constructing the final URL needed to look up Adam. Using schema-based indexing, you’d need to construct a URL using the node_labels key. The fine-grained requests and associated response are shown in the following listing.

Listing 10.5. HTTP request and response for getting info about Adam via his userId

This response shows that the returned node, Adam, happens to be associated with node ID 0 . It also includes all of the properties associated with this node . If you wanted to explore the relationships to or from this node, another call would be required. The URL associated withall_relationships could be used to retrieve this set of information, resulting in the response shown in the following listing.

Listing 10.6. HTTP request and response for all of Adam’s relationships

From these responses, you could continue exploring or discovering the graph as required with additional calls. Although this discoverable approach holds fast to the core principles behind REST, it makes the basic fine-grained REST API quite a chatty protocol, often requiring multiple over-the-wire requests to satisfy basic graph queries. This does not bode well for performance. In the following sections, you’ll see what kinds of approaches can be used to reduce the chattiness, but you should now have a good basic understanding of how the core REST API is designed and functions.

10.3.3. Using the Cypher Neo4j server REST API endpoint

You can, more often than not, use the Cypher REST endpoint to achieve the same result as you’d get from following the various fine-grained REST API calls from a client, but with fewer network hops and more control. As you’ll see moving forward, the use of Cypher is very much encouraged as a way to reduce some of the chattiness of your calls.

The following listing shows how you can execute a Cypher query over the REST API to get both the basic node info for Adam, plus a listing of all Adam’s relationships, in one single call—compared to the two separate calls demonstrated in the previous section with the raw fine-grained API.

Listing 10.7. Using Cypher via REST API to get Adam’s info, including all relationships

The benefit of using the Cypher endpoint is that you’re able to aggregate and consolidate the required data in a single execution, rather than requiring multiple individual network calls. Although listing 10.7 returned the raw node and relationships data, you could easily modify it to return only a subset of that data, reducing the size of the payload coming back over the wire. You could choose to only return Adam’s name, and the names of friends at the end of his IS_FRIEND_OF relationships, with the following query:

MATCH ( x:Person { userId: { uId } } )-[r:IS_FRIEND_OF]-(y)

RETURN

x.name as name ,

collect(y.name) as friend_names

10.3.4. Using a remote client library to help access the Neo4j server

In section 10.3.2, you saw how the curl client can be used to operate with the fine-grained low-level REST API. You also probably appreciate that this could prove to be quite tedious if you have to parse the JSON, construct appropriate URLs, and then follow them around yourself. You additionally learned how to make use of the Cypher endpoint as another means of accessing the Neo4j server. As with the fine-grained API, however, this required you to fully understand the low-level queries being executed, including parsing and constructing JSON results.

You could choose to make use of one of the community-contributed Neo4j remote client libraries to provide a more developer-friendly interface that does a lot of this plumbing for you. Figure 10.5 depicts what such a setup would look like. Many languages and framework combinations are available—a listing of remote REST wrappers (as well as general purpose wrappers) can be found at http://www.neo4j.org/develop/drivers. At the time of writing, this list includes Java, .NET, Python, Django, PHP, JRuby, JavaScript, and Clojure.

Figure 10.5. Server-based deployment approach using remote REST client libraries

These libraries and frameworks make integrating with the various aspects of the REST API a far more pleasant experience than trying to deal with the raw API directly. Listing 10.8 shows how a Java client could make use of the java-rest-binding library (https://github.com/neo4j/java-rest-binding) to begin navigating through all of Adam’s relationships. This particular binding conveniently wraps the REST calls behind the well-known GraphDatabaseService API, which you’ve already encountered in the Neo4j embedded mode.

Listing 10.8. Java REST client using the java-rest-binding library

Note

The neo4j-rest-graphdb binding provides two ways to interact with the server. The first is the RestAPIFacade interface that provides simple wrappers around the basic Neo4j REST API. The other approach (used in listing 10.8) is to use a GraphDatabaseService implementation that delegates to the appropriate REST API calls behind the scenes. Although this second approach can be very convenient in that it uses the GraphDatabaseService interface you’ve come to know and love, it must be stressed that you shouldn’t expect the same performance that you’ll get from the embedded mode. The additional network calls involved will make the Neo4j server calls perform less optimally than simple embedded calls. You’ll need to employ different tactics when using a GraphDatabaseService that’s talking to a Neo4j server over the network to get acceptable performance. We’ll be looking into these options later in the chapter.

10.3.5. Server plugins and unmanaged extensions

Neo4j provides two main mechanisms for avoiding the verbose and chatty nature of the basic REST API (aside from the Cypher REST endpoint): server plugins and unmanaged extensions, as highlighted in figure 10.6.

Figure 10.6. Accessing Neo4j via server plugins and unmanaged extensions

These two options allow you to write custom server-side code to supplement or enhance the existing REST API provided out of the box, and they’re sometimes compared to the stored procedures of the relational database world. They attempt to get around some of the performance limitations inherent in the backward-and-forward nature of the server mode by providing a way to offload some of the heavy-lifting logic to the server side, with only the final result needing to be sent back to the client. Unmanaged extensions, in particular, provide the opportunity to define a more domain-friendly REST API. You’ll be seeing examples of how to write server plugins and unmanaged extensions in sections 10.5.3 and 10.5.4 when we show you how to get the most out of the server setup. Before that, we’re going to look at comparisons between embedded and server modes.

10.4. Weighing the options

Now that you have a solid understanding of how the embedded and server modes work and differ from each other, we’ll move on to looking at pros and cons of each approach. You’ll see when it might make sense to choose one over the other, and the possible implications for these choices.Table 10.1 summarizes the main points we’ll be covering. Most of the points listed here can be broadly classified under architectural or performance considerations, with a few falling into the “other” category. We’ll begin to attack this comparison from those perspectives.

Table 10.1. Advantages and disadvantages of Neo4j embedded and server modes

Server type

Advantages

Disadvantages

Embedded

Speed.

Language restrictions (only Java and JVM languages supported).

Ability to take full advantage of all low-level APIs directly.

Possible common library clashes with your application.

Ability to operate in HA setup.

Tight coupling of application process and Neo4j.

Application may potentially impact the database’s performance and vice versa.

Inability to scale Neo4j independently of the application.

Server

Decoupled architecture: you can scale and manage Neo4j independently of the application.

Awkward and cumbersome fine-grained REST API.

Larger set of client platforms supported (not only JVM-based).

Slower speed, though using REST streaming, Cypher, batching, server plugins, and extensions may help.

Multiple clients can use the database.

Restricted to only being able to deal with JSON or HTML responses for raw REST API at this point in time.

Ability to operate in HA setup.

Neo4j has successfully been used in both embedded and server modes for startups and large corporations, including companies such as Adobe, Ebay, and GameSys, so you can rest assured that both approaches have been proven on both large and small scales. Neo Technology, the commercial backer of Neo4j, provides a list of customers (including case studies on some of them) who have successfully used Neo4j in various setups. For more information, see the Neo Technology site: http://www.neotechnology.com/customers.

10.4.1. Architectural considerations

One of the first things you should consider when embarking on any project is what the overall architectural requirements are, including what kind of clients will need to be supported.

For now we’ll ignore considerations such as whether or not you need HA. That’s not to say that HA isn’t important, but for the purposes of this section, it’s not part of the equation. HA is covered in chapter 11. Suffice it to say that both embedded and server modes do cater to HA.

Language considerations

When it comes to your project, if you have any specific language restrictions up front, this will naturally form one of the major factors driving your decision. The server mode can cater to a much larger set of client platforms compared to the embedded mode. With the embedded mode, you’re restricted to Java or one of the other supported JVM-based languages only; the server mode can deal with any client that can “talk” HTTP.

Separation of concerns: app concerns versus DB concerns

Choice of clients aside, one of the more fundamental items to consider is to what extent you need to be able to scale and manage your application separately from the Neo4j database.

To make this discussion more concrete, consider figure 10.7, which shows two possible ways in which you could choose to deploy the movie-based social network application from previous chapters: embedded and server modes. Let’s pretend that there was a requirement for a web application to be available for general users to interact with, as well as an administration section, or separate application, where authorized administrators could perform maintenance and housekeeping tasks, such as loading new movies, deleting old users, and so on.

Figure 10.7. Two possible deployment scenarios for the social network application: embedded and server modes

With the embedded mode, the lifecycle, memory, and processing capabilities of the application (social-movie.war) are tightly bound to those of the embedded Neo4j database; with the server mode, there are separate JVMs handling the application and the Neo4j database (JVM 1 and JVM 2). In the server version, the administration application is separated out into its own PHP-based web client, even further decoupling the main components and applications from one another.

In embedded mode, your application and Neo4j share the same JVM and therefore share the same Java heap; they’re subject to the same garbage collection (GC) cycle and will essentially live and die together. If your application causes or triggers GC in embedded mode, this may impact the performance of Neo4j. What may appear to be Neo4j reacting slowly may simply be Neo4j waiting for a GC pause to complete. There may also be cases when you’ll want to tune the Neo4j JVM and GC parameters separately from those of your application, as they may have fundamentally different usage patterns. Unfortunately, this isn’t possible with embedded mode. You’d need to find settings that could function reasonably well for both database and application needs combined. Having said that, as stated earlier, many companies have successfully embedded Neo4j in their applications, but obviously special consideration needs to be given to this area if you do. Chapter 11 goes into more detail about how to tune and maintain the JVM for optimal Neo4j use.

Additionally, because Neo4j is a Java-based application, it will have dependencies on other common libraries. Neo4j uses Lucene as its core indexing implementation (see listing 10.2). Although care has been taken to minimize these dependencies, if your application also makes use of any of these shared common libraries, you’ll need to ensure that only one appropriate version is included and used in the JVM at any given time; otherwise there may be unexpected results, either in your application or in the way the database behaves. This is generally not a big issue, but it’s something to be aware of.

Hardware considerations

Closely linked to the preceding separation of concerns issue are hardware considerations. For Neo4j to operate and function as efficiently as possible, it ideally needs a beefy machine with a lot of RAM and sufficiently fast disks (refer to chapter 11 for more details). If your setup is such that this isn’t possible—for example, if you have an existing limited application server box or a set of boxes that can’t be upgraded for whatever reason—then embedding Neo4j in your application may well bring your hardware to its knees.

In this case, one option may be to procure a new machine with sufficient memory where a Neo4j server instance could reside, leaving the application server boxes to deal with what would be considered a more typical application memory profile. If you can upgrade your machine to accommodate the additional Neo4j needs, the embedded approach could be a viable option.

10.4.2. Performance considerations

Performance is one of the key areas where the embedded and server modes differ. Neo4j in embedded mode will always outperform the server mode when doing a direct comparison of execution times for the same set of operations done via the native Java API as opposed to the REST API. This is due to the added latency and overhead associated with making calls over the network.

By way of an initial comparison, look at the results detailed in table 10.2, showing the time in milliseconds for embedded and server modes (with nodes per second in parentheses) taken to create one million new user nodes with a name property. Each new node was created in its own transaction (TX) using the raw Java API in embedded mode versus the raw REST API of the server mode.

Table 10.2. The initial results of embedded versus server mode performance when creating new nodes

Scenario

Description

Embedded

Server

1

1 TX per node (1,000,000 × 1)

168,815 ms (5,952 nodes/s)

2,380,140 ms (420 nodes/s)

* Run on a MacBook Pro with 16 GB of RAM and 1 TB SSD (with FileVault FS encryption turned on). The Neo4j server was run on the same local machine that the unit test was run on, and the unit test made use of the neo4j-rest-graphdb REST client library detailed earlier.

On the face of it, these numbers don’t make for very good reading, even for the embedded mode. Three odd minutes to create 1 million nodes? Don’t despair just yet; this example was designed to prove a point. We could argue that the question should never be a simple case of which one is faster. Rather, given a scenario, what can be done to get the best performance, and is this performance level acceptable. A few adjustments to the way in which the operations are performed can have a drastic effect on performance.

Table 10.3 shows how the performance gets a lot better when you start to make better use of transactions (native transactions in embedded mode, and batches for server mode). This simple change has a big effect on the original numbers: up to 10 times faster for the embedded mode and 16 times faster for the server mode using batches.

Table 10.3. Extended results of embedded versus server mode performance when creating new nodes

Scenario

Description

Embedded

Server

1

1 TX per node (1,000,000 × 1)

168,815 ms (5,952 nodes/s)

2,380,140 ms (420 nodes/s)

2

1 TX for all nodes (1 × 1,000,000)

25,654 ms (40,000 nodes/s)

Took too long, hung

3

Batched TXs (20 × 50,000)

16,081 ms (62,500 nodes/s)

148,357 ms (6756 nodes/s)

Whenever you’re presented with performance numbers, make sure you understand how the performance test was put together and what factors are in play—or not. To lay all of our cards on the table, listings 10.9 and 10.10 show the code used to perform these comparisons.

Listing 10.9. Code used for embedded performance test comparison

Listing 10.10. Code used for server performance test (RAW API) comparison

When a REST transaction is started , all subsequent REST calls are merely collected and held in memory until the transaction is marked as complete (until the try with resource Transaction block completes). At this point, the whole collection of requests is then sent over the network as a single batched request. It should be noted that this batching behavior is disabled by default. To turn it on, you need to set the system property org.neo4j.rest.batch_transaction=true, which is done via .

As a general rule, the server mode, and in particular the REST API, will suffer more in the area of performance, due to the network-based calls required, but this doesn’t mean that it should simply be avoided. There are quite a few alternative approaches and methods that can be used to improve the performance of Neo4j in server mode to the point where it performs at an acceptable level. Section 10.5 is dedicated to looking at this.

10.4.3. Other considerations

Besides the architectural and performance considerations already discussed, there are a couple of additional areas to consider that may also influence whether you choose to use embedded or server mode.

REST API: supported data exchange formats

The REST API currently only supports JSON and HTML as the data exchange formats. If you prefer XML or some other format, you’re out of luck. The only way to make use of a different data exchange format is to use unmanaged extensions (see section 10.5.4), which requires that you define your own REST interface for accessing and interacting with the database. You do lose the ability to make use of all of the prebuilt REST API functionality when you go down this route.

Transactions

Neo4j is a fully ACID-compliant database. When using the embedded mode, this is a fairly straightforward proposition, but with server mode, it can sometimes throw up a few interesting challenges.

In server mode, each HTTP request is treated as a single transaction, by default. This means, for example, that you wouldn’t be able to create two nodes as part of a single transaction if all you had at your disposal was the raw REST API. (The raw REST API requires two separate HTTPPOST requests that would be treated as two separate transactions.)

Approaches for handling transaction scenarios in a server context include

· Using the Cypher endpoint

· Using the REST transactional endpoint

· Using server plugins or unmanaged extensions

· Using batch operations

Depending on your situation, you may be able to define a single Cypher statement that’s able to perform multiple operations in one go. Section 10.5.2 provides more details on using the Cypher endpoint. Additionally, there’s also a Transactional REST endpoint (http://docs.neo4j.org/chunked/stable/rest-api-transactional.html), which allows you to execute a series of Cypher statements within the scope of a transaction, over multiple HTTP requests. With this option, the client explicitly issues commit or rollback commands via dedicated REST endpoints.

Server plugins and unmanaged extensions enable you to write server-side code that can take the data for creating the nodes in a single HTTP request, and then on the server side ensure they’re created as a single transaction (see sections 10.5.3 and 10.5.4 for examples). Neo4j also provides functionality called batch operations that allow you to send groups or batches of low-level REST instructions over a single HTTP call, where all of the batched instructions get treated as a single unit when executed on the server side. More information about batch operations can be found in the “Transactional HTTP endpoint” section of the Neo4j Manual, at http://neo4j.com/docs/stable/rest-api-transactional.html.

10.5. Getting the most out of the server mode

This final section highlights options for extracting the best possible performance out of the server mode, under a given a set of circumstances. As we’ve said before, the server mode will suffer more than the embedded mode in the area of performance, but by using some of these techniques and approaches, server mode performance can be brought to an acceptable level.

To fully understand and appreciate the differences between the approaches we’ll present, the remainder of this section will use our social network domain as an example to see how these different approaches result in different performance metrics. Let’s assume you have a requirement to be able to find and return the names of all of a user’s immediate friends, specifically friends whose names start with the letter J. In your system, there’s already a user set up, Adam, who has 600 immediate friends, 15 of whom have names starting with a J. (Adam’s userId is adam001; this user has been uniquely indexed and his node ID happens to be 0.)

You’ll be using two different clients for your experiments: the first is simply the Unix curl client, and the other is the java-rest-binding client. (You can find the curl scripts and JUnit client test classes that we used for our timings as part of the provided source code. Appendix B provides instructions for running these.)

Table 10.4 provides a template for you to fill in your findings as you go along.

Table 10.4. Performance metrics log:template

Scenario

Description

Number of server calls made

Curl client

Java REST binding

Cold

Warm

Cold

Warm

1

Raw REST API

2

Cypher call

3

Server plugin

4

Unmanaged extension

· Time is stated in milliseconds.

· Timing for curl client is done using the bash time command.

· Timing for the Java REST binding done using Java (Spring) StopWatch.

· Cold = server first stopped then started before the first REST call was made.

· Warm = second call (same as previous) made directly after first without server restart.

Cold vs. warm timings

The first call (a cold call) made to the server includes the time required to perform one-off bootstrapping, caching, and initialization processing, which may not be present for subsequent (warm) calls. This means that cold calls will almost always be slower, and may also be slightly less predictable than warm calls. We will be providing you with both timings to provide a full picture.

10.5.1. Avoid fine-grained operations

As a general rule, when performing any kind of operations over a network, you should aim to minimize the number of network hops required to carry out that operation, and this same principle should also be applied to the manner in which the REST API is used. The raw, low-level REST API operations are very fine-grained, typically operating on a single node or relationship at any time. They can generate a lot of unnecessary network traffic if used inappropriately.

Rather than only using the discoverable low-level REST API for retrieving data, you should consider the following alternatives that may result in far fewer network calls and much better performance:

· Use the Cypher REST endpoint

· Use the Traversal REST endpoint

· Create a server plugin or unmanaged extension to return results

For creating data, consider these options:

· Use the mutating functionality in the Cypher API

· Use the REST batch API

· Create a server plugin or unmanaged extension to perform your task

If you use the raw REST API with its hypermedia-driven approach to get the information you need for the example scenario, you’ll need a total of 602 network calls in order to accomplish this. Assuming the Neo4j server is running locally on port 7474, you’d need to do the following:

· 1 GET on http://localhost:7474/db/data/node/0 to get the initial information on data and options available for Adam.

· 1 GET on http://localhost:7474/db/data/node/{nId}/relationships/all/IS_FRIEND_OF to get a listing of all the IS_FRIEND_OF relationships to and from Adam (see listing 10.6 for request/response details).

· 600 GETs on http://localhost:7474/db/data/node/{nId}/properties to get each of the relationship structures returned in this first call (there will be 600 entries). You’d then use the end URI to construct the next call, which will return all of the properties for the ending (friend) node. The JSON returned in each of these calls could then be parsed and used to only return those names starting with a J.

The updated performance metrics log is shown in table 10.5.

Table 10.5. Performance metrics log after scenario 1, raw REST API

Scenario

Description

Number of server calls made

Curl client

Java REST binding

Cold

Warm

Cold

Warm

1

Raw REST API

602

5726 ms

5303 ms

1066 ms

917 ms

2

Cypher call

3

Server plugin

4

Unmanaged extension

10.5.2. Using Cypher

The Neo4j REST API makes a provision for you to run arbitrary Cypher statements against the server by posting the appropriate query or statement and parameters to the designated REST endpoint responsible for executing Cypher. Provided you have all the information you need to construct your query or statement upfront, this should require only one network call.

Using the response returned from requesting the service root URL, described in section 10.3.2, the URL defined against the cypher key (see listing 10.4) provides the entry point to use to send Cypher queries and statements to the server. The following snippet shows the service root response from listing 10.4:

...

"batch" : "http://localhost:7474/db/data/batch",

"cypher" : "http://localhost:7474/db/data/cypher",

...

Listings 10.11 and 10.12 show the corresponding request and response to execute a Cypher query to satisfy our example scenario.

Listing 10.11. Cypher REST request

The actual Cypher query is specified in the query key in the JSON request, with any associated parameters provided in the params key as key-value pairs.

Listing 10.12. Cypher REST response

The results are returned in a structure that defines the column headers and corresponding data values . You can see that the column name matches the RETURN statement specified in the Cypher request from listing 10.11, and that all the matches for this column heading are provided as an array of names under the data section in listing 10.12. The updated metrics log is shown in table 10.6.

Table 10.6. Performance metrics log after scenario 2, Cypher call

Scenario

Description

Number of server calls made

Curl client

Java REST binding

Cold

Warm

Cold

Warm

1

Raw REST API

602

5726 ms

5303 ms

1066 ms

917 ms

2

Cypher call

1

740 ms

45 ms

104 ms

32 ms

3

Server plugin

4

Unmanaged extension

This approach has provided a dramatic improvement to the performance for the query.

10.5.3. Server plugins

Server plugins provide a mechanism for offloading some of the processing-intensive logic to the server rather than having to perform it all on the client, with multiple requests having to flow backward and forward to accomplish the same thing. Server plugins are sometimes compared to stored procedures in the relational database world.

Server plugins have specifically been designed to extend the existing REST API options returned for a node, relationship, or the global graph database. Recall that when you make a request for the detail of a particular node, you get a lot of options back, including an extensions key. See the following snippet for a recap:

{ ...

"extensions" : { . . .},

"property" : "http://localhost:7474/db/data/node/0/properties/{key}",

"self" : "http://localhost:7474/db/data/node/0",

"data" : { "name" : "Adam" }

... }

This represents the list of extension points (server plugins available) for the node. Similar extension points will be available for relationships and the graph as a whole.

To write a server plugin, you need to first decide what it is you want to target or extend—the node, relationship, or graph database options—and then follow these steps:

1. Write a class that extends the ServerPlugin class.

2. Ensure that the fully qualified name of the server plugin class is listed in a file called org.neo4j.server.plugins.ServerPlugin.

3. Ensure that the plugin class and file are JAR files, and that they’re placed on the class path of the Neo4j server.

4. Access the functionality by discovering and then calling the appropriate REST URL.

The following listing shows a ServerPlugin class we created to extend the capabilities of a node by finding all friends with names starting with a J.

Listing 10.13. ServerPlugin class

All Nodes are Equal

When a plugin targets a node, it targets all nodes. Even if logically you have different types of nodes defined within your database, such as user nodes and movie nodes, Neo4j will make this server plugin available on all nodes. Care should be taken to define server plugins that can be used across all nodes, or for some mechanism to be in place to ensure it gets executed on only the appropriate types of nodes.

Extending the ServerPlugin class will ensure that this class is picked up as a server option when the server starts. For each extension point required, a method should be created that specifies (via the @PluginTarget annotation ) what the discovery point type is. This will be one ofNode, Relationship, or GraphDatabaseService. Combined with the @Name annotation , this will determine where and under what name the additional REST endpoints are exposed in the overall REST API. A corresponding reference to the Node, Relationship, orGraphDatabaseService argument in the method itself will also be required so that this reference can be used to perform any functionality or logic that may be required.

The following listing shows how you could make an HTTP request to get the details for node 0 (Adam), as well as a portion of the resulting response, including a listing of what extensions are available.

Listing 10.14. Extension snippet of HTTP response for getting info on Adam node

As the numbered list of steps in the beginning of this section stated, a JAR file needs to be created that contains both the plugin class and the org.neo4j.server.plugins .ServerPlugin file, as shown in the following snippet (available in the META-INF/services directory):

com.manning.neo4jia.chapter10.serverplugin.JFriendNamesServerPlugin

This JAR file should then be placed on the server class path. This is usually done by placing the JAR file in the plugin directory of wherever the server is installed.

The results of this server plugin execution, which under the covers is using the native embedded API, are detailed in table 10.7. Again, there’s quite a drastic improvement over the plain REST API.

Table 10.7. Performance metrics log after scenario 3, server plugin

Scenario

Description

Number of server calls made

Curl client

Java REST binding

Cold

Warm

Cold

Warm

1

Raw REST API

602

5726 ms

5303 ms

1066 ms

917 ms

2

Cypher call

1

740 ms

45 ms

104 ms

32 ms

3

Server plugin

1

147 ms

20 ms

76 ms

16 ms

4

Unmanaged extension

10.5.4. Unmanaged extensions

If you require complete control over your server-side code, then unmanaged extensions may be what you’re looking for. Unlike server plugins, which merely allow you to augment the existing REST API at specific points, unmanaged extensions essentially allow you to define your own domain-specific REST API. Instead of nodes and relationships, you can now deal in users and movies if you so choose.

Neo4j makes this possible by allowing you to deploy arbitrary JAX-RS (Java API for RESTful web services) classes to the server. JAX-RS provides a set of APIs that are supposed to make developing REST services a piece of cake for developers. Broadly speaking, you define a Java class, which, through a set of annotations, binds the class to a particular URL pattern and mount point within the Neo4j server. When this mount point is invoked, control is transferred to this class, which can have full access to the Neo4j graph database, allowing the class to perform whatever actions or functionality is required, returning the data in whatever format is desired. Though the protocol still needs to be over HTTP, the data format isn’t restricted to only JSON and HTML, as with the REST API and server plugins.

Warning!

Unmanaged extensions essentially give you unrestricted access to use and influence the resources of the Neo4j server. This is extremely powerful, but you need to be careful you don’t accidentally shoot yourself in the foot, so to speak. This could be done by consuming all of the JVM heap space while performing an expensive traversal of some sort. Provided you understand what you’re doing, this can be a powerful tool in your toolbox, but as the saying goes: With great power comes great responsibility!

The next listing shows an implementation of an unmanaged extension that looks up a user based on their name and then returns all of the user’s immediate friends whose name starts with a J.

Listing 10.15. An unmanaged extension

As with server plugins, this class needs to be a JAR file and made available to the Neo4j server. By convention, this is done by placing the JAR file in the plugins directory of your Neo4j server.

Additionally, you’ll need to add and map any unmanaged extensions in the neo4j-server.properties file against the org.neo4j.server.thirdparty_jaxrs_classes key. (This can usually be found in the conf directory of the Neo4j server installation.) The mapping consists of defining the Java package that contains the extension classes. You map this to the base mount point as shown in the following snippet:

org.neo4j.server.thirdparty_jaxrs_classes=com.manning.neo4jia.chapter10.un

managedext=/n4jia/unmanaged

This means that to execute the unmanaged extension and get all of Adam’s friends’ names starting with the letter J, you’d need to issue an HTTP GET against http://localhost:7474/n4jia/unmanaged/example/user/adam001/jfriends.

Table 10.8 shows how the unmanaged extension in listing 10.15 fares against the other approaches you’ve seen so far.

Table 10.8. Performance metrics log after scenario 4, unmanaged extension

Scenario

Description

Number of server calls made

Curl client

Java REST binding

Cold

Warm

Cold

Warm

1

Raw REST API

602

5726 ms

5303 ms

1066 ms

917 ms

2

Cypher call

1

740 ms

45 ms

104 ms

32 ms

3

Server plugin

1

147 ms

20 ms

76 ms

16 ms

4

Unmanaged extension

1

158 ms

20 ms

119 ms

18 ms

Besides providing yet another mechanism for improving the performance of the server, unmanaged extensions provide benefits in allowing you to define a domain-specific REST API, as well as the ability to use whatever data interchange format—JSON, binary, text, or otherwise—you choose. We reiterate our warning, however, that this powerful tool needs to be managed carefully lest you inadvertently open Pandora’s box.

10.5.5. Streaming REST API

As of Neo4j version 1.8, the option to “stream” the JSON responses to REST requests has been introduced as another means for improving the performance of the Neo4j REST API. By default streaming is turned off.

At present, from the client’s perspective, all that’s required to have the results streamed back is to provide an extended header (X-stream=true), as shown in figure 10.8.

Figure 10.8. Result of turning streaming on/off

Note that this is still new functionality and it may be subject to change as clearer usage patterns emerge and the API evolves. We won’t go into too much detail at this time.

To demonstrate the additional improvement in performance gained by using the streaming API, we executed a Cypher query that returned all the nodes in the example database. We added an extra 120,000 nodes to generate data resulting in a total of 120,602 nodes, including Adam and his friends existing within the database. The streamed results came back in about 14 seconds for 149 MB compared to 21 seconds for 168 MB.

Note

The payload is smaller because the streamed results are compressed (whitespace removed) when being sent back. Streaming can also reduce the memory required by the server to perform this task. This is possible because the data requested is streamed directly back to the client as it’s read from the database, without having to temporarily store it in any nodes or relationship objects along the way.

10.6. Summary

You’ve seen how both the embedded and server modes provide the ability to access and interact with the Neo4j database, but the manner in which this is done is fundamentally different.

With the embedded mode, you have quite a cozy relationship with Neo4j, which is only available to Java and a select number of other JVM-based languages. Though you get direct access to all of Neo4j’s low-level APIs and can leverage the performance gains associated with this, you’re also required to share your resources (memory and the like) with Neo4j as well. This could potentially introduce additional overhead and complications to the management of your application, and that needs to be taken into account.

In the server mode, the Neo4j process is isolated and can be managed completely separately from that of the application, which is a big win in large distributed architectures. All the interactions, however, need to be done through the REST abstraction layer, which although it casts a wider net in terms of the clients it supports, has some performance implications that need to be understood and dealt with appropriately.

These performance issues are not insurmountable, with options like server plugins, extensions, streaming, and appropriate use of Cypher statements all helping to allow the server mode to operate at a better performance level.

The next logical query is how to run your Neo4j database in a production environment. You now have all the basics under your belt, but what will you need to take into account when you actually want to go live? The next, and final, chapter will cover this topic. You’ll learn about some of the operational considerations involved in using Neo4j in the real world.