Learning Cypher (2014)

Chapter 2. Filter, Aggregate, and Combine Results

In the previous chapter, we learned the basics of Cypher querying, including pattern matching. To use Cypher in real-world applications, you'll need a set of features, which we are going to cover in this chapter. These features include searching by text in the database, data aggregation for statistical analysis, pagination for performance improvements, or even simple filtering on an array of properties.

Filtering

Pattern matching is useful to describe the data we are looking for and how nodes and relations are organized in the graph database. However, we often need to filter data in more detail.

The book store – an example

In this chapter, we'll learn how to filter results using a real-world example: the book store. Just as we saw in the previous chapter, we have to define labels and relationships. A minimal set of labels are as follows:

· Book: This label includes all the books

· Person: This label includes authors, translators, reviewers, and so on

· Publisher: This label includes the publishers of books in the database

· User: This label includes the users of the website

A set of basic relationships are as follows:

· PublishedBy: This relationship is used to specify that a book was published by a publisher

· Votes: This relationship describes the relation between a user and a book, for example, how a book was rated by a user

Every book has the following properties:

· Title: This is the title of the book in string format

· Tags: This is an array of string tags useful for searching through the database based on topic, arguments, geographic regions, languages, and so on

Now, suppose we have already filled the database with a lot of books, and we are developing the website of the book store. In such applications, providing an effective search is very important because we can't make the user scroll through all the results on their own.

In the example code files that you can download from the Packt Publishing website at http://www.packtpub.com/support, you'll find an example database that contains 150 books. I used this database to perform the queries in this chapter. The graph of this example database can be seen in the following screenshot:

The book store – an example

Text search

In the previous chapter, we saw that we can search a node or a relationship with a given value of a property. Of course, this is very useful when you know that a certain node with a given value of a property exists; this is the case when a user is looking for a book and knows its title exactly. Consider the following query:

MATCH (b:Book { title: 'In Search of Lost Time' })

RETURN b

The same query can be written using the WHERE keyword, as shown in the following query:

MATCH (b:Book)

WHERE b.title = 'In Search of Lost Time'

RETURN b

In fact, for both queries, the result is as follows:

+----------------------------------------------------------+

| b |

+----------------------------------------------------------+

| Node[143]{title:"In Search of Lost Time",tags:["novel"]} |

+----------------------------------------------------------+

WHERE is the keyword that introduces a filter predicate; in this case, a filter based on the equality comparison of a property (Book.title) and a value (the text to search).

Note that the preceding two queries, although being quite different, have exactly the same effect and the same performance results because they are interpreted in the same way by the engine. This means that choosing either syntax is just a matter of taste. The latter is more similar to SQL. In fact, the creators of Cypher were inspired by SQL when they introduced the WHERE clause.

Working with regular expressions

Maybe in a real-world application, this kind of search would be performed in an advanced search section of a website. In fact, the common search by title function would look for titles that contain the text typed by the user. With Cypher, this can be done with regular expressions.

Regular expressions are widely used patterns that match expressions for strings. The following query will match all titles that contain the word Lost:

MATCH (b:Book)

WHERE b.title =~ '.*Lost.*'

RETURN b

Here, the special character . matches any character, while * matches any occurrence of the previous pattern, and once combined, .* matches any occurrence of any character. So, the regular expression .*Lost.* matches all strings that contain any character at the beginning (.*), then the word Lost and finally, any other character (.*). The result of this query is as follows:

+----------------------------------------------------------------+

| b |

+----------------------------------------------------------------+

| Node[143]{title:"In Search of Lost Time",tags:["novel"]} |

| Node[221]{title:"Love's Labour's Lost",tags:["comedy","drama"]}|

+----------------------------------------------------------------+

With regular expressions, of course, we can do much more. Just to remind us how regular expressions work, the following is a table of the most widely used patterns:

Pattern	Meaning	Example	Matched Strings
[ ]	This stands for any character included in the bracket or in the range specified	Te[sx]t [A-C]lt	Text Test Alt Blt Clt
*	This implies any occurrence of the preceding pattern, including no occurrence	Te[xs]*t	Text Test Texxt Tet
.	This implies any character, including none	Te.t	Text Test
?	This implies one or no occurrence of the preceding pattern	Te[xs]?t	Text Test Tet
\|	This stands for alternative matching	Te(xt\|ll)	Text Tell
\	This pattern escapes the following reserved character	Te\.t	Te.t

The regular expression standard used here is identical to the one used by Java in the official Java documentation at http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html. There, you can find the full summary of the constructs supported.

The following is a list of some of the most widely used regular expression patterns:

Regular expression	Matches
([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})	E-mail
(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-])\/?	An HTTP URL without a query string
<([a-z]+)([^<]+)(?:>(.)<\/\1>\|\s+\/>)	HTML tags
\d{5}	A ZIP code
\+?[\d\s]{3,}	A phone number

So, if we are looking for all titles that start with a given text (for example, Henry), we can write the following query:

MATCH (b:Book)

WHERE b.title =~ 'Henry.*'

RETURN b

The following query returns all the books that have tale, tales, or their respective title cases:

MATCH (b:Book)

WHERE b.title =~ '.*[Tt]ale(s)?.*'

RETURN b

The result is shown in the following output code:

+--------------------------------------------------------------+

| b |

+--------------------------------------------------------------+

| Node[3]{title:"Fairy tales"} |

| Node[31]{title:"The Canterbury Tales"} |

| Node[129]{title:"The Tale of Genji"} |

| Node[141]{title:"Tales"} |

| Node[245]{title:"The Winter's Tale",tags:["comedy","drama"]} |

+--------------------------------------------------------------+

By design, regular expressions are case sensitive. If you want to match a string independently by a letter's case, you can use the (?i) specifier. Place it at the beginning of the string to perform case-insensitive matching. The following query gives the same result as the previous one with respect to our database:

MATCH (b:Book)

WHERE b.title =~ '(?i).*tale(s)?.*'

RETURN b

Tip

You can find a lot of free resources online on regular expressions. Apart from the official Oracle documentation, you can find a lot of reading material on the following websites:

· http://www.rexegg.com/: This is a tutorial with links and examples

· http://www.regexplanet.com/advanced/java/index.html: This is a tester that uses the java.util.regex.Pattern implementation

· https://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/: This is a comprehensive cheat sheet

Escaping the text

In real-world applications, sometimes we have to use regular expression patterns dynamically; for example, we will use the following code snippet while working in Java:

String query = "(?i).*" + textToFind;

Map<String, Object> params = new HashMap<>();

params.put("query", query);

ExecutionResult result = engine

.execute("MATCH (b:Book) WHERE b.title =~ {query} RETURN b",params);

This code puts the textToFind string in a regular expression pattern, which is then passed to the query as a parameter.

The problem with this code is that if textToFind comes from the user input (as it's quite likely), a user could type in a valid regular expression pattern and still get an unexpected result. For example, the word 10$ would match only the titles that end with 10 and not all the titles that contain the word 10$.

In Java, escaping the text is easy, as shown in the following code snippet:

import java.util.regex.Pattern;

String query = "(?i).*" + Pattern.quote(textToFind);

Of course, every programming language or environment has its own functions to escape regular expressions. You should refer to their respective documentation.

Value comparisons

We often need to search items by comparing their values. For example, if we want all books published in 2012 and later, this can be done as follows:

MATCH (b:Book) -[r:PublishedBy]-> (:Publisher)

WHERE r.year >= 2012

RETURN b.title, r.year

This query matches all book nodes that have a relation with a publisher, filtering the books by comparing their year of publication, which must be greater than or equal to 2012. The result is shown in the following output code:

+-------------------------------------------+

| book.title | r.year |

+-------------------------------------------+

| "Akka Essentials" | 2012 |

| "Getting Started with Memcached" | 2013 |

| "Java EE 7 Developer Handbook" | 2013 |

+-------------------------------------------+

The >= is a comparison operator. The other comparison operators are as follows:

· <>: This is the different than operator

· <: This is the lesser than operator

· <=: This is the lesser than or equal to operator

· >: This is the greater than operator

Tip

You can find the complete list of supported operators and functions, each explained with an example, in the Appendix.

The IN predicate

The IN predicate allows us to specify more values for an expression. For example, consider the following query:

MATCH (b:Book) -[r:PublishedBy]-> (:Publisher)

WHERE r.year IN [2012, 2013]

RETURN b.title, r.year

This query matches all books published in 2012 or in 2013. It's a short alternative for the Boolean operator OR, which can be used as shown in the following query:

r.year = 2012 OR r.year = 2013

Boolean operators

The WHERE clause supports four Boolean operators. They are as follows:

· OR: This is the logical inclusive disjunction operator.

· AND: This is the logical conjunction operator; it returns TRUE only if both predicates are TRUE.

· XOR: This is the logical exclusive disjunction operator; it returns TRUE if only one of the predicates is TRUE. Use this if you don't want the results to satisfy both the predicates.

· NOT: This operator returns TRUE only if the following predicate is FALSE.

Working with collections

In Chapter 1, Querying Neo4j Effectively with Pattern Matching, we saw that Neo4j supports array parameters. So, how can we filter nodes and relationships by array values?

Collection predicates are functions with a predicate argument that return a Boolean value. Suppose you want to get all books that have a particular tag from a given dataset. With collection predicates, we can filter books based on collections.

For example, if we have tagged every book with an array of strings, the query will be as follows:

MATCH (b:Book)

WHERE ANY ( tag IN b.tags WHERE tag IN ['nosql','neo4j'] )

RETURN b.title,b.tags

The preceding query returns all books tagged with NoSQL, Neo4j, or both. The following result is obtained:

+----------------------------------------------------------------+

| b.title | b.tags |

+----------------------------------------------------------------+

| "Getting Started with NoSQL" | ["nosql"] |

| "Learning Cypher" | ["nosql","neo4j"] |

| "Instant MongoDB" | ["nosql","mongodb"] |

| "Ruby and MongoDB Web Development" | ["nosql","mongodb","ruby"]|

+----------------------------------------------------------------+

How does this query work? The keyword ANY, just as every predicate, introduces a rule with the following syntax:

item_identifier IN collection WHERE rule

In the case of ANY, at least one item in the collections must follow the rule. So, for every book, if there is any tag in the collection that satisfies the rule, it will be taken; otherwise, it will be discarded.

There are four collection predicates. They are as follows:

· ANY: This predicate returns TRUE if at least one item in the collection adheres to the expression

· ALL: This predicate returns TRUE if all items in the collection adhere to the rule

· NONE: This predicate returns TRUE if no item in the collection follows the rule

· SINGLE: This predicate returns TRUE if exactly one item follows the rule

For example, if we want books tagged only as Neo4j or NoSQL and nothing else, we have to change the previous query as follows:

MATCH (b:Book)

WHERE ALL( tag IN b.tags WHERE tag IN ['nosql','neo4j'] )

RETURN b.title, b.tags

Note that we have just replaced the ANY keyword with the ALL keyword, and we get the following result:

+--------------------------------------------------+

| b.title | b.tags |

+--------------------------------------------------+

| "Getting Started with NoSQL" | ["nosql"] |

| "Learning Cypher" | ["nosql","neo4j"] |

+--------------------------------------------------+

If we want all books tagged as NoSQL but not as Neo4j, we can use the NONE predicate as follows:

MATCH (b:Book)

WHERE ANY ( tag IN b.tags WHERE tag = 'nosql' )

AND NONE ( tag in b.tags WHERE tag = 'neo4j' )

RETURN b.title,b.tags

In fact, the result does not contain any book that is tagged Neo4j, as shown in the following output code:

+----------------------------------------------------------------+

| b.title | b.tags |

+----------------------------------------------------------------+

| "Getting Started with NoSQL" | ["nosql"] |

| "Instant MongoDB" | ["nosql","mongodb"] |

| "Ruby and MongoDB Web Development" | ["nosql","mongodb","ruby"]|

+----------------------------------------------------------------+

The SINGLE predicate is useful for performing the exclusive-OR operation among collection values. For example, if we want all books tagged either Ruby or MongoDB (we don't want too specific books), we will use the following code:

MATCH (b:Book)

WHERE SINGLE (tag IN b.tags WHERE tag IN ('ruby', ' mongodb ') )

RETURN b

The preceding query works because if we have a book tagged both Ruby and MongoDB, the SINGLE statement won't match because the specified predicate will return two tags. The result is shown in the following output code:

+-----------------------------------------+

| b.title | b.tags |

+-----------------------------------------+

| "Instant MongoDB" | ["nosql","mongodb"] |

+-----------------------------------------+

Paging results – LIMIT and SKIP

Even if you have never developed a web application, you would know why paging results is so important, for the following reasons:

· Loading a lot of data in a single HTTP request is very slow

· Showing a lot of data in a single page is very expensive in terms of bandwidth

Paging is necessary to avoid a lot of data from being loaded all together in a single query. For example, if the user searches for the string drama in the book database, the full result would likely be huge, that is, hundreds of books. Here, we can split the result in pages of twenty books; we can show twenty books in each page and have two buttons, next and previous, to switch the page forward or backward.

We get the first page with the following query:

MATCH (b:Book)

WHERE b.title =~ '(?i).*drama.*'

RETURN b

LIMIT 20

This will return the first twenty nodes with the Book label that have the title that matches the regular expression (?i).*drama.*. The (?i) part of the expression means that the regular expression is case insensitive, while the .* part of the expression matches any character type in the string. So, the regular expression matches every string that contains the word drama, independent of the character case.

Now, when the user clicks on the "next" button, we have to get the second chunk of data, skipping the first part. We must use the SKIP keyword as shown in the following query:

MATCH (b:Book)

WHERE b.title =~ '(?i).*drama.*'

RETURN b

SKIP 20

LIMIT 20

To get the third chunk of data, we have to skip past the first 40 items and so on for further pages; to show the page i, we must skip 20*i items. To avoid having to change the query every time, we can take advantage of the Cypher parameters, as shown in the following query:

MATCH (b:Book)

WHERE b.title =~ {query}

RETURN b

SKIP {skip}

LIMIT {limit}

The preceding query uses three parameters: query, skip, and limit. The first is the regular expression to find in the title, and the others are the number of items to skip and to return.

Using the REST API, a call using the paginated query will have the following payload:

{

"query" : "MATCH (b:Book) WHERE b.title =~ {query} RETURN b SKIP {skip} LIMIT {limit}",

"params" : {

"query" : "(?i).*drama.*",

"skip" : 20,

"limit": 20

}

Whereas in a Java-embedded database, the query can be used in a method as shown in the following code snippet:

import java.util.*;

import java.util.regex.Pattern;

import org.neo4j.cypher.javacompat.*;

import org.neo4j.graphdb.Node;

import org.neo4j.helpers.collection.IteratorUtil;

// ... more code

public List<Node> find(String text, int limit, int skip) {

final String query = "(?i).*" + Pattern.quote(text) + ".*";

Map<String, Object> params = new HashMap<>();

params.put("query", query);

params.put("limit", limit);

params.put("skip", skip);

final ExecutionResult result = engine

.execute("MATCH (b:Book) WHERE b.title =~ {query} RETURN b SKIP {skip} LIMIT {limit}", params);

List<BNode> ret = new LinkedList<>();

Iterator<Node> iterator = result.columnAs("b");

for(Node b : IteratorUtil.asIterable(iterator)) {

ret.add(b);

}

return ret;

}

This is the basic method to perform text searches in the database in order to show items in the graphical user interface. It takes three parameters: the text to find, the maximum number of items to return, and the number of items to skip. Then, it puts them in a map used as the parameter list in the following query:

MATCH (b:Book) WHERE b.title =~ {query} RETURN b SKIP {skip} LIMIT {limit}

Finally, the result is put in a list and returned. Maybe you are wondering if you can sort the results before paging the data. This is possible and important because you can use this feature, for example, to return the top ten most-voted books. This is the topic of the next section.

Sorting

If you have experience with SQL, then sorting with Cypher is exactly the same as sorting with SQL. We can use the ORDER BY clause to specify the columns to be used for sorting, as shown in the following query:

MATCH (b:Book)

WHERE ANY ( tag IN b.tags WHERE tag IN ['drama'] )

RETURN b.title

ORDER BY b.title

LIMIT 5

The preceding query looks for books tagged drama in the database, then sorts them by title, and returns the first five book entries found. We can note the following:

· The ORDER BY clause follows the RETURN clause

· This clause is above the LIMIT or SKIP clause so that we can sort the data before limiting our page

The result set is as follows:

+-----------------------------+

| b.title |

+-----------------------------+

| "A Lover's Complaint" |

| "A Midsummer Night's Dream" |

| "All's Well That Ends Well" |

| "Anthony and Cleopatra" |

| "As You Like It" |

+-----------------------------+

A descending sort

To sort inversely, just postpone the DESC clause to the ORDER BY clause, as shown as shown in the following query:

MATCH (b:Book)

WHERE ANY ( tag IN b.tags WHERE tag IN ['drama'] )

RETURN b.title

ORDER BY b.title DESC

LIMIT 5

It gives the following result:

+------------------------+

| b.title |

+------------------------+

| "Venus and Adonis" |

| "Twelfth Night" |

| "Troilus and Cressida" |

| "Titus Andronicus" |

| "Timon of Athens" |

+------------------------+

Tip

The result rows are different because the LIMIT clause is evaluated after the ORDER BY clause, so Cypher is limiting the result set to five items, which are already sorted. This is important in real-world applications because it allows us to both page data in a small result set and sort it however we want.

Dealing with null values using the COALESCE function

How are null values evaluated by the ORDER BY clause? They are always considered the largest values; so, they are the last values in case of an ascending sort, while they are the first values in case of a descending sort. This is useful while we are looking for data that is sorted in an ascending manner, and indeed that's likely to be the most common situation. But with a descending sort, we can get unexpected results.

The following query returns books from our database, putting the most recently published books at the top:

MATCH (b:Book)

OPTIONAL MATCH (b) -[p:PublishedBy]-> (c)

RETURN b.title, p.year

ORDER BY p.year DESC

LIMIT 5

We are using an OPTIONAL MATCH clause because we want to get all the books in the database, but as the database does not contain the publication year for some books, we get them at the first position:

+------------------------------+

| b.title | p.year |

+------------------------------+

| "Fairy tales" | <null> |

| "The Divine Comedy" | <null> |

| "Epic of Gilgamesh" | <null> |

| "Book of Job" | <null> |

| "The Art of Prolog" | <null> |

+------------------------------+

So, how to get the null values to stay at the bottom of the list and the most recent book at the top?

For this, COALESCE is the function we need. It takes a variable number of arguments and returns the first non-null value. A null value is returned only if all arguments are null.

Using the COALESCE function, we can write the preceding query in the ORDER BY clause as follows:

MATCH (b:Book)

OPTIONAL MATCH (b) -[p:PublishedBy]-> (c)

RETURN b.title, p.year

ORDER BY COALESCE(p.year, -5000) DESC

LIMIT 5

Here, we have instructed Cypher to sort the dataset by a value, that is, the year of publication. If it is a null value, it is considered as -5000. Now, we have the result we were expecting, as shown in the following output code:

+-------------------------------------------+

| b.title | p.year |

+-------------------------------------------+

| "Getting Started with Memcached" | 2013 |

| "Java EE 7 Developer Handbook" | 2013 |

| "Akka Essentials" | 2012 |

| "Learning Ext JS 3.2" | 2010 |

| "The Art of Prolog" | <null> |

+-------------------------------------------+

Aggregating results

In our application, we have users who rate books with a score from one to five. Now, we are going to query the database to get some aggregated information about book scores.

Counting matching rows or non-null values

Suppose that we want to know the number of users who have voted for a book. For this, we need to count the number of the vote relations between the users and that book, as shown in the following code snippet:

START b=node({id})

MATCH (b) <-[r:Vote]- (u:User)

RETURN COUNT(*) as votes

The only difference with the query patterns we already know is that here, we have used the COUNT function in the RETURN clause. With Cypher, the RETURN clause drives the aggregation of entities. In this case, as we have nothing else in the RETURN clause but the COUNTfunction, all the matching results are counted and the result is a single value. The result is as shown in the following output code:

+-------+

| votes |

+-------+

| 7 |

+-------+

1 row

The arguments inside the COUNT function can be the following:

· The * keyword is a placeholder that instructs the counter to count all matching rows

· A variable that makes the counter skip null values, missing properties, or missing matches

In the previous query, specifying the COUNT(r.score) score property will give the same result because in our database, we haven't got any null votes. However, we can have a book without any scores. Now, we are going to investigate two queries that differ only in the usage of the COUNT function. The first query is as follows:

MATCH (b:Book {title: "The Art of Prolog"})

OPTIONAL MATCH (b) <-[r:Votes]- (:User)

RETURN b, COUNT(r.score) as votes

The preceding query counts only the rows that have a non-null vote score. The following result is obtained:

+---------------------------------------------------------------+

| b | votes |

+---------------------------------------------------------------+

| Node[1037]{title:"The Art of Prolog",tags:["prolog"]} | 0 |

+---------------------------------------------------------------+

We have no votes here because no user has voted for this book. Now, look at the following query where I have changed COUNT(r.score) to COUNT(*):

MATCH (b:Book {title: "The Art of Prolog"})

OPTIONAL MATCH (b) <-[r:Votes]- (:User)

RETURN b, COUNT(*) as votes

This query gives one vote in the result, as shown in the following output code:

+---------------------------------------------------------------+

| b | votes |

+---------------------------------------------------------------+

| Node[1037]{title:"The Art of Prolog",tags:["prolog"]} | 1 |

+---------------------------------------------------------------+

In fact, we are using the COUNT(*) function to count all the rows that match the expression in the first row of the query. We have just one matching row because the presence of the votes relationship is optional. So, we have this matching row counted as one vote, which of course is wrong.

Tip

Be careful while using the COUNT(*) function versus the COUNT(variable) function, in conjunction with the OPTIONAL MATCH clause; you could get unexpected results!

Summation

If you are interested in the total score received by a book too, you can use the SUM function, as shown in the following query:

START b=node(5)

MATCH (b:Book) <-[r:Votes]- (:User)

RETURN COUNT(r.score) as votes, SUM(r.score) as total

The argument of the SUM function is the variable to summarize. The result of this query is as follows:

+---------------+

| votes | total |

+---------------+

| 5 | 14 |

+---------------+

Here, the null values will be ignored.

Average

The average score is the sum of all the scores given by the users on a book divided by the sum of the number of votes. The AVG function computes this for us, as shown in the following query:

START b=node(5)

MATCH (b:Book) <-[r:Votes]- (:User)

RETURN AVG(r.score) as avgScore

The argument of this function is the variable whose average we want to compute. Here too, the null values will be ignored.

Tip

The AVG function treats null values differently from zero. A zero is summed in the average and counted, while null values are just ignored. null values don't influence the average. For example, the average of 10 and a null value will be 10, while the average of 10 and 0 will be 5.

The result of the previous query is as follows:

+----------+

| avgScore |

+----------+

| 3.8 |

+----------+

Note that although our score data consisted of integer values, the AVG function returns a floating point value (double).

Maximum and minimum

Cypher provides the functions MAX and MIN to compute the largest and the smallest value in the property specified as argument, respectively. Consider the following query:

START b=node(5)

MATCH (b:Book) <-[r:Votes]- (:User)

RETURN MAX(r.score), MIN(r.score)

This query returns the largest and the smallest score vote given to the book. The result is as shown in the following output code:

+-----------------------------+

| MAX(r.score) | MIN(r.score) |

+-----------------------------+

| 5 | 3 |

+-----------------------------+

Again, null values are ignored.

Standard deviation

The standard deviation measures the variation from average for a set of values. In our case, it will predict how many voters agree about the average score. Consider the following query:

START b=node(5)

MATCH (b:Book) <-[r:Votes]- (:User)

RETURN AVG(r.score) as avgScore, STDEV(r.score) as stdDevScore

The preceding query returns the standard deviation with the average. The result is as follows:

+-------------------------------+

| avgScore | stdDevScore |

+-------------------------------+

| 3.8 | 0.8366600265340756 |

+-------------------------------+

The result tells us that the average is 3.8 and that users agree with the votes.

Cypher also provides other statistical aggregation functions. You can find explanations on them in the Appendix.

Collecting values in an array

If statistical functions provided by Cypher are not enough for your needs, you can collect all the values in an array so that you can easily process them with your preferred algorithm. For example, the following query returns all the score votes received for two books:

START b=node(5,6)

MATCH (b:Book) <-[r:Votes]- (:User)

RETURN b.title, COLLECT(r.score)

As you can see, we used the START keyword to instruct Cypher to start from two nodes, those with the ID 5 or ID 6, and for each of them, we got the collected values of scores. The result is shown in the following output code:

+----------------------------------------+

| b.title | COLLECT(r.score) |

+----------------------------------------+

| "Epic of Gilgamesh" | [5,4,3,4,1] |

| "The Divine Comedy" | [4,3,5,3,4] |

+----------------------------------------+

Grouping keys

To better explain how the RETURN function works with aggregation, let's try to remove the b.title column. The query would then be as follows:

START b=node(5,6)

MATCH (b:Book) <-[r:Votes]- (:User)

RETURN COLLECT(r.score)

The result is strongly different from the result of the preceding query, as shown in the following output code:

+-----------------------+

| COLLECT(r.score) |

+-----------------------+

| [3,3,5,2,1,4,3,1,1,3] |

+-----------------------+

In fact, the purpose of the b.title column in the previous query was to set a grouping key of the rows. By removing it, we instruct Cypher to collect everything together. This is quite different from SQL, where you have to explicitly specify grouping keys and return values. Cypher is more concise and I find it more intuitive than SQL.

Conditional expressions

Consider this scenario: we want to write a query that computes the average scores of books in the following two categories:

· Recent books: This category contains all books published in the year 2010 or later

· Other books: This category contains all other books published before 2010 or without a date of publication

We already know how to compute the average of a value in a group based on a property value, but here we have an arbitrary grouping based on a condition. We can use the CASE WHEN expression to express this condition. The query will be as follows:

MATCH (b:Book)<-[r:Votes]-(:User)

OPTIONAL MATCH (b) -[p:PublishedBy]-> (c)

RETURN

CASE WHEN p.year >= 2010 THEN 'Recent'

ELSE 'Old' END as category,

AVG(r.score)

In this query, we introduced a CASE WHEN expression in the RETURN clause. The expression evaluates to the Recent string if the year of publication is 2010 or later, and Old if otherwise. In the RETURN clause, we have the book category that results from the CASE WHENexpression and AVG(r.score); therefore, Cypher will group the score average by the book category. In fact, the result is as follows:

+-------------------------------+

| category | AVG(r.score) |

+-------------------------------+

| "Recent" | 4.4333333333333333 |

| "Old" | 4.9767123287671234 |

+-------------------------------+

There are two types of CASE WHEN expressions. The type we have just learned is the generic one. It is used when we have a set of alternative conditions to verify. If we just have a set of possible values, then we can use the simple form. Here is an example:

MATCH (b:Book)<-[r:Votes]-(:User)

OPTIONAL MATCH (b) -[p:PublishedBy]-> (c)

RETURN

CASE p.year % 2

WHEN 0 THEN 'Even'

WHEN 1 THEN 'Odd'

ELSE 'Unknown' END as parity,

AVG(r.score)

This query computes the score average grouped by the parity publication year. Here, the CASE WHEN statement is slightly different; we have an expression right after the CASE keyword. This expression is evaluated and compared to the values in the WHEN clauses. If one of them matches, the corresponding value expressed in the THEN clause is picked; otherwise, the ELSE value is returned. The result is as follows:

+-----------------------------+

| parity | AVG(r.score) |

+-----------------------------+

| "Odd" | 4.0 |

| "Even" | 3.4 |

| <null> | 4.4413793103448276 |

+-----------------------------+

You may wonder why do we have a null value instead of Unknown in the result. The reason is that if p.year is a null value, the expression in the case cannot be evaluated. To make null values fall in our condition, we will again resort to the COALESCE function. The query is as follows:

MATCH (b:Book)<-[r:Votes]-(:User)

OPTIONAL MATCH (b) -[p:PublishedBy]-> (c)

RETURN

CASE COALESCE(p.year % 2, -1)

WHEN 0 THEN 'Even'

WHEN 1 THEN 'Odd'

ELSE 'Unknown' END as parity,

AVG(r.score)

Separating query parts using WITH

Suppose you want the books sorted by average score, with highest score at the top; for this, we can write the following query:

MATCH (b:Book) <-[r:Votes]- (:User)

RETURN b, AVG(r.score) as avgScore

ORDER BY avgScore DESC

This query returns all books that have votes with their average score. However, that could be a huge number of books, almost the entire database. Suppose you are looking only for books with a score greater than or equal to 4. Where would you place the WHEREclause? Of course, you can't put it before the RETURN clause because the average score is computed there. We have to split the query in two: a part to compute the averages and a part for filtering and sorting.

The WITH keyword exists for this reason. The query is as follows:

MATCH (b:Book) <-[r:Votes]- (:User)

WITH b, AVG(r.score) as avgScore

WHERE avgScore >=4

RETURN b, avgScore

ORDER BY avgScore DESC

The result is as follows:

+---------------------------------------------+

| b | avgScore |

+---------------------------------------------+

| Node[171]{title:"Anna Karenina"} | 4.4 |

| Node[141]{title:"Tales"} | 4.0 |

+---------------------------------------------+

This example shows us how to re-use an existing query to perform further processing. We can perform the following tasks:

· Give an alias to computed values in the RETURN function

· Replace the RETURN keyword with the WITH keyword

· Append the new part to the previous using the variable declared in the WITH function

Splitting queries is useful to filter data, which is otherwise difficult. This is done by combining the WHERE function with the MATCH or OPTIONAL MATCH functions. For example, if we want the year of publication of the book with the best score, the following query can be used:

MATCH (b:Book) <-[r:Votes]- (:User)

WITH b, AVG(r.score) as avgScore

ORDER BY avgScore DESC

LIMIT 1

OPTIONAL MATCH (b) –[p:PublishedBy]-> ()

RETURN b.title, p.year

This query is understandable. In the first four lines, we are looking for the book with the best score. In the remaining part of the query, we get the year of publication.

The UNION statement

Suppose we want to know how many books and how many authors are stored in our database. We can perform two queries to get these values, but with the UNION statement, we can merge these data together. Just put the UNION keyword between two or more queries, as shown in the following query:

MATCH (b:Book)

RETURN 'Books' as type, COUNT(b) as cnt

UNION ALL

MATCH (a:Person)

RETURN 'Authors' as type, COUNT(a) as cnt

The result is as follows:

+-----------------+

| type | cnt |

+-----------------+

| "Books" | 150 |

| "Authors" | 142 |

+-----------------+

The only condition we must be careful of with the UNION statement is that the result set must have the same number of columns and the columns must have the same names.

You're perhaps wondering why we used the UNION ALL statement and not the UNION statement in the previous example. There is a subtle difference between them—the UNION statement removes duplicated rows after merging results, so it is slower.

Tip

If you don't care about duplicated rows, or you already know that your result set has no duplicates, use the UNION ALL statement because it faster than the UNION statement.

Summary

In this chapter, you learned some advanced techniques to query a Neo4j database. First of all, we used the WHERE statement to filter data. Text values can be filtered using regular expressions, which are very flexible and powerful tools to work with strings. Numbers can be filtered using mathematical operators and functions. Logical expressions can be built using Boolean operators (OR, AND, XOR, and NOT). Collections can be filtered using collection predicates.

Paging data is an important feature for a database, especially when it can be very large. You learned how to page data using the LIMIT and SKIP keywords. An important part of this chapter is about aggregating. You learned how to use the RETURN clause to aggregate data with the most common aggregation functions, especially COUNT, SUM, AVG, MAX, and MIN. Finally, we took a look at two useful features of this language: the WITH and UNION keywords. The WITH keyword is used to split queries in order to make them easier to write. TheUNION keyword is used to merge two or more result sets.

In the next chapter, you will learn how to create a database using Cypher writing clauses.