MongoDB Database Fundamentals - Pentaho Analytics for MongoDB (2014)

Pentaho Analytics for MongoDB (2014)

Chapter 2. MongoDB Database Fundamentals

The previous chapter introduced the document-oriented storage of MongoDB, and we learned that MongoDB JSON-style documents are built on two primary data structures, a nested collection of documents and arrays. This chapter elaborates on this topic and explains the MongoDB database concepts essential for learning how to query MongoDB data with Pentaho. We will start by discussing MongoDB database objects and design methods and then learn how to query the restored sample clickstream database using the mongo shell.

The following are the topics that we will cover in this chapter:

· MongoDB database objects

· Sample MongoDB clickstream database

· MongoDB data modeling

· MongoDB query methods

By the end of this chapter, you will have a foundational knowledge of MongoDB design and the primary query methods needed to start retrieving data from the sample database.

MongoDB database objects

We will begin our journey of understanding MongoDB database objects by comparing each object to its equivalent RDBMS object when applicable. There are some similarities between MongoDB and RDMBS objects, although the object names are different. MongoDB object names include databases, collections, documents, fields, and indexes.

The highest level container in a MongoDB server instance is a database. A database consists of one or more collections. A collection holds a grouping of documents just as a RDBMS table holds a grouping of records. A document is made up of one or more fields, which as you probably guessed, are similar to table fields (that is, columns). Relational databases define fields at the table level while document-oriented databases define fields at the document level. Additionally, as with a RDBMS, MongoDB allows you to create indexes on fields for better sorting and lookup performance.

The following figure shows a basic object relationship in a MongoDB database with one collection containing two documents and a total of three fields. The mandatory ObjectId fields, the unique keys for a collection, are left out intentionally to keep the illustration simple.

MongoDB database objects

Typically, all documents in a collection will be similar and serve a related purpose. However, you will notice in the previous figure that Document 2 has a third field, [referring_url], which does not exist in the Document 1. This highlights a major difference between MongoDB document collections and RDBMS tables that documents within the same collection can have different fields. In other words, MongoDB collections do not enforce a schema.

People often refer to this as having a dynamic schema or schema-less database. A dynamic schema gives developers the ability to define and manipulate the database on the fly by inserting new fields, such as the [referring_url] field, which do not already exist in other documents of the same collection. This level of schema flexibility allows developers to model documents to closely resemble software application objects. In reality, you will find most documents within a single collection sharing a common structure, thus making downstream reporting much easier.

Sample clickstream database objects

The sample pentaho database consists of a fictional data model and data used to capture all of the web events that occur when users interact with a web application. These online user interactions, often referred to as clickstream data, give marketers insight into website user behavior. The sample database contains three collections for capturing clickstream events by session.

The definition of a session can vary across companies and web applications; however, it will always have a defined start point and end point. We don't need to worry about how a session is defined. Instead, just know that a unique web session is identified in the database by the unique identifier field id_session, and all events that occur within a single session will all have the same id_session value.

The following table provides a list of objects that form the fundamental building blocks of a MongoDB database and the associated object names in the pentaho database.

MongoDB objects

Sample MongoDB object names

Database

pentaho

Collection

sessions, events, and sessions_events

Document

The sessions document fields (key-value pairs) are as follows:

· _id: 512d200e223e7f294d13607

· ip_address: 235.149.145.57

· id_session: DF636FB593684905B335982BEA3D967B

· browser: Chrome

· date_session: 2013-02-20T15:47:49Z

· duration_session: 18.24

· referring_url: www.xyz.com

The events document fields (key-value pairs) are as follows:

· _id: 512d1ff2223e7f294d13c0c6

· id_session: DF636FB593684905B335982BEA3D967B

· event: Signup Free Offer

The sessions_events document fields (key-value pairs) are as follows:

· _id: 512d200e223e7f294d13607

· ip_address: 235.149.145.57

· id_session: DF636FB593684905B335982BEA3D967B

· browser: Chrome

· date_session: 2013-02-20T15:47:49Z

· duration_session: 18.24

· referring_url: www.xyz.com

· event_data: [array]

· event: Visited Site

· event: Signup Free Offer

MongoDB data modeling

MongoDB data modeling is an advanced topic that would consume multiple chapters in this book if covered in depth. We will instead focus on a few key points regarding data modeling so that you will understand the implications for your report queries. These points touch on the strengths and weaknesses of embedded data models.

The most common data modeling decision in MongoDB is whether to normalize or denormalize related collections of data. Denormalized MongoDB models embed related data together in the same database documents while normalized models represent the same records with references between documents. This modeling decision will impact the method used to extract and query data using Pentaho, because the normalized method requires joining documents outside of MongoDB.

Tip

With normalized models, Pentaho Data Integration is used to resolve the reference joins.

The table in the previous section shows three collections in the pentaho database: sessions, events, and sessions_events. The first two collections, sessions and events, illustrate the concept of normalizing the clickstream data by separating it into two related collections with a reference key field in common. In addition to the two normalized collections, the sessions_events collection is included to illustrate the concept of denormalizing the clickstream data by combining the data into a single collection.

Normalized models

Because multiple clickstream events can occur within a single web session, we know that sessions have a one-to-many relationship with events. For example, during a single 20-minute web session, a user could invoke four events by visiting the website, watching a video, signing up for a free offer, and completing a sales lead form. These four events would always appear within the context of a single session and would share the same id_session reference.

The data resulting from the normalized model would include one new session document in the sessions collection and four new event documents in the events collection, as shown in the following figure:

Normalized models

Each event document is linked to the parent session document by the shared id_session reference field, whose values are highlighted in red.

This normalized model would be an efficient data model if we expect the number of events per session to be very large for a couple of reasons. The first reason is that the current version of MongoDB limits the maximum document size to 16 megabytes, so you will want to avoid data models that create extremely large documents. The second reason is that query performance can be negatively impacted by large data arrays that contain thousands of event values. This is not a concern for the clickstream dataset, because the number of events per session is small.

Denormalized models

The one-to-many relationship between sessions and events also gives us the option of embedding multiple events inside of a single session document. Embedding is accomplished by declaring a field to hold either an array of values or embedded documents known as subdocuments. The sessions_events collection is an example of embedding, because it embeds the event data into an array within a session document. The data resulting from our denormalized model includes four event values in the event_data array within thesessions_events collection as shown in the following figure:

Denormalized models

As you can see, we have the choice to keep the session and event data in separate collections, or alternatively, store both datasets inside a single collection. One important rule to keep in mind when you consider the two approaches is that the MongoDB query language does not support joins between collections. This rule makes embedded documents or data arrays better for querying, because the embedded relationship allows us to avoid expensive client-side joins between collections. In addition, the MongoDB query language supports a large number of powerful query operators for accessing documents by the contents of an array. A list of query operators can be found on the MongoDB documentation site at http://docs.mongodb.org/manual/reference/operator/.

To summarize, the following are a few key points to consider when deciding on a normalized or denormalized data model in MongoDB:

· The MongoDB query language does not support joins between collections

· Currently, the maximum document size is 16 megabytes

· Very large data arrays can negatively impact query performance

In our sample database, the number of clickstream events per session is expected to be small—within a modest range of only one to 20 per session. The denormalized model works well in this scenario, because it eliminates joins by keeping events and sessions in a single collection. However, both data modeling scenarios are provided in the pentaho MongoDB database to highlight the importance of having an analytics platform, such as Pentaho, to handle both normalized and denormalized data models.

MongoDB query methods

The MongoDB query language is built from the ground up as a powerful way to retrieve, process, and update documents. In this section, we are going to use the mongo shell interface to learn fundamental MongoDB query structures and practice querying documents from the pentaho database. We will build on this query knowledge in the next chapter by replacing the mongo shell with Pentaho for retrieving data from MongoDB.

Tip

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Query exercise 1

Let's connect to the mongo shell and execute some often used help commands:

1. Establish a connection to MongoDB using the mongo.exe shell:

2. C:\mongodb\bin\mongo.exe

3. Show the available databases:

4. show dbs

5. Show a list of general help methods:

6. help

7. Show help for database methods:

8. db.help()

9. Switch to the sample database, pentaho, and show the available collections:

10.use pentaho

11.show collections

Note

The console window output will display three collections in the pentaho database: events, sessions, and session_events.

12. Count the number of documents in the sessions collection:

13.db.runCommand( { count: 'sessions' } )

Note

The console window output will display, { "n":1000, "ok":1 }, indicating a count of 1000 documents.

14. Show help for sessions collection methods:

15.db.sessions.help()

Note

The console window output will show help for the session collection including the following find() methods which are often used to retrieve data:

· db.sessions.find(….).count()

· db.sessions.find(….).limit(n)

· db.sessions.find(….).sort(….)

Read operations

The MongoDB query language supports the four CRUD database operations. CRUD is an acronym for the following database operations: create, read, update, and delete. This book focuses only on read operations, which are used to retrieve documents from MongoDB for reporting and analysis. Read operations include all the operations needed to query MongoDB, and the find() method is the primary method used to retrieve MongodDB documents. The find() method is similar to the select statement in the SQL query language. MongoDB queries accept objects and arguments, using the following syntax for the find() method:

db.collection.find( <query>, <projection> )

You must first specify a database that you will be working with. Once you have selected a database, you can define a collection within your query using the db.collection object followed by an optional query argument and projection argument.

Note

The MongoDB query language supports the querying of only one collection per query. We will learn in later chapters how to use Pentaho Data Integration to get around this limitation.

The <query> argument specifies the criteria for which documents to retrieve within the collection. If you don't specify a query argument, all of the documents in the collection will be returned. The <projection> argument specifies what fields to return in your query results; likewise, if you don't specify a projection, all fields in the collection will be returned.

Query exercise 2

In this exercise, we give MongoDB queries a try using the pentaho database connection established in the previous exercise by performing the following steps:

1. Query the sessions collection and retrieve all of the documents and fields in the collection:

2. db.sessions.find ()

Note

Using the find() method without a query argument will return all of the documents in the collection. The console window output for this query will display all of the available fields in the first 20 documents.

3. Query the sessions collection to retrieve only the web sessions with a Chrome browser by adding equality criteria to the query argument:

4. db.sessions.find ( {browser: 'Chrome'} )

Note

The console window output will display all of the available fields in the first 20 documents that match the criteria of having a Chrome browser.

5. Restrict the fields returned by the query to only the id_session and browser fields by adding a projection argument:

6. db.sessions.find ( {browser: 'Chrome'}, {id_session: 1, browser: 1} )

Note

After the projection field name, you specify either 1 to include a field or a 0 to exclude a field. The console window outputs the id_session, browser, and the system default, _id, fields in the first 20 documents that have a Chrome browser.

Query operators

The find() method is a container for a more powerful way to create queries using the $query operator. The $query operator can be used to define document queries by comparison, logic, data type, array size, and others. For example, if we wanted to query the sessions collection for sessions that have a Chrome browser and session duration in minutes greater than or equal eighteen minutes, we would add the $gte comparison operator as follows:

db.sessions.find ( { browser: 'Chrome', duration_session: {$gte: 18} } )

By default, the two fields used to specify the filter criteria are joined with the and conjunction. If we want to query for sessions that have a Chrome browser or session duration greater than or equal eighteen minutes, we need to add the $or and $gte query operators. The $or operator evaluates two expressions and selects the documents that satisfy at least one of the expressions as follows.

db.sessions.find ( $or: [ { browser: 'Chrome'}, {duration_session: {$gte: 18}] } )

The examples of other comparison operators include $lt for less than, $lte for less than or equal, $gt for greater than, and $ne for not equal.

Querying arrays

Data arrays contain a set of values that are often used to filter MongoDB queries. When specifying equality criteria for multiple values in an array, MongoDB will look for an exact match of the array values and the order of those values. First, let's query for a single event in the array. You will recall that the sessions_events collection holds the event_data array containing clickstream events. To find all of the sessions that contain the Added Item To Cart event, we issue the following query against the event_data array:

db.sessions_events.find ( {"event_data.event": 'Added Item To Cart'})

Note

The console window will display the first 20 documents that contain this event along with any other events that occurred during the session.

The session with id_session = E8995C988FD3441AA3077DE435AFC3EC contains the following event sequence: Visited Site, Watched Video, Signup Newsletter, Added Item To Cart, and Completed Lead Form. In the next query, we will look for two events, Visited Site and Watched Video, and see how MongoDB handles equality criteria for multiple events. To find sessions where users visited the website and then watched a video in that exact order, we simply add both events to the query as follows:

db.sessions_events.find ( {"event_data.event": [ 'Visited Site', 'Watched Video' ] } )

Note

The console window will display only five documents that match the exact criteria of someone who visited the website and then watched a video. It does not include the session we identified with these two events, because the default filter on multiple events in an array is to look for an exact match.

So, filtering on a single value uses a contains equality match, while filtering on multiple values restricts the query to only exact matches. This is too restrictive for scenarios when you would like find documents that contain these two events along with other possible events that occurred in the same session. The MongoDB query language includes the $all query operator to expand the equality criteria and retrieve sessions that contain both of the specified events, shown as follows:

db.sessions_events.find ( {"event_data_event": {$all: [ 'Visited Site', 'Watched Video' ] } } )

Note

The console window will display many more documents that match the criteria of someone visiting the website and watching a video in addition to performing any other sequence of events in a web session.

Query exercise 3

In this exercise, you will query MongoDB to find web sessions where a user watched a video, signed up for a free offer, and added an item to their shopping cart. For this exercise, assume we don't care about the order of the events or occurrence of other events in the same session. The steps for this are as follows:

1. Establish a connection to MongoDB using the mongo.exe shell:

2. C:\mongodb\bin\mongo.exe

3. Switch to the pentaho database:

4. use pentaho

5. Query the sessions_events collection, and retrieve all of the documents and fields in the collection:

6. db.sessions_events.find ()

Note

The console window output will display all of the available fields in the first 20 documents.

7. Query the sessions_events collection to retrieve web sessions that contain the following three events: Added Item To Cart, Signup Free Offer, and Watched Video:

8. db.sessions_events.find ( {"event_data.event": {$all: ['Added Item To Cart', 'Signup Free Offer', 'Watched Video'] } } )

Note

The console window output will display all of the available fields in the first 20 documents that match your criteria.

9. Count the number of web sessions that contain the following three events: Added Item To Cart, Signup Free Offer, and Watched Video:

10.db.runCommand ( {count: 'sessions_events',

11.query: {"event_data.event": {$all: ['Added Item To Cart', 'Signup Free Offer', 'Watched Video"] } } })

Note

The console window output will display a count of 37 documents that match your criteria.

Summary

Congratulations! Now, you know how to query the clickstream database using the mongo shell! You can add any combination of events to your array query and gain a powerful insight into the sequences of events that occur with online user behavior. The next chapter expands on this query knowledge and introduces you to the Pentaho interface to query MongoDB.