Designing a Schema-less Data Model - Couchbase Essentials (2015)

Couchbase Essentials (2015)

Chapter 6. Designing a Schema-less Data Model

In this chapter, we're going to take a step back from how to program for a Couchbase database, and instead focus on design considerations for a Couchbase application. We touched on a few of the important design ideas in the previous chapters, but we'll now explore keys and documents in greater detail.

There is no right way to design a document-based application. This notion differs significantly from relational application design. If you're an experienced developer of RDBMS-based systems, you've likely undergone the process of converting a logical model to a highly normalized database design. NoSQL design is very different.

Proper document design is tightly coupled with both your logical model and your application use cases. Moreover, use-case-based document design will vary among document databases. Designing your Couchbase documents is not necessarily the same as designing your MongoDB documents.

Since Couchbase is a hybrid data store, we'll need to consider both key/value and document designs. Here, our key/value design might differ from a key/value design for a key/value database such as Redis. As you explore both key/value and document design, you'll learn about the specifics of Couchbase that impact on design decisions.

Key design

With Couchbase, you can't have a document without a key. Therefore, it's clearly important to have a strategy for key design. How you choose to generate your keys will, generally, be partly preference driven and partly use case driven. We'll start by examining what basic requirements exist for keys in Couchbase.

Keys, metadata, and RAM

We saw previously that Couchbase keys are parts of the metadata of a document. This fact was revealed as we explored views that used the meta argument in map functions to retrieve document IDs for indexes. Prior to Couchbase Server 3.0, all keys were kept in the memory even if their documents weren't. Therefore, longer keys required more memory. Since Couchbase performs best when documents are kept in the memory, smaller keys mean more RAM available for documents.

In Couchbase Server 3.0, keeping metadata in the memory is still the default behavior, but it is now tunable. While you're now able to delete metadata from the memory for documents that have been evicted (based on a most recently used strategy), metadata not being in RAM does slow down performance. Hence, the choice of key is still of importance for very large datasets.

Predictable keys

In the relational world, primary keys are almost always autoincremented integers. You generally don't care about that primary key, as you're more likely to access a row in a table by some secondary index. Of course, there are times when you'll display a record by its ID, but it's likely that you found that record's ID via some other lookup, such as a SELECT * statement or through a foreign-key-related document.

Couchbase documents are not very different. We're able to get to documents by secondary indexes or even nonconstrained relations. However, these lookups require use of the view API. While Couchbase provides more than satisfactory performance for view lookups, these queries will never be as fast as in-memory document fetches via the key/value API.

If you're designing an application that requires extremely fast performance, avoiding the view API might be desirable. If you need to boost performance via the key/value API, you might want to have predictable keys that your application can use. Creating predictable keys does require some thought, however.

For an example of how we can use predictable keys, let's consider the simple case of a system that pushes messages to a user in a manner similar to that of Twitter or Facebook. Our version will be simplified because we won't concern ourselves with the comparison of read with unread messages. We'll assume that the system regularly updates a view with new messages and refreshes that view during each update.

In such a system, we could start with a user document that has an array of messages:

{

username: "jpage",

passwordHash: "0123456",

messages: [

"Hello!",

"Great gig!",

]

}

If we wanted to use a predictable key to find a user's messages, we'd have to use a key that would be accessible via some attributes of the user data; for example, we could use the username as the key. Assuming a user has to log in with their username, the application would be able to provide the key to a key/value Get operation at some interval.

While in this case we would be able to bypass the need for a view to get our user messages, it's not an ideal design. One problem is that if a user changes their username, a new document would have to be created, and keys cannot be renamed! In practice, you could retrieve the original user document, create a copy, and remove the old document. However, this approach does risk leaving an orphaned document behind if the delete operation were to fail.

Another potential problem is that each time the messages are retrieved, the entire document will have to be retrieved. The key/value Get operations do not support projections (selecting subsets of records). As such, it's important to consider document size for the performance of Get. It's always faster to retrieve only the data needed by your application.

While views can be used to provide a part of the document at its messages, remember that this is not good practice. Views should rarely emit document details to be used as it is. Instead, we can use predictable keys to create a simple access pattern to break our documents across multiple keys.

If we break our document into smaller documents with related keys, we could use one of two approaches for using keys. The first approach would be to store the key of the other documents within the parent document; for example, a user document might hold a key reference to a userMessages document.

A second approach would be to use a variant of the predictable parent key for each of the child keys. So, if the parent key were user::jpage, then the child keys could be of the user::profile::jpage or user::messages::jpage form. In this approach, the keys hold some form of taxonomy for the documents, and this can be used to discover the document type within a map function:

function(doc, meta) {

var keyParts = meta.id.split("::");

if (keyParts[0] == "user") {

emit(...);

}

}

The preceding approach does have the added benefit of letting you avoid the need to use a type property in your documents. This benefit is less about document size (since the key might be larger) and more about not being required to maintain an extra property in your documents.

In practice, including taxonomy in the key is largely a matter of preference. The performance difference during indexing will be nominal, that is, the difference between a string split operation and a string comparison operation. However, smaller keys require less RAM for metadata. If predictability is not important, the type property approach does potentially allow for less RAM use.

Unpredictable keys

As you might have surmised, Couchbase Server does not provide a mechanism to generate keys. Therefore, it is up to your application to generate unique keys. There are a couple of different strategies you might employ in creating unique keys, but generally speaking, all you should be concerned with is maintaining uniqueness.

The most common means of generating keys is to use a globally (or universally) unique identifier, typically referred to as a GUID or UUID. Most modern programming platforms support GUID generation. When creating a document, you simply have to create a new GUID and use that value when calling add or set.

Storing keys

It might seem strange to title a section as Storing keys, since you don't actually have a choice as to where keys are stored. However, it's important to note that storing the key inside the document is redundant and potentially invalid if not kept in sync.

The problem of including a key in the document tends to arise when using JSON serializers to create documents from business objects. Consider the following C# class:

public class User

{

public string Id { get; set; }

public string Username { get; set; }

public string Email { get; set; }

}

What happens when this class is serialized into JSON? Most likely, the Id property will be included in the document. Assuming that you expect this property to map to the key of the document, you'll want to make sure that you ignore this property during serialization. Many JSON serializers provide a means to prevent a property from being serialized. In other cases, you may have to transform the object into an object without an Id property.

On the way out, you will likely want to map the Id property of your domain object to the key of the document. Newer SDKs such as Java and .NET provide this support out of the box. In other cases, you'll simply assign the key used during a Get key/value operation or the key discovered during a view query.

These approaches are illustrated in the following C# snippets. Note that these samples intentionally bypass the built-in JSON support to demonstrate the explicit mapping of Id properties to business objects:

public class User

{

//Don't include this field when serializing

[JsonIgnore]

public string Id { get; set; }

public string Username { get; set; }

public string Email { get; set; }

}

var key = "12345";

var user = new User

{

Id = key,

Username = "jsmith",

Email = "jsmith@example.com"

};

//Serialize the User instance to JSON

//The Id property will be ignored

var json = JsonConvert.SerializeObject(user);

//Insert the User JSON

bucket.Insert<string>(user.Id, json);

//Get the JSON string from the bucket

var savedJson = bucket.Get<string>(key);

//Deserialize the JSON back into a User instance

var savedUser = JsonConvert.DeserializeObject<User>(savedJson);

//The savedUser will have a null Id at this point

//Manually set the Id property to the key

savedUser.Id = key

From this example, you might be wondering why you need to set the Id property of the savedUser instance to a key when you already know the key. The assumption here is that your application will somehow make use of this data object and attempt to access the Idvalue. Suppose you were making use of this object in an HTML templating engine. You could display an Edit User link using this code:

<a href="/Edit/@user.Id">Edit User</a>

Key restrictions

Regardless of which key strategy you choose, there are a couple of minor restrictions on keys; for example, they are strings no more than 250 bytes long. Also, you cannot use spaces in your keys, but as we saw in the previous case, you may use characters such as punctuation to delimit a key.

Document design

Document design is a more involved activity than key design. There are far more variables to consider when creating a document's schema. Some of these factors are specific to Couchbase. Others are generally applicable to document databases.

Denormalization

When designing a relational system, you typically start with a highly denormalized logical view of your entities. That view is then normalized into a physical model where the data is spread across several tables in an effort to minimize any possible data redundancy.

Similarly, you'll likely start your document design by creating a denormalized, logical model. With this approach, your design first considers the most complete document that your domain demands. For example, if you were building a blog, you might start with a blog document with nested posts. Within each post, there would be nested comments and tags:

{

"type": "blog",

"title": "John Zablocki'sdllHell.net",

"author": {

"name": "john.zablocki",

"email": "jz@example.com"

},

"posts": [

{

"title": "Couchbase Schema Design",

"body": [

"Couchbase schema design..."

],

"date": "2015-01-05",

"tags": [

"couchbase",

"nosql",

"schema"

],

"comments": [

{

"comment": "Thanks for the post.",

"user": "jsmith"

}

]

},

{

"title": "Azure DocumentDB",

"body": [

"Using Azure DocumentDB..."

],

"tags": [

"azure",

"nosql"

],

"comments": [

{

"comment": "Thanks for the post.",

"user": "jsmith"

},

{

"comment": "Interesting.",

"user": "jdoe"

}

]

}

]

}

Next, you'll create a physical model with a goal of minimizing normalization. In other words, you'll design a schema where related entities are broken apart only when necessary.

In the previous blog example, the normalized relational model would likely include separate tables for blogs, posts, comments, and tags. The process of creating a minimally normalized document model should follow from considering use cases for your application.

A good way to start in the case of a blog is to separate posts from the parent blog. This step of normalization is important because without it, every time a blog post is read, the blog, the post, and all sibling posts will be retrieved as well. Clearly, it's best to be able to retrieve a single blog post:

//key blog_johnzablockis_dllhellnet

{

"type": "blog",

"title": "John Zablocki'sdllHell.net",

"author": "author": {

"name": "john.zablocki",

"email": "jz@example.com"

}

}

//key post_couchbase_schema_design

{

"blogId": "johnzablockis_dllhellnet",

"title": "Couchbase Schema Design",

"body": [

"Couchbase schema design..."

],

"date": "2015-01-05",

"tags": [

"couchbase",

"nosql",

"schema"

],

"comments": [

{

"comment": "Thanks for the post.",

"user": "jsmith"

}

]

}

The separate blog and post documents here demonstrate how to link two documents together by placing the key of one document on the related document. In this case, we have the blog document's key on the post document in its blogId property.

If we assume that each blog has only one author, we have another decision to make—where to put the author details. One option is to separate the authors into their own documents. If we were to take this approach, then in order to show an author's name on a post, we'd either have to get to the author document through the blog document, or include a second ID reference on posts, which would be the author document's key.

Alternatively, there is a valid approach that involves keeping the author details within the blog, and (redundantly) including the name of the author in the post document. With this approach, we avoid the need to retrieve additional documents to simply add a name for display:

{

"blogId": "johnzablockis_dllhellnet",

"title": "Couchbase Schema Design",

"author": "john.zablocki",

"body": [

"Couchbase schema design..."

]

}

Tip

If you're a relational developer, this last step probably feels a bit "dirty." That's common at first when moving to NoSQL. Remember that NoSQL databases exist to make programming against databases easier and to provide optimal performance. Denormalizing data by being redundant is a tool to achieve both of these goals. Moreover, you are likely to optimize your relational model by denormalizing a column or two to avoid an extra join or query. However, this approach does of course require some maintenance effort to ensure data integrity.

The decision as to whether to leave comments within posts will be discussed later in this chapter, as it is a more nuanced choice to make, compared to blogs and posts. As for tags, it would be very inefficient to normalize tags into their own documents as you might do with a relational system; each tag would require a Get operation.

One reason the normalized model works so well in relational systems is that SQL joins allow related data to be gathered from several different tables and presented as a single logical result. However, joins also create overhead for queries by increasing disk access operations.

Most NoSQL systems have forgone join support and rely heavily on the cache to support rapid retrieval of several documents when a somewhat normalized document is required. Couchbase is capable of tens of thousands of operations per second on a single node. As such, multi-get overhead is generally not a concern.

Object-to-document mappings

When you design an application, you tend to model your business or domain objects much more closely with your logical model than your physical model. This conflict leads to what is often termed as the object-relational impedance mismatch, which is another way of saying that it's hard to map your domain objects to your relational model.

With document stores, you have a much easier path from the object to the document. For starters, JSON (and its binary variant) is itself a notation for describing objects. Within a document, there is built-in support for nested collections, related properties, and basic property types. More importantly, pretty much all modern programming languages have JSON serializers.

It won't always make sense to store the entirety of a domain object graph in a single document, but as a general rule, it's useful to start with this design and to allow your use cases to dictate document separation.

Data types

With relational databases, there are numerous types that may be used to create a schema. From fixed-length to variable-length strings and floating-point numbers with various precisions, SQL systems support a great number of options. The situation is quite different with Couchbase Server.

As a document store relying heavily on JSON, Couchbase needs only a few primitive types supported by JSON. These types are strings, numbers, arrays, and Booleans. From these types, virtually any object graph can be stored as a document. Note that dates are not addressed by the JSON standard.

Document separation

There is no golden rule as to when you should separate your document into smaller documents. We saw earlier in this chapter that performance considerations might lead us to do so, but there are also a few other reasons that don't necessarily involve speed of document retrieval.

One common reason for breaking a document into smaller documents is write contention. Consider a blog post and its comments. In the following abbreviated document, we can see that each comment on a post is stored as a nested object within a commentscollection. While this is certainly a valid document design, there are situations where it might not be optimal:

{

"title": "Couchbase Schema Design",

"body": "Designing documents...",

"type": "post",

"comments": [

{ "message": "Great post", "user": "rplant" },

{ "message": "I learned a lot", "user": "jpjones" },

]

}

Consider a situation where a blog post is quite popular and is likely to generate hundreds (or even thousands) of comments in a short period of time. In this case, it is important to understand one aspect of Couchbase document retrieval, that it's all or nothing.

When you perform a Get operation on a document, the entire document is returned. While Couchbase is quite fast and that document is likely coming from the RAM (not the disk), it still means every page view would pull back all comments. If you're not displaying those comments, then you're retrieving a potentially significant amount of data for no reason.

While you may not be concerned with the transfer of unused data when the document is retrieved, there is another consideration for such a document design. If hundreds or thousands of users are trying to update a document at the same time, there will be contention for that document. Using CAS for optimistic locking is certainly a requirement here, but CAS will only prevent stale updates, it won't minimize contention.

As an alternative, you could separate each comment into its own small document. In doing so, you eliminate the need to perform CAS operations and keep the post document lean:

{

"message": "Great post",

"user": "rplant" ,

"type": "comment",

"postId": "couchbase_schema_design"

}

To find all comments associated with a given post, you can create a view where the index is on the postId property of the comment document:

function(doc, meta) {

if (doc.type == "comment" && doc.postId) {

emit(doc.postId);

}

}

Again, whether this approach makes sense for your situation depends primarily on the needs of your particular application. If a post document were to get only a dozen or so comments, there is little need to worry about CAS impacting performance. There would also be little concern over document size and retrieving superfluous data.

Tip

Keep in mind that with the approach of breaking nested entities into separate documents, you're increasing the RAM requirements for metadata. Having a single document means having only one key and associated metadata values.

Another reason you might consider breaking a document into separate documents is about document access patterns. Recall that Couchbase Server keeps recently used documents in the memory whenever possible. Storing related data together in a single document means potentially storing unnecessary data in the RAM.

As an example of how to design for this scenario, consider an activity where you track customers and customer orders. The denormalized approach would be to have a customer document with a nested collection of orders:

{

"username": "tyorke",

"type": "customer",

"orders": [

{ "description": "microphone": "date": "2014-11-02" },

{ "description": "drum machine": "date": "2014-11-02" }

]

}

The problem with this document design is that a customer is likely to spend more time visiting an online store than actually creating orders. If the RAM is a constraint for your cluster, you should consider separating the order documents into a separate document. That way, the less frequently used order details are less likely to occupy the RAM when resources are constrained:

{

"type": "customerOrders",

"customerId": "tyorke",

"orders": [

{ "description": "microphone": "date": "2014-11-02" },

{ "description": "drum machine": "date": "2014-11-02" }

]

}

It should be clear from our brief discussion of document separation that the saying "no single size fits all" holds true here. Your application, more than well-defined academic rules, will dictate how to segment your documents.

Finally, it's worth noting that documents in Couchbase are limited to 20 megabytes in size. While in practice, this limitation is rarely an issue, it should be kept in mind if you decide to store binary data or other large structures. If you reach this limit, you might be forced to separate your documents regardless of the considerations discussed previously.

Object schemas

Although schema-less databases such as Couchbase don't impose any structure on your documents, it's likely that your application will. We've already discussed the advantages of document databases in terms of natural object mapping. Another benefit of this mapping is that your application effective defines the schema for documents in your Couchbase buckets.

Allowing your datastore to be given a schema from the application layer is not unique to document databases. Over the past decade or so, it has become common to use ORM libraries with code-first database design. With this approach, you create a domain object and allow a certain tool to create your database schema from these objects.

In the .NET world, the entity framework will allow you to define classes in C# and then generate a database from those entities. The tables will match the class names, and the columns will match the types and names of the properties. In Ruby, an active recordallows database schemas to be created from Ruby classes. Other frameworks have similar libraries.

Code-first tends to be implicit with document databases. Since each document you create was likely the result of serializing an object into JSON, that object defined the schema for the resultant document.

There are some caveats to allowing your objects to become your document schemas. Earlier in this chapter, we saw the problem of serializing an Id property into a document. You may also want to exclude other properties from being serialized.

If we consider the brewery and beer documents from the beer-sample bucket, we'd have a Beer class in our application that has a property referencing its brewery. This property would exist primarily for the purpose of navigation between related objects:

public class Beer

{

public string Id { get; set; }

public string Name { get; set; }

public string Type { get; set; }

public string BreweryId { get; set; }

public Brewery Brewery { get; set; }

}

If we serialized the preceding C# class, we'd end up with a nested brewery. As we know, however, these documents are separated in the brewery-sample database. To avoid this problem, you'll need to instruct your JSON serializer to ignore certain properties. In .NET, the JSON.NET library supports attributes for this purpose:

public class Beer

{

[JsonIgnore]

public string Id { get; set; }

public string Name { get; set; }

public string Type { get; set; }

public string BreweryId { get; set; }

[JsonIgnore]

public Brewery Brewery { get; set; }

}

Schema-less structure changes

An important consideration when allowing your objects to create your document schemas is versioning. A big advantage of schema-less databases is that your data model is free to change without having to deal with relational-style schema changes. For example, dropping or adding a column might lock a table or require downtime for your SQL-database-backed application.

Because there is the flexibility of having no database-imposed schema, it does not mean you are free from schema change concerns. If you are using object-to-document mappings, you've effectively created a strongly typed document database. If your object changes, it may no longer match its document, and vice versa.

It's likely that your platform's JSON serializer that will determine the impact of schema changes. If your document has a property that's no longer applicable to your object, deserialization could cause a runtime error. Similarly, serializing a changed object could create variations in documents of the same type, creating unintended view results.

One approach to addressing this problem is to add a version number property to your documents. With this approach, your application and your views may react differently to changes based on the version of the document being read or written.

Another approach is to validate and/or modify document schemas before making any application layer changes. It is possible to write a view to find all the unique document schemas in your bucket:

function (doc, meta) {

if (doc.type) {

var props = [];

for (var prop in doc) {

props.push(prop);

}

emit({ "type" : doc.type, "schema" : props.sort() });

}

}

In the preceding map function, we first check whether the document has a type property. This step is not required, but we assume that any document in the bucket related to an object has a type property associated with it. After this step, each of the properties of the document is pushed into an array.

The keys of this index are JSON objects that include the document type and the sorted set of properties from the document:

{

"id": "becca",

"key": {

"type": "user",

"schema": [

"email",

"firstName",

"lastName",

"type"

]

},

"value": null

},{

"id": "hank",

"key": {

"type": "user",

"schema": [

"firstName",

"lastName",

"type"

]

},

"value": null

},{

"id": "karen",

"key": {

"type": "user",

"schema": [

"firstName",

"lastName",

"type"

]

},

"value": null

}

In its current state, this view is not completely useful. However, if you add a reduce function with the built-in _count function, and group the results by setting the group option to true, then you will get a list of all unique schemas and a count of documents with those schemas:

{

"key": {

"type": "user",

"schema": [

"email",

"firstName",

"lastName",

"type"

]

},

"value": 1

},

{

"key": {

"type": "user",

"schema": [

"firstName",

"lastName",

"type"

]

},

"value": 2

}

You can also easily write a view to locate documents with or without a particular property. If you want to find all user documents without an email property, you can use the following map function:

function(doc, meta) {

if (doc.type == "user" && ! doc.email) {

emit(null, null);

}

}

These views demonstrate how to find information about document schemas. With this information, you can iterate over the results and update documents to have an updated schema.

Tip

If you're willing to lose the benefits of user-defined types, you should consider using dictionary structures in your application. Dictionaries map naturally to JSON and have less risk of breaking on schema mismatches.

Object and document properties

Another advantage of having document schemas derived from classes is that your documents will inherit name and data types from your objects. Generally speaking, this behavior should be acceptable. However, there are a couple of issues we need to be aware of.

Perhaps, the most important consideration here is about document property names. JSON became popular for data transfer in part due to its relative terseness when compared to XML. However, with no database-defined schema, Couchbase documents repeatedly include the same schema information across potentially billions of documents.

Long names take up more RAM and more disk space. While this is not an issue for smaller apps, large datasets may need to be optimized to have smaller property names. Fortunately, most JSON serializers support property name mapping. For example, in .NET a user class could be mapped as follows:

public class User

{

[JsonProperty("fn")]

public string FirstName { get; set; }

[JsonProperty("ln")]

public string LastName { get; set; }

[JsonProperty("t")]

public string Type { get; set; }

}

When this class is serialized, it will be a smaller document that has the properties mapped and unchanged:

{

"fn": "Wolfgang",

"ln": "Mozart",

"t": "user"

}

It's also important to understand how your JSON serializer maps property types. While strings and numbers will be consistent, dates may not be consistent. Make sure you check your platform's JSON serialization behavior.

Document relationships

Another important design consideration is about dealing with document relationships. Throughout this chapter, we saw how to separate related documents but we haven't fully discussed how to work with related documents.

With document databases, the basic approach to handling relationships involves including the ID of a related document with the relating document. We've seen this design in the previous blog sample and in the beer-sample database, where beer documents include a brewery_id property. Again, this is a convention and there is no database constraint.

Without database-enforced referential integrity, your application layer will be responsible for enforcing data validity. Once again, views may be used to identify where deficiencies in data exist. For example, if we want to find all beer names whose brewery ID is invalid, we can simply iterate over the results of the collated view example in Chapter 4, Advanced Views, looking for beer names without a matching brewery.

One of the advantages of relational constraints and joins is that your object-relational mapper is able to assemble your object graph for your application. Without formal relationships in Couchbase (or other document databases), your application will have to perform multiple queries to get related documents, and manually assemble your object graph.

Finalizing the schema

When designing relational systems, you often end up with some data being denormalized for performance or other reasons. As joins prove costly, a typical optimization step is to create flattened tables, where redundant columns are close to the data to which they're related.

With document databases, you'll likely end where you started, with a mostly denormalized document structure. Not only does a denormalized document store related data together, it also is likely to include properties from other documents that are not primary keys.

As an example of a denormalized relationship, consider the blog post and comment example. If comments are to be displayed with their respective authors, then either numerous lookups must be made to user documents, or some subset of author details must be stored redundantly with each comment.

As with relationships based on IDs, other properties might change in their primary location, forcing your application to know how to update the redundant records. If a user changes their username, not only user documents but also all comments by that user will need to be updated.

Summary

As we saw in this chapter, designing Couchbase documents is partly art and partly science. More than relational systems and most other NoSQL systems, Couchbase's schema-less design requires great care, not just because Couchbase is a hybrid key/value and document store system.

Many developers choose Couchbase for its performance. Designing a document-based system for scaling involves a unique set of constraints and concerns. Other developers choose Couchbase for its flexibility. Designing a document-based system for flexibility raises several unique considerations for applications.

Those developers who choose Couchbase for both its flexibility and its scalability have the added challenge of trying to tweak performance without sacrificing the flexibility of a document database.

It's always tempting to approach system design by sticking to what we know. It's important to remember that Couchbase is a truly unique system, and your document design will not necessarily seem obvious at first. However, you shouldn't be afraid to allow some parts of your design to feel relational and others to feel nonrelational.

In the next chapter, we're going to continue to explore application designs in a schema-less world. While creating a simple, Couchbase-based web application, we'll be able to work through several issues we explored in this chapter.