The persistence layer - Infrastructure - Microsoft .NET Architecting Applications for the Enterprise, Second Edition (2014)

Microsoft .NET Architecting Applications for the Enterprise, Second Edition (2014)

Part IV: Infrastructure

CHAPTER 14 The persistence layer

Chapter 14. The persistence layer

The possession of facts is knowledge; the use of them is wisdom.

—Thomas Jefferson

There was a time in which most of the effort involved in the design and building of a software system was about designing and building a data access layer. The role of the data model was crucial and central to the organization of the rest of the system. The data model—and more often than not, the relational data model—was the first and most important step on the agenda.

Is this step really no longer the most important step today in the building of a software system?

The advent of Domain-Driven Design (DDD) set the ground for a paradigm shift in which it’s the business, and no longer the data, that is the foundation of the software design. As discussed in Chapter 8, “Introducing the Domain Model,” when you do DDD you’re not necessarily using an object-oriented model. For example, you could have a functional model instead. In any case, the introduction of a conceptual model—whether it’s object-oriented or functional—makes the persistence of data a secondary aspect.

To the application’s eyes, the source of data is no longer the physical database. It is, instead, the logical model built on top of the business domain. For obvious reasons, the logical model has to be persisted. However, this becomes a simpler infrastructure concern, and persistence is not even necessarily strictly bound to relational database management systems (DBMS).

In this chapter, we analyze the persistence layer—namely, the portion of the system’s infrastructure that deals with the persistence of data. We’ll first define what is expected to be in the persistence layer and then focus on the patterns and implementation details.

Portrait of a persistence layer

At some point, almost all software these days needs to access infrastructure components for reading data, presenting data back to the user, or saving results. With the advent of DDD, the segment of code that deals with reading and writing data that survives sessions has been segregated from the core of the system. The persistence layer is the name commonly used to refer just to the segment of code that knows about the nitty-gritty details of data access: connection strings, query languages, indexes, JSON data structures, and the like.

Responsibilities of the persistence layer

Let’s start by identifying the responsibilities of a persistence layer. The persistence layer is usually created as a class library and is referenced by the domain layer (specifically, domain services) as well as the application layer. In turn, the persistence layer references any data-access-specific technology, whether an Object/Relational Mapper (O/RM) such as Entity Framework or NHibernate, ADO.NET, a NoSQL database, or even external data services.

Saving permanent data

The persistence layer offers a bunch of classes that, first and foremost, know how to save data permanently. Permanent data is data processed by the application and available for reuse at a later time. Note that not every system today needs to write data of its own.

Sometimes you are called to write just a segment of a larger system—a bounded context of a top-level architecture. It might be that you are just given a set of URLs to get or save data. Yet, the system needs to have a sort of black hole where calls for data persistence end up. This is the persistence layer.

Handling transactions

Especially if you use the Domain Model pattern, you might need to perform write operations in the context of a transaction—sometimes even a distributed transaction. The persistence layer should be aware of the transactional needs of the application. However, the persistence layer should handle in person only transactions that form a unit of work around some data access.

More concretely, this means that the persistence layer should take care of updating multiple tables within the boundaries of an aggregate in a single unit of work. The persistence layer, however, should not be involved in the handling of broader transactions that involve other components and take place within an application service (or a saga if you use an event-driven architecture).

In a nutshell, the transactional responsibilities of the persistence layer don’t exceed the boundaries of plain data access in the context of a data aggregate. Everything else should be taken care of at a higher level either through the Microsoft .NET TransactionScope class, distributed transactions, or just step-by-step rollback/compensation policies within the use-case workflow.

Reading permanent data

The persistence layer is in charge of reading data from any permanent store, either database tables, files, or HTTP services. Especially in a Command/Query Responsibility Segregation (CQRS) scenario, the persistence layer that focuses on reading data might be physically separated from the persistence layer that deals with commands and writes.

For performance reasons, it might be desirable that reading is offloaded to distinct servers and leverages cached data. This is a key aspect for any systems that need to serve millions of pages on a monthly basis. Surprisingly, for such sites (think, for example, of news and media sites or airline and booking sites) caching is far more important than algorithms, gzip-ping results, grouping scripts at the bottom, or taking advantage of other tips commonly associated with the idea of improving web performance.

The persistence layer is also the ideal place to centralize some caching strategy for the content of the data source.

Design of a Repository pattern

Today, the persistence layer is traditionally implemented using the Repository pattern. A repository is a class where each method represents an action around the data source—whatever that happens to be. This said, the actual structure of the repository classes might vary quite a bit in different scenarios and applications. Let’s review the basics of the pattern first.

The Repository pattern

According to Martin Fowler, a repository is a component that mediates between the domain model and data-mapping layers using a collection-like interface for accessing domain objects. (See http://martinfowler.com/eaaCatalog/repository.html.) This definition seems to closely match the public interface of the root context object that most O/RM frameworks expose.

A repository is expected to wrap up as collections all persistent data to be managed. Collections can be queried and updated. In a nutshell, a repository is just the interfacing layer that separates the domain model (or, more in general, the business logic) from data stores.

While the definition from Fowler is broadly accepted, it’s still a bit too abstract and doesn’t drill down into concrete aspects of implementation. A repository impersonates the persistence layer and is expected to have the same responsibility we just outlined. The repository performs data access using one or multiple specific data-access technologies, such as Entity Framework, NHibernate, ADO.NET, and so forth.

Figure 14-1 provides a view of how a repository fits in a layered architecture. Recall, as discussed in Chapter 8, that you should aim at having one repository class per aggregate.

Image

FIGURE 14-1 Fitting a repository in a layered architecture.

Having repositories is a common practice today, though repositories might take slightly different forms in different applications. The benefits of having repositories can be summarized in the following points:

Image Achieves separation of concerns

Image Reduces the potential for duplicate data-access code

Image Increases the amount of testable code in the application and domain layers by treating data-access code as an injectable component

In addition, a set of well-isolated repository classes lays the groundwork for some applications to be deployed with one of a few possible data-access layers targeting different databases.


Image Important

Honestly, we don’t think that scenarios where the same application must be able to support radically different data sources in different installations are as common as many seem to think. Sure, there might be applications that need be designed to support Oracle, SQL Server, or perhaps MySQL data stores. Overall, we believe that these applications are not common. It is much more common that applications use just one data source that might change over time as the product evolves. Having repositories surely helps when switching data sources, but this is not an everyday scenario.


The Unit of Work pattern

What’s the granularity of a repository? Would you use a single repository class for the entire data source? Or should you use a repository class for each aggregate? (As mentioned in Chapter 8, the term aggregate is specific to the Domain Model; however, it is a rather general concept for which you end up having some sort of implementation in most cases, even if you don’t use the Domain Model.)

Whether you envision the repository as the persistence layer of a single aggregate or as a single entry point for the entire data source, the implementation must be aware of the Unit of Work (UoW) pattern.

UoW is defined as the list of operations that form a business transaction. A component, like the repository, that supports the pattern coordinates the writing out of changes in a single physical transaction, including the resolution of concurrency problems. The definition is taken fromhttp://martinfowler.com/eaaCatalog/unitOfWork.html.

At the end of the day, supporting units of work means enabling callers to arrange logical transactions by composing together operations exposed by the repository. A repository built on top of, say, Entity Framework will likely use the DbContext object to create a unit of work that will be transactionally persisted by calling the SaveChanges method.

Repository pattern and CQRS

Today the design and implementation of repositories is bound to the supporting architecture of choice. If you do CQRS, you typically want to have repositories only in the command stack—one repository class per aggregate. The command stack repository will be limited to write methods (for example, Save) and one Get method capable of returning an aggregate by ID. We’ll detail the structure of such a command repository in the next section.

In a CQRS scenario, the read stack doesn’t typically need a set of repository classes. As discussed in Chapter 10, “Introducing CQRS,” and Chapter 11, “Implementing CQRS,” in CQRS the query stack is fairly thin and consists only of a data-access layer (for example, based on LINQ and Entity Framework) that returns data-transfer objects (DTOs) ready for the presentation layer. You just don’t need the extra complexity of an additional layer like repositories.

Repository pattern and Domain Model

When the supporting architecture is Domain Model, you have a single domain layer with no explicit separation between the command and query stack. We thoroughly discussed this architecture in Chapter 8 and Chapter 9, “Implementing the Domain Model.”

In a Domain Model scenario, like the one presented in the Evans’ book, you have a single repository per aggregate. The repository class will handle both queries and commands. The implementation of the aggregate repository, therefore, will have methods like Save and Delete as well as a bunch of query methods (depending on the specific needs of the domain) that return ad hoc DTOs as appropriate.


Image Important

At this point in the book, we wouldn’t be surprised to find out that readers are deeply debating the role of the Domain Model pattern. The question we want to raise is, does it really make sense any more to go with a single stack as in the Domain Model instead of a more flexible and lightweight CQRS architecture? We consider CQRS to be the state-of-the-art architecture today for any sort of bounded context that requires a bit of complexity beyond basic two-tier CRUD. In light of this, the only type of repository class you are going to write is what we call here the “command” repository. It’s the only repository you likely need today, whether you do CQRS or Domain Model. If you do Domain Model, however, you just add more query methods to the interface outlined next.


The interface of a command repository

Whether you use CQRS or Domain Model, you likely need a repository for write actions to be performed on the aggregates.

A repository is based on a generic interface; the interface is usually defined in the domain layer. The implementation of the interface, on the other hand, usually goes in a separate assembly part of the infrastructure layer. Here’s the interface that represents a common starting point for all repositories. Note that it might also be convenient to have a read-only property to access the UoW object:

public interface IRepository<TAggregate, in TKey> where TAggregate: IAggregateRoot
{
TAggregate Get(TKey id);
void Save(TAggregate aggregate);
void Delete(TAggregate aggregate);
}

We have a few comments to make about this.

First, the IAggregateRoot marker interface to identify aggregate types is not strictly necessary, and you can even avoid it altogether if it turns out to be just a marker interface. You can just use class or new() in the where clause.

Second, using an aggregate type is highly recommended, but it is not infrequent to see repositories built for individual entities. This happens either because of a simple model that just doesn’t have significant aggregate boundaries or because the designer can’t see boundaries and ends up treating every domain entity as an aggregate.

Third, the Delete method seems to be an obvious requirement for a repository interface. However, this is true only if you look at repositories through the lens of CRUD and with a largely database-oriented mindset. In a real-world business domain, however, you don’t delete anything. So if you really design your system paying full respect to the ubiquitous language—and build a ubiquitous language that faithfully reflects the business domain—you don’t need a Delete method in all repositories. You likely have forms of logical deletions that end up being a variation of theSave method at the repository level.

Fourth, mostly the same can be said for the Save method, which encompasses the role that an Add method would have in a classic CRUD interface.

Finally, it makes sense to have a Get or FindById method that takes the aggregate ID and returns an aggregate instance.

Implementing repositories

The overall structure of a repository can be split in two main parts: query and update. In a CQRS solution, you might have two distinct sets of repositories (and for the query part probably no repositories at all). Otherwise, the same class incorporates both query and update methods. As mentioned, you will generally have a repository class for each aggregate or relevant entity in your system.

The query side of repositories

Built around an aggregate type, a repository might return an entire graph of objects. Think, for example, of an Order type. When you implement the FindById method, what are you going to retrieve and return? Most likely, the order information—all details and information about the customer and products. Maybe not in this case, but this approach in general lays the ground for a potentially large graph to retrieve and return to upper layers.

What could be an alternate approach?

Options for prototyping query methods

The first option that comes to mind is having multiple query methods that address different scenarios, such as methods that return a different projection of the same data (fewer properties, fewer calculated properties, or both) or an incomplete graph of objects.

This option makes up for a rather bloated repository class, at least for some of the aggregates. To cut the number of methods short, you can have a single method that accepts a predicate through which you specify the query criteria. Here’s an example:

IEnumerable<TAggregate> FindBy(Expression<Func<TAggregate, bool>> predicate);

A deeper analysis, however, reveals that the major problem of building the query side of a repository is not getting query criteria but is in what you return. Unless you choose to have individual methods for each scenario, the best you can do is return IEnumerable<TAggregate> as shown earlier.

The point is that in general there’s no guarantee that once you’ve called a repository query method you’re done and have exactly the data you need to present back to the users. When complex business logic is involved, more filters might need to be applied to reach the desired subset of data for a given use-case. In doing so, and also depending on the structure of your code, you might need to create and maintain multiple data-transfer objects (DTOs) on the view to the presentation layer. Using DTOs also means using adapters and mapper classes to copy data from entities to DTOs, which means more complexity, at least as far as the number of classes (and unit tests) is concerned.

There’s also a subtler point that goes against the use of predicates that return collections of aggregates. As an architect, you learn about the system to build from the semantics of the ubiquitous language. The ubiquitous language, though, also has a syntax made of specific verbs and names. When the domain expert says that the system must return all inbound invoices for more than $1,000 that have not been paid yet, she is actually referring to “inbound invoices” that are then filtered by “amount” and then further filtered by “payment state.”

The following LINQ expression renders it clearly:

from invoice in Database.InboundInvoices
.NotPaid()
where invoice.Total > 1000;

The preceding code snippet is a plain description of the query to write, and it is fairly expressive because it is based on a number of custom extension methods. The preceding code snippet has neither multiple database queries nor multiple LINQ in-memory queries built on top of a single database query that initially returns all invoices.

What you see is an abstract query that can be built in a single place—a repository method—or across multiple layers, including repositories, the domain layer, and the application layer. The query finally gets executed only when data is actually required—typically, in the application layer.

In short, we see two main options for building the query side of a repository class:

Image Use a repository class with as many query methods as required and prototyped as appropriate. This option has the potential to introduce some code duplication and, in general, increases the size of the codebase because of the DTOs you might need to create at some point.

Image Have very few query methods that return IQueryable<TAggregate> objects for the upper layers of the code to close the query by indicating the actual projection of data. This would define the actual query to run against the database. This approach minimizes the database load as well as the number of DTOs involved and keeps the overall size of the codebase to a minimum.

Our personal sentiment is that there’s little value in having repositories for the query side of the system.


Image Note

Returning IQueryable types is what we have defined in past chapters as LET, or Layered Expression Trees. In particular, you can refer to Chapter 10, for the pros and cons of LET.


Asynchronous query methods

Starting with version 6, Entity Framework supports asynchronous query operations. The DbContext class is enriched with a list of async methods, such as ToListAsync, FirstOrDefaultAsync, and SaveChangesAsync. Therefore, you can query and save using the popular async and awaitkeywords that were introduced with .NET 4.5. Here’s an example:

public async Task<IList<ExpenseCategoryV2DTO> GetExpenses()
{
using (var db = new YourDataContext())
{
return await db.ExpenseCategories
.AsNoTracking()
.OrderBy(category => category.Title)
.Select(category => new Models.ExpenseCategoryV2DTO
{
Id = category.Id,
Title = category.Title,
DefaultAmount = category.DefaultAmount,
Version = “Version 2"
}).ToListAsync();
}
}


Image Note

The AsNoTracking method configures the LINQ query so that entities returned will not be cached in the unit of work object, because that happens by default when Entity Framework is used underneath LINQ.


Async database operations can be applied when long-running operations or network latencies might block the application. Async database operations are also an option when you—just to reduce blocking—manually create multiple threads to carry operations separately. In this case, however, you might end up with a large number of threads and a large memory footprint. Async operations hard-coded at the .NET level make use of system threads to wait for operations to complete and do not penalize the application itself.


Image Important

The DbContext class you use to control query or save operations (sync or async) is not thread-safe. To avoid issues and apparently weird exceptions, you should have only one operation per DbContext running at a time. You can still run multiple database operations in parallel as long as you use distinct instances of the DbContext class.


Returning IQueryable types

Returning IQueryable types from a repository instead of the results of an actual query against the database will give a lot more flexibility to the callers of repositories. This is a gain as well as a pain.

It’s a gain because it’s the actual consumer of the data who takes the responsibility of narrowing results to what’s actually needed. The idea is that you build the query by adding filters along the way and generate SQL code and execute the query only when it cannot be further deferred.

This approach can also be a pain in some situations. Repository callers are actually given the power of doing nearly everything without control. They can certainly use the power of IQueryable to add filters and minimize database workload while keeping the repositories lean and mean. However, they can also overuse the Include method and grow the graph of returned objects. Repository callers can implement very heavy queries against the database and filter results in memory. Finally, you should consider that the composition of queries occurs within the LINQ provider of the O/RM of choice. Not all LINQ providers have the same capabilities in terms of both performance and the ability to manage large queries. (In our experience, the LINQ provider in Entity Framework is by far the most reliable and powerful.)

The risk of running into such drawbacks is higher if the team writing repositories is not the same as the one writing upper layers and, of course, it is also a matter of experience.

A repository centered on IQueryable types can have the following layout:

public interface IRepository<TAggregate>
{
IQueryable<TAggregate> All(); // Really worthwhile?
}

As you can see, however, the query side consists of just a single method. Ultimately, you rarely have a strict need for repositories in the query stack. This means you can avoid IRepository<T> entirely and build your queries starting from the data source context class (DbContext in Entity Framework). The data source context class offers as IQueryable<T> the full list of data objects to filter further.

Some don’t like IQueryable at all

In the community, there’s a lot of debate about the use of IQueryable in repositories. As mentioned, an approach based on IQueryable gives upper layers of the system more responsibilities about the query being built. Also, the query side of the system is tightly bound to LINQ providers. This latter point alone seems to be the major argument raised by those who don’t like IQueryable in the repository.

The IQueryable interface is not responsible for the actual execution of the query. All it does is describe queries to execute. Executing queries and materializing results is the task of a specific LINQ provider. LINQ providers are bound to the actual technology being used for data access.

As long as you use Entity Framework with Microsoft SQL Server, relying on LINQ for queries is not a problem at all. The same can be said if you use a different database, such as Oracle, whether through the official Oracle provider or the third-party DevArt provider. We used the NHibernate provider a few years ago, and it wasn’t very reliable at the time. But it likely has improved along the way.

Another objection to using IQueryable is that it puts a dependency on the LINQ infrastructure. Determining whether this is a problem or not is up to you and your team. The syntax of LINQ is part of C#, so having a dependency on it is not a problem at all. The point is something else—adependency on the syntax of LINQ for database queries poses a dependency on some LINQ provider for that data source. We do recognize that in general there’s no guarantee you can find reliable LINQ support for every possible data source. In this regard, a solution based on IQueryablemight face serious problems when you have to switch it to a different data source.

But is this really going to happen? If it’s only a theoretical possibility, in our opinion you’d better ignore it and take advantage of the benefits of IQueryable for its level of performance and code cleanness. However, we like to suggest a test. What is your stance regarding the following statement?

A repository should offer an explicit and well-defined contract and avoid arbitrary querying.

If you have a strong opinion about that, you also have the best answer about the whole IQueryable story.

Persisting aggregates

Because the repository is also responsible for the persistence of aggregates, a common thought is that the command side of a repository must mimic the classic CRUD interface and offer Add, Update, and Delete methods.

As mentioned earlier, instead, in a realistic business domain you hardly have anything like Add or Delete. All you have is Save because “save” is a sufficiently neutral term that can be used for nearly any aggregate and is also familiar to developers. Most business jargon doesn’t have a “save” verb; rather, you’ll find something like “register,” “issue,” or maybe “update.” However, by using Save, you’ll likely stay close enough to the ubiquitous language and simplify development. It might be, instead, that on a specific repository class—not the base interface—you can have business-specific methods that logically correspond to delete and insertion methods. You always save the order, for example, whether it is a new order or an update to an existing order. You hardly ever delete the order, at least in the sense of removing it from the system. Instead, you likelycancel the order, which consists of setting a particular value on a particular field—in the end, an internal call to the Save method.

Implementing the Unit of Work pattern

A repository class needs to have a dependency on some concrete database technology. If you use the Microsoft stack, the database technology is likely Entity Framework. Let’s have a look at the common implementation of a repository class:

public interface IOrderRepository : IRepository<Order>
{
// Query methods
...

// Command methods
void Save(Order aggregate);
...

// Transactional
void Commit();
}

The concrete repository might look like the code shown here:

public class OrderRepository : IOrderRepository
{
protected Database _database;
public OrderRepository()
{
_database = new Database();
}

public void Save(Order order)
{
_database.Orders.Add(order);
}

public void Commit()
{
_database.SaveChanges();
}
...
}

The Database class here is a custom class that inherits from DbContext and represents the entry point in Entity Framework. Having an instance of Database scoped at the repository level ensures that any action performed within the scope of the repository instance is treated within the same physical database transaction. On the other hand, this is the essence of the UoW pattern.

You should also have a Commit method to attempt to commit the transaction at the DbContext level.

Atomic updates

In some simpler scenarios and, in general, whenever it works for you, you can have atomic updates that basically use a local instance of the database context scoped to the method rather than the class:

using (var db = new Database())
{
db.Orders.Add(aggregate);
db.SaveChanges();
}

The only difference between atomic and global methods is that atomic methods trigger their own database transaction and can’t be joined to other, wider transactions.

Storage technologies

An interesting change that occurred in recent years in software engineering is that more and more systems are using alternate storage technologies in addition to, or instead of, classic relational data stores. Let’s briefly recap the options you have. These options are relevant because they are the technologies you would use in the repository classes to perform data access.

Object/Relational mappers

Using O/RM tools is the most common option for repositories. An O/RM is essentially a productivity tool that greatly simplifies (and makes affordable for most projects) the writing of data-mapping code that persists an object-oriented model to a relational table and vice versa. For a system based on the Microsoft stack, today an obvious choice is Entity Framework. Another excellent choice—for a long time, it was the primary O/RM choice—is NHibernate. (See http://community.jboss.org/wiki/NHibernateForNET.)

Entity Framework and NHibernate are only the two most popular choices; many other O/RM choices exist from various vendors. A quick list is shown in Table 14-1. A more comprehensive list can be found at http://en.wikipedia.org/wiki/DataObjects.NET.

Image

TABLE 14-1. A list of O/RM tools for Microsoft .NET

Choosing the O/RM to use can be as easy as it can be hard. Today, Entity Framework is the first option to consider. In our opinion, Entity Framework is not perfect but it’s really easy to use and understand; it has tooling and a lot of support and documentation. Beyond this, Entity Framework is like any other O/RM: you need to know it to use it properly. A developer can still create a disaster with Entity Framework. Among the mistakes we’ve seen (and done) with Entity Framework are using lazy loading when it is not appropriate and loading object graphs in memory that are too large without being aware of that. And, more than anything else, complaining about Entity Framework and its performance when it was, in the end, our own very fault.

When it comes to choosing an O/RM, we suggest you consider the following points: productivity, architecture, and support.

Productivity mainly refers to the learning curve necessary to master the framework and how easy it turns out to be to maintain an application. Architecture refers to the capabilities of the framework to map domain models and the constraints it might impose. In addition, it refers to the set of features it supports, especially when it comes to performance against the underlying database, testability, and flexibility. Finally, support refers to supported databases, the ecosystem, the community, and the commitment of the vendor to grow and maintain the product over time.

Entity Framework and NHibernate are the most popular choices, and the saga about which of the two is the best O/RM in town is a never-ending story. We have successfully used both products. We used NHibernate extensively in years past, and we find ourselves using Entity Framework with both SQL Server and Oracle quite often now. If we drew a line to find a set of technical differences, we’d say that Entity Framework still has room for improvement and that the lack of batch commands, the lack of a second-level cache, and especially the lack of custom-type support are the biggest differences we’re aware of.

Are those differences enough to determine a winner?

No, we don’t think so. Those differences exist and can’t be denied, but honestly we think that they don’t dramatically affect the building of a persistence layer.

External services to read and write data

When a layered solution is mentioned, it is natural to think that the solution is the entire system. More often than many think, instead, architects are called to write layer solutions that are only one bounded context in a larger system.

In this scenario, it is not surprising that you have no access to the physical database and that sometimes you don’t even see it. You know that data is somehow written and read from some place in some remote cloud; all you have is a set of URLs to call.

In this case, your repository doesn’t include any DbContext object, just some .NET API to arrange HTTP or network calls. Such calls might be atomic or chained, synchronous or asynchronous; sometimes multiple calls might even go in parallel:

var tasks = new Task<Object>[2];
tasks[0] = Task<Object>.Factory.StartNew(() => _downloader1.Find(...));
tasks[1] = Task<Object>.Factory.StartNew(() => _downloader1.Find(...));
await Task.WhenAll(tasks);
var data1 = tasks[0].Result as SomeType1;
var data2 = tasks[1].Result as SomeType2[];

In the repositories, you use downloader and uploader components and run them asynchronously on separate threads using synchronization when required. Caching is even more important in repositories based on external services because it can save HTTP calls. Similarly, caching at the HTTP level through front-end proxy servers can increase performance and scalability for large and frequent reads.


Image Note

Probably the most common situation in which repositories are based on external services are mobile applications and single-page web applications. In particular, in single-page web applications you can use ad hoc JavaScript libraries (for example, breeze.js) to connect to a tailor-made Web API front end and perform reads and writes with minimal configuration and effort.


OData endpoints

OData, short for Open Data Protocol, is a web protocol that offers a unified approach for querying and manipulating remote data via CRUD operations. In addition, OData exposes metadata about the encapsulated content that clients can use to discover the type information and relationships between collections. Originally defined by Microsoft, the protocol is now being standardized at OASIS.

OData implements the same core idea as Open Database Connectivity (ODBC), except that it is not limited to SQL databases. You might want to check http://www.odata.org for more information about the protocol and details about the syntax. To form a quick idea about the protocol, consider the following URL:

/api/numbers?$top=20&$skip=10

The URL gets whatever the numbers endpoint returns—assumed to be a collection of objects—and then gets 20 items, skipping the first 10. OData defines the query string syntax in a standard way. This means that a server-side layer can be created to parse the query string and adjust the response.

In terms of repositories, OData helps arrange queries on top of remote services. In other words, if the remote service supports the OData protocol, you can send HTTP requests that build custom queries on top of exposed collections so that a filtered response can be returned as well as custom projections of data.

With ASP.NET Web API, you can easily create an OData endpoint around a data set. All you need to do is mark the method with the Queryable attribute and make it return an IQueryable type:

[Queryable]
public IQueryable<SomeType> Get()
{
var someData = ...;
return someData.AsQueryable();
}

OData supports several serialization formats, including JSON.

Distributed memory cache

A repository is the place where you use caching if caching is required. This means that caching technologies are also part of the architect’s tool chest. There are many easy ways to cache data. In an ASP.NET application, you can use, for example, the built-in Cache object. This object works beautifully except that it is limited to a single machine and Internet Information Services (IIS) process.

More powerful memory caches are available that work across a distributed architecture. The most prominent example is Memcached, but several other commercial and open-source products exist, such as NCache, ScaleOut, and Redis. When it comes to using a cache—distributed or not—the pattern followed by the repository code is fairly common:

Image Try to read from the cache.

Image If no data is found, access the back-end store.

Image Save data in the cache.

Image Return the data.

For updates, the pattern is similar: you first update the back-end store and then the intermediate cache, although in some scenarios of extreme scalability it is acceptable that things occur in the reverse order.

NoSQL data stores

Yet another storage technology for (some) repositories are NoSQL data stores. Such data stores are designed from the ground up just to store huge amounts of data—any data, of any size and complexity—across a possibly huge number of servers. They don’t follow any fixed schema, meaning that they don’t face any rigidity and let you evolve the system with great ease. The key benefit that NoSQL provides can be summarized as follows:

Image Ability to deal with unstructured data

Image Ability to support eventual consistency scenarios, which cuts off writing time and makes a write-intensive system easy to scale

Image Independence from query language

Image Natural sharding that doesn’t require balance and predesign analysis of the involved tables and partition of data

A NoSQL data store requires a team to develop new skills, and it poses new challenges for the team both on the development side and with the setup and administrative part of the work. As we see it, NoSQL data stores are definitely an option to consider for some segments of the persistence layer. In smaller, simpler, or just specific bounded contexts, NoSQL storage can even be the primary storage technology. More often than not, though, we see NoSQL used in conjunction with more traditional forms of storage. This form of persistence has been given the quite fancy name of polyglot persistence.

Why should you consider nonrelational storage?

At present, cracks are starting to show in the otherwise solid granite wall of relational stores. The cracks are the result of the nature of a relational store and of some applications and their data. It’s becoming increasingly difficult to fit real data into the rigid schema of a relational model and, more often than not, a bit of redundancy helps save queries, thus making the application faster. No hype and no religion—it’s just the evolution of business and tools.

In our opinion, calling relational stores “dead” and replacing them tout-court with the pet NoSQL product of choice is not a savvy move. There are moments in life and business in which you must be ready to seize new opportunities far before it is clear how key they are. If you’re an architect and your business is building applications for clients, NoSQL stores are just another tool in your toolbox.

It’s a different story, instead, if your business is building and selling tools for software development. In this case, jumping on the NoSQL bandwagon is a smart move. NoSQL stores are useful in some business scenarios because of their inherent characteristics, but smart tooling is required to make them more and more appealing and useful. This is an opportunity to seize.

If you’re an architect who builds software for a client, your responsibility is understanding the mechanics of the domain and the characteristics of the data involved and then working out the best architecture possible.

Familiarizing yourself with NoSQL

For the past 40 years or so, we used relational data stores and managed to use them successfully in nearly all industry segments and business contexts. We also observed relational data productively employed in systems of nearly all sizes.

Relational data stores happily survived the advent of object-orientation. Presented as a cutting-edge technology, object-based databases did not last long. When objects came up, relational stores had already gained significant market penetration to be seriously threatened by emerging object-based databases. Quite simply, development teams found it easier to build new object-oriented artifacts in which to wrap access to relational data stores and SQL queries. For decades, the industry didn’t really see a need for nonrelational storage.

Are things different today?

Not-Only SQL

Originally, the NoSQL movement started as a sharp and somewhat rabid reaction to SQL and relational databases. At some point, the NoSQL movement and products hit the real world. The NoSQL acronym was then reworked to a more pragmatic “Not Only SQL.” In a nutshell, a NoSQL store doesn’t store data as records with a fixed schema of columns. It uses, instead, a looser schema in which a record is generically a document with its own structure. Each document is a standalone piece of data that can be queried by content or type. Adding a field to a document doesn’t require any work other than just saving a new copy of the document and maybe a bit of extra versioning work.

With NoSQL, there’s no need to involve the IT department to make changes to the structure of stored data. NoSQL allows you to write a data-access layer with much more freedom and many less constraints than a traditional relational environment.

Is the increased agility in writing and deploying software a good reason to abandon relational stores?

Where are NoSQL stores being used?

As a matter of fact, an increasing number of companies are using NoSQL stores. A good question to ask is, “Where are NoSQL stores being used?” This is a much better question to ask than, “How can we take advantage of NoSQL stores?” As we see it, examining realistic technology use-cases to see if they match your own is preferable to blindly looking for reasons to use a given technology.

NoSQL stores are used in situations where you can recognize some of the following characteristics:

Image Large—often, unpredictably large—volumes of data and possibly millions of users

Image Thousands of queries per second

Image Presence of unstructured/semi-structured data that might come in different forms, but still needs the same treatment (polymorphic data)

Image Cloud computing and virtual hardware involved for extreme scalability needs

Image Your database is a natural event source

If your project doesn’t match any of these conditions, you can hardly expect to find NoSQL particularly rewarding. Using NoSQL outside of such conditions might not be wrong, but it might just end up being a different way of doing the same old things.

Flavors of NoSQL

There’s not just one type of NoSQL data store. Under the umbrella of NoSQL, you can find quite a few different classes of data stores:

Image Document/Object store This store saves and indexes objects and documents in much the same way as a relational system saves and retrieves records. Stored data can also be queried via associated metadata. The big difference with relational systems is that any stored object has its own schema made and can be abstracted as a collection of properties.

Image Graph store This store saves arbitrarily complex collections of objects. The main trait of these systems is that they support relationships between data elements.

Image Key value store This store works like a huge dictionary made of two fields—key and value. Store data is retrieved by key. Values are usually serialized as JSON data.

Image Tabular store This store is based on concepts similar to relational stores. The big difference is the lack of normalization you find in tabular stores. Tabular data usually fulfills the first normal form of the relational model (which is about repeated groups of data) but not the second form (which is about giving each data element a primary key). For more information, refer to http://en.wikipedia.org/wiki/First_normal_form and http://en.wikipedia.org/wiki/Second_normal_form.

NoSQL stores offer quite different sets of capabilities. Therefore, various segments of the application might need different NoSQL solutions. Examples of a key-value store are in-memory data containers such as Memcached. Examples of a document stores are products often associated these days with the whole idea of NoSQL, such as CouchDB, MongoDB, and RavenDB. Let’s find out more about document databases.

What you gain and what you lose

Relational databases have existed for at least 40 years, and everybody in the software industry is used to them. A few well-established relational databases exist and take up the vast majority of the world demand for data storage, dwarfing any NoSQL solution. Far from being a dead-end technology, classic SQL technology improves over time, as the new column store feature in SQL Server 2014 demonstrates.

Yet, some new business scenarios emerge that push the power of relational databases to the limit. The challenge for architects is to resist the temptation to use trendy tools—and, instead, to use them just when, and if, appropriate for the particular business.

Downsides of the relational model in some of today’s scenarios

The major strengths of relational databases can be summarized as follows:

Image Supports a standard data-access language (SQL)

Image Table models are well understood and the design and normalization process is well defined

In addition, the costs and risks associated with large development efforts and with large chunks of data need to be well understood. Gaining expertise in design, development, optimization, and administration is relatively easy, and an ecosystem of tools exists for nearly any necessity.

On the downside, really data-intensive applications treating millions of rows might become problematic because relational databases are optimized for specific scenarios, such as small-but-frequent read/write transactions and large batch transactions with infrequent write access.

In general, it is correct to say that when relational databases grow big, handling them—reads and writes—can become problematic. However, how big should it grow to become so expensive to handle that you want to consider alternative solutions? They should grow fairly big indeed. Relational tables certainly don’t prevent scaling-out, except that they introduce the extra costs of data sharding.

In a relational environment, the issues with massive reads and writes are mostly related to the cost of managing indexes in large tables with millions of rows. Relational databases also add overhead even for simple reads that join and group data. Integrity, both transactional and referential, is a great thing, but it can be overkill for some applications. Relational databases require the flattening of complex real-world objects into a columnar sequence of data and vice versa. The negative points can be summarized as follows:

Image Limited support for complex base types in both reading and writing via SQL (need of O/RM).

Image Knowledge of the database structure is required to create ad hoc queries.

Image Indexing over a large number of records (in the order of millions of rows) becomes slow.

In general, a relational structure might not be the ideal choice for serving pages that model unstructured data that cannot be reduced to an efficient schema of rows and columns and to serve binary content from within high-traffic websites.

Relational databases still work great, but in some very special scenarios (lots of reads and writes, unstructured data and no need for strict consistency) different choices for storage are welcome. It’s easy to map those aspects to social networks. If you’re building a social network, you definitely need to look into polyglot persistence. When heterogeneous storage is taken into account, NoSQL storage is likely one of the options—and the most compelling one.

Eventual consistency

There are two major features of a NoSQL store: the ability to handle schemaless data and eventual consistency. Together these features offer a persistence layer that is easy to modify, supports CQRS scenarios well, and facilitates scalability. Eventual consistency is a critical attribute from a pure business perspective.

Eventual consistency is when reads and writes aren’t aligned to the same data. Most NoSQL systems are eventually consistent in the sense that they guarantee that if no updates are made to a given object for a sufficient period of time, a query returns what the last command has written.

In most cases, eventual consistency isn’t an issue at all. You generally need to be as consistent as possible within the bounded context. You don’t really need any level of consistency across bounded contexts. As the system grows, despite the technology, you can’t expect full consistency. Within the bounded context, though, there are business scenarios in which eventual consistency is not acceptable. As an example, consider a banking application that lets a user withdraw an amount of money from an account. If two operations for the same amount occur, you must be able to recognize them as distinct operations and take no risk that the second is taken as a repetition of the first. Without the full ACID consistency of a relational store, this could be a problem.

There’s a simple test to see whether eventual consistency is an issue. How would you consider a scenario in which a command writes some data, but a successive read returns stale data? If it’s absolutely crucial that you’re constantly able to read back what has just been written, you have two options:

Image Avoid NoSQL databases.

Image Configure the NoSQL database to be consistent.

Let’s see eventual consistency in action in a sample document database such as RavenDB.

Eventual consistency in RavenDB

Consider the following code snippet, which assumes the use of RavenDB—a popular .NET NoSQL database. For more information, visit http://ravendb.net.

// You store an object to the database
DocumentSession.Store(yourDocument);

// Try to read back the content just saved
DocumentSession.Query<SomeDocumentType>().Where(doc => doc.Id == id)

The effect of eventual consistency—the default configuration in RavenDB—is that what you’ll read isn’t the same as what you’ve just written.

As far as RavenDB is concerned, writing on the store and updating indexes used by the query engine are distinct operations. Index updates occur as scheduled operations. That misalignment doesn’t last more than a few seconds if there are no other updates to the same object taking place in the meantime.

Scared? Just worried? Or thinking of dropping all the wicked NoSQL stuff for the rest of your career? Well, there are programmatic tools to control eventual consistency. Here’s a possible way to force ACID consistency in RavenDB:

var _instance = new EmbeddableDocumentStore { ConnectionStringName = "RavenDB" };
_instance.Conventions.DefaultQueryingConsistency =
ConsistencyOptions.AlwaysWaitForNonStaleResultsAsOfLastWrite;

When you do this, though, that read doesn’t return until the index has been updated. A trivial read, therefore, might take a few seconds to complete.

There are better ways to wait for indexes to update. For example, you can determine the type of consistency a query needs at query time, as in the following example:

using( var session = store.OpenSession() )
{
var query = session.Query<Person>()
.Customize(c=> c.WaitForNonStaleResultsAsOfLastWrite())
.Where( p => /* condition */ );
}

In this example, the WaitForNonStaleResultsAsOfLastWrite query customization is telling the server to wait for the relevant index to have indexed the last document written and ignore any documents hitting the server after the query has been issued. This helps in certain scenarios with a high write ratio, where indexes are constantly stale.

There are many other WaitForNonStaleResultsXxxx methods in RavenDB that solve different scenarios. Another possibility is to fully embrace eventual consistency, ask the server if the returned results are stale, and behave accordingly:

using( var session = store.OpenSession() )
{
RavenQueryStatistics stats;
var query = session.Query<Person>()
.Statistics( out stats )
.Where( p => /* condition */ );
}

In this example, you ask the server to also return query statistics that inform you whether or not returned results are stale. You then take the best possible action based on the scenario.


Image Note

From this, you see that NoSQL stores fill a niche in the industry. They do pose challenges, though. NoSQL addresses some architecture issues while neglecting others. Eventual consistency is an excellent example of trade-offs in the software industry. You can happily concede to eventual consistency to gain performance and scalability. And you can probably do that in more situations than one might think at first. But if you need full ACID consistency, you should not use NoSQL at all.


Polyglot persistence by example

Today there are at least two options for a persistence layer: classic relational storage and polyglot persistence. Polyglot persistence is a fancy term that simply refers to using the most appropriate persistence layer for each operation.

Let’s consider a polyglot-persistence scenario for a generic e-commerce system. In such a system, you reasonably might need to save the following information:

Image Customers, orders, payments, shipments, products, and all that relates to a business transaction

Image Preferences of users discovered as they navigate through the catalog of products or resulting from running business intelligence on top of their transactions

Image Documents representing invoices, maps of physical shops to find the nearest one to the user, directions, pictures, receipts of delivery, and so on

Image Detailed logs of any user’s activity: products he viewed, bought, commented on, reviewed or liked

Image Graph of users who bought the same and similar products, similar products, other products the user might be interested in buying, and users in the same geographical area

Saving any such pieces of information to a single relational or NoSQL store is possible, but probably not ideal. Polyglot persistence consists of mixing together various stores and picking the right one for each type of information. For example, you can store customers and orders to a SQL Server data store and access it through Entity Framework or any other O/RM framework. Entity Framework, in particular, can be used also to transparently access an instance of a Microsoft Azure SQL Database in the cloud.

User preferences can go to the Azure table storage and be accessed through an ad hoc layer that consumes Azure JSON endpoints. Documents can go to, say, RavenDB or MongoDB and be consumed through the dedicated .NET API. The user’s history can be saved to a column store such as Cassandra. Cassandra is unique in that it associates a key value with a varying number of name/value pairs. The nice thing is that any row stored to Cassandra can have a completely different structure than other records. The effect is similar to having a name/value dictionary where the value is a collection of name/value pairs. Cassandra can be accessed from .NET using an ad hoc framework like FluentCassandra (http://fluentcassandra.com). Finally, hierarchical information can be stored to a graph database such as Neo4j through the .NET client.

The overall cost of polyglot persistence might be significant at times. For the most part, you end up working with .NET clients that are relatively straightforward to use. Yet, when a lot of data is involved and performance issues show up, heterogeneous skills are required and a shortage or lack of documentation is the norm rather than the exception.

Planning a sound choice

Eventual consistency is the first relevant factor that might cause you to rule out the use of NoSQL products. Another factor is when the data to represent is so intricate, large, growing, and frequently updated that a relational table would grow too fast, posing other issues such as sharding and caching.

When you have to make a decision, which other aspects and parameters should you consider?

The “why-not?” factor

To put things into perspective, let’s clean up any bias and try to honestly answer the following simple question: “In the context of the project, are you satisfied with relational storage?” Here’s a table of possible answers.

Image

TABLE 14-2. Are you satisfied with relational storage?

Relational vs. NoSQL is definitely an architectural choice; it’s a choice that’s hard to make, critical, and not to be indefinitely delayed. As we see things, the “Yes, but” answer is the sole answer that opens up a whole new world of opportunities without flying the project away from sane pragmatism.

But where does that answer lead the project?

The “Yes but” answer indicates that, overall, relational storage works for you, but the sound of the engine is not completely clean. You perceive that some grains of sand are left in the gears. Should you consider moving away from the consolidated, well-established, and comfortable relational world? And why should you change? Compelling reasons wanted!

Beyond the aforementioned discriminants, a compelling reason is represented by the hefty license fees you (or your customers) are asked to pay for most relational databases. Beyond this, we don’t think there are compelling reasons to choose NoSQL as the sole approach for storage.

NoSQL is gaining ground and momentum. At the moment, though, there’s no clear perception of why this is happening. But it is happening. We’re led to believe that it’s a mix of curiosity and love for innovation—a sort of “Why-not?” factor—to push more and more companies to try out a NoSQL store in a real-world system. The mechanics of NoSQL, therefore, are that by trying it out an increasing number of companies figure out that in more and more scenarios they can drop relational databases in favor of NoSQL.

This is slowly eroding the pillars of a world that couldn’t have looked more solid and untouchable only a couple of years ago.

If you’re clueless about what to do, we suggest you look into three parameters that relate to the nature of the data you handle: characteristics, volatility and growth.

Characteristics of the data

Is the data the application will handle homogeneous? In this context, homogeneity refers to the degree at which conceptual data can be partitioned in formal tables of homogeneous entities. You should be aware that not all entities captured in requirements can be always and easily constrained within a tabular model and expressed as combination of rows and columns.

This is a general point—things have always been like this. However, for some reason and for many years, architects preferred to normalize (actually, flatten) data to tabular schemas rather than look into, say, object-based stores. We believe that the primary reason has been the lack of (commercially) valid alternatives, which are now available.

The relational model is more than adequate if the data is naturally homogeneous or if it lends itself well to be flattened within a tabular schema. The second scenario, though, opens a window for alternate solutions that the savvy architect should be ready to catch and investigate further. Having to work with data that is hard to bend into a table scores a point in favor of NoSQL, even though architects have been able to massage nontabular data into tables for years.

Volatility of the schema

How often is the schema of data expected to change over time and across installations of the system for different customers? If the requirements churn of the system introduces new data types to persist, you must ensure that the database can support them as well.

A NoSQL store is much better suited than a classic relational store to handle originally homogeneous data that, at some point, turns into more heterogeneous data. In a relational model, such adjustments can be accomplished only by altering the structure of the (production) tables. As you might understand, this is never a small change.

Growth of the data

A data store might grow in size to address new requirements or just because of an increased number of users. When this happens, a relational database might encounter serious performance issues, especially when the rate of reading and writing is quite high. In this context, the more the data is tabular in nature, the more the application delays indicate its getting into trouble.

When the amount of data grows to a point that it might become a problem, the database must be scaled up in some way. The simplest option is vertical scaling, except that it has the drawback of just moving the bottleneck forward without removing or definitely fighting it. Another option is horizontal scaling through clustering.

Most commercial relational databases do offer horizontal scaling, but these solutions can be very complex to set up and quite expensive. Modern NoSQL products have been designed to address specific issues of horizontal scaling and performance issues related to intensive read and write operations.

Summary

There’s no application and no reasonable bounded context that doesn’t have to read and/or write data. These tasks are performed by the persistence layer, which is ideally the only place in the application where connection strings and URLs are known and managed.

A lot has changed in recent years as far as persistence is concerned. First and foremost, reads and writes might not take place over the same data store. Second, the data store is not necessarily a database. If it is a database, there’s sometimes no necessity for stored procedures and a separate database-level layer of SQL code to access data.

The final aspect that has changed is the nature of the database, whether it has to be relational, NoSQL, or polyglot. Ultimately, the point isn’t to pronounce relational stores dead and replace them with the NoSQL product of choice. The point is to understand the mechanics of the system and characteristics of the data and work out the best possible architecture. In our opinion, the most foreseeable concrete application for NoSQL stores is as an event store in the context of an event-sourcing architecture.

For systems where there isn’t a simple yes or no answer to whether or not you should use relational stores, the best you can do is consider polyglot persistence. Instead of forcing a choice between NoSQL and relational databases, you could look into a storage layer that combines the strengths of the two. This type of storage system tackles different problems in the most appropriate way.

Within the context of an enterprise, you should use different storage technologies to store different types of data. This is especially true in a service-oriented architecture. In this case, each service might have its own storage layer. There would be no reason to unify storage under a single technology or product. Polyglot persistence does require that you learn different storage technologies and products. As a training cost, though, it’s a reasonable investment.

Finishing with a smile

To generate a smile or two at the end of this chapter, we resort to rephrasing a few popular sayings. To start, we’d love to resurrect the popular Maslow’s hammer law. Nearly everybody knows it, but few people connect it to Abraham Maslow. In its commonly-used short formulation the law sounds like “If all you have is a hammer, everything looks like a nail.” Here is a really profound way of rephrasing it:

If all you know is SQL, all data looks relational.

Lily Tomlin (an American actress and comedian) said, “If love is the answer, would you please rephrase the question?” We think this can be applied to data access as the following:

If too many records are the answer, would you please rephrase the query?

The popular Murphy’s law can also be rephrased for databases:

If you design a database such that a developer can put incorrect data into it, eventually some developer will do that.