Data Virtualization - Information Management: Strategies for Gaining a Competitive Advantage with Data (2014)

Information Management: Strategies for Gaining a Competitive Advantage with Data (2014)

Chapter 9. Data Virtualization

The Perpetual Short-Term Solution

Accessing data in multiple places across the information ecosystem makes sense for edge and one-off workloads. Data virtualization could be essential to managing redundancy.

Keywords

business intelligence; data virtualization; master data management; big data; data warehouse; self-service business intelligence

One of the premises of this book is that information will, by necessity, be spread throughout the organization into heterogeneous data stores. While the allocation of data to a platform should be ideal for the majority of its usage, obviously the selection can never be perfect for all uses. Therein lies a dilemma. Physical comingling of data on a platform will always yield results with the highest performance.1 The tradeoffs are the cost of redundant, disparate, and uncontrolled data. It’s more than the cost of storage. People costs, error corrections, and faulty decision making because of disparate, errant data are far more costly.

One technology, or capability if you will, providing us the dilemma by providing query capabilities across platforms is data virtualization. Data virtualization is middleware that has less to do with the rendering of the result set and everything to do with the platforms in which the data resides—those platforms we are selecting from throughout this book. For these reasons, I discuss it in its own chapter and don’t include it in the business intelligence chapter (Chapter 15).

With permission, parts of this chapter have been pulled from “Data Virtualization” by William McKnight, published on BeyeNETWORK.com. Read the full article at: http://www.b-eye-network.com/view/15868.

The History of Data Virtualization

Data virtualization is not new. It has been a capability out there for a long time. Historically, it has underperformed expectations and was used to circumvent moving data into the data warehouse that really should have been physically cohabiting with the other data warehouse data. The resulting cross-platform queries, built into the technical architecture, tended to be slow and bring dispute to the notion of a well-performing cross-platform query, which is needed much more today as organizations have a wider variety of heterogeneous platforms to choose from.

In the heyday of data warehouse building, data virtualization was pushed too hard and too fast and it got the proverbial black eye for trying to create the unworkable “virtual data warehouse.” Difficult to consolidate, this style of warehouse left pieces around the organization and imagined a layer over all of them, stitching together result sets on the fly. In most cases, this was unworkable. The “hard part” of consolidation, ideally supported by governance, was required. It still is. You would not refrain from building a data warehouse and simply rely on data virtualization. For the reasons mentioned in Chapter 6, a data warehouse is still required. However, there are times today when data virtualization can be a short-term solution or a long-term solution.

Fortunately, the technology has caught up and deserves a second look. There are stand-alone data virtualization tools and some degree of data virtualization is built into many business intelligence tools and even platforms. Some of the latter tends to be more limited in focus on close technology partners to the technology.

Data Virtualization is an enabler of the No-Reference architecture.

Controlling Your Information Asset

Every organization should expect chaos over the next decade while working to get the important asset of information under control. Technical architecture will drag behind the delivery of information to users. Much of data delivery will be custom. Data virtualization will be useful in this custom, creative delivery scenario. It has the ability to provide a single view of data that is spread across the organization. This can simplify access, and the consumer won’t have to know the architectural underpinnings. These are the short-term solution aspects to data virtualization.

How long is the short term? Term is a vague concept in this era of technological advancement, but nonetheless it is important to have time frames in mind for solutions in order to make wise investments. In the case of data virtualization, the short term is until the architecture supports a physical view through integration or until it becomes evident that data will remain consciously fragmented and data virtualization becomes the long-term solution.

The right architectural answer may be to not centralize everything, and the right business answer may be to not take the extended time to design and develop solutions in the three distinct technologies—business intelligence, data warehousing, and data integration—required for replicating the data.

Leveraging Data Virtualization

While many programs like the data warehouse, CRM, ERP, big data, and sales and service support systems will be obvious epicenters of company data, there is the inevitable cross-system query that must be done periodically or regularly. Many organizations start out by leveraging their data virtualization investment to rapidly produce operational and regulatory reports that require data from heterogeneous sources.

Further queries spawn from this beginning since data virtualization has the only bird’s-eye view into the entire data ecosystem (structured/unstructured), seamless access to all of the data stores that have been identified to the virtualization tool, including NoSQL stores, cloud-managed stores, and a federated query engine.

As middleware, data virtualization utilizes two primary objects: views and data services. The virtualization platform consists of components that perform development, run-time, and management functions. The first component is the integrated development environment. The second is a server environment, and the third is the management environment. These combine to transform data into consistent forms for use.

Integrated Business Intelligence

Data virtualization is used primarily for providing integrated business intelligence, something formerly associated only with the data warehouse. Data virtualization provides a means to extend the data warehouse concept into data not immediately under the control of the physical data warehouse. Data warehouses in many organizations have reached their limits in terms of major known data additions to the platform. To provide the functionality the organization needs, virtualizing the rest of the data is necessary.

Pfizer and Data Virtualization

Pfizer had an “information sharing challenge” with applications that “don’t talk to each other.” They implemented data virtualization without sacrificing the architectural concepts of an information factory. Source data was left in place, yet all PharmSci data was “sourced” into a single reporting schema accessible by all front-end tools and users. The fact that data virtualization has the ability to cache data from a virtual view as a file or insert the data into a database table (via a trigger) adds significant value to the solution.

—Dr. Michael C. Linhares, Ph. D. and Research Fellow

The Pfizer head of the Business Information Systems team has a nice quote in the book “Data Virtualization” (Davis and Eve, 2011), which sums up much of data virtualization’s benefits: “With data virtualization, we have the flexibility to decide which approach is optimal for us: to allow direct access to published views, … to use caching, … or to use stored procedures to write to a separate database to further improve performance or ensure around the clock availability of the data.” Sounds like flexibility worth adding to any shop.

Data virtualization brings value to the seams of our enterprise—those gaps between the data warehouses, data marts, operational databases, master data hubs, big data hubs, and query tools. It is being delivered as a stand-alone tool as well as extensions to other technology platforms, like business intelligence tools, ETL tools, and enterprise service buses.

Data virtualization of the future will bring intelligent harmonization of data under a single vision. It’s not quite there yet, but is doubtless the target of research and investment due to the escalating trends of competitive pressures, company ability, and system heterogeneity.

Using Data Virtualization

Data virtualization is a class of stand-alone tools as well as a significant capability added to many other tools, mostly database systems. It is important enough to understand discretely in order to get a full handle on information architecture. I will use “tools” generically to refer to a data virtualization tool or to the capability within a separate tool.

Something virtual does not normally physically exist, but based on the judgment of the tool and the capacity of its cache, the tool may actually physically create the desired structure. Regardless, at some point it has to “join” the data from the heterogeneous sources and make it available. Virtualization refers to querying data that is not guaranteed to reside in a single physical data store, but may as a result of caching. Check to see how intelligent the caching mechanisms of your chosen virtualization platform is. Some will rival the temperature sensitivity capabilities of a robust DBMS.

Checklist of Data Virtualization Tool Requirements

1. Data stores it can access

2. Intelligence in determining data for its cache

3. Optimizer’s ability to manage the multiple optimizers of the data stores

4. User management

5. Security management

6. Load balancing

7. User interface

While some tools provide a user interface which is useful for those odd queries you may want to run, combining data virtualization with a business intelligence tool (or as part of a business intelligence tool) gets the virtualization capability into the tools that the users are used to for their single-system queries.

Use Cases for Data Virtualization

Given the heterogeneous information management architecture, the goal of eliminating unnecessary redundancy, and the capabilities of data virtualization, we land data in its best spot to succeed and go from there.

Data virtualization is not a materialized view, which is always a physical structure.

Data Virtualization Use Cases

Composite Software, a prominent data virtualization vendor and part of Cisco Systems, organizes the data virtualization use cases as follows:

• BI data federation

• Data warehouse extensions

• Enterprise data virtualization layer

• Big data integration

• Cloud data integration

There is some obvious overlap between these. For example, most data warehouses are built for business intelligence, so extending the warehouse virtually actually provides federation for BI. This form of virtualization is helpful in augmenting warehouse data with data that doesn’t make it to the warehouse in the traditional sense, but nonetheless is made available as part of the warehouse platform. Big data integration refers to the integration of data in Hadoop, NoSQL systems, large data warehouses and data warehouse appliances. Finally, the cloud is presented as a large integration challenge that is met by data virtualization.

Master Data Management

Master Data Management (MDM), discussed in Chapter 7, is built for governing data and distributing that data. The distribution of MDM data has a significant architectural aspect to it. MDM data does not have to be physically distributed to a similar structure residing in the target system that wants the data. Depending on the frequency of access and the concurrency requirements on the MDM hub itself, MDM data can stay in the hub and be joined to data sets far and wide in the No-Reference Architecture. Ultimately, MDM will be the highest value usage for data virtualization.

When the structure you wish to join MDM (relational) data with is not relational, you may create a separate relational store for use with the nonrelational data (which would still necessitate data virtualization) or you can utilize the main MDM hub for data virtualization.

Data virtualization is not synchronization, which is keeping two separate data stores consistent in data content without much time delay.

MDM data can increase the value of Hadoop data immensely. Hadoop is going to have low granularity transaction-like data without information, other than perhaps a key, about company dimensions like customer. Those transactions can be analyzed “on the face” of the transaction, which has some value, but the value of bringing in information not found in the transaction—but relevant to it—is much more valuable.

If analyzing people’s movements across the store based on sensor devices, while it’s helpful to know a generic person’s pattern, it is more helpful to know it is Mary Smith, who lives at 213 Main Street (an upper scale geo), has a lifetime value that puts her in the second decile, has two kids, and prefers Nike clothing. You can design the store layout based on the former, but you can make targeted offers, too, based on the latter.

A similar analogy applies to MDM and Data Stream Processing (Chapter 8), which has to do with real-time data analysis. Analyzing the stream (by human or machine) along with the dimensional view provided by MDM means you can customize the analysis to the customer, product characteristics, and location characteristics. While a $10,000 credit card charge may raise an alert for some, it is commonplace for others. Such limits and patterns can be crystalized in MDM for the customers and utilized in taking the next best action as a result of the transaction.

Data virtualization cannot provide transactional integrity across multiple systems, so data virtualization is not for changing data. It is for accessing data.

Because data streams are not relational, the MDM hub could be synchronized to a relational store dedicated to virtualization with the stream processing or the stream processing could utilize the MDM hub. The decision would be made based on volume of usage of the hub, physical proximity of the hub, and concurrency requirements to the hub.

Adding numerous synchronization requirements to the architecture by adding numerous hubs can add undue overhead to the environment. Fortunately, most MDM subscribers today are relational and have a place in the data store for the data to be synchronized to.

Mergers and Acquisitions

In the case of a merger or acquisition (M&A), immediately there are redundant systems that will take months to years to combine. Yet also immediately there are reporting requirements across the newly merged concept. Data virtualization can provide those reports across the new enterprise. If the respective data warehouses are at different levels of maturity, are on different database technologies or different clouds, or differ in terms of being relational or not, it does not matter. Obviously, these factors also make the time to combine the platforms longer, and possibly not even something that will be planned.

Data virtualization has the ability to perform transformation on its data, but—as with data integration—the more transformation, the less performance. With virtualization happening at the time of data access, such degregations are magnified. Approach data virtualization with heavy transformation and CPU-intensity with caution.

I am also using M&A as a proxy for combining various internal, non-M&A based fiefdoms of information management into a single report or query. While the act of an M&A may be an obvious use case, companies can require M&A-like cross-system reporting at any moment. This is especially relevant when little information management direction has been set in the organization and a chaotic information environment has evolved.

Temporary Permanent Solution

The need to deliver business intelligence per requirements may outweigh your ability to perform the ETL/data integration required to physically commingle all the data required for the BI. Any data integration requirements are usually the most work intensive aspect of any business intelligence requirement.

As you set up the information management organization for performing the two large categories of work required—development and support—you will need to consider where you draw the line. Development is usually subjected to much more rigorous justification material, rigorous prioritization, and project plans or agile set up. Support is commonly First In, First Out (FIFO), queue-based work that is of low scope (estimated to be less than 20 person-hours of effort).

Having run these organizations, I’ve become used to doing quick, and pretty valid, estimations of work effort. I know that when the work involves data integration, it likely exceeds my line between development and support, so I put that in the development category. Unfortunately, now that we can no longer rely on a row-based, scale-up, costly data warehouse to meet most needs, quite often we’ve placed data in the architecture in disparate places.

With data virtualization, the business intelligence can be provided much more rapidly because we are removing data integration. However, the performance may suffer a bit from a physically coupled data set.

If you look at 100 BI requirements, perhaps 15 will be interesting in 6 months. There are 85 requirements that you will fulfill that serve a short-term need. This does not diminish their importance. They still need to be done, but it does mean they may not need to be quite as ruggedized. One study by the Data Warehousing Institute2 showed the average change to a data warehouse takes 7–8 weeks.

At Pfizer, cross-system BI requirements will be met initially through data virtualization in the manner I described. After a certain number of months, if the requirement is still interesting, they will then look into whether the performance is adequate or should be improved with data integration.

You need data virtualization capability in the shop in order to compete like this. Talk about being agile (Chapter 16)! If you accept the premise of a heterogeneous environment, data virtualization is fundamental. I would also put cloud computing (Chapter 13), data governance, and enterprise data integration capabilities in that category.

Stand-alone virtualization tools have a very broad base of data stores they can connect to, while embedded virtualization in tools providing capabilities like data storage, data integration, and business intelligence tend to provide virtualization to other technology within their close-knit set of products, such as partners. You need to decide if independent virtualization capabilities are necessary. The more capabilities, the more flexibility and ability to deliver you will have.

Simplifying Data Access

Abstracting complexity away from the end user is part of self-service business intelligence, discussed in Chapter 15. Virtual structures are as accessible as physical structures when you have crossed the bridge to data virtualization. While the physical structures and data sources, so are virtual structures. This increases the data access possibilities exponentially.

Data virtualization also abstracts the many ways and APIs to access some of the islands of data in the organization such as ODBC, JDBC, JMS, Java Packages, SAP MAPIs, etc. Security can also be managed at the virtual layer using LDAP or Active Directory.

When Not to do Data Virtualization

There may be some reports that perform well enough and are run infrequently enough that it may make sense to virtualize them, but there is another perspective and that is auditability. Going back to a point in time for a virtual (multi-platform) query is difficult to impossible. It requires that all systems involved keep all historical data consistently. I would not trust my Sarbanes–Oxley compliance reports or reports that produce numbers for Wall Street to data virtualization. Physicalize those.

Similarly, when performance is king, as I’ve mentioned, you’ll achieve better performance with physical instantiation of all the data into a single data set. Mobile applications in particular (covered in Chapter 15) need advanced levels of performance, given the nature of their use.

While I have somewhat championed data virtualization in this chapter, these guidelines should serve as some guardrails as to its limitations. It is not a cure-all for bad design, bad data quality, or bad architecture.

Combining with Historical Data

At some point in time, in systems without temperature sensitivity that automatically route data to colder, cheaper storage, many will purposefully route older data to slower mediums. The best case for this data is that it is never needed again. However, it is kept around because it might be needed. If that time comes, the data is often joined with data in hot storage to form an analysis of a longer stretch of data than what is available there. The cross-platform query capabilities of data virtualization are ideal for such a case.

This also serves as a proxy for the “odd query” for data across mediums. This is the query that you do not want to spend a lot of time optimizing because it is infrequent. The users know that and understand.

Action Plan

• Determine existing data virtualization capabilities within current tools

• Acquire the needed data virtualization capabilities if necessary

• Utilize data virtualization to solve business intelligence requirements requiring cross-system data

• Analyze your process for data platform selection to ensure proper fit of data to platform for most (never all) of its uses

www.mcknightcg.com/bookch9

Reference

1. Davis JR, Eve R. Data Virtualization: Going Beyond Traditional Data Integration to Achieve Business Agility U.S: Nine Five One Press; 2011; p. 176.


1A consideration for cloud computing as well, covered in Chapter 13.

2Business Intelligence Benchmark Report 2011.