Data Warehouses and Appliances - Information Management: Strategies for Gaining a Competitive Advantage with Data (2014)

Information Management: Strategies for Gaining a Competitive Advantage with Data (2014)

Chapter 6. Data Warehouses and Appliances

Prebuilt analytical machines, up and running quickly, can siphon off analytic workloads such as the data warehouse. The data warehouse is the heart of the information-centric organization and remains a fixture in the environment for reasons discussed in this chapter. Practices for implementing a data warehouse are discussed as well. Select data warehouse appliances, and data appliances, are discussed to further convey their distinction and place in the ecosystem.

Keywords

business intelligence; data warehouse; data warehouse appliance; analytic appliance

Please note in this chapter, by necessity, I will break custom and name a few vendors. The variety of approaches for analytic databases demand some representative deployed examples to explain the concepts.

Data Warehousing

If your data warehouse is under-delivering to the enterprise or if somehow you have not deployed one, you have the opportunity to deploy or shore up this valuable company resource. As a matter of fact, of all the concepts in this book, the data warehouse would be the first entity to bring to standard. There are innumerable subtleties and varieties in architecture and methods. Many are appropriate in context of the situation and the requirements. I will certainly delve into my data warehouse architecture proclivities later in the chapter.

First, I want to explain that the data warehouse is an example of an analytic database. It is not meant to be operational in the sense of running the business. It is meant to deliver on the analytics discussed in Chapter 3 and reporting, whether straightforward in the sense of basic reports or deep and complex predictive analytics.

In this context, the data warehouse will be the generalized, multi-use, multi-source analytic database for which there may or may not be direct user access. This data warehouse distributes data to other analytic stores, frequently called data marts. The data warehouse of this book sits in the No-Reference architecture at the same level as many of the other analytic stores. It is possible for an enterprise to have multiple data warehouses by this definition.

You should primarily make sure the data warehouse is well suited as a cleansing,1 distribution, and history management system. Beyond that, be prepared to supplement the data warehouse with other analytic stores best designed for the intended use for the data. At the least in this, where possible, you want the smaller analytic stores to procure their data from the data warehouse, which is held to a data quality standard. Many times the non-data warehouse analytic stores will be larger than the data warehouse itself, will contain other data, and cannot be sourced from the data warehouse.

Companies have already begun to enter the “long tail” with the enterprise data warehouse. Its functions are becoming a steady but less interesting part of the information workload. While the data warehouse is still a dominant fixture, analytic workloads are finding their way to platforms in the marketplace more appropriate to the analytic workload.

When information needs were primarily reporting from operational data, the data warehouse was the center of the known information management universe by a long shot. While entertaining the notion that we were doing analytics in the data warehouse, competitive pressures have trained the spotlight on what true analytics are all about. Many warehouses have proven not to be up to the task.

And it’s analytics, not reporting, that is forming the basis of competition today. Rearview-mirror reporting can support operational needs and pay for a data warehouse by virtue of being “essential” in running the applications that feed it. However, the large payback from information undoubtedly comes in the form of analytics.

Regarding the platform for the data warehouse, if the analytics do not weigh down a data warehouse housed on a non-analytic DBMS (not an appliance, not columnar, all HDD storage),2 big data volumes will. As companies advance their capabilities to utilize every piece of information, they are striving to get all information under management.

One site’s data warehouse is another site’s data mart is another site’s data warehouse appliance or analytic database. Get your terminology straight internally, but find a way to map to a consistent usage of the terms for maximum effectiveness.

All these analytic stores need to go somewhere. They need a platform. This is the set of choices for that platform:

1. Hadoop – good for unstructured and semi-structured data per Chapter 11, a poor choice for a structured-data data warehouse

2. Data Warehouse Appliance – described in this chapter, tailor-made for the structured-data analytic store; can be the data warehouse platform for small-scale or mid-market warehouses; highly scalable

3. Data Appliance – described in this chapter; highly scalable; it wants your operational workload, too; can be the data warehouse platform for small-scale or mid-market warehouses

4. Specialized RDBMS – Some specialized RDBMSs are not sold as appliances, so I’m putting them in a separate category; good for the analytic workload; many of these are columnar, which is covered in Chapter 5

There is a platform that is “best” for any given analytic workload. While I’m at it, there is one vendor solution that is best as well. However, you can merge workloads, creating other workloads. So a workload is not a workload now really. The “not best” possibilities in a modern enterprise are very large for any defined workload. As one analytic leader told me last month, “There are a million ways that I can manage this data.” That statement is not just hyperbole!

The goal is to get as close as possible to the best solution without losing too much value by delaying. “Analysis paralysis,” as they say, is alive and well in enterprise architecture decisions! This book is meant to cut through the delays and get you into the right category as soon as possible. You can cut out many cycles with this information and set about making progress quickly. So how do we define that “best”/good/good enough architecture?

When making product decisions for an analytic environment, the database platform is of utmost importance. It should be chosen with care and with active discernment of the issues and marketing messages. The landscape for this selection has changed. It is now much more of a value-based proposition. The architectural offerings have moved beyond the traditional Massively Parallel (MPP) or Clustered Symmetric, which have been the standards for many years.

The platform decision could come about based on new initiatives. However, it is equally viable to reassess the platform when your current one is going to require major new investment or simply is not reaching the scale that you require. It is also prudent for every shop to periodically reevalute the marketplace to make sure that the current direction is the right one in light of the new possibilities. Now is a time when the possibilities, with data warehouse appliances, merit such a reevaluation.

You will create a culture around your selected platform. You will hire and train your team to support it. It will become the primary driver for hardware and other software selections. Your team will attend user group meetings and interact with others using the DBMS for similar purposes. You will hire consultancy on the DBMS and you will research how to most effectively exploit the technology. You will need vendor support and you will want the vendor to be adding relevant features and capabilities to the DBMS that are needed for data warehousing in the future.

• Scalable – The solution should be scalable in both performance capacity and incremental data volume growth. Make sure the proposed solution scales in a near-linear fashion and behaves consistently with growth in all of: database size, number of concurrent users, and complexity of queries. Understand additional hardware and software required for each of the incremental uses.

• Powerful – Designed for complex decision support activity in an advanced workload management environment. Check on the maturity of the optimizer for supporting every type of query with good performance and to determine the best execution plan based on changing data demographics. Check on conditional parallelism and what the causes are of variations in the parallelism deployed. Check on dynamic and controllable prioritization of resources for queries.

• Manageable – The solution should be manageable through minimal support tasks requiring DBA/System Administrator intervention. There should be no need for the proverbial army of DBAs to support an environment. It should provide a single point of control to simplify system administration. You should be able to create and implement new tables and indexes at will.

• Extensible – Provide flexible database design and system architecture that keeps pace with evolving business requirements and leverages existing investment in hardware and applications. What is required to add and delete columns? What is the impact of repartitioning tables?

• Interoperable – Integrated access to the web, internal networks, and corporate mainframes.

• Recoverable – In the event of component failure, the system must keep providing to the business. It also should allow the business to selectively recover the data to points in time – and provide an easy-to-use mechanism for doing this quickly.

• Affordable – The proposed solution (hardware, software, services) should provide a relatively low total cost of ownership (TCO) over a multi-year period.

• Flexible – Provides optimal performance across the full range of normalized, star, and hybrid data schemas with large numbers of tables. Look for proven ability to support multiple applications from different business units, leveraging data that is integrated across business functions and subject areas.

• Robust in Database Management Systems Features and Functions – DBA productivity tools, monitoring features, parallel utilities, robust query optimizer, locking schemes, security methodology, intra-query parallel implementation for all possible access paths, chargeback and accounting features, and remote maintenance capabilities.

And then there’s price. For the cost of storing data (i.e, a certain number of terabytes)—which is a lousy way of analyzing a workload3—Hadoop is going to be the cheapest, followed by the data warehouse/data appliance or specialized RDBMS, followed by the relational database management system. Functionality follows price.

There are many pitfalls in making this determination, not the least of which is the tendency to “over-specifying” a platform to account for contingencies.4 Many appliances sell in a stepwise manner such that you buy the “1 terabyte” or “10 terabyte,” etc. model. Overspecing an appliance can quickly obviate its cost advantage over the RDBMS.

What are we procuring for?

I used to recommend the procurement of a system that will not have to be altered, in terms of hardware, for 3 years. The pace of change has increased and I now recommend procuring for a period of about 1 year. In other words, you should not have to add CPU, memory (RAM), disk, or network components for 1 year. This is a good balance point between going through the organizational trouble to procure and add components and keeping your powder dry for market change, advancement, and workload evolution. If your system is in the cloud, providing rapid elasticity, this concern is passed to the cloud provider, but you still need to be on top of this.

Since the price of these systems is highly subject to change, I will seek to share specific information on pricing in this chapter’s QR Code.

Tips for Getting the Best Deal

• Care about what matters to your shop

• Do all evaluations onsite

• Use your data in the evaluations

• Competition is key, do not tell vendor they have been selected until after commercials are agreed

• Don’t share weightings with vendor

• Maintenance runs approximately 20% and is negotiable

• Care about Lifetime Cost!

• Offer to be a public reference, it’s a useful bargaining chip

• Long-term relationship is important

• Don’t try to shaft the vendor, spread good karma

• Best time to negotiate (especially with a public company) is quarter- and year-end

Even if your store is in the “cloud,” you MUST choose it wisely. Being “in the cloud” does not absolve you from data platform selection.

Data Warehouse Architecture

No matter where you put your data warehouse, your decision points have only begun if you say you are doing an “Inmon”5 or “Kimball”6 data warehouse.7 As a matter of fact, this “decision” can be a bit of a red herring and can impede progress by giving a false illusion of having everything under control.

Here are some of the post-Inmon/Kimball decisions that must be made for a data warehouse:

• Build out scope

• Business involvement

• Definition of data marts – units of work or physical expansiveness of use with ETL tool in ETL

• Processes data access options and manner of selection – by use, by enterprise, by category

• Data retention and archival definition of data marts – units of work or physical expansiveness of use of ETL tool in ETL processes

• Granularity of data capture integration strategy – virtual and physical metadata handling modeling technique(s) need, utility, and physical nature of data marts

• Operational reporting and monitoring

• Persistence, need, and physical nature of data staging

• Physical instantiation of operational data stores – single-source, multi-source

• Program development team engineering technology selection process – framework, best-of-breed

• Source work effort distribution – source team, data warehouse team, shared

• Use of operational data stores for source systems – selective, complete

The best approach is an enterprise architecture that includes a strong data warehouse built with an agile development methodology with some up-front work and use of standards, broadly defined enterprises and virtualization techniques used to unite multiple enterprises.

The Data Warehouse Appliance

The data warehouse appliance (DWA) is a preconfigured machine for an analytic workload. Despite the name “data warehouse” appliance, as mentioned before, the DWA makes good sense for a general purpose data warehouse when the Achilles heels of the DWA, discussed later, are acceptable.

Non-data warehouse analytic workloads, however, are usually totally appropriate for a data warehouse appliance, a data appliance, or a specialized analytic store.

The data warehouse machine preconfiguration concept is not new. Teradata, Britton Lee, and Sequent are late 1980s examples of this approach. Hardware and software vendors have commonly preconfigured hardware, OS, DBMS, and storage to alleviate those tedious commodity tasks (as well as the risk of not getting it right) from the client. Well-worn combinations such as those put forward for TPC benchmarks are available from either the hardware or software vendor (to “ready prompt”) in most cases. However, some of the new aspects of the modern data warehouse appliance are the use of commodity components and open source DBMS (or, in vendor terms, DBMS alternatives) for a low total cost of ownership. These open source DBMSs provide a starting point for basic database functionality and the appliance vendors focus on data warehouse-related functionality enhancements.

The current parallel RDBMS developments had their origins in 1980s university research on data partitioning and join performance. Resultant data movement or duplication to bring together result sets can be problematic, especially when gigabytes of data need to be scanned.

Analytic platforms, including data warehouses, have gone through phases over the years, including Symmetric Multiprocessing, Clustering, and Massively Parallel Processing.

Symmetric Multiprocessing

One of the early forms of parallel processing was Symmetric Multiprocessing or SMP. The programming paradigm was the same as that for uniprocessors. However, multiple CPUs could share the load using one or more of the forms of parallelism. A Least Recently Used (LRU) cache, kept in each CPU, makes this option more viable. SMP tended to hit a saturation point around 32–64 CPUs, when the fixed bandwidth of the bus became a bottleneck.

Clustering

Clustering became a way to scale beyond the single node by using an interconnect to link several intact nodes with their own CPUs, bus, and RAM. The set of nodes is considered a “cluster.” The disks could either be available to all nodes or dedicated to a node. These models are called “shared disk” and “shared nothing” respectively. This was great for fault tolerance and scalability, but eventually the interconnects, with their fixed bandwidth, became the bottleneck around 16–32 nodes.

Massively Parallel Processing

Massively Parallel Processing, MPP, is essentially a large cluster with more I/O bandwidth. There can be up to thousands of processors in MPP. The nodes are still either shared disk or shared nothing. The interconnect is usually in a “mesh” pattern, with nodes directly connected to many other nodes through the interconnect. The MPP interconnect is also faster than the Clustered SMP interconnect. Remote memory cache, called NUMA (non-uniform memory access), was a variant introduced to MPP. DBMSs quickly adopted their software for MPP, and while the interconnect bottleneck was eroding, MPP became a management challenge. And it is expensive.

We can see that each step was an evolutionary advancement on the previous step.

A data warehouse appliance is a preconfigured hardware, operating system, DBMS, storage, and the proprietary software that makes them all work together.

It became apparent that a real analytic workload could go beyond MPP with a strong price-performance push. Thus the data warehouse appliance was born. I call it “MPP Plus” because each appliance is based on MPP, but adds to it in divergent ways. I’ll now showcase 3 data warehouse appliances (actually 1 DWA and 2 families of DWAs) to demonstrate the “plus” in “MPP plus.” These are only meant as examples of the divergence of approach in data warehouse appliances.

IBM Netezza Appliances

IBM’s Netezza is a leading data warehouse appliance family. Their preconfigurations range from 1 TB to multi-petabytes of (compressed) data. Netezza’s philosophy is that parallelism is a good thing and they take parallelism to a new level. They utilize an SMP node and up to thousands of single-CPU SPUs (”snippet processing units”) configured in an MPP arrangement in the overall architecture, referred to as “AMPP” for asymmetric massively parallel processing. The SPUs are connected by a Gigabit Ethernet, which serves the function of the interconnect.

There are hundreds of SPUs in a rack. Each rack fully populated contains a few terabytes. The racks stand over 6’ tall. The DBMS is a derivative of Postgres, the open source DBMS, but has been significantly altered to take advantage of the performance of the architecture.

What’s inside are multi-way host CPUs and a Linux Operating System. These are commodity class components and this is where the cost savings to a customer come from. Cost savings can also come from the lowered staff requirements for the DBA/System Administration roles. The use of commodity components is one important introduction from Netezza.

The architecture is a shared nothing, but there is a major twist. The I/O module is placed adjacent to the CPU. The disk is directly attached to the SPU processing module. More importantly, logic is added to the CPU with a Field Programmable Gate Array (FPGA) that performs record selection and projection, processes usually reserved for relatively much later in a query cycle for other systems. The FPGA and CPU are physically connected to the disk drive. This is the real key to Netezza query performance success—filtering at the disk level. This logic, combined with the physical proximity, creates an environment that will move data the least distance to satisfy a query. The SMP host will perform final aggregation and any merge sort required.

Enough logic is currently in the FPGA to make a real difference in the performance of most queries. However, there is still upside with Netezza as more functionality can be added over time to the FPGA.

All tables are striped across all SPUs and no indexes are necessary. Indexes are one of the traditional options that are not provided with Netezza. All queries are highly parallel table scans. Netezza will clearly optimize the larger scans more. Netezza does provide use of highly automated “zone map” and materialized view functionality for fast processing of short and/or tactical queries.

Teradata Data Warehouse Appliances

Teradata has taken its database management system and rolled it into a family of data warehouse appliances.

Teradata appliances utilize the Teradata DBMS. Throughout Teradata’s existence, their solutions have been at the forefront of innovation in managing the tradeoffs in designing systems for data loading, querying, and other maximizing.

One of the keys is that all database functions in Teradata (table scan, index scan, joins, sorts, insert, delete, update, load, and all utilities) are done in parallel all of the time. All units of parallelism participate in each database action. There is no conditional parallelism within Teradata.

Also of special note is the table scan. One of Teradata’s main features is a technique called synchronous scan, which allows scan requests to “piggy back” onto scans already in process. Maximum concurrency is achieved through maximum leverage of every scan. Teradata keeps a detailed enough profile of the data under management that scans efficiently scan only the limited storage where query results might be found.8

The Teradata optimizer intelligently runs steps in a query in parallel wherever possible. For example, for a 3-table join requiring 3 table scans, Teradata starts all three scans in parallel. When scans of tables B and C finish, it will begin the join step as the scan for table A finishes.

Teradata systems do not share memory or disk across the nodes, the collections of CPU, the memory, or the bus. Sharing disk and/or memory creates overhead. Sharing nothing minimizes disk access bottlenecks.

The Teradata BYNET, the node-to-node interconnect, which scales linearly to over a thousand nodes, has fault-tolerant characteristics that were designed specifically for a parallel processing environment.

Continual feeding without table-level locks with Teradata utilities can be done with multiple feeders at any point in time. And again, the impact of the data load on the resources is customizable. The process ensures no input data is missed regardless of the allocation.

Teradata appliances use the same DBMS as Teradata’s other DBMS.

The Teradata Data Warehouse Appliance

The Teradata Data Warehouse Appliance is the Teradata appliance family flagship product. With four MPP nodes per cabinet and scaling to many cabinets with over a dozen terabytes each, the Teradata Data Warehouse Appliance can manage up to hundreds of terabytes.

The Teradata Data Warehouse appliance can begin at 2 terabytes of fully redundant user data on 2 nodes and grow, node-by-node if necessary, up to dozens of nodes. The nodes can be provided with Capacity on Demand as well, which means the capacity can be configured into the system unlicensed until it is needed.

The Teradata Data Mart9Appliance

The Teradata Data Mart Appliance is a single node, single cabinet design with a total user data capacity of single-digit terabytes. It is a more limited capacity equivalent of the Teradata Data Warehouse Appliance, limited to that single node. So, likewise, it is suitable for the workload characteristics of an analytic that will not exceed this limit.

It should be noted that a single node environment comes with the potential for downtime in the unlikely event that the node fails—there is no other node to cover for the failure.

The Teradata Extreme Data Appliance

The Teradata Extreme Data Appliance is part of the Teradata appliance family. It outscales even the Teradata Active Enterprise Data Warehouse (Active EDW) machine, into the petabytes. A system of this size would have less concurrent access requirements due to the access being spread out across the large data. The Teradata Extreme Data Appliance is designed with this reality in mind.

It is designed for high-volume data capture such as that found in clickstream capture, call detail records, high-end POS, scientific analysis, sensor data, and any other specialist system useful when the performance of straightforward, nonconcurrent analytical queries is the overriding selection factor. It also will serve as a surrogate for near-line archival strategies that move interesting data to slow retrieval systems. The Extreme Data Appliance will keep this data online.

ParAccel Analytic Database

As a final example of how vendors are delivering data warehouse appliances, here is a little information about the ParAccel Analytic Platform from Actian Corporation.

ParAccel has elements of many of the above platform categories in one platform.

ParAccel is a columnar database (discussed in Chapter 5.) Being columnar with extensive compression, which packs the data down on disk, strongly minimizes the I/O bottleneck found in many of the contenders for the analytic workload.

ParAccel architecture is shared-nothing massively-parallel, the scalable architecture for the vast majority of the world’s largest databases.

Another aspect of ParAccel is that ParAccel allows for full SQL. It also allows for third-party library functions and user defined functions. Together, these abilities allow a ParAccel user to do their analytics “in database,” utilizing and growing the leverageable power of the database engine and keeping the analysis close to the data. These functions include Monte Carlo, Univariate, Regression (multiple), Time Series, and many more. It is most of the functionality of dedicated data mining software.

Perhaps the feature that makes it work best for analytics is its unique accommodation of Hadoop. Without the need to replicate Hadoop’s enormous data, ParAccel treats Hadoop’s data like its own. With a special connector, ParAccel is able to see and utilize Hadoop data directly. The queries it executes in Hadoop utilize fully parallelized MapReduce. This supports the information architecture, suggested below, of utilizing Hadoop for big data, ParAccel for analytics, and the data warehouse for operational support. It leverages Hadoop fully without performance overhead.

Connectors to Teradata and ODBC also make it possible to see and utilize other data interesting to where the analytics will be performed.

ParAccel offers “parallel pipelining” which fully utilizes the spool space without pausing when a step in the processing is complete. ParAccel is compiled architecture on scale-out commodity hardware. With in-memory and cloud options, a growing blue-chip customer base, and most importantly, a rich feature base for analytics and integration with Hadoop, ParAccel is built to contend for the analytic workload.

Achilles Heels of the Data Warehouse Appliance

The discussion of data warehouse appliances would not be fully wrapped up without noting some common Achilles heels. I want to be sure you understand them. I have said the (non-appliance) RDBMS is generally the most expensive form of storage. It requires more manpower than the appliance and is slower. I’ve said the appliance is “MPP Plus” and the installation is easier.

Data warehouse appliances’ Achilles heels include:

• Concurrency issues

• Not designed to play the role of a feeding system to downstream stores

• Restrictions in upgrading; need to upgrade stepwise

• Lack of openness of the platform10

• Lack of redeployability of the platform to broad types of workloads

• False confidence in “throwing data at the platform” working out well

These are not uniformly true for all appliances, of course. However, these are very important to understand, not only in potentially eliminating them as the data warehouse platform, but also in making sure the “mart” workload is not encroaching on these limitations.

Data Appliances and the Use of Memory

Finally, there is an even newer class of appliances, which don’t conform well to a single label. I’ll call them Data Appliances since they are intended to support operational as well as analytical workloads, unlike the ill-named Data Warehouse Appliance. These Data Appliances are relational database machines that run databases. The Oracle Exadata Machine11 and SAP HANA are the most prominent examples and are likely to be centerpiece examples of this approach to data management for decades to come.

While they both approach the challenge of data storage from a non-HDD perspective, they do it differently. HAHA is an all in-memory appliance (disk used for failover) whereas Exadata runs the Oracle DBMS on an intelligently combined SSD and in-memory platform with high amounts of SSD cache.

Innovation has occurred in multiple areas such as:

• Hardware innovations such as multi-core processors that truly maximize the value of hardware

• Faster analytics—faster access to complex calculations that can be utilized in support of immediate and appropriate business action

• The price of memory has precipitously dropped

The result of the innovation is the ability to use “more of” something that we have used for quite some time, but only for a very small slice of data—memory.

In-memory capabilities will be the corporate standard for the near future, especially for traditional databases, where disk I/O is the bottleneck. In-memory based systems do not have disk I/O. Access to databases in main memory is up to 10,000 times faster than access from storage drives. Near-future blade servers will have up to 500 gigabytes of RAM. Already systems are being sold with up to 50 terabytes of main memory. Compression techniques can make this effectively 10x – 20x that size.

As with columnar capabilities, I also expect in-memory to be a much more prominent storage alternative in all major DBMSs. DBMS will offer HDD, SSD, and in-memory.

SAP BusinessObjects introduced in-memory databases in 2006 and is the first major vendor to deliver in-memory technology for BI applications. There are currently many database systems that primarily rely on main memory for computer data storage and few who would claim to be absent any in-memory capabilities and with a roadmap with near-term strong in-memory capabilities.

While you should never use a data warehouse appliance for an operational workload, a data appliance will be much more readily adopted for an operational environment than an analytic store. From an analytic store perspective, the data appliance may be a “piggyback” decision in the enterprise and useful when workloads are when real-time, delay-free access to data can be utilized and delays will have measurable negative impact on the business. Keep an eye on the QR Code for updates on In-Memory systems.

Action Plan

1. Sketch your current post-operational environment, showing how the data warehouse is juxtaposed with data warehouse appliances, relational marts, Hadoop, and column databases

2. Analyze where the data warehouse, if it’s on a traditional relational DBMS, has workloads that are not performing as well as needed; consider moving them off to a data warehouse appliance

3. Consider if a data warehouse appliance is appropriate for housing the data warehouse

4. Make sure the data warehouse is, at the very least, cleansing and distributing data and storing history

5. Is a data appliance on the horizon for handling a lot of enterprise storage needs and could that include the analytic workload?

6. Put those workloads where real-time, delay-free access to data can be utilized and delays will have measurable negative impact on the business in systems with a large component of in-memory storage

www.mcknightcg.com/bookch6


1Per the standard from Chapter 4 on Data Quality.

2This applies to implementations of DBMS that have these capabilities, but the capabilities are not enabled.

3Cost per gigabyte per second is a better technical measure.

4I’ve had clients who have had 3 levels of management increase the spec “just in case”—a problem somewhat mitigated by cloud architectures.

5Bill Inmon, the “father of the data warehouse” and a proponent of up-front planning.

6Ralph Kimball, author of “The Data Warehouse Toolkit,” known for dimensional modeling and expanding the data warehouses incrementally.

7Most common vendor mistruth is a variant of “just throw all your data on our platform and everything will be fine.”

8“Teradata Intelligent Scanning.”

9Data “Mart” (vs. Warehouse) is a product label only and is meant to address scale of the project and not refer to the “polar opposite” of a Data Warehouse.

10I didn’t want to say “proprietary” but many will for this Achilles heel; “commodity components” aside, some configurations make third-party support more difficult.

11Occasionally, Oracle will reject the appliance label for Exadata, saying it is an “engineered system.”