Data Stream Processing - Information Management: Strategies for Gaining a Competitive Advantage with Data (2014)

Information Management: Strategies for Gaining a Competitive Advantage with Data (2014)

Chapter 8. Data Stream Processing

When Storing the Data Happens Later

Processing data as it occurs in the environment is an excellent approach for high velocity data when time is of the essence.

Keywords

business intelligence; data stream processing; complex event processing; real-time business intelligence; event stream processing

This is the real chapter on real-time access. It doesn’t get much more real-time than sub-millisecond in-memory processing with Data Stream Processing. Data Stream Processing (DSP)1 can hardly be considered a data store alongside the data warehouses, analytical appliances, columnar databases, big data stores, etc. described in this book since it doesn’t actually store data. However, it is a data processing platform. Data is only stored in data stores for processing later anyway, so if we can process without the storage, we can skip the storage.

You will often hear DSP associated with data that is not stored. While it’s not stored to facilitate the DSP, quite possibly we’ll use DSP to process data because it’s real-time and handles complex (multiple-stream) analysis and we’ll still store it in a database for its other processing needs. Sometimes, however, the value of the data is primarily to serve DSP and, at hundreds of thousands of events per second, storing the data may not be needed.

The decision to process data with DSP and the decision to store the data in a database/file are separate decisions.

Data processed by DSP is increasingly best-fit in a Hadoop system given its high velocity and typically unstructured nature.

With permission, parts of this chapter have been pulled from “Stream Processing” by William McKnight, published on BeyeNETWORK.com. Read the full article at: http://www.b-eye-network.com/view/15968.

The information management paradigm of the past decade has been to capture data in databases to make it accessible to decision makers and knowledge workers. To reduce query loads and processing demands on operational systems, data from operational systems is typically transferred into databases where it becomes part of the information value chain.

In the value chain that leads from data availability to business results (data→information→knowledge→action→results), making data available to end users is just a part of the cycle. The real goal of information management is to achieve business results through data.

The information value chain is data→information→knowledge→action→results

One technology that has made that information more accessible, albeit in a different paradigm, is stream processing. With stream processing, and other forms of business intelligence known as operational business intelligence because it happens to data in operational systems, data is processed before it is stored (if it is ever stored), not after.

Data velocity can severely tax a store-then-process solution, thereby limiting the availability of that data for business intelligence and business results. Stream processing brings the processing into real-time and eliminates data load cycles (at least for that processing). It also eliminates the manual intervention that would be required if the data were just made available, but not processed.

Stream processing is often the only way to process high velocity and high volume data effectively.

Since all data can contribute to the information management value chain and speed is essential, many companies are turning to processing directly on the data stream with a process-first approach. It’s more important to process data than store it; and with data stream processing, multiple streams can be analyzed at once. It has become very common to spread the operational workload around by processing different transactions of the same type in multiple systems. With stream processing, all of these streams can be processed at the same time. The added complexity of multiple streams is sometimes referred to as complex event processing (CEP),2 which is a form of DSP.

Many organizations have redundant processing systems. Think of the systems financial services organizations have in place to process trades. Several trades executed by a customer could be processed on different systems all over the world. Without CEP, it would be impossible to draw real-time inferences from these trades by the same customer.

CEP is looking across the streams for opportunities and threats.

Uses of Data Stream Processing

Wash trades, which happens when an investor simultaneously sells and buys shares in order to artificially increase trading volume and thus the stock price, are illegal, as are many other simultaneous trades meant to overload a system and catch it unaware. It’s not just good business practices and deterring fraud that are driving putting mechanisms in place to stop fraudulent trades. Regulations have also been created that impose this type of governance on financial services organizations.

According to the Wall Street Journal, “U.S. regulators are investigating whether high-frequency traders are routinely distorting stock and futures markets by illegally acting as buyer and seller in the same transactions, according to people familiar with the probes. Investigators also are looking at the two primary exchange operators that handle such trades, CME Group Inc., and IntercontinentalExchange Inc., the Atlanta company that in December agreed to purchase NYSE Euronext for $8.2 billion, the people said. Regulators are concerned the exchanges’ systems aren’t sophisticated enough to flag or stop wash trades, the people said” (Patterson et al., 2013).

CME Group, for example, famously has DSP technology: TIBCO Streambase (Schmerken, 2009). It will need to convince US regulators it is using it well to prevent wash trades.

Financial services and capital markets companies are leading the charge into data stream processing, specifically complex event processing.

CEP allows a financial institution to look “across” all of its streams/systems/trades simultaneously and look for patterns that can trigger an automated action. For example, it can look at a trader’s complete set of trades for the last 5 minutes or for the last 30 trades, regardless of when they occurred. CEP can then look for the same security symbol in multiple trades or whatever indications in the transaction data should trigger action. Another common use is CEP looking at overall trader volume to determine if the trades should be allowed. Individuals can be grouped by association to determine fraud as well.

With customer profiling (see the chapter on Master Data Management), it can be determined what the normal behaviors of an individual and a peer group are. Hidden links between accounts and individuals can be discovered. For example, it may be determined that a broker may have multiple accounts in the names of several connected people.

This happens outside of DSP, but is used as supporting master data that triggers proper action within DSP.

Such actions resulting from DSP may include:

1. Cancelling the trade(s)—after all, it is not completely processed, yet

2. Postponing the trade(s) until a manual review is completed

3. Allowing the trade(s)

The primary actions you would seek from DSP (including CEP) are 1 and 2 above. If 100% of transactions fell to 3, this is not a good use of DSP. DSP is useful when immediate actions are necessary. When possible, you should allow the transactions to hit the database(s) from which they would be picked up—for example, by Master Data Management for distribution. It’s more efficient.

The manual review will be used to determine intent and develop a case backed by data. The process needs to be executed quickly and efficiently.

A transaction contains various points of information that are interesting to feed to a Master Data Management hub (see Chapter 7). There is trade count, trade volume, updates to the trading profile, and an association between the trader and the securities. Once it gets in MDM, it is available for immediate distribution to the organization.

Every organization has data streams. It’s data “in play” in the organization, having been entered into an input channel and in the process of moving to its next organizational landing spot, if any. Two fundamental approaches are taken with stream processing:

1. Gather the intelligence from the transaction

2. Use the transaction to trigger a business activity

Any data that needs to be acted on immediately should be considered for DSP. High-velocity actionable data is a good candidate for stream processing. Made possible by stream processing, the successful real-time intervention into any of these processes easily translates into savings of millions of dollars per year in large companies with high velocity/high volume transaction data.

Health-care companies can analyze procedures and symptoms and bring real-time useful feedback into the network immediately. For example, they can look for distribution of potentially incompatible drugs that are potentially being distributed in an emergency situation.

The stream can often be visualized with the DSP tool.

Retailers can make smart next-best offers to their customers based on what the shopper is currently (this moment) experiencing.

Manufacturers can detect anomalous activity on the production line such as changes in tempo, temperature changes, and inventory drawdown.

The U.S. Department of Homeland Security uses DSP to monitor and act on security threats by looking at patterns across multiple systems, including analyzing multiple streams of videos for suspicious simultaneous activity. DSP is also being used on the battlefield as well to spot impending danger based on the path of numerous objects.

Many types of organizations, including financial services organizations, can analyze streams from suppliers to immediately take up the best (least expensive) offers.

High volume trading can also use DSP to take advantage of opportune market conditions when every sub-millisecond counts.

Candidate workflows for data stream processing should be analyzed through the lens of the need for real-time processing. There are limitations as to what “other” data can be brought to bear on the transaction and used in the analysis. Integration cannot be physical. After all, these streams have high velocity, and real-time analysis has to be done on a continuous basis. As the worlds covered in this book (MDM, Hadoop, data virtualization, analytic databases, stream processing, etc.) collide, I expect that the MDM hub, containing highly summarized business subject-area information, will be the entity utilized with stream processing, possibly with data virtualization (Chapter 9).

Once you’re incorporating other data in the analysis, from a risk perspective, organizations have more data to consider regarding whether a transaction should be or should not be allowed. While some transactions may appear to be risky, many others could appear to be suspicious only due to lack of an up-to-date customer profile, which increasingly will be in an MDM store. Regardless, the point is that it is very possible to combine stream processing with master data through data virtualization.

Data Stream Processing Brings Power

The business world is changing so fast. Those who can embrace the technological advances, regardless of company size, can certainly level the playing field. DSP brings this kind of power to all who embrace it.

For example, trading algorithms on Wall Street used to be developed over time and utilized for months or quarters. Today, traders need to be updating and applying change to their algorithms much more rapidly. Applying DSP can allow for these algorithms to be developed much more quickly and effectively. The key to algorithmic success is knowledge of DSP and the application of “data science,” rather than being a Tier 1 bank. Also, similar to offer analysis, a trading firm can determine the best venue and method for executing their orders.

Stream SQL Extensions

Stream processing accommodates an SQL-like programming interface that has extended SQL in the form of time series and pattern matching syntax that allows analysis to take place in “the last n minutes” or “the last n rows” of data. Processing a “moving window” of cross-system, low-latency, high-velocity data delivering microsecond insight is what stream processing is all about.

If you’re used to SQL, you may be surprised at the conceptual similarities between a database and a stream from an SQL perspective. Essentially, the stream is treated as a table (referred to as a Stream Object) and the SQL extensions are extensions to basic SQL.

An important area of SQL extension is time series processing.

The high volume-trading customer’s decision rule example could look like this:

WHEN ORCL price moves outside of its moving average by 4%

AND My portfolio moves up by 2%

AND IBM price moves up by 5% OR ORCL price moves down by 5%

ALL WITHIN any 3 minute interval

THEN buy 100 shares of ORCL and sell 100 shares of IBM

This example utilizes multiple streams (My Portfolio, ORCL share price, IBM share price). It includes the time series (within any 3-minute interval). The logic comprises events in multiple streams and the action to be done is automated: buy ORCL, sell IBM. Of course, this is pseudocode and DSP must actually send the buy and sell transactions to the trading system.

Imagine holding binoculars to watch the fish pass in several streams simultaneously. This represents the time orientation of DSP, since you can only see the last few seconds of activity in the lens.

A Fraud Detection System (FDS) could look like this:

IF the requested loan amount exceeds $100,000

AND a denied loan application exists for a similar name and address

ALL WITHIN any 2 hour interval

THEN display denial message on dashboard

In this case, you are using the time-series features of the StreamSQL extensions to advantage to encompass low-velocity data (loan applications) across a larger time period (2 hours). The only limitation on data within the analysis is the size of memory.

Smartphones currently pulse where you are every few seconds. If your company’s app has location services turned on, you can use it to your advantage to monitor users’ locations and cross-reference that to current promotions that may be of interest to the customer.

A Location Detection system might look like this:

IF I have an active promotion

AND we can detect customer location

AND customer is within 5 miles of a retail location

AND has purchased in the promotion’s category in the last year

THEN send a text message with the promotion.

Tracking people is one thing—and a very profitable one at that. Tracking most everything else as well is referred to as the “internet of things.” Based on its ability to look across streams and generate activity, DSP will play a huge role in this emerging movement.

In Conclusion

Like columnar databases and data virtualization discussed in this book, DSP may or may not stand alone as a technology. Increasingly it is being embedded in other technologies and packaged applications—some described in this book and others in the category of Business Process Management (BPM). Regardless, DSP merits each shop, especially the information management architecture office, be well aware of the possibilities of DSP as a unique domain and put it to use when real-time and cross-stream processing matters and data storage may or may not matter.

Action Plan

• Analyze workloads to determine which could provide more value if they were truly real-time

• Analyze the opportunities to look across streams of data to create immediate and automated actions

• Analyze the current toolset to determine what DSP capabilities you may already have

www.mcknightcg.com/bookch8

References

1. Patterson S, Strasburg J, Trindle J. Wash trades scrutinized. Wall St J. 2013;March <http://online.wsj.com/article/SB10001424127887323639604578366491497070204.html>.

2. Schmerken I. CME group picks streambases CEP platform for options pricing. Wall St Technol. 2009;June <http://www.wallstreetandtech.com/electronic-trading/cme-group-picks-streambases-cep-platform/21810083>.


1Also known as Event Stream Processing

2Also known as Concurrent Stream Processing.