Big Data and Cloud Computing - Joe Celko's Complete Guide to NoSQL: What Every SQL Professional Needs to Know about NonRelational Databases (2014)

Joe Celko's Complete Guide to NoSQL: What Every SQL Professional Needs to Know about NonRelational Databases (2014)

Chapter 9. Big Data and Cloud Computing

Abstract

Big Data is largely a buzzword in IT right now. It was coined by Forrester Research to put a wrapper around existing database mining, data management, and other extensions of existing technology to the current hardware. The goal is to use mixed tools with larger volumes of several different forms of data being brought together under one roof. Along with this approach to data, we are also concerned with cloud computing, which is a public or private Internet network that replaces the tradition hardwired landlines within a company.

Keywords

Forrester Research; V-list; cloud computing; Big Data; data mining

Introduction

The term Big Data was invented by Forrester Research along with the four V buzzwords—volume, velocity, variety, variability—in a whitepaper. It has come to apply to an environment that uses a mix of the database models we have discussed so far and tries to coordinate them.

There is a “Dilbert” cartoon where the pointy-haired boss announces that Big Data lives in the Cloud. It knows what we do” (http://dilbert.com/strips/comic/2012-07-29/). His level of understanding is not as bad as usual for this character. Forrester Research created a definition with the catchy buzz phrase “the four V’s—volume, velocity, variety, variability” that sells a fad. Please notice value, veracity, validation, and verification are not in Forrester’s V-list.

The first V is volume, but that is not new. We have had terabyte and petabyte SQL databases for years; just look at Wal-Mart’s data warehouses. A survey in 2013 from IDC claims that the volume of data under management by the year 2020 will be 44 times greater than what was managed in 2009. But this does not mean that the data will be concentrated.

The second V is velocity. Data arrives faster than it has before, thanks to improved communication systems. In 2013, Austin, TX, was picked by Google as the second U.S. city to get their fiber-optic network. I get a daily summary of my checking account transactions; only paper checks that I mailed are taking longer than an hour to clear. In the 1970s, the Federal Reserve was proud of 24-hour turnaround.

The third V is variety. The sources of data have increased. Anyone with a cellphone, tablet or home computer is a data source today. One of the problems we have had in the database world is that COBOL programmers came to tiered architectures with a mindset that assumes a monolithic application. The computation, data management and data presentation are all in one module. This is why SQLs still have string functions to convert temporal data into local presentation formats, to put currency symbols and punctuation in money and so forth. Today, there is no way to tell what device the data will be on, who the end user will be. We need loosely coupled modules with strong cohesion more than ever.

The fourth V is variability. Forrester meant this to refer to the variety of data formats. We are not using just simple structured data to get that volume. In terms of pure byte count, video is easily the largest source in my house. We gave up satellite and cable television and only watch Internet shows and DVDs. Twitter, emails, and other social network tools are also huge. Markup languages are everywhere and getting specialized. This is why ETL tools are selling so well.

Think of varying to mean both volume and velocity. Television marketing companies know that they will get a very busy switch board when they broadcast a sale. They might underestimate what the volume or velocity will be, but they know it is coming. We do not always have that luxury; a catastrophe at one point in the system can cascade. Imagine that one of your major distribution centers was in Chelyabinsk, Russia, when the meteor hit on February 13, 2013. The more centralized your system, the more damage a single event can do. If that was your only distribution center, you could be out of business.

The other mindset problem is management and administration with Big Data. If we cannot use traditional tools on Big Data, then what do we do about data modeling, database administration, data quality, data governance, and database programming?

The purpose of Big Data, or at least the sales pitch, is that we can use large amounts of data to get useful insights that will help an enterprise. Much like agile programming became an excuse for bad programming. You need to have some idea of, say, the data quality. In statistics, we say “sample size does not overcome sample basis,” or that a small random herd of cattle is better than a large, diseased herd.

In more traditional databases (small herd), people will see and clean some data, but most raw Big Data is not even eyeballed because there is simply too much of it (large herd). The lack of quality has to be huge and/or concentrated to be seen in the volume of data.

Even worse, the data is often generated from automated machinery without any human intervention. Hopefully, the data generation is cleaning the data as it goes. But one of the rules of systemantics is that fail-safe systems fail by failing to fail safely (Gall, 1977).

In fairness to Big Data, you should not assume that “traditional data” has been following best practices either. I will argue that good data practices still apply, but that they have to be adapted to the Big Data model.

9.1 Objections to Big Data and the Cloud

“Nothing is more difficult than to introduce a new order, because the innovator has for enemies all those who have done well under the old conditions and lukewarm defenders in those who may do well under the new.” —Niccolo Machiavelli

Old Nick was right, as usual. As with any new IT meme, there are objections to it. The objections are usually valid. We have an investment in the old equipment and want to milk it for everything we can get out of it. But more than that, our mindset is comfortable with the old terms, old abstractions, and known procedures. The classic list of objections is outlined in the following sections.

9.1.1 Cloud Computing Is a Fad

Of course it is a fad! Everything in IT starts as a fad: structured programming, RDBMS, data warehouses, and so on. The trick is to filter the hype from the good parts. While Dilbert’s pointy-haired boss thinks that “If we accept Big Data into our servers, we will be saved from bankruptcy! Let us pay!,” you might want to be more rational.

If so, it is a very popular fad. Your online banking, Amazon purchases, social media, eBay, or email are already in a cloud for you. Apple and Google have been keen to embrace cloud computing, affirming the idea that this is a technology revolution with longevity. Cloud computing is a developing trend, not a passing trend.

9.1.2 Cloud Computing Is Not as Secure as In-house Data Servers

This is true for some shops. I did defense contract work in the Cold War era when we had lots of security in the hardware and the software. But very few of us have armed military personnel watching our disk drives. Most shops do not use encryption on everything. High security is a very different world.

But you have to develop protection tools on your side. Do not leave unencrypted data on the cloud, or on your in-house servers. When you leave it on your in-house servers, it can get hacked, too. Read about the T.J. Maxx scandal or whatever the “security breach du jour” happens to be. This will sound obvious, but do not put encryption keys in the cloud with the data it encrypts. Do not concentrate data in one site; put it in separate server locations. When one site is hacked, switch over to another site.

9.1.3 Cloud Computing Is Costly

Yes, there are initial costs in the switch to the cloud. But there are trade-offs to make up for it. You will not need a staff to handle the in-house servers. Personnel are the biggest expense in any technological field. We are back to a classic trade-off. But if you are starting a new business, it can be much cheaper to buy cloud space instead of your own hardware.

9.1.4 Cloud Computing Is Complicated

Who cares? You are probably buying it as a service from a provider. As the buyer, your job is to pick the right kind of cloud computing for your company. The goal is to keep the technical side as simple as possible for your staff. And for your users!A specialized company can afford to hire specialized personnel. This is the same reason that you buy a package that has a bunch of lawyers behind it.

9.1.5 Cloud Computing Is Meant for Big Companies

Actually, you might avoid having to obtain costly software licenses and skilled personnel when you are a small company. There are many examples of very small companies going to the cloud so they could reach their users. If you are successful, then you can move off the cloud; if you fail, the cost of failure is minimized.

9.1.6 Changes Are Only Technical

Did the automobile simply replace the horse? No. The design of cities for automobiles is not the same as for horses. The Orson Welles classic movie The Magnificent Ambersons (1942) ought to be required viewing for nerds. The story is set in the period when the rise of the automobile changed American culture. It is not just technology; it is also culture.

Let me leave you with this question: How does the staff use the company internal resources with the cloud? If there is an onsite server problem, you can walk down the hall and see the hardware. If there is a cloud problem, you cannot walk down the hall and your user is still mad.

There are no purely technical changes today; the lawyers always get involved. My favorite example (for both the up and down side of cloud computing) was a site to track local high school and college sports in the midwest run by Kyle Godwin. He put his business data in Megaupload, a file-sharing site that was closed down by the Department of Justice (DOJ) for software piracy in January 2013. When he tried to get his data back, the DOJ blocked him claiming it was not his data. His case is being handled by the Electronic Freedom Foundation (EFF) as of April 2013.

There are some legal questions about the ownership of the data, so you need to be sure of your contract. Some of this has been decided for emails on servers, but it is still full of open issues.

Using the Cloud

Let me quote from an article by Pablo Valerio (2013):

In case you need to make sure your data is properly identified as yours, and to avoid any possible dispute, the next time you negotiate an agreement with a cloud provider, you’d be wise to include these provisions in the contract:

◆ Clearly specify the process, duration, and ways the data will be returned to you, at any time in the contract duration.

◆ Also specify the format your data should be returned—usually the format the data was stored in the first place.

◆ Establish a limit, usually days, when the data should be fully returned to your organization.

◆ Clearly establish your claims of ownership of the data stored, and that you don’t waive any rights on your property or copyright.

◆ Sometimes we just accept service agreements (where we just agree on the conditions set forth by the provider) without realizing the potential problems. I seriously recommend consulting an attorney.

9.1.7 If the Internet Goes Down, the Cloud Becomes Useless

This is true; the Internet connection is a point of vulnerability. Netflix had losses when their service, Amazon Web Services (AWS), went down on multiple occasions.

This was also true of power supplies for large data centers. National Data Corporation in Atlanta, GA, did credit card processing decades ago. Their response to their first major power failure was to route power lines from separate substations. When both substations failed during a freak ice storm, they added a third power line and a huge battery backup.

If the whole Internet is down, that might mean the end of world as we know it. But you can have a backup connection with another provider. This is a technical issue and you would need to frame it in terms of the cost to the company of an outage. For example, one of my video download sites lost part of its dubbed anime; they are the only source for the series I was watching, so I was willing to wait a day for it to come back up. But when a clothing site went down, I simply made my gift certificate order with another site.

9.2 Big Data and Data Mining

Data mining as we know it today began with data warehousing (a previous IT fad). Data warehousing concentrated summary data in a format that was more useful for statistical analysis and reporting. This lead to large volumes of data arranged in star and snowflake schema models, ROLAP, MOLAP, and other OLAP variants.

The data warehouse is denormalized, does not expect to have transactions, and has a known data flow. But it is still structured data. Big Data is not structured and contains a variety of data types. If you can pull out structured data in the mix, then there are already tools for it.

9.2.1 Big Data for Nontraditional Analysis

More and more, governments and corporations are monitoring your tweets and Facebook posts for more complex data than simple statistical analysis. U.S. News and World Report ran a story in 2013 about the IRS collecting a “huge volume” of personal data about taxpayers. This new data will be mixing it with the social security numbers, credit card transactions, and the health records they will enforce under ObamaCare to create robo-audits via machines. The movie Minority Report (2002, Steven Spielberg, based on a Philip K. Dick short story) predicts a near future where a “precrime” police division uses mutants to arrest people for crimes they have not yet committed. You are simply assumed guilty with no further proof.

Dean Silverman, the IRS’s senior advisor to the commissioner, said the IRS is going to devote time to even scrutinizing your Amazon.com purchases. This is not new; RapLeaf is a data mining company that has been caught harvesting personal data from social networks like Facebook and MySpace, in violation of user privacy agreements. Their slogan is “Real-Time Data on 80% of U.S. Emails” on their website. The gimmick is that processing unstructured data from social networks is not easy. You need a tool like IBM’s Watson to read and try to understand it.

In May 2013, the Government Accounting Office (GAO) found that the IRS has serious IT security problems. They have addressed only 58 of the 118 system security-related recommendations the GAO made in previous audits. The follow-up audit found that, of those 58 “resolved” items, 13 had not been fully resolved. Right now, the IRS is not in compliance with its own policies. It is not likely that its Big Data analytics will succeed, especially when they have to start tracking ObamaCare compliance and penalizing citizens who do not buy health insurance.

In 2010, Macy’s department stores were still using Excel spreadsheets to analyze customer data. In 2013, Macys.com is using tens of millions of terabytes of information every day, which include social media, store transactions, and even feeds in a system of Big Data analytics. They estimate that this is a major boost in store sales.

Kroger CEO David Dillon has called Big Data analytics his “secret weapon” in fending off other grocery competitors. The grocery business works on fast turnaround, low profit margins, and insanely complicated inventory problems. Any small improvement is vital.

Big retail chains (Sears, Target, Macy’s, Wal-Mart, etc.) want to react to market demand in near-real time. There are goals:

◆ Dynamic pricing and allocation as goods fall in and out of fashion. The obvious case is seasonal merchandise; Christmas trees do not sell well in July. But they need finer adjustments. For example, at what price should they sell what kind of swimwear in which part of the country in July?

◆ Cross-selling the customer at the cash register. This means that the customer data has to be processed no slower than a credit card swipe.

◆ Tighter inventory control to avoid overstocking. The challenge is to put external data such as weather reports or social media with the internal data retailers already collect. The weather report tells us when and how many umbrellas to send to Chicago, IL. The social media can tell us what kind of umbrellas we should send.

The retail chain’s enemy in the online, dot-com retailers: clicks versus bricks. They use Big Data, but use it differently. Amazon.com invented the modern customer recommendation retail model. Initially, it was crude and needed tuning. My favorite personal experience was being assured that other customers who bought an obscure math book also like a particular brand of casual pants (I wear suits)! Trust me, I was the only buyer of that book on Earth that month. Today, my Netflix and Amazon recommendations are mostly for things I have already read, seen, or bought. This is means that my profiles are correctly adjusted today. When I get a recommendation I have not seen before, I have confidence in it.

9.2.2 Big Data for Systems Consolidation

The Arkansas Department of Human Services (DHS) has more than 30 discrete system silos in an aging architecture. There has been no “total view” of a client. They are trying to install a new system that will bring the state’s social programs together, including Medicaid, the Supplemental Nutrition Assistance Program (SNAP), and the State Children’s Health Insurance Program (SCHIP). This consolidation is going to have to cross multiple agencies, so there will be political problems as well as technical ones.

The goal is to create a single point of benefits to access all the programs available to them and a single source to inform once all the agencies about any change of circumstances regardless of how many benefits programs the client uses.

In a similar move, the Illinois DHS decided to digitize thousands of documents and manage them in a Big Data model. In 2010, the DHS had more than 100 million pieces of paper stored in case files at local offices and warehouses throughout the state. You do not immediately put everything in the cloud; it is too costly and there is too much of it. Instead, the agency decided to start with three basic forms that deal with the applications and the chronological case records stored as PDF files. The state is using the IBM Enterprise Content Management Big Data technologies. When a customer contacts the agency, a caseworker goes through a series of questions and inputs the responses into an online form. Based on the information provided, the system determines program eligibility, assigns metadata, and stores the electronic forms in a central repository. Caseworker time spent retrieving information has gone from days to just seconds, which has been a big boost to customer service. Doug Kasamis, CIO at DHS, said “the system paid for itself in three months.”

Concluding Thoughts

A survey at the start of 2013 by Big Data cloud services provider Infochimps found that 81% of their respondents listed Big Data/advanced analytics projects as a top-five 2013 IT priority. However, respondents also report that 55% of Big Data projects do not get completed and that many others fall short of their objectives. We grab the new fad first, then prioritize business use cases. According to Gartner Research’s “Hype Cycle,” Big Data has reached its “peak of inflated expectations” by January 2013. This is exactly what happened when the IT fad du-jour was data warehouses. The failures were from the same causes, too! Overreaching scope, silos of data that could not be, and management failures.

But people trusted data warehouses because they were not exposed to the outside world. In mid-2013, we began to find out just how much surveillance the Obama administration has on Americans with the Prism program. That surveillance is done by using the cloud to monitor emails, social networks, Twitter, and almost everything else. The result has been a loss of confidence in Big Data for privacy.

References

1. Adams S. Dilbert. 2012; In: http://dilbert.com/strips/comic/2012-07-29/; 2012.

2. Gall J. Systemantics: How systems work and especially how they fail. Orlando, FL: Quadrangle; 1977.

3. Gartner Research. (2010–2013). http://www.gartner.com/technology/research/hype-cycles/.

4. McClatchy-Tribune Information Services. Illinois DHS digitizes forms, leverages mainframe technology. 2013; In: http://cloud-computing.tmcnet.com/news/2013/03/18/6999006.htm; 2013.

5. Satran R. IRS High-Tech tools track your digital footprints. 2013; In: http://money.usnews.com/money/personal-finance/mutual-funds/articles/2013/04/04/irs-high-tech-tools-track-your-digital-footprints; 2013.

6. Valerio P. How to decide what (& what not) to put in the cloud. 2012; In: http://www.techpageone.com/technology/how-to-decide-what-what-not-to-put-in-the-cloud/#.Uhe1PdJOOSo; 2012.

7. Valerio P. Staking ownership of cloud data. 2013; In: http://www.techpageone.com/technology/staking-ownership-of-cloud-data/#.Uhe1QtJOOSo; 2013.