Big Data Bootcamp: What Managers Need to Know to Profit from the Big Data Revolution (2014)

Chapter 6. The Intersection of Big Data, Mobile, and Cloud Computing

Boon to App Development

The intersection of Big Data, mobile, and cloud computing has created the perfect storm for the development of innovative new applications. Mobile visualization applications are providing dynamic insight into sales, marketing, and financial data from virtually anywhere. Cloud computing services are making it possible to store virtually unlimited amounts of data and access scalable, low-cost data processing functionality at the push of a button.

This chapter covers the latest advances in Big Data, cloud computing, and mobile technologies. From choosing the right technologies for your application to putting together your team, you’ll learn everything you need to know to build scalable Big Data Applications (BDAs) that solve real business problems.

How to Get Started with Big Data

Building your own BDA need not be hard. Historically, the investment, expertise, and time required to get a new Big Data project up and running prevented many companies and entrepreneurs from getting into the Big Data space. Now, however, Big Data is accessible to nearly anyone, and building your own BDA is well within reach.

Mobile devices mean the insights from Big Data are readily available, while cloud services make it possible to analyze large quantities of data quickly and at a much lower cost than was traditionally possible. Let’s first take a look at the latest cloud-based approaches to working with Big Data.

The Latest Cloud-Based Big Data Technologies

Building Big Data applications traditionally meant keeping data in expensive, local data warehouses and installing and maintaining complex software. Two major developments mean that this is no longer the case.

First, the widespread availability of high-speed broadband means that it is easier to move data from one place to another. No longer must data produced locally be analyzed locally. It can be moved to the cloud for analysis.

Second, more and more of today’s applications are cloud-based. That means more data is being produced and stored in the cloud. Increasing numbers of entrepreneurs are building new BDAs to help companies analyze cloud-based data such as e-commerce transactions and web application performance data.

The latest cloud-based Big Data technologies have evolved at three distinct layers—infrastructure, platform, and application.

Infrastructure

At the lowest level, Infrastructure as a Service (IaaS), such as Amazon Elastic Compute Cloud (EC2) and Simple Storage Service (S3), as well as Google Cloud Storage, make it easy to store data in the cloud.

Cloud-based infrastructure services enable immense scalability without the corresponding up-front investments in storage and computing infrastructure that is normally required. Perhaps more than any other company, Amazon pioneered the public cloud space with its Amazon Web Services (AWS) offering, which you’ll read about in more detail in the next section.

Providers such as AT&T, Google, IBM, Microsoft, and Rackspace have continued to expand their cloud infrastructure offerings. IBM has recently become more aggressive, perhaps growing concerned about the rapid growth of AWS. In 2013, IBM acquired SoftLayer Technologies for $2 billion to expand its cloud services offering. SoftLayer was doing an estimated $800M to $1B in revenue annually prior to the acquisition.¹

Amazon, Google, Microsoft and Rackspace have been competing particularly aggressively at the infrastructure level, repeatedly cutting prices on both computing and storage services. Between them, the four companies cut prices some 26 times in 2013, according to cloud management firm RightScale, with Amazon making the most price cuts. This bodes well for Big Data because it means the cost of storing and analyzing large quantities of data in the cloud are continuing to decrease. In many cases, working with Big Data in the cloud is cheaper than doing so locally.

Platform

At the middle layer, Platform as a Service (PaaS) solutions offer a combination of storage and computing services while providing more application-specific capabilities. Such capabilities mean that application developers can spend less time worrying about the low-level details of how to scale underlying infrastructure and more time focusing on the unique aspects of their applications.

After getting off to a somewhat late start, Microsoft has continued to expand its Microsoft Azure cloud offering. Microsoft Azure HDInsight is an Apache Hadoop offering in the cloud, which enables Big Data users to spin Hadoop clusters up and down on demand.

Google offers the Google Compute Engine and Google App Engine, in addition to Google Fusion, a cloud-based storage, analysis, and presentation solution for rapidly working with immense quantities of table-based data in the cloud. More recently, Google introduced Google Cloud DataFlow, which Google is positioning as a successor to the MapReduce approach. Unlike MapReduce, which takes a batch-based approach to processing data, Cloud DataFlow can handle both batch and streaming data.

Newer entrants like Qubole—founded by former Big Data leaders at Facebook—and VMWare—with its CloudFoundry offering—combine the best of existing cloud-based infrastructure services with more advanced data and storage management capabilities. Salesforce.com is also important to mention, since the company pioneered the cloud-based application space back in 1999. The company offers its Force.com platform. While it is well-suited for applications related to customer relationship management (CRM), the platform does not yet offer capabilities specifically designed for BDAs.

Application

Software as a Service (SaaS) BDAs exist at the highest level of the cloud stack. They are ready to use out of the box, with no complex, expensive infrastructure to set up or software to install and manage. In this area, Salesforce.com has expanded its offerings through a series of acquisitions including Radian⁶ and Buddy Media. It now offers cloud-based social, data, and analytics applications.

Newer entrants like AppDynamics, BloomReach, Content Analytics, New Relic, and Rocket Fuel all deal with large quantities of cloud-based data. Both AppDynamics and New Relic take data from cloud-based applications and provide insights to improve application performance. BloomReach and Content Analytics use Big Data to improve search discoverability for e-commerce sites. Rocket Fuel uses Big Data to optimize the ads it shows to consumers.

The application layer is likely to see continued growth over the next few years as companies seek to make Big Data accessible to an ever broader audience. For end-users, cloud-based BDAs provide a powerful way to reap the benefits of Big Data for specific verticals or business areas without incurring the setup time and costs traditionally required when starting from scratch.

Tip Employing cloud-based applications is an economical and powerful way to reap the benefits of Big Data intake and analysis at the business and end-user level.

But with many areas not yet addressed by today’s BDAs, there are plenty of technologies to help you get started building your own BDA fast. In that regard, perhaps no offering is better known than AWS.

Amazon Web Services (AWS)

Almost no cloud-based platform has had more success than AWS. Until recently, common wisdom was that data that originated locally would be analyzed locally, using on-site computer infrastructure, while data that originated in the cloud would be stored and analyzed there. This was in large part due to the time-consuming nature of moving massive amounts of data from the on-site infrastructure to the cloud so that it could be analyzed.

But all that is changing. The availability of very high bandwidth connections to the cloud and the ease of scaling computing resources up and down in the cloud means that more and more Big Data applications are moving to the cloud or at a minimum making use of the cloud when on-site systems are at capacity.

But before you explore some of the ways in which businesses are using Amazon Web Services for their Big Data needs, first take a step back and look at how AWS became so popular. Although it seems like AWS became a juggernaut virtually overnight, Amazon actually introduced the service in 2003. By 2006, Amazon had commercialized the service and was on its way to becoming the powerhouse that it is today. According to some analyst estimates, AWS accounted for some $3.8 billion of Amazon’s revenue in 2013 and is expected to account for a whopping $6.2 billion in 2014.²

How did an online bookseller turn into one of the leading providers of public cloud services? Amazon had developed an infrastructure that allowed it to scale virtual servers and storage resources up and down on demand, a critical capability to address the significant spikes in demand the company saw during holiday shopping seasons. Amazon made the storage and computing capacity—quite often excess capacity—that it had originally built to support the demands consumers were placing on its retail web site available to business users.

Of course, cloud-based infrastructure had existed previously in the form of web hosting from providers like GoDaddy, Verio, and others. But Amazon made it truly easy for users to get started by signing up for an account online using a credit card. Amazon made it equally easy to add more storage (known as S3 for Simple Storage Service) and computing power (known as EC2 for Elastic Compute Cloud).

Notably, Amazon’s offering provides storage and computing in a utility style model. Users pay only for what they actually consume. That makes it possible to buy a few hours of processing when necessary, rather than committing to a months- or years-long purchase of an entire server. Users can try one server, known as an instance, run it for however long they need, and turn it off when they don’t need it anymore. They can bring up bigger or smaller instances based on their computing needs. And all this can be done in a few minutes if not seconds.

Amazon’s no-lock-in offering combined with the ease of getting started really sets it apart from the hosting providers that came before it. What’s more, users figure that if the infrastructure is good enough to support Amazon’s huge e-commerce web site, it is more than good enough to support their own applications.

And Amazon’s pricing is hard to beat. Because of the scale of its infrastructure, Amazon buys immense amounts of hardware and bandwidth. As a result, it has tremendous purchasing power, which in turn enabled it to lower its prices repeatedly. Not only do AWS users get to leverage the capabilities of a flexible cloud-based infrastructure, they also get to take advantage of Amazon’s purchasing power as if it were their own. Plus, they don’t have to worry about all of the infrastructure requirements that typically go along with maintaining physical servers: power, cooling, network capacity, physical security, and handling hard-drive and system failures. With AWS, all of that comes with the service.

As a result of this flexibility, AWS has seen tremendous adoption. Leading tech companies like Netflix and Dropbox run on top of AWS. But AWS adoption doesn’t stop there. Pharmaceutical maker Pfizer uses AWS when its need for high-performance computing (HPC) services outstrips the capacity the company has available on-site. This is a great example of cloud-based Big Data services in action because the HPC workloads used for drug discovery and analysis are often spiky in nature. From time to time, immense amounts of computing power are required for pharmaceutical analytics; the cloud is an ideal way to support such spikes in demand without requiring a long-term purchase of computing resources.

In another example, entertainment company Ticketmaster uses AWS for ticket pricing optimization. Ticketmaster needed to be able to adjust ticket costs but did not want to make the costly up-front investment typically required to support such an application. In addition, as with Pfizer, Ticketmaster has highly variable demand. Using AWS, the company was able to deploy a cloud-based MapReduce and storage capability. According to Amazon, taking a cloud-based approach reduced infrastructure management time by three hours a day and reduced costs by some 80 percent.

Over time, Amazon has added more and more services to its cloud offering, especially services that are useful for BDAs. Amazon Elastic MapReduce (EMR) is a cloud-based service for running MapReduce jobs in the cloud.

MapReduce is a well-known Big Data approach for processing large data sets into results, with the data typically segmented into chunks that are distributed over multiple servers or instances for processing, with the results then put back together to produce the final insights. MapReduce was originally designed by Google in 2004. Apache has popularized the approach with its open source Hadoop MapReduce offering and companies like Cloudera, HortonWorks, and MapR have made the approach even more viable with commercial offerings. Hadoop and the MapReduce approach have become the mainstay of Big Data due to their ability to store, distribute, and process vast amounts of data.

With EMR, users do not need to perform the often laborious and time-consuming setup tasks required to create a MapReduce infrastructure. Moreover, if additional compute resources are needed to complete MapReduce jobs in a timely fashion, users can simply spin up more instances. Perhaps most importantly, users don’t need to commit to a huge upfront investment to get their very own MapReduce infrastructure. Instead of buying the hardware and network infrastructure needed for a typical 50- or 100-node computing cluster, users can simply spin up as many instances as they need and then turn them off when their analysis is done.

Amazon’s solution doesn’t stop there. In the last few years the company has also introduced an ultra low-cost data archiving solution called Glacier. While Glacier does not enable access to data in real-time, it is one-tenth the price or less of traditional archiving solutions. This gives those on the fence yet another reason to move their data to the cloud.

To further expand its offerings, Amazon introduced Kinesis, a managed cloud service for processing of very large quantities of streaming data. Unlike EMR and MapReduce, which are batch-based, Kinesis can work with streaming data such as tweet streams, application events, inventory data, and even stock trading data in real time.

Note Amazon is continually introducing new Big Data and cloud-related capabilities in Amazon Web Services. Kinesis is the company’s latest offering, designed to handle large quantities of streaming data like tweets, stock quotes, and application events.

No longer is it the case that only data that originated in the cloud should be analyzed in the cloud. Now and in the years ahead, low-cost storage, flexible, scalable computing power, and pre-built services like EMR and Kinesis combined with high bandwidth access to public cloud infrastructure like AWS will provide more compelling reasons to perform Big Data analysis in the cloud.

AWS started out with limited customer support and a lack of enterprise-grade Service Level Availability (SLA) options. Now the company provides enterprise-grade availability. Companies like Netflix, Dropbox, Pfizer, Ticketmaster, and others rely on AWS to run their mission-critical applications. Some analysts project Amazon will go from several billion in revenue for AWS to $15 to $20 billion in revenue annually in just the next few years. To do so, it’s highly likely the company will continue to expand its enterprise cloud and Big Data offerings.

AWS provides a compelling alternative to traditional infrastructure, enables the rapid launch of new Big Data Services, and provides on-demand scalability. It’s also a good option for overflow capacity when it comes to supporting the demands of large-scale analytics applications. What we have seen so far in terms of services offered and near continuous price reductions is just the beginning. There are many more cloud-based Big Data services yet to come, with all of the inherent affordability and scalability benefits.

Public and Private Cloud Approaches to Big Data

Cloud services come in multiple forms—public, private, and hybrid. Private clouds provide the same kind of on-demand scalability that public clouds provide, but they are designed to be used by one organization. Private clouds essentially cordon off infrastructure so that data stored on that infrastructure is separate from data belonging to other organizations.

Organizations can run their own private cloud internally on their own physical hardware and software. They can also use private clouds that are deployed on top of major cloud services such as AWS. This combination often offers the best of both worlds—the flexibility, scalability, and reduced up-front investment of the cloud combined with the security that organizations require to protect their most sensitive data.

Note Running a private cloud on top of a major service like AWS provides flexibility, low cost, and all the benefits of public clouds but adds the security that many of today’s businesses demand.

Some organizations choose to deploy a combination of private cloud services on their own in-house hardware and software while using public cloud services to meet spikes in demand. This hybrid cloud approach is particularly well-suited for Big Data applications. Organizations can have the control and peace of mind that comes from running their core analytics applications on their own infrastructure while knowing that additional capacity will immediately be available from a public cloud provider when needed.

The compute-intensive parts of many BDAs such as pricing optimization, route planning, and pharmaceutical research do not need to be run all the time. Rather, they require an immense amount of computing power for a relatively short amount of time in order to produce results from large quantities of data. Thus, these applications are perfectly suited to a hybrid cloud approach.

What’s more, such applications can often take advantage of the lowest available pricing for computing power. In the case of AWS, spot instances are priced based on market supply and demand. The spot instance buyer offers a price per hour and if the market can support that price, the buyer receives the instance. Spot instance prices are typically much lower than the equivalent prices for on-demand instances.

The downside of spot instances is that they can go away without notice if market prices for instances go up. This can happen if the cloud services provider sees a significant increase in demand, such as during a holiday period. But the transient nature of Big Data processing combined with the distributed nature of the MapReduce approach means that any results lost due to an instance going away can simply be re-generated on another instance.

The more instances over which Big Data processing is distributed the less data is at stake on any one given instance. By combining a distributed architectural approach with the low-cost nature of spot instances, companies can often significantly reduce their Big Data related costs. By layering a virtual private cloud on top of public cloud services, organizations can get the security they need for their Big Data Applications at the same time.

How Big Data Is Changing Mobile

Perhaps no area is as interesting as the intersection of Big Data and mobile. Mobile means that Big Data is accessible anytime, anywhere. What used to require software installed on a desktop computer can now be accessed via a tablet or a smartphone. What’s more, Big Data applications that used to require expensive custom hardware and complex software are now available. You can monitor your heart rate, sleep patterns, and even your EKG using a smartphone, some add-on hardware, and an easy-to-install mobile application. You can combine this with the cloud to upload and share all of the data you collect.

Big Data and mobile are coming together not only for the display and analysis of data but also for the collection of data. In a well-known technological movement called the Internet of Things (IoT), smartphones and other low-cost devices are rapidly expanding the number of sensors available for collecting data. Even data such as traffic information, which used to require complex, expensive sensor networks, can now be gathered and displayed using a smartphone and a mobile app.

Fitness is another area where Big Data is changing mobile. Whereas we used to track our fitness activities using desktop software applications, if we tracked them at all, now wearable devices like Fitbit are making it easy to track how many steps we’ve walked in a given day. Combined with mobile applications and analytics, we can now be alerted if we haven’t been active enough. We can even virtually compete with others using applications like Strava combined with low-cost GPS devices. Strava identifies popular segments of cycling routes and lets individual cyclists compare their performance on those segments. This enables riders to have a social riding experience with their friends and even well-known professional cyclists even when they can’t ride together. These riders can then analyze their performance and use data to improve.

At the doctor’s office, applications like PracticeFusion are enabling doctors to gather and view patient data for Electronic Health Records (EHR) and to analyze that data right on a tablet that they can easily carry with them. No longer does data collection have to be a multiple-step process of writing down notes or recording them as audio files, and then transferring or transcribing those notes into digital form. Now, doctors can record and view patient information “in the moment” and compare such information to historical patient data or to benchmark data. No more waiting—the intersection of Big Data and mobile means real time is a reality.

But the applications for real-time, mobile Big Data don’t stop there. Google Glass, a futuristic-looking wearable device that both records and presents information, is changing the way we view the world. No longer do you need to refer to your smartphone to get more information about a product or see where you are on a map. Glass integrates real-time information, visualization, and cloud services, all available as an enhancement to how you see the world.

In the business world, Big Data is changing mobile when it comes to applications like fleet routing, package delivery, sales, marketing, and operations. Analytics applications specifically optimized for mobile devices make it easy to evaluate operational performance while on the go. Financial data, web site performance data, and supply chain data, among many other sources, can all easily be accessed and visualized from tablets and smartphones. Not only do these devices have extremely capable graphics capabilities, but their touch-screen interfaces make it easy to interact with visualizations. Compute-intensive data processing can be done in the cloud, with the results displayed anywhere, anytime.

Big Data is having a big impact in the fields of logistics and transportation as well. Airplane engine data is downloaded in real time to monitor performance. Major automotive manufacturers such as Ford have created connected car initiatives to make car data readily accessible. As a result, developers will be able to build new mobile applications that leverage car data.

Such car applications, once the domain of only the most advanced car racing teams, are already becoming widely available. Opening up access to the wealth of information produced by automobiles means that car manufacturers don’t have to be the only producers of interesting new car applications. Entrepreneurs with ideas for compelling car-related applications will be able to create those applications directly.

This platform approach is not limited to cars of course. It makes sense for virtually any organization that has a rich source of data it wants to open up so that others can leverage it to create compelling new applications.

Fitness, healthcare, sales, and logistics are just a few of the many areas where mobile, cloud, and Big Data are coming together to produce a variety of compelling new applications. Big Data is changing mobile—and mobile is changing Big Data.

How to Build Your Own Big Data Applications

You’ve taken a look at the different kinds of applications you can build with Big Data and the infrastructure available to support your endeavor. Now you’ll dig into the details of building your own BDA, including how to identify your business need.

Building BDAs need not be hard. Historically, the investment, expertise, and time required to get a new Big Data project up and running prevented many companies and entrepreneurs from getting into the Big Data space. Now, however, Big Data is accessible to nearly anyone, and building a BDA is well within reach.

To build your own BDA, you need five things:

· A business need

· One or more sources of data

· The right technologies, for storage, computing, and visualization

· A compelling way to present insights to your users

· An engineering team that can work with large-scale data

Defining the Business Need

In my work with clients, I often find that companies start with Big Data technology rather than first identifying their business needs. As a result they spend months or even years building technology for technology’s sake rather than solving a specific business problem.

One way to address this common pitfall is to get business users and technologists together very early in the process. Quite often these individuals are in silos; by breaking down those silos and getting them together sooner than later, you are far more likely to solve a well-defined business need.

For entrepreneurs—those working independently and those working in larger organizations—Big Data presents some amazing new opportunities. You can focus on a specific vertical such as healthcare, education, transportation, or finance, or you can build a general-purpose analytics application that makes Big Data more accessible. Uber is one example of a company that is using Big Data and mobile to disrupt the transportation and logistics markets. Meanwhile, Platfora is a new cloud-based visualization application. With ever-increasing interest in Big Data, the timing could not be better for building your own compelling application.

Tip Now is the time to build your own Big Data Application. For one thing, you’ll learn a lot. For another, you can bet that your top competitors are hard at work building applications of their own.

Identifying Data Sources

Depending on the type of application you’re building, data can come in many forms. For example, my company, Content Analytics, develops a product that analyzes entire web sites. Data comes from crawling tens of millions of web pages, much like Google does. The company’s product then extracts relevant data such as pricing, brand segmentation, and content richness information from each of the web pages.

While identifying your data sources, it’s also important to determine whether those data sources are unstructured, structured, or semi-structured. As an example, some web pages contain text and images almost exclusively. This is considered unstructured data.

Many e-commerce web sites contain pricing, product, and brand information. However, such data, as presented on the web pages, is not available in a well-known structured form from a database. As a result, the data has some structure but is not truly well-structured, making it relatively difficult to parse. This is considered semi-structured data.

Finally, data such as customer contact information stored in a database is a good example of structured data. The type of data you’ll be working with will have a big impact on how you choose to store that data and how much you have to process the data before you can analyze it.

Data can also come from internal sources. One telecommunications client we worked with came up with the idea of developing a new Big Data field-management application. The application would bring together data about customer support requests with field technician availability and even customer account value to determine where technicians should go next. It could also incorporate mobile/location information to remind technicians to bring the proper equipment with them when they arrive at each service call. Major telecommunications companies have upwards of 75,000 or more field technicians. As a result, using data to deploy those technicians efficiently could potentially save a company hundreds of millions of dollars on an annual basis.

In another example, data can come from a company’s web site. Site performance data and visitor information are both available, but need to be stored, analyzed, and presented in a compelling way. It’s not enough just to store the log files a web site generates. Companies like AppDynamics and New Relic have created entire new businesses around application performance monitoring, while products like Adobe Marketing Cloud and Google Analytics convert massive amounts of web site traffic data into insights about who’s visiting your site, when, and for how long.

Finally, don’t forget public data sources. From economic data to population data, employment numbers and regulatory filings, an immense amount of data is available. The challenge is that it is often difficult to access.

Real estate information web site Zillow built an entire business from difficult-to-access real estate sale price information. Such information was accessible in public records, but those records were often stored in paper format in local town clerks offices. If such records were stored electronically, they often were not accessible online. Zillow turned this hard-to-access data into a central repository in standardized form accessible via an easy-to-use web interface. Consumers can enter an address and get an estimated price based on sales data for the surrounding area. Public data can be an invaluable resource. Sometimes you just have to do the work necessary to obtain it, organize it, and present it.

Once you have identified both the business need and your sources and types of data, you’re ready to start designing your Big Data Application. A typical Big Data Application will have three tiers: infrastructure to store the data, an analytics tier to extract meaning from the data, and a presentation layer.

Choosing the Right Infrastructure

For most startups today, the decision to store and process their Big Data in the cloud is an easy one. The cloud offers a low up-front time and cost investment and on-demand scalability. In terms of how to store the data, there are typically three options. You can use the traditional SQL (Structured Query Language) database infrastructure. This includes products like MySQL, Oracle, PostgreSQL, and DB2. To a certain scale, MySQL works remarkably well and offers an unbeatable price—it’s free. It also has the benefit of running on commodity hardware. There are also cloud-based versions of these databases available. Amazon, for example, offers its RDS service, a per-built, cloud-based data store that has features like redundancy built-in.

However, storing your data in a traditional database carries with it a range of challenges, not least of which is that the data has to be well structured before it can be stored. For many applications, the database tends to become the bottleneck fairly quickly. Too many tables that keep growing with too many rows, a tendency to scale up—that is, to add more processing, storage, and memory capacity on a single machine rather than splitting data across multiple machines—and poorly written SQL queries can cause even mid-size databases to grind to a halt. When it comes to Big Data, the application itself is often the least complex part—the data store and the approaches used to access it are often the most critical aspect.

At the other end of the spectrum are NoSQL databases. NoSQL databases overcome many of the limitations of traditional databases—the complexity of splitting data across multiple servers and the need to have structured defined up-front. However, they suffer from some severe limitations. Specifically, they have limited support for traditional SQL query-based approaches to storing and retrieving data. Recently, NoSQL databases have added more SQL-like capabilities to address these limitations, and they certainly excel at storing large quantities of unstructured data, such as documents and other forms of text. There are a number of products available in this area, but the most well-known (and best funded one) is MongoDB.

Finally, there is what has become the mainstream Big Data solution, Hadoop. The Apache Hadoop distribution is available for free and the design of the technology makes it easy to distribute data across many machines for analysis. This means that instead of having to rely on one database located on one or a few machines, data can be spread across dozens if not hundreds or thousands of machines in a cluster for analysis. For those needing commercial support, more advanced management capabilities, and faster or near-real time analytics, commercial solutions are available from Cloudera, HortonWorks, MapR, and other vendors.

If your application is cloud-based, you can either configure your servers for Hadoop and MapReduce or, depending on your vendor, use a pre-built cloud-based solution. Amazon MapReduce (EMR) is one such service that is easy to use. Amazon also offers the AWS Data Pipeline service, which helps manage data flows to and from disparate locations. For example, the Data Pipeline could automatically move log files from a cluster of server instances to the Amazon S3 storage service and then launch an EMR job to process the data.

The downside of using such pre-built, platform-specific solutions is that it makes it somewhat harder to move to another cloud platform or a custom configuration should you want to do so later. But in many cases, the benefits of using such solutions, such as the speed with which you can get started and the reduced maintenance overhead, often outweigh the potential downsides.

Note Keep in mind that when you go with a specific platform, like Amazon’s version of MapReduce called EMR, you may have trouble migrating the data and all the work you put into creating a solution to another platform. Often, the benefits outweigh the risks, but give it some thought before you commit a lot of resources.

Presenting Insights to Customers

Once you have an approach to collecting, storing, and analyzing your data, you’ll need a way to show the results to your users, customers, or clients and enable them to see the results and interact with the data. In many cases, their initial questions will beget more questions, so they will need a way to iterate on their analysis.

In terms of your application, you could design everything from scratch, but it makes a lot more sense to leverage the large number of pre-built modules, tools, and services available. Suppose your application needs to support the display of data on a map and the ability for users to interact with that data. You could build an entire mapping solution from scratch, but it is much more effective to make use of pre-built solutions.

If you’re building a web-based data analytics service, CartoDB, BatchGeo, and Google Fusion are cloud-based data presentation and mapping solutions that make it easy to integrate geographic visualization into your application. Instead of writing a lot of custom visualization support, you simply upload your data in a supported format to one of those services and integrate a little bit of code into your application to enable the selected service to display the resulting visualization.

Such cloud-based mapping visualization solutions are also particularly useful for mobile applications. If you’re building a fleet routing application, for example, using pre-built modules means you can focus on building the best possible algorithms for routing your fleet and the easiest-to-use application interface for your drivers, rather than having to worry about how to display detailed map data.

The downside to using such services is that your data is accessible to those services. While they may be contractually obligated not to use your data other than to show visualizations on your site, there is some added security risk when it comes to confidential data and it’s important to weigh the tradeoffs between ease of use and security when deciding to use any third-party cloud-based visualization solution.

If your use case demands it, you can instead integrate open source or proprietary geographic visualization software directly into your application. CartoDB, for example, is available as an open source solution as well.

Pre-built modules are available to support a number of other visualization capabilities as well. HighCharts is one solution that makes it easy to add great-looking charts to your application.

Depending on the nature of your application, more or less custom presentation development work will be required. My company Content Analytics uses a combination of custom visualization capabilities and pre-built solutions like HighCharts to provide complete visualization support in its product. This approach means the application’s users get the best of both worlds: a specialized user interface for viewing data about online marketing effectiveness combined with elegant visualizations that display and chart the data in compelling and easy-to-understand ways.

Another approach to visualization is to focus on data analysis in your application and leave the presentation to products focused on visualization, such as Tableau, QlikView, and Spotfire. These products were designed for visualization and have extensive capabilities. If the competitive advantage for your application comes from your data and/or the analytics algorithms and approaches you develop, it may make most sense to focus on those, make it easy to export data from your application, and use one of these applications for data presentation. However, as a broader set of users, especially those without traditional data science and analytics backgrounds, integrate Big Data into their daily work, they will expect visualization capabilities to be built directly into Big Data Applications.

Gathering the Engineering Team

Building a compelling Big Data Application requires a wide range of skills. On your team, you’ll need:

· Back-end software developers who can build a scalable application that supports very large quantities of data.

· Front-end engineers who can build compelling user interfaces with high attention to detail.

· Data analysts who evaluate the data and develop the algorithms used to correlate, summarize, and gain insights from the data. They may also suggest the most appropriate visualization approaches based on their experiences working with and presenting complex data sets.

· User interface designers who design your application’s interface. You may need an overall designer who is responsible for your application’s look and feel, combined with individual designers familiar with the interface norms for specific desktop, tablet, and mobile devices. For example, if you are building an iPhone interface for your application, you will need a designer familiar not only with mobile application design but also with the iPhone-specific design styles for icons, windows, buttons, and other common application elements.

· DevOps (Development-Operations) engineers who can bridge the worlds of software coding and IT operations. They configure and maintain your application and its infrastructure, whether it is cloud-based or on-premise. DevOps engineers usually have a mix of systems administration and development expertise, enabling them to make direct changes to an application’s code if necessary, while building the scripts and other system configurations necessary to operate and run your application.

· A product or program manager whose mission it is to focus on the customer’s needs. With Big Data Applications it is all too easy to get lost in technology and data rather than staying focused on the insights that customers and users actually need.

Combine the right team with the right data sources and technologies, and you’ll be able to build an incredibly compelling Big Data Application. From fleet management to marketing analytics, from education to healthcare, the possibilities for new Big Data Applications are almost unlimited.

____________________

¹http://www.forbes.com/sites/bruceupbin/2013/06/04/ibm-buys-privately-held-softlayer-for-2-billion/

²http://www.informationweek.com/cloud/infrastructure-as-a-service/amazons-cloud-revenues-examined/d/d-id/1108058?

All materials on the site are licensed Creative Commons Attribution-Sharealike 3.0 Unported CC BY-SA 3.0 & GNU Free Documentation License (GFDL)

If you are the copyright holder of any material contained on our site and intend to remove it, please contact our site administrator for approval.