Understanding the Big Data Toolset - Appendixes - Microsoft Business Intelligence Tools for Excel Analysts (2014)

Microsoft Business Intelligence Tools for Excel Analysts (2014)

PART IV: Appendixes

Appendix A: Understanding the Big Data Toolset

Appendix B: Considerations for Delivering Mobile BI

APPENDIX A: Understanding the Big Data Toolset

Big data refers to a broad set of tools that are designed to work with large data volumes. These tools are gaining popularity because they solve the limitations of traditional tools by scaling up on hardware and sharing the workload across many linked computers. Using this divide-and-conquer approach allows big data technologies to accomplish tasks that are not feasible otherwise.

When it comes to analytics, there are some special considerations regarding big data. You need an easy way to access and navigate platforms without the need to spend precious time learning new programming languages such as Java. In the past few years several tools have come available that offer a big data SQL solution: a tool that sits on top of a big data platform and allows you to write SQL queries against the big data engine.

In this appendix we review the big data SQL tools on the market now, giving you a sense of the big data landscape. You’ll also discover how you can connect to those tools from Excel.

Big Data SQL Offerings

There are several big data offerings that support a relatively easy-to-use SQL engine on top of Hadoop or another similar-style platform.

note.eps We cover the talked-about entrants into the market as of this writing. Note, though, that this space is quite new and many companies are jumping into it almost daily.

Most of these technologies have a similar approach to big data SQL. They scale up on hardware and use a divide-and-conquer approach to break big data volumes into small chunks and distribute the processing across multiple nodes. The main differentiator among the various technologies is whether or not they are Hadoop-based. For example, Amazon Redshift does not sit on top of Hadoop, but instead uses a different technology that allows similar scalability as Hadoop. On the other hand, Hortonworks and Cloudera are tools that are installed on top of Hadoop and take advantage of its scalability features.

Amazon Redshift

Provider: Amazon under the Amazon Web Services platform

Web Site: http://aws.amazon.com/redshift

Platforms: This tool only runs in the cloud on Amazon Web Services.

Technology Overview: Amazon Redshift is a fully managed cloud-based data warehouse service. You can think of it as a kind of DropBox for databases ranging from a few hundred gigabytes to a petabyte or more. The technology Redshift uses is built to make it easy to scale up on hardware as data needs grow. It also includes compression and a columnar-based design that is best suited for analytics queries and can support very large data volumes.

Pros: Amazon Redshift is a production-grade tool that delivers on its promises. Getting up and running with it is easy, and managing it is even easier.

Cons: It is only available in the cloud with no option for on-site installation. Connecting to the Amazon Redshift via Microsoft tools is a challenge from behind the firewall. Although, most big data SQL tools that are in the cloud have this same challenge.

Hortonworks Hive

Provider: Hortonworks (partners with Microsoft)

Web Site: http://hortonworks.com

Platforms: This tool is available on Hadoop and can be cloud- or on-premises-based.

Technology Overview: Hive is the open-source SQL implementation on Hadoop. Up to recently it did not allow real-time SQL queries. However, this has changed with the Stinger initiative, which has the objective of configuring Hive to allow real-time SQL queries.

Pros: On-premises or cloud installation. Allowing on-premises installation is critical if your organization has a no-cloud policy.

Cons: As of this writing, Hive is still relatively new and the real-time engine is in Beta release.

Cloudera Impala

Provider: Cloudera

Web Site: www.cloudera.com

Platforms: This tool is a real-time SQL engine that was developed by Cloudera and sits on the Hadoop platform.

Technology Overview: Impala is a proprietary SQL engine that is designed for analytics purposes and high scalability. It can read traditional Hadoop file formats and can span to multiple Hadoop nodes. It uses C++ instead of Java for performance and doesn’t translate SQL into MapReduce.

Pros: On-premises or cloud installation. Allowing on-premises installation is critical if your organization has a no-cloud policy.

Cons: Does not allow incremental update of data, forcing a complete rebuild when underlying data changes. This is a pretty big limitation for large datasets as it can take quite a bit of time to reload.

sb_button.tif Defining MapReduce

Many of the big data toolsets offer MapReduce, a software framework originally developed by Google to improve its indexing algorithms and heuristics. The idea behind MapReduce is that large amounts of unstructured data can be processed in lots of smaller parallel nodes across many processers or stand-alone machines. This effectively distributes the process loads such that it allows programmers to handle massive amounts of data much faster and without communication or server failures. MapReduce is most commonly used for data mining, analysis of large financial systems, and data-intensive scientific simulations.


Provider: IBM

Web Site: www.ibm.com/software/data/bigdata

Platforms: This tool is an SQL engine that was developed by IBM and sits on the IBM-variant of the Hadoop platform called BigInsights.

Technology Overview: Big SQL is a proprietary SQL engine that is designed for analytics purposes. According to IBM, Big SQL takes the SQL syntax submitted by the user and translates it to individual MapReduce jobs. Big SQL can also support real-time queries over a single node only.

Pros: On-premises or cloud installation. Allowing on-premises installation is critical if your organization has a no-cloud policy.

Cons: Does not allow real-time SQL queries against multiple nodes because it relies on MapReduce jobs for that.

Google BigQuery

Provider: Google

Web Site: http://cloud.google.com/products/bigquery

Platforms: This tool is a real-time SQL engine that was developed by Google using its proprietary technology.

Technology Overview: Google BigQuery is a proprietary SQL engine that is designed for analytics purposes and high scalability. BigQuery is the public version of Google’s own Dremel query service that Google has used for years to track device installation and analyze spam. Dealing in read-only datasets, Google BigQuery allows programmers to use SQL-like queries to extract and analyze billions of rows at a time.

Pros: Is capable and easy to administer.

Cons: Cloud-only installation, which can be limiting for some organizations. Per Google’s description, “append-only” implies some limitation on updates of historical data, which could be limiting for analytics datasets.

Facebook Presto SQL

Provider: Facebook

Web Site: http://prestodb.io

Platforms: This tool is an open-source SQL engine that was developed by Facebook and sits on top of the open-source Hadoop platform.

Technology Overview: Presto SQL was developed by Facebook to address the latency limitations of MapReduce jobs and allow interactive queries against large datasets stored in Hadoop.

Pros: On-premises or cloud installation. Allowing on-premises installation is critical if your organization has a no-cloud policy.

Cons: New tool relative to the others in this chapter.

Defining a Big Data Connection

For analytics purposes, your job will mainly involve accessing data from a big data platform. Loading the data into a platform requires specialized skills and assistance from your system administrators. However, after the data is loaded into the platform, you can access it via your analytics tools, including Excel.

Most big data tools allow you to access data via ODBC or JDBC drivers. With Microsoft tools, you use ODBC drivers. The first step in connecting to the platform is to create your ODBC connection.

note.eps Before you can connect to your platform, make sure that you have the drivers installed on your machine. Each tool has its own requirements for an ODBC driver. For example, Amazon Redshift requires you to install the PostgreSQL ODBC driver to connect to one of its clusters. You can find the proper driver for each on the tool’s Web site.

After you have installed the proper driver, follow these steps to connect to your platform:

1. Open the Data Sources (ODBC) Administrator dialog box and click the System DSN tab (see Figure A-1).

tip.eps If you don’t know where ODBC is installed on your computer, choose Start → Search, and type ODBC in the search bar.

You may already have data sources defined because of other tools that use ODBC connections. In our case we have two Redshift connections defined.


Figure A-1: ODBC Data Source Administrator dialog box.

2. Click the Add button to add a new ODBC connection.

A window opens with a list of available drivers on your computer, as shown in Figure A-2.


Figure A-2: Select a driver from the available listed.

3. Select the driver you installed.

4. Click Finish.

The Setup window opens, where you define your connection (see Figure A-3).

This window is specific to the driver you’re using and the tool you’re connecting to. You get the connection information from your big data administrator.


Figure A-3: Driver Setup window.

5. Enter your connection information, click the Test button to test your connection, and then click Save.

Your connection is now ready to be used in your analytics application.

sb_button.tif Special cloud considerations

When connecting to a cloud installation, you need to be aware of your network’s firewall settings. This issue only comes into play when you have long running queries (30 minutes or more) as may be the case with big data queries. By default, firewall settings may kill TCP/IP connections that run long. This causes big data tools to not return the result set to the client tool. Getting around this issue requires you to override this default option by setting the TCP/IP Keep Alive setting to Yes. This option varies by driver and some of them do not allow it (as is the case with the PostgreSQL ODBC driver). Big data tools are still maturing, and most vendors are working on making it easier for you to connect to their platforms.

Connecting to Big Data Tools with Excel

After defining your ODBC connection, you can access your big data tool by following these steps with Excel:

1. From the Data tab, click the Connections button.

The Excel Workbook Connections dialog box opens.

2. Click the Add button at the top of the window.

The Existing Connections window opens.

3. Click the Browse for More button.

The Select Data Source window opens.

4. Click the New Source button.

The Data Connection Wizard opens (see Figure A-4).


Figure A-4: Data Connection Wizard.

5. Select ODBC DSN and click Next.

The Connect to ODBS Data Source screen opens, with a list of ODBC connections that have been defined on your computer (see Figure A-5).


Figure A-5: Data Connection Wizard ODBC data sources.

6. Select your ODBC connection and click Next.

The Select Database and Table screen opens (see Figure A-6).


Figure A-6: The Select Database and Table screen.

7. Select a table that you want to access and click Finish.

The Workbook Connections dialog box opens again with the connection you just defined.

8. Click Close at the bottom of the dialog box to return to the main workbook.

Modifying your connection

With a defined connection, Excel only allows you to connect to a table or a view. You can, however, change the SQL that has been built by the wizard to create a more sophisticated SQL query that would better suit your reporting needs. To accomplish this, follow these steps:

1. On the Data tab, click the Connections button.

The Workbook Connections dialog box opens.

2. Select your defined connection and click the Properties button.

The Connection Properties dialog box opens.

3. Click the Definition tab (see Figure A-7).


Figure A-7: The Connection Properties dialog box.

4. Modify the SQL in the Command Text box.

Figure A-7 shows how we changed the SQL to return the top 10 rows only for that table, but you can get as sophisticated as you want with your query (provided your big data SQL tool supports the syntax you’re trying to use).

5. When you’re done, click OK to save your changes.

Using your connection

Now that you have defined the connection in Excel, you can start using it.

1. From the Data tab, click the Existing Connections button.

The Existing Connections window opens.

2. Select your connection and click the Open button.

The Import Data dialog box opens (see Figure A-8).


Figure A-8: The Import Data dialog box.

3. Select the type of view you want to create and click OK.

Excel links the table to the query you defined in your connection. You can refresh this data as needed.

caution.eps Be careful not to return large amounts of data in your query as your source could have large data volumes.