Doing Data Science with Excel and Knime - Computing for Data Science - Data Science For Dummies (2016)

Data Science For Dummies (2016)

Part 4

Computing for Data Science

Chapter 17

Doing Data Science with Excel and Knime

IN THIS CHAPTER

check Using Excel to examine your data

check Formatting and summarizing data in Excel

check Automating tasks in Excel

check Improving your business with KNIME

In this day and age, when it seems like every organization is reliant upon cloud-based applications, standard installable desktop applications are fewer and farther between. Nonetheless, there are still a few programs out there that you can install on your computer and use for data science tasks. In this chapter, I explain how you can use Microsoft Excel to perform some basic tasks to help simplify your project work in data science. I also introduce a free, open-source analytics platform called KNIME and discuss how you can use it to perform advanced data science tasks without having to learn how to code.

Making Life Easier with Excel

Microsoft Excel holds a special place among data science tools. It was originally designed to act as a simple spreadsheet. Over time, however, it has become the people’s choice in data analysis software. In response to user demands, Microsoft has added more and more analysis and visualization tools with every release. As Excel advances, so do its data munging and data science capabilities. Excel 2013 includes easy-to-use tools for charting, pivot tables, and macros. It also supports scripting in Visual Basic so that you can design scripts to automate repeatable tasks.

The benefit of using Excel in a data science capacity is that it offers a fast and easy way to get up close and personal with your data. If you want to browse every data point in your dataset, you can quickly and easily do this using Excel. Most data scientists start in Excel and eventually add other tools and platforms when they find themselves pushing against the boundaries of the tasks Excel is designed to do. Still, even the best data scientists out there keep Excel as an important tool in their tool belt. When working in data science, you might not use Excel every day, but knowing how to use it can make your job easier.

remember Although you have many different tools available to you when you want to see your data as one big forest, Excel is a great first choice when you need to look at the trees. Excel attempts to be many different things to many different kinds of users. Its functionality is well-compartmentalized, to avoid overwhelming new users, while still providing power users with the more advanced functionality they crave. In the following sections, I show you how you can use Excel to quickly get to know your data. I also introduce Excel pivot tables and macros, and tell you how you can use those to greatly simplify your data cleanup and analysis tasks.

Using Excel to quickly get to know your data

If you’re just starting off with an unfamiliar dataset and you need to spot patterns or trends as quickly as possible, use Excel. Excel offers effective features for exactly these purposes. Its main features for a quick-and-dirty data analysis are

· Filters: Filters are useful for sorting out all records that are irrelevant to the analysis at hand.

· Conditional formatting: Specify a condition, and Excel flags records that meet that condition. By using conditional formatting, you can easily detect outliers and trends in your tabular datasets.

· Charts: Charts have long been used to visually detect outliers and trends in data, so charting is an integral part of almost all data science analyses.

To see how these features work in action, consider the sample dataset shown in Figure 17-1, which tracks sales figures for three employees over six months.

image

FIGURE 17-1: The full dataset that tracks employee sales performance.

Filtering in Excel

To narrow your view of your dataset to only the data that matters for your analysis, use Excel filters to filter irrelevant data out of the data view. Simply select the data and click the Home tab’s Sort & Filter button, and then choose Filter from the options that appear. A little drop-down option then appears in the header row of the selected data so that you can select the classes of records you want to have filtered from the selection. Using the Excel Filter functionality allows you to quickly and easily sort or restrict your view to only the subsets of the data that interest you the most.

Take another look at the full dataset shown in Figure 17-1. Say you want to view only data related to Abbie’s sales figures. If you select all records in the Salesperson column and then activate the filter functionality (as just described), from the drop-down menu that appears you can specify that the filter should isolate only all records named Abbie, as shown in Figure 17-2. When filtered, the table is reduced from 18 rows to only 6 rows. With this particular example, that change doesn’t seem so dramatic, but when you have hundreds, thousands, or even a million rows, this feature comes in very, very handy.

image

FIGURE 17-2: The sales performance dataset, filtered to show only Abbie’s records.

warning Excel lets you store only up to 1,048,576 rows per worksheet.

Conditional formatting to spot outliers and trends in tabular data

To quickly spot outliers in your tabular data, use Excel’s Conditional Formatting feature. Imagine after a data entry error that Abbie’s March total sales showed $208,187.70 but was supposed to be only $20,818.77. You’re not quite sure where the error is located, but you know that it must be significant because the figures seem off by about $180,000.

To quickly show such an outlier, select all records in the Total Sales column and then click the Conditional Formatting button on the Ribbon’s Home tab. When the button’s menu appears, choose the Data Bars option. Doing so displays the red data bar scales shown in Figure 17-3. With data bars turned on, the bar in the $208,187.70 cell is so much larger than any of the others that you can easily see the error.

image

FIGURE 17-3: Spotting outliers in a tabular dataset with conditional formatting data bars.

If you want to quickly discover patterns in your tabular data, you can choose the Color Scales option (rather than the Data Bars option) from the Conditional Formatting menu. After correcting Abbie’s March Total Sales figure to $20,818.77, select all cells in the Total Sales column and then activate the Color Scales version of conditional formatting. Doing so displays the result shown in Figure 17-4. From the red-white-blue heat map, you can see that Abbie has the highest sales total and that Brian has been selling more than Chris. (Okay, you can’t see the red-white-blue in my black-and-white figures, but you can see the light-versus-dark contrast.)

image

FIGURE 17-4: Spotting outliers in a tabular dataset with color scales.

Excel charting to visually identify outliers and trends

Excel’s Charting tool gives you an incredibly easy way to visually identify both outliers and trends in your data. An XY (scatter) chart of the original dataset (refer to Figure 17-1) yields the scatter plot shown in Figure 17-5. As you can see, the outlier is overwhelmingly obvious when the data is plotted on a scatter chart.

image

FIGURE 17-5: Excel XY (scatter) plots provide a simple way to visually detect outliers.

Alternatively, if you want to visually detect trends in a dataset, you can use Excel’s Line Chart feature. The data from Figure 17-4 is shown as a line chart in Figure 17-6.

image

FIGURE 17-6: Excel line charts make it easy to visually detect trends in data.

As you can clearly see from the figure, Chris’s sales performance is low. He’s in last place among the three salespeople, but he’s gaining momentum. Because he seems to be improving, maybe management would want to wait a few months before making any firing decisions based on sales performance data.

Reformatting and summarizing with pivot tables

Excel developed the pivot table to make it easier for users to extract valuable insights from large sets of spreadsheet data. If you want to generate insights by quickly restructuring or reclassifying your data, use a pivot chart. One of the main differences between a traditional spreadsheet and a dataset is that spreadsheets tend to be wide (with a lot of columns) and datasets tend to be long (with a lot of rows). Figure 17-7 clearly shows the difference between a long dataset and a wide spreadsheet.

image

FIGURE 17-7: A long dataset and a wide spreadsheet.

The way that Excel is designed leads many users to intuitively prefer the wide format — which makes sense because it’s a spreadsheet application. To counter this preference, however, Excel offers the table feature so that you can quickly convert between long and wide formats. You can also use pivot tables to quickly calculate subtotals and summary calculations on your newly formatted and rearranged data tables.

tip Creating pivot tables is easy: Just select all cells that comprise the table that you seek to analyze. Then click the Pivot Table button on the Insert tab. This opens the Create PivotTable dialog box, where you can define where you want Excel to construct the pivot table. Select OK, and Excel automatically generates a PivotTables Fields interface on the page you’ve specified. From this interface, you can specify the fields you want to include in the pivot table and how you want them to be laid out.

The table shown in Figure 17-8 was constructed using the long-format sales performance data shown in Figure 17-7. It’s an example of the simplest possible pivot table that can be constructed, but even at that, it automatically calculates subtotals for each column, and those subtotals automatically update when you make changes to the data. What’s more, pivot tables come with pivot charts — data plots that automatically change when you make changes to the pivot table filters based on the criteria you’re evaluating.

image

FIGURE 17-8: Creating a wide data table from the long dataset via a pivot table.

Automating Excel tasks with macros

Within Excel, macros act as a set of functions and commands that you can use to automate tasks. If you want to save time (and hassle) by automating Excel tasks that you routinely repeat, use macros.

Macros are pre-scripted routines written in Visual Basic for Applications (VBA). You can use macros to decrease the amount of manual processing you need to do when working with data in Excel.

To access macros, first activate Excel’s Developer tab from within the Options menu on the File tab. (When the Options menu opens, choose Customize Ribbon from your choices on the left, and then click to select the Developer check box in the column on the right.) Using the Developer tab, you can record a macro, import one that was created by someone else, or code your own in VBA.

To illustrate macros in action, imagine that you have a column of values and you want to insert an empty cell between each one of the values, as shown in Figure 17-9. Excel has no easy, out-of-the-box way to make this insertion. Using Excel macros, however, you can ask Excel to record you while you step through the process one time, and then assign a key command to this recording to create the macro. After you create the macro, every time you need to repeat the same task in the future, just run the macro by pressing the key command, and the script then performs all the required steps for you.

image

FIGURE 17-9: Using a macro to insert empty cells between values.

tip When you record a macro, it will record in Absolute mode by default. If you want it to record the macro in Relative mode instead, you’ll need to select the Use Relative References option before recording the macro:

· Relative: Every action and movement you make is recorded as relative to the cell that was selected when you began the recording. When you run the macro in the future, it will run in reference to the cell that’s selected, acting as though that cell were the same cell you had initially selected when you recorded the macro.

· Absolute: After you start recording the macro, every action and movement you make is repeated when you run the macro in the future, and those actions or movements are not made in any relative reference to whatever cell was active when you started recording. The macro routine is repeated exactly as you recorded it.

In the preceding example, the macro was recorded in Relative mode. This enables the macro to be run continuously, anywhere, and on top of results from any preceding macros run. Since, in this scenario, the macro recorded only one iteration of the process, if it had been recorded in Absolute mode, every time it was run, the macro would have kept adding a space between only the one and two values. It would not operate on any cells other than the ones it was recorded upon.

warning Macro commands are not entered into Excel’s Undo stack. If you use a macro to change or delete data, you’re stuck with that change.

tip Test your macros first and save your worksheets before using them so that you can revert to the saved file if something goes wrong.

Excel power users often graduate to programming their own macros using VBA. Because VBA is a full-fledged programming language, the possibilities from pairing Excel with VBA are almost endless. Still, ask yourself this question: If you’re going to invest time in learning a programming language, do you need to work within the confines of Excel’s spreadsheet structure? If not, you might consider learning a scientific computing language, like R or Python. These open- source languages have a more user-friendly syntax and are much more flexible and powerful.

Using KNIME for Advanced Data Analytics

If you don’t know how to code but still want the benefits that custom predictive analytics has to offer, you can download and install KNIME and use its visual environment to access these features. KNIME offers services, solutions, and open-source software to fulfill the advanced analytics requirements of today’s data-driven business enterprise. The company’s purpose is to provide an open platform that meets the data-mining and analytics needs of the masses.

If you want data-mining software that you can install on your PC and use for predictive analytics, look no further than KNIME Analytics Platform. KNIME is easy to use, so even beginners who don’t know how to code can use the program. For more advanced users, however, KNIME offers plug-ins that can be used to integrate Weka’s preconstructed analysis modules or to run R and Python scripts from within the application. Beginners and advanced users alike can use KNIME predictive analytics to

· Upsell and cross-sell: Build cross-selling and upselling models that enable you to increase sales by making optimal recommendations of other products that customers are likely to be interested in purchasing as well.

· Churn reduction: Mine customer data and identify which customers you’re most likely to lose and why.

· Sentiment and network analysis: Analyze the sentiment of people and organizations in your social networks, to help identify which areas of your business are performing well and which ones may need some work.

· Energy usage prediction and auditing: Perform time series analyses and build regression models from energy usage data.

If you’re curious about KNIME and how you can use it to increase revenues and streamline business workflows, the good news is that KNIME’s Analytics Platform is available for free. In the following sections, I discuss how KNIME can be used to reduce churn, perform sentiment analysis, and predict energy usage.

Reducing customer churn via KNIME

I talk a bit about customer churn in Chapter 20, where I spell out how you can use clustering techniques to maintain customer loyalty. Within KNIME, you use the k-means algorithm to perform this type of churn analysis. (For more on the k-means algorithm, see Chapter 6.) If you want to use KNIME to help reduce customer churn, it offers a cool churn analysis workflow (see www.knime.org/knime-applications/churn-analysis) that shows exactly how to use the platform to perform this type of analysis.

Using KNIME to make the most of your social data

You can use sentiment analysis (as I discuss in Chapter 20) to monitor and detect, early on, your customer satisfaction rates. KNIME offers a social media clustering workflow (see www.knime.org/knime-applications/social-media-sentiment-analysis) and a text processing plug-in (see https://tech.knime.org/knime-text-processing) that you can use against your social media data to keep a close eye on how customers and potential customers are feeling about your brand and its offerings.

In Chapter 20, I also discuss the benefits that strategic social networking can bring to your business. If you want to identify and forge alliances with thought leaders and online social influencers who operate in your target niche, use KNIME’s Social Media Leader/Follower Analysis workflow and network analysis plug-in at www.knime.org/knime-applications/social-media-leaderfollower-analysis.

Using KNIME for environmental good stewardship

Everyone knows that energy usage predictions and audits are essential to responsible energy planning. Within KNIME, you use time series analysis and autoregressive modeling to generate a predictive model from historical data and data trends. (For more on time series analysis and regression models, see Chapter 5.) If you want to use KNIME to build predictive models for energy usage, you can use the Energy Usage Prediction (Time Series Prediction) workflow (see www.knime.org/knime-applications/energy-usage-prediction) that the platform provides at its public workflow server (see www.knime.org/example-workflows).