Microsoft Business Intelligence Tools for Excel Analysts (2014)
PART I: Leveraging Excel for Business Intelligence
Chapter 7: Using the Power Query Add-In
In This Chapter
· Installing Power Query
· Learning about Power Query basics
· Understanding transformation actions
· Connecting to a wide array of data sources
· Creating and using Power Query functions
In information management, ETL refers to three separate functions:
· Extraction: Reading of data from a specified source and extracting a desired subset of data.
· Transformation: Cleaning, shaping, and aggregating of data to convert it to the desired structure.
· Load: Importing or writing of the resulting data to a target location.
You've probably been manually performing ETL processes for years — although you might not think of it that way. Whenever you pull data from a source location, manipulate that data, and integrate it into your reporting, you're performing ETL.
In an attempt to develop robust and reusable ETL processes, Microsoft released the Power Query Add-In. Power Query enhances the ETL experience by offering a mechanism to extract data from a wide variety of sources, perform complex transformations on that data, and then load the data into a workbook or the internal Data Model.
In this chapter, you see how the Power Query Add-In works and discover some of the innovative ways you can use it to help save time and automate the steps needed to ensure that clean data is imported into your reporting models.
Installing and Activating the Power Query Add-In
The Power Query Add-In does not come with Excel out of the box. Your administrator might have installed it for you. If you see the Power Query tab on the Ribbon — look ahead to Figure 7-1 — you already have it. If you don't see it, you need to download and install Power Query.
Downloading the Power Query Add-In
As of this writing, the Power Query Add-In is only available if you have one of the following editions of Office or Excel:
· Office 2010 Professional Plus: Available for purchase through any retailer
· Office 2013 Professional Plus: Available through volume licensing only
· Office 365 Pro Plus: Available with an ongoing subscription to Office365.com
· Excel 2013 Stand-alone Edition: Available for purchase through any retailer
If you have one of these editions, you can install and activate the Power Query Add-In. Type Excel Power Query Add-In into your favorite search engine to find the free installation package. Note that Microsoft offers Power Query for both Excel 2010 and Excel 2013 in both 32- and 64-bit platforms. Be sure to download the version that matches your version of Excel as well as the platform your PC is running.
After it’s installed, you need to activate the add-in by following these steps:
1. Choose File → Options.
2. Select the Add-Ins option on the left, select COM Add-Ins from the Manage drop-down list, and click Go.
3. Look for Power Query for Excel in the list of available COM Add-Ins. Select the check box next to each one of these options and then click OK.
4. Close and restart Excel.
The Power Query tab is now on the Ribbon, as shown in Figure 7-1.
Figure 7-1: The Power Query Add-In is on its own tab on the Ribbon.
Power Query Basics
The first step in using Power Query is to extract data from a wide variety of sources — the extraction part of ETL. For example, you might want to know how many radio stations in Maryland are dedicated to sports. You can find this data by pulling data from the Web through Power Query.
Searching for source data
To start searching for data, follow these steps:
1. Click the Power Query tab.
2. On the Get External Data group, click the Online Search button.
There are many types of external data sources you can choose here (you can find out about them in the section “Power Query Connection Types” later in this chapter). Searching online is a basic feature of Power Query, so for now, click the Online Search button to pull data.
Power Query opens the Online Search pane, where you can enter a search term.
3. Enter your search term.
For example, if you're interested in the number of radio stations that cover sports, you enter Maryland radio stations.
Power Query presents a list of Web sources that match your search term (see Figure 7-2).
4. Hover your mouse over a result to get a preview of the data shown on the left.
Figure 7-2: Hovering your mouse over any result shows you a preview of the data.
5. When you find a data source that contains what you need, click Edit at the bottom of the preview window.
The Query Editor window opens at this point, which allows you to select options (see the next section).
Whether connecting to a Web site, a database, or any other data source, Power Query always starts you off with a dedicated pane that shows the available data sources. You can hover your mouse over any data source to preview the contents. You can then click Edit at the bottom of the preview window to transform the data.
Shaping the selected source data
When you choose to edit a data source, Power Query opens a Query Editor window that contains its own Ribbon; a preview pane on the left, which shows a preview of the data; and a Query Settings pane on the right (see Figure 7-3). This is your primary workbench. Here, you can apply certain actions to shape, clean, and transform the data to suit your needs — the transform part of ETL.
Figure 7-3: The Query Editor window allows you to shape, clean, and transform your data.
The idea is to work with each column shown in the Query Editor, applying the necessary actions that leave you with the data and structure you need.
1. Right-click any column and choose an action.
We cover all the actions you can take in the section “Understanding Column and Table Actions” later in this chapter. For now, right-click the Format column and select Remove Other Columns to find out how many sport stations are in Maryland.
Now only one column (the Format column) in the date preview is showing.
2. Right-click the Format column and select Insert Custom Column.
An additional column is added.
The Insert Custom Column dialog box shown in Figure 7-4 opens.
Figure 7-4: Add a new custom column that contains the number 1 for all rows.
3. Type 1 in the Custom Column Formula box, and click OK.
Each row in the new column is now tagged with a 1, specifying that each row counts as one station. You can now aggregate the data to sum your newly created custom column for each station format.
You now see the two columns shown in Figure 7-5.
The default name given to any new column you add is Custom. You can rename your newly inserted columns by right-clicking the column header and selecting Rename.
4. Right-click the Format column and select Group By.
The Group By dialog box opens, as shown in Figure 7-6.
Figure 7-5: Power Query adds your custom column specifying that each row counts as one station.
Figure 7-6: Group by the Format column and sum the Custom column into an aggregate column called Count of Stations.
5. Select Format from the Group By drop-down menu, enter Count of Stations in the New Column Name box, select Sum from the Operation drop-down menu, and select Custom from the Column drop-down menu. Click OK when you’re done.
Power Query then groups by the Format column and sums the Custom column into an aggregate column called Count of Stations.
Now there is one row for each radio station format. Each row shows the count of radio stations.
6. Click the Sort command on the Query Editor Ribbon to sort by the Count of Stations column.
You get the results shown in Figure 7-7.
Now you know how many radio stations Maryland has dedicated to sports: 9.
Figure 7-7: The final table after grouping shows that Maryland has nine radio stations dedicated to sports.
Understanding query steps
With a few clicks, you searched the Internet, found some base data, and manipulated that data to suit your needs. This is what Power Query is all about — it enables you to easily pull, filter, and reshape data without the need for any programmatic coding skills. Power Query does all the legwork using its own formula language (also known as M language). Each action you take when working with Power Query results in a line of code that is written into a query step. Query steps are embedded M code that allow your actions to be repeated each time you refresh your Power Query data.
The query steps are listed in the Applied Steps section of the Query Settings pane (see Figure 7-8). Each query step represents an action you took to get to a final data table.
Figure 7-8: Check out query steps in the Applied Steps section of the Query Settings pane.
Click any step and the underlying M code appears in the Power Query formula bar.
If you don’t see a formula bar in the Query Editor, click the View tab on the Query Editor Ribbon, then place a check in the Formula Bar check box.
When you click a query step, the data shown in the preview pane is a preview of what the data looked like up to and including the step you clicked. For example, if you click the step before the GroupedRows step, you see what the data looked like before you applied grouping.
Managing query steps
You can right-click any step to see a menu of options for managing your query steps. Figure 7-9 shows the following options:
· Edit Settings: Edit the arguments or parameters that define the selected step.
· Rename: Give the selected step a meaningful name.
· Delete: Remove the selected step. Be aware that removing a step can cause errors if subsequent steps depend on the deleted step.
· Delete Until End: Remove the selected step and all following steps.
· Move Up: Move the selected step up in the order of steps.
· Move Down: Move the selected step down in the order of steps.
Figure 7-9: Right-click any query step to edit, delete, or move the step.
Viewing the Advanced Editor
Power Query lets you view and edit your query’s embedded M code directly. While in the Query Editor window, click the Advanced Editor button on the View tab. The Advanced Editor window opens, as shown in Figure 7-10.
Figure 7-10: The Advanced Editor window.
The Advanced Editor window is little more than a space for you to type your own M code. The M language is a fairly robust one, allowing for many advanced actions that you can’t take through the Query Editor. You can directly code your own steps.
If you’re interested in learning more about M language and coding your own steps, start with the M Language specification guide by Microsoft. To get it, type Microsoft Power Query Formula Language Specification into your favorite search engine.
Outputting your query results
When you’ve configured your Power Query data, you can output the results. This would be the load part of ETL.
At the bottom of the Query Settings pane (see Figure 7-11), choose how you want to output your data:
· Load to Worksheet: This is the default choice. It outputs the results as a table in a workbook.
· Load to Data Model: This choice outputs the data to the internal Data Model that you can integrate into a Power Pivot report.
Then click the Home tab and click the Apply & Close button. At this point, the Query Editor window closes, and depending on the location you have chosen, your results are loaded to either an Excel table in the current workbook or the internal Data Model.
Figure 7-11: At the bottom of the Query Settings pane, select whether you want your query to be output to a new worksheet or the internal Data Model.
Refreshing Power Query data
It’s important to note that Power Query data is not in any way connected to the source data used to extract it. A Power Query data table is merely a snapshot. In other words, as the source data changes, Power Query does not automatically keep up with the changes. You need to take steps to refresh your query.
Manually refresh a single Power Query output
You can manually refresh your Power Query outputs no matter how you chose to load the results:
· Load to an Excel worksheet: Find the worksheet that contains your Power Query output, click anywhere within the table, and then click the Refresh button on the Query tab. You can also right-click anywhere inside the table and select Refresh.
· Loaded to the internal Data Model: Open the Power Pivot window, select your Power Query data, and then click the Refresh button on the Home tab.
Set up automatic refresh
You can configure your data sources to automatically refresh your Power Query data. To do so, follow these steps:
1. On the Data tab, click the Connections button.
The Workbook Connections dialog box opens.
2. Select the Power Query data connection you want to refresh and click the Properties button.
3. In the Properties dialog box, click the Usage tab.
4. Set the options to refresh the chosen data connection every X minutes or the option to refresh the data connection when the Excel workbook is opened.
· Refresh Every X Minutes: Excel automatically refreshes the chosen data connection a specified number of minutes. This refreshes all tables associated with that connection.
· Refresh Data When Opening the File: Excel automatically refreshes the chosen data connection upon opening the workbook. This refreshes all tables associated with that connection as soon as the workbook is opened.
5. Click OK to confirm your changes.
Managing existing queries
You can see all the Power Query outputs in your workbook by clicking the Workbook button on the Power Query Ribbon. The Workbook Queries pane opens as shown in Figure 7-12.
Figure 7-12: The Workbook Queries pane enables you to edit, delete, and manage the queries in your workbook.
Right-click the desired query to take any one of the following actions:
· Edit: Opens the Query Editor, allowing you to modify the query name, description, and query steps.
· Refresh: Refreshes the data in the query.
· Duplicate: Creates a copy of the query.
· Reference: Creates a new query that references the output of the original query.
· Delete: Deletes the selected query.
· Merge: Merges the selected query with another query in the workbook by matching specified columns.
· Append: Appends the results of another query in the workbook to the selected query.
· Share: Publishes and shares the selected query via a Power BI server, which your IT department sets up and manages.
· Show the Peek: Shows a preview of the data.
Understanding Column and Table Actions
In the beginning of this chapter, you discovered how to transform data by applying actions to certain columns. Those actions are just a few of the many column-level and table-level transformations you can take. This section lists the various actions available and explains what each does.
Column level actions
Right-click any column in the Query Editor to open a context menu listing the actions you can take (see Figure 7-13).
Figure 7-13: Right-click any column to see the column-level actions you can use to transform the data.
To apply certain actions to multiple columns at one time, select the target columns before right-clicking.
Table 7-1 shows what each action is meant to accomplish, and which actions are available with multiple columns.
Many of the column actions have corresponding buttons on the Query Editor’s Ribbon.
Table 7-1: Column-Level Actions
Available When Selecting Multiple Columns
Removes the selected column from the Power Query data.
Remove Other Columns
Removes all non-selected columns from the Power Query data.
Use First Row As Headers
Replaces each table header name by the values in the first row of each column.
Creates a duplicate of the selected column as a new column placed at the far right of the table. The name given to the new column is Copy of X, where X is the name of the original column.
Splits the column into multiple columns either by a specified delimiter or by a specified number of characters.
Removes all rows from the selected column where the values duplicate earlier values. The row with the first occurrence of a value is not removed.
Removes rows containing errors in the selected column.
Replaces one value with another value in the selected column.
Fills empty cells in the column with the value of the first non-empty cell above them.
Changes the data type of the selected column to any of these types: Binary, Date, Date/Time, Date/Time/Timezone, Duration, Logical, Number, Text, Time, or Using Locale (localizes data types to the country you specify).
Changes the way values in the column are rendered. You can choose from the following options: Lowercase, Uppercase, Capitalize Each Word, Trim, Clean, JSON, or XML. If the values in the column are date/time values, the options are Date, Time, Day, Month, Year, or Day of Week. If the values in the column are number values, the options are Round, Absolute Value, Factorial, Base-10 Logarithm, Natural Logarithm, Power, or Square Root.
Insert Custom Column
Inserts a new column after the last column of the table. The values in the new column are determined by selecting the type of column to insert: Custom (you define the values) or Index (Power Query adds a sequential list of numbers).
Insert Index Column
Adds a column with a sequential list of index numbers starting from 0.
Aggregates data by row values. For example, you can Group by State and either count the number of cities in each state or sum the population of each state.
Transposes the selected columns from column-oriented to row-oriented, or vice versa.
Moves the selected column to a different location in the table. You have the choice of moving the column Left, Right, To Beginning, or To End.
Renames the selected column to a name you specify.
Navigates to the contents of the column. This is used with tables that contain metadata representing embedded information.
Add as New Query
Creates a new query with the contents of the column. This is done by referencing the original query in the new one. The name of the new query is the same as the column header of the selected column.
While in the Query Editor, Power Query allows you to apply certain actions to the entire data table. You can see the available table-level actions by clicking the table icon (see Figure 7-14).
Figure 7-14: Click the table icon in the upper left-hand corner of the Query Editor preview pane to see the table-level actions you can use to transform the data.
Table 7-2 lists each table-level action along with its primary purpose.
Many of the table actions have corresponding commands on the Query Editor’s Ribbon.
Table 7-2: Table-Level Actions
Use First Row As Headers
Replaces each table header name by the values in the first row of each column.
Removes all rows from where the values in the selected columns duplicate earlier values. The row with the first occurrence of a value set is not removed.
Removes rows containing errors in the currently selected columns.
Insert Custom Column
Inserts a new column after the last column of the table. The values in the new column are determined by the value or formula you define.
Insert Index Column
Adds a new column containing a sequential list of index numbers starting from 0.
Keep Top Rows
Removes all but the top N number of rows. You specify the number threshold.
Keep Top 100 Rows
Removes all but the top 100 rows.
Keep Range of Rows
Removes all rows except those that fall within a range you specify.
Remove Top Rows
Remove top N rows from the table.
Remove Alternate Rows
Remove alternate rows from the table starting at first row, specifying the number of rows to remove and the number of rows to keep.
Creates a new query that merges the current table with another query in the workbook by matching specified columns.
Creates a new query that appends the results of another query in the workbook to the current table.
Power Query Connection Types
Microsoft has invested a great deal of time and resources in ensuring that Power Query has the ability to connect to a wide array of data sources. Whether you need to pull data from an online search, an external Web site, a file, a database system, SharePoint, Facebook, or a big data source such as Hadoop, Power Query can accommodate most, if not all, your source data needs.
The available connection types are in the Get External Data group on the Power Query tab. As shown in Figure 7-15, Power Query offers these categories of connection types:
· Online Search: Pulls data from a Web source based on a specified search term.
· From Web: Pulls data from a Web site based on a specified URL.
· From File: Pulls data from a specified flat file or folder.
· From Database: Pulls data from a relational database system.
· From Other Sources: Pulls data from a wide array of inter-company, cloud, and big data sources. This category includes an option to start with a Blank Query in the event you want to write your own M code from scratch.
· From Table: Pull data from a defined Excel table within the existing workbook.
Figure 7-15: Power Query has the ability to connect to a wide array of data sources, from simple text files to big data sources such as Hadoop.
After you select a connection type you enter the parameters Power Query requires in a series of dialog boxes. Such parameters Power Query needs to connect to a data source are file path, URL, server name, and credentials.
Each connection type requires a unique set of parameters, so each of its dialog boxes will be different. Luckily, Power Query rarely needs more than a handful of parameters to connect to any one data source, so the dialog boxes are relatively intuitive and hassle-free.
Power Query saves the data source parameters for each data source connection you have used. You can view, edit, or delete any of the data source connections by clicking the Data Source Settings button on the Power Query tab. In the Data Source Settings dialog box, select any connection and you can edit or delete it; see Figure 7-16.
Figure 7-16: The Data Source Settings dialog box enables you to edit or delete previously used data connections.
Deleting a connection does not delete any of its associated data you may have already loaded in your workbook or internal Data Model. However, when you try to refresh the data, Power Query won’t have any of the connection parameters (because you deleted them), so it asks you again for the connection parameters.
Creating and Using Power Query Functions
Earlier in this chapter, we discussed how Power Query uses its own formula language known as M. When you connect to a data source and apply transformations to that data, Power Query diligently saves your actions as M code behind the scenes in query steps. This allows for your transformation steps to be repeated when you refresh the data in your query.
With a little knowledge, you can leverage the M language to extend the capabilities of Power Query with your own custom functions. Custom functions come in handy when you frequently need to apply business-specific calculations or perform complex transformations involving conditional testing with If … Then … Else logic.
In this section we walk you through the basics of building your own custom functions.
Creating and using a basic custom function
When building a custom function for Power Query, you’re creating a query and manipulating its M code to return a desired result. That result can be an array, a data table, or a single value.
In this section, you build a basic mathematical function that calculates profit. This function takes a revenue amount and a cost amount and outputs a profit amount using a basic mathematical operation:
Revenue - Cost = Profit
For basic functions such as this one, you can start with a blank query and enter the needed M code from scratch:
1. From the Power Query tab, choose From Other Data Sources → Blank Query.
The Query Editor window opens.
2. In the Query Editor Ribbon, click the View tab and click the Advanced Editor button.
3. In the Advanced Editor window, replace the starter syntax with the following code in the code box:
let Profit = (Revenue, Cost)=>
The first line tells Power Query that this function needs two parameters: Revenue and Cost. The second line tells Power Query to subtract the Cost parameter from the Revenue parameter. The last line of the code tells Power Query to return the result.
Power Query doesn’t care what you name the functions as long as the names start with a letter and don’t include any spaces.
Figure 7-17 shows what the code looks like in the Advanced Editor window.
Figure 7-17: Enter your custom code in the Advanced Editor window.
4. Click Done to close the Advanced Editor window.
5. In the Query Settings pane, change the name of the query in the Name input box.
Give your function a descriptive name — for example, FunctionProfit — as opposed to Query1.
6. On the Home tab of the Query Editor, click the Apply & Close button.
Power Query creates a seemingly useless table and adds the query to the Workbook Queries pane, as shown in Figure 7-18. Unfortunately, there is no way to create a function without creating an associated table, so as worthless as the table seems, you can’t delete it.
Figure 7-18: Your function is ready to use.
You can now use this function in other queries that contain revenue and cost fields. For example, Figure 7-19 shows a query with a field called Sales Amount and a field called Equipment Cost. You can use your newly created function to calculate profit using these two fields.
Figure 7-19: A table with Sales Amount and Equipment Cost fields.
Right-click any column header and select Insert Custom Column. In the Insert Custom Column dialog box (see Figure 7-20), you can invoke your function by name, passing the Sales Amount and Equipment Cost fields as parameters separated by a comma.
Figure 7-20: Use the Insert Custom Column dialog box to invoke your function.
Click OK, and Power Query triggers the function for each row in the data table. Figure 7-21 shows the newly created Custom column with the returned profit calculation. You can, of course, rename this field to indicate what the numbers represent (right-click the field header and select the Rename option).
Figure 7-21: Power Query triggers the function and returns a result for each row in the table.
Power Query functions apply only to the workbook in which they reside. If you start a new workbook, you need to re-create your functions in that new workbook.
Advanced function example: Combining all Excel files in a directory into one table
When building a basic function like a profit function, it’s no big deal to start from a blank query and enter all the code from scratch. But for more complex functions, it’s generally smarter to build a starter query via Query Editor, and then manipulate the M code to accomplish what you need.
For example, imagine you have a set of Excel files in a directory (see Figure 7-22). These files all contain a worksheet called MySheet that holds tables of data. The tables in each file have the same structure but need to be combined into one file. This is a common task that you’ve probably faced at one time or another. Without a solid knowledge of Excel VBA programming, this task typically entails opening each file, copying the data on the MySheet tab, and then pasting the data into single workbook.
Figure 7-22: Imagine you have the task of combining the data in all the Excel files in this directory into one table.
Power Query can make short work of this task, but it requires a bit of direction via a custom function. It would be difficult for most anyone to start from a blank query and type the M code for the relatively complex function needed for this endeavor. Instead, you can build a starter query via Query Editor, and then wrap the query in a function.
Follow these steps:
1. On the Power Query tab, choose From File → From Excel.
2. Browse to the directory that contains all the Excel files and select one of them.
3. In the Navigator pane (shown in Figure 7-23), select the sheet holding the data that needs to be consolidated and then click Edit to open the Query Editor.
4. Use the Query Editor to apply some basic transformation actions.
For example, you can designate the first row as a column header and remove any unneeded columns.
5. After you’ve applied the needed transformations, click the Advanced Editor button on the View tab.
The Advanced Editor window opens with the bulk of the code for your function already created, as shown in Figure 7-24. Power Query hard-coded the file path and the filename for the Excel file you originally selected. The idea is to wrap this starter code in a function that will pass a dynamic file path and filename.
Figure 7-23: Select one of the Excel files in the target directory and navigate to the sheet holding the data that needs to be consolidated.
Figure 7-24: Open the Advanced Editor to see the starter code.
6. Wrap the entire block of code with your function tags, specifying that this function requires two parameters: FilePath and FileName.
The hard-coded file path and filename have been replaced with their respective parameters, as shown in Figure 7-25.
7. Close the Advanced Editor.
8. In the Query Settings pane, change the name of the query in the Name box. Give your function a descriptive name (in this scenario, fGetMyFiles).
Figure 7-25: Replace the hard-coded file path and file name with your dynamic parameters.
9. Click the Apply & Close button on the Home tab.
At this point, you can use the custom function on all the files in the target directory.
10. On the Power Query tab, choose From File → From Folder to start a connection to the directory that contains all the Excel files.
11. In the From Folder dialog box, provide Power Query with the file path of the target directory.
The Query Editor window opens to show you a table similar to the one in Figure 7-26. This table contains a record for each file in the chosen directory. The Folder Path and Name columns supply the function with the needed FilePath and FileName parameters.
Figure 7-26: Create a new query using the From Folder connection type to retrieve a table of all the files in the target directory.
12. Right-click any column header and select Insert Custom Column.
13. In the Insert Custom Column dialog box (see Figure 7-27), invoke the function and pass the Folder Path and Name fields as parameters separated by a comma. Click OK.
Power Query triggers the function for each row in the data table. The function itself grabs the data from each file and returns a table array. Figure 7-28 shows the newly created Custom column with a returned table array for each file.
Figure 7-27: Use the Insert Custom Column dialog box to invoke the function.
Figure 7-28: Power Query triggers the function and returns a table array for each file in the directory.
14. Click the Custom column header to see a list of fields included in each table array (see Figure 7-29). Select which fields in the table array to show, click the Expand radio button, and then click OK.
Figure 7-29: Click the Custom column header to expand the table arrays.
With each table array expanded, Power Query shows the columns pulled from each Excel file and adds the detailed records to the data preview. Figure 7-30 shows the data preview for the final combined table.
15. Click the Apply & Close button to output the combined table.
Figure 7-30: The final combined view.
Don’t lose track of the fact that this relatively complex task was facilitated by a custom function. For all the steps required to accomplish this task, very little effort was actually expended on writing the code for the function. Power Query wrote the code for the core functionality, and you simply wrapped that code into a function.