Importing and Cleaning Data - Analyzing Data with Excel - Microsoft Excel 2016 BIBLE (2016)

Microsoft Excel 2016 BIBLE (2016)

Part V
Analyzing Data with Excel

Excel is a superb data-analysis tool — if you know how to extract the information you really need. In this part, you'll learn how to obtain, clean up, and analyze data in Excel. As you'll see, many of the data-analysis capabilities in Excel are both surprisingly powerful and easy to use.

In This Part

1. Chapter 32
Importing and Cleaning Data

2. Chapter 33
Introducing Pivot Tables

3. Chapter 34
Analyzing Data with Pivot Tables

4. Chapter 35
Performing Spreadsheet What-If Analysis

5. Chapter 36
Analyzing Data Using Goal Seeking and Solver

6. Chapter 37
Analyzing Data with the Analysis ToolPak

7. Chapter 38
Working with Get & Transform

Chapter 32
Importing and Cleaning Data

IN THIS CHAPTER

1. Importing data into Excel

2. Manipulating and cleaning data

3. Using the new Flash Fill feature to extract and concatenate data

4. Reviewing a checklist for data cleaning

5. Exporting data to other formats

Data is everywhere. For example, if you run a website, you're collecting data continually and you may not even know it. Every visit to your site generates information that is stored in a file on your server. This file contains lots of useful information, if you take the time to examine it.

That's just one example of data collection. Virtually every automated system collects data and stores it. Most of the time, the system that collects the data is also equipped to verify and analyze the data — but not always. And, of course, data is collected manually. A nonautomated telephone survey is a good example.

Excel is a good tool for analyzing data, and it's often used to summarize the information and ­display it in the form of tables and charts. But often, the data that's collected isn't perfect. For one reason or another, it needs to be cleaned up before it can be analyzed.

One common use for Excel is as a tool to clean up data. Cleaning up data involves getting raw data into a worksheet and then manipulating it so it conforms to various requirements. In the process, the data will be made consistent so it can be properly analyzed.

This chapter describes various ways to get data into a worksheet and provides some tips to help you clean it up.

Importing Data

Before you can do anything with data, you must get it into a worksheet. Excel is able to import most common text file formats and can retrieve data from websites.

Importing from a file

This section describes file types that Excel can open directly, using the File image Open ­command. Figure 32.1 shows the list of file filter options you can specify in the Open dialog box.

Open dialog box displaying the list of file filter options with a cursor pointing the highlighted Worksheets (*.xlsx',*.xlsm',*.xlxb',.xls).

Figure 32.1 Filtering by file extension in the Open dialog box.

Spreadsheet file formats

In addition to the current file formats (XLSX, XLSM, XLSB, XLTX, XLTM, and XLAM), Excel 2016 can open workbook files from all previous versions of Excel:

· XLS: Binary files created by Excel 4, Excel 95, Excel 97, Excel 2000, Excel 2002, and Excel 2003

· XLM: Binary files that contain Excel 4 macros (no data)

· XLT: Binary files for an Excel template

· XLA: Binary files for an Excel add-in

Excel can also open one file format created by other spreadsheet products: ODS, the OpenDocument spreadsheet format. ODS files are produced by a variety of “open” software, including Google Drive, OpenOffice, LibreOffice, StarOffice, and several others.

Note

Excel does not support Lotus 1-2-3 files, Quattro Pro files, or Microsoft Works files.

Database file formats

Excel 2016 can open the following database file formats:

· Access files: These files have various extensions, including .mdb and .accdb.

· dBase files: Produced by dBase III and dBase IV. Excel does not support dBase II files.

In addition, Excel supports various types of database connections that enable you to access data selectively. For example, you can perform a query on a large database to retrieve only the records you need (rather than the entire database).

Text file formats

A text file contains raw characters, with no formatting. Excel can open most types of text files:

· CSV: Comma-separated values. Columns are delimited with a comma, and rows are delimited with a carriage return.

· TXT: Columns are delimited with a tab, and rows are delimited with a carriage return.

· PRN: Columns are delimited with multiple space characters, and rows are delimited with a carriage return. Excel imports this type of file into a single column.

· DIF: The file format originally used by the VisiCalc spreadsheet. Rarely used.

· SYLK: The file format originally used by Multiplan. Rarely used.

Most of these text file types have variants. For example, text files produced on a Mac have different end-of-row characters. Excel can usually handle the variants without a problem.

When you attempt to open a text file in Excel, the Text Import Wizard might kick in to help you specify how you want the data to be retrieved.

Tip

To bypass the Text Import Wizard, press Shift while you click the Open button in the Open dialog box.

When Excel Can't Open a File

If Excel doesn't support a particular file format, don't be too quick to give up. It's likely that others have had the same problem as you. Try searching the Web for the file extension, plus the word Excel. It's possible that a file converter is available, or maybe someone has figured out how to use an intermediary program to open the file and export it into a format that Excel recognizes.

Importing HTML files

Excel can open most HTML files, which can be stored on your local drive or on a web server. Choose File image Open and locate the HTML file. If the file is on a web server, you'll need to copy the URL and paste it into the File Name field in the Open dialog box.

The way the HTML code renders in Excel varies considerably. Sometimes the HTML file may look exactly as it does in a browser. Other times it may bear little resemblance, especially if the HTML file uses Cascading Style Sheets (CSS) for layout.

In some cases, you can access data on the Web by using the Get & Transform feature. I discuss this topic in Chapter 38, “Working with Get & Transform.”

Importing XML files

XML (Extensible Markup Language) is a text file format suitable for structured data. Data is enclosed in tags, which also serve to describe the data.

Excel can open XML files, and simple files will display with little or no effort. Complex XML files will require some work, however. A discussion of this topic is beyond the scope of this book. You'll find information about getting data from XML files in Excel's Help system and online.

Importing a text file into a specified range

If you need to insert a text file into a specific range in a worksheet, you might think that your only choice is to import the text into a new workbook and then to copy the data and paste it into the range where you want it to appear. However, you can do it in a more direct way.

Figure 32.2 shows a small CSV file. The following instructions describe how to import this file, named monthly.csv, beginning at cell C3.

A small CSV file presented through Notepad. Listed are months from January to December with five two-digit numbers in random order.

Figure 32.2 This CSV file will be imported into a range.

1. Choose Data image Get External Data image From Text. The Import Text File dialog box appears.

2. Navigate to the folder that contains the text file.

3. Select the file from the list, and then click the Import button. The Text Import Wizard appears.

4. Use the Text Import Wizard to specify the way the data will be imported. For a CSV file, specify Delimited, with a Comma Delimiter.

5. Click the Finish button. The Import Data dialog box, shown in Figure 32.3, appears.Import Data dialog box with selected radio buttons for Table under Select how you want to view this data in your workbook and Existing Worksheet with option =$C$3 under Where do you want to put the data?

Figure 32.3 Using the Import Data dialog box to import a CSV file.

6. Click the Properties button. The External Data Range Properties dialog box appears.

7. Deselect the Save Query Definition check box, and click OK to return to the Import Data dialog box.

8. In the Import Data dialog box, specify the location for the imported data. It can be a cell in an existing worksheet or a new worksheet.

9. Click OK, and Excel imports the data (see Figure 32.4).

Worksheet with a list of months in column C and data values in columns D, E, F, G, and H from the CSV file.

Figure 32.4 This range contains data imported directly from a CSV file.

Note

You can ignore step 7 if the data you're importing will be changing. By saving the query definition, you can quickly update the imported data by right-clicking any cell in the range and choosing Refresh Data.

Copying and pasting data

If all else fails, you can try standard copy-and-paste techniques. If you can copy data from an application (for example, a word-processing program or a document displayed in PDF viewer), there's a good chance you can paste it into an Excel workbook. For best results, try pasting using the Home image Clipboard image Paste image Paste Special command, and try various paste options listed. Usually, pasted data will require some cleanup.

Data Cleanup Techniques

This section discusses a variety of techniques that you can use to clean up data in a worksheet.

imageChapter 11, “Creating Formulas That Manipulate Text,” contains additional examples of text-related formulas that may be helpful when cleaning data.

Removing duplicate rows

If data is compiled from multiple sources, it may contain duplicate rows. Most of the time, you want to eliminate the duplicates. In the past, removing duplicate data was essentially a manual task — although it could be automated by using a confusing advanced filter technique. But now removing duplicate rows is easy, thanks to Excel's Remove Duplicates command (introduced in Excel 2007).

Start by moving the cell cursor to any cell within your data range. Choose Data image Data Tools image Remove Duplicates, and the Remove Duplicates dialog box, shown in Figure 32.5, appears.

Remove Duplicates dialog with buttons for Select All and Unselect All. Below is a box listing six column headers with checked boxes. Box for My data has headers option has a check mark.

Figure 32.5 Use the Remove Duplicates dialog box to delete duplicate rows.

Note

If your data is in a table, you can also use Table Tools image Design image Tools image Remove Duplicates. These two commands work the same.

The Remove Duplicates dialog box lists all the columns in your data range or table. Place a check mark next to the columns that you want to be included in the duplicate search. Most of the time, you'll want to select all the columns, which is the default. Click OK, and Excel weeds out the duplicate rows and displays a message that tells you how many duplicates it removed. It would be nice if Excel gave you the option to change your mind, but it doesn't. If Excel deleted too many rows, you can undo the procedure by clicking Undo (or by pressing Ctrl+Z).

When you select all columns in the Remove Duplicates dialog box, Excel will delete a row only if the content of every column is duplicated. In some situations, you may not care about matching some columns, so you would deselect those columns in the Remove Duplicates dialog box. For example, if each row has a unique ID code, Excel would never find any duplicate rows. So you'd want to uncheck that column in the Remove Duplicates dialog box.

When duplicate rows are found, the first row is kept and subsequent duplicate rows are deleted.

Caution

Duplicate values are determined by the value displayed in the cell — not necessarily the value stored in the cell. For example, assume that two cells contain the same date. One of the dates is formatted to display as 5/15/2016, and the other is formatted to display as May 15, 2016. When removing duplicates, Excel considers these dates to be different. Similarly, values that are formatted differently are considered to be different. So $1,209.32 is not the same as 1209.32. Therefore, you might want to apply formatting to entire columns to ensure that duplicate rows are not overlooked just because of a formatting difference.

Identifying duplicate rows

If you would like to identify duplicate rows so you can examine them without automatically deleting them, here's another method. Unlike the technique described in the previous section, this method looks at actual values, not formatted values.

Create a formula to the right of your data that concatenates each of the cells to the left. The formulas that follow assume that the data is in columns A:F.

Enter this formula into cell G2:

=A2&B2&C2&D2&E2&F2

Add another formula in cell H2. This formula displays the number of times a value in column G occurs:

=COUNTIF(G:G,G2)

Copy these formulas down the column for each row of your data.

Column H displays the number of occurrences of that row. Unduplicated rows will display 1. Duplicated rows will display a number that corresponds to the number of times that row appears.

Figure 32.6 shows a simple example. If you don't care about a particular column, just omit it from the formula in column G. For example, if you want to find duplicates regardless of the Status column, just omit D2 from the concatenating formula.

Image described by surrounding text.

Figure 32.6 Using formulas to identify duplicate rows.

Don't Forget About Get & Transform

The best new feature in Excel 2016 is built-in support for Get & Transform (formerly known as Power Query). In the past, this powerful tool was in the form of an add-in and worked only with enterprise versions of Excel. Now it's built into all desktop versions of Excel 2016.

You can use Get & Transform to import data from a variety of sources. It also has several features to help you clean up and transform data that's stored in an Excel table. Hint: Select any cell in your table and choose Data image Get & Transform image From Table.

Drop-down menu of From Other Sources option under New Query drop-down in the Data tab on the Ribbon of Excel 2016. Cursor points the From Web option.

For more information, refer to Chapter 38.

Splitting text

When importing data, you might find that multiple values are imported into a single column. Figure 32.7 shows an example of this type of import problem.

Image described by caption and surrounding text.

Figure 32.7 The imported data was put in one column rather than multiple columns.

Tip

I used a fixed-width font (Courier New) to display the data in Figure 32.7. With the default font, it was not apparent that the data lined up nicely in fixed-width columns.

If the text is all the same length (as in the example), you might be able to write a series of formulas that extract the information to separate columns. The LEFT, RIGHT, and MID functions are useful for this task.

imageSee Chapter 11 for examples of formulas that extract characters from text.

You should also be aware that Excel offers two nonformula methods to assist in splitting data so it occupies multiple columns: Text to Columns and Flash Fill.

Using Text to Columns

The Text to Columns command can parse strings into their component parts.

First, make sure that the column that contains the data to be split up has enough empty columns to the right to accommodate the extracted data. Then select the data to be parsed and choose Data image Data Tools image Text to Columns. Excel displays the Convert Text to Columns Wizard, which consists of a series of dialog boxes that walk you through the steps to convert a single column of data into multiple columns. Figure 32.8 shows the initial step, in which you choose the type of data:

Convert Text to Columns Wizard dialog box over a worksheet. Radio button for Fixed Width under Original Data Type is selected. A preview of the selected data is located below.

Figure 32.8 The first dialog box in the Convert Text to Columns Wizard.

· Delimited: The data to be split is separated by delimiters such as commas, spaces, slashes, or other characters.

· Fixed Width: Each component occupies the same number of characters.

Make your choice and click Next to move on to step 2, which depends on the choice you made in step 1.

If you're working with delimited data, specify the delimiting character or characters. You'll see a preview of the result. If you're working with fixed-width data, specify the column breaks directly in the preview window.

When you're satisfied with the column breaks, click Next to move to step 3. In this step you can click a column in the preview window and specify general formatting for the column. Click Finish, and Excel splits the data as specified.

Using Flash Fill

The Text to Columns Wizard works well for many types of data. But sometimes you'll encounter data that can't be parsed by that wizard. For example, the Text to Columns Wizard is useless if you have variable-width data that doesn't have delimiters. In such a case, the Flash Fill feature might save the day. But keep in mind that Flash Fill works successfully only when the data is very consistent. Flash Fill was introduced in Excel 2013.

Flash Fill uses pattern recognition to extract data (and also concatenate data). Just enter a few examples in a column that's adjacent to the data and choose Data image Data Tools image Flash Fill (or press Ctrl+E). Excel analyzes the examples and attempts to fill in the remaining cells. If Excel didn't recognize the pattern you had in mind, press Ctrl+Z, add another example or two, and try again.

Figure 32.9 shows a worksheet with some text in a single column. The goal is to extract the numeric value from each text string and put the number into a separate cell. The Text to Columns Wizard can't do it because the space delimiters aren't consistent. It might be possible to write an array formula, but it would be complicated.

Image described by surrounding text.

Figure 32.9 The goal is to extract the numbers in column A.

imageThis workbook, which also includes other Flash Fill examples, is available on this book's website at www.wiley.com/go/excel2016bible. The filename is flash fill demo.xlsx.

To try using Flash Fill, activate cell B1 and type the first number (20). Move to B2, and type the second number (6). Can Flash Fill identify the remaining numbers and fill them in? Choose Data image Data Tools image Flash Fill (or press Ctrl+E), and Excel fills in the remaining cells in a flash. Figure 32.10 shows the result.

Worksheet displaying column A with data similar to those in Figure 32.9 and column B with incorrect numbers compared to those found in the data per cell.

Figure 32.10 Using manually entered examples in B1 and B2, Excel's Flash Fill feature makes some incorrect guesses.

As you see, Excel identified most of the values. Accuracy increases if you provide more examples. For example, provide an example of a decimal number. Delete the suggested values in Column B, enter 3.12 in cell B6, and press Ctrl+E. This time, Flash Fill gets all of them correct (see Figure 32.11).

Worksheet displaying column A with data similar to those in Figure 32.9 and column B with correct numbers compared to those found in the data per cell.

Figure 32.11 After you enter an example of a decimal number, Excel gets all of correct.

This simple example demonstrates two important points:

· You must examine your data carefully after using Flash Fill. Just because the first few rows are correct, you can't assume that Flash Fill worked correctly for all rows.

· Flash Fill increases accuracy when you provide more examples.

Figure 32.12 shows another example, names in column A. The goal is to extract the first, last, and middle name (if it has one). In column B Flash Fill successfully gets all the first names using only two examples (Mark and Tim). Plus, it successfully extracted all the last names (column C), using Russell and Colman. Extracting the middle names or initials (column D) eluded me until I provided examples that included a space on either side of the middle name).

Image described by surrounding text.

Figure 32.12 Using Flash Fill to split names.

imageFor a reliable formula-based solution for splitting names, see Chapter 11.

To summarize, Excel's Flash Fill is an interesting idea, but it works reliably only if the data is very consistent. Even when you think it worked correctly, make sure you examine the results carefully. And think twice before trusting it with important data because there's no way to document the way the data was extracted. But the main limitation is that (unlike formulas) Flash Fill is not a dynamic technique. If your data changes, the flash-filled column does not update.

Note

You can also use the Flash Fill feature to create new data from multiple columns. Just provide a few examples of the way you want the data combined, and Excel will figure out the pattern and fill in the column. Using Flash Fill to create data seems to work much better than using it to extract data. But then again, it's also easier to create formulas to create data from existing columns.

Changing the case of text

Often, you'll want to make text in a column consistent, in terms of case. Excel provides no direct way to change the case of text, but it's easy to do with formulas (see the sidebar “Transforming Data with Formulas”).

The three relevant functions are

· UPPER: Converts the text to ALL UPPERCASE.

· LOWER: Converts the text to all lowercase.

· PROPER: Converts the text to Proper Case. (The first letter in each word is capitalized, as in a proper name.)

These functions are quite straightforward. They operate only on alphabetic characters and just ignore all other characters and return them unchanged.

If you use the PROPER function, you'll probably need to do some additional cleanup to handle exceptions. Following are examples of transformations that you probably would consider incorrect:

· The letter following an apostrophe is always capitalized (for example, Don'T). This is done, apparently, to handle names like O'Reilly.

· The PROPER function doesn't handle names with an embedded capital letter, such as McDonald.

· “Minor” words such as and and the are always capitalized. For example, some people would prefer that the third word in United States Of America not be capitalized.

Often, you can correct some of these problems by using Find and Replace.

Transforming Data with Formulas

Many of the data cleanup examples in this chapter describe how to use formulas and functions to transform data in some way. For example, you can use the UPPER function to transform text into uppercase. When the data is transformed, you'll have two columns: the original data and the transformed data. Almost always you'll want to replace the original data with the transformed data. Here's how to do it:

1. Insert a new temporary column for formulas to transform the original data.

2. Create your formulas in the temporary column, and make sure that the formulas do what they were intended to do.

3. Select the formula cells.

4. Choose Home image Clipboard image Copy (or press Ctrl+C).

5. Select the original data cells.

6. Choose Home image Clipboard image Paste image Values (V).

This procedure replaces the original data with the transformed data. Then you can delete the temporary column that holds the formulas.

Removing extra spaces

It's usually a good idea to ensure that data doesn't have extra spaces. It's impossible to spot a space character at the end of a text string. Extra spaces can cause lots of problems, especially when you need to compare text strings. The text July is not the same as the text July with a space appended to the end. The first is four characters long, and the second is five characters long.

Create a formula that uses the TRIM function to remove all leading and trailing spaces and to replace multiple spaces with a single space. This example uses the TRIM function. The formula returns Fourth Quarter Earnings (with no excess spaces):

=TRIM(" Fourth Quarter Earnings ")

Data that is imported from a web page often contains a different type of space: a nonbreaking space, indicated by &nbsp in HTML code. In Excel, this character can be generated by this formula:

=CHAR(160)

You can use a formula like this to replace those spaces with normal spaces:

=SUBSTITUTE(A2,CHAR(160)," ")

Or use this formula to replace the nonbreaking space character with normal spaces and to remove excess spaces:

=TRIM(SUBSTITUTE(A2,CHAR(160)," "))

Removing strange characters

Often, data imported into an Excel worksheet contains strange (sometimes unprintable) characters. You can use the CLEAN function to remove all nonprinting characters from a string. If the data is in cell A2, this formula will do the job:

=CLEAN(A2)

Note

The CLEAN function can miss some nonprinting Unicode characters. It's programmed to remove the first 32 nonprinting characters in the 7-bit ASCII code. Consult the Excel Help system for information on how to remove the nonprinting Unicode characters. (Search Help for the CLEAN function.)

Converting values

In some cases you may need to convert values from one system to another. For example, you may import a file that has values in fluid ounces, and they need to be expressed in milliliters. Excel's handy CONVERT function can perform that and many other conversions.

If cell A2 contains a value in ounces, the following formula converts it to milliliters:

=CONVERT(A2,"oz","ml")

This function is extremely versatile and can handle most common measurement units in the following categories: weight and mass, distance, time, pressure, force, energy, power, magnetism, temperature, volume, liquid, area, bits and bytes, and speed.

imageSee Chapter 16, “Miscellaneous Calculations,” for more examples that use the CONVERT function.

Excel can also convert between number bases. You may import a file that contains hexadecimal values, and you need to convert them to decimal. Use the HEX2DEC function to perform this conversion. For example, the following formula returns 1,279, the decimal equivalent of its hex argument.

=HEX2DEC("4FF")

Excel can also convert from binary to decimal (BIN2DEC) and from octal to decimal (OCT2DEC).

Functions that convert from decimal to another number base are DEC2HEX, DEC2BIN, and DEC2OCT.

Excel 2013 introduced a new function, BASE, that converts a decimal number to any number base. Note that there is not a function that works in the opposite direction. Excel does not provide a function that converts any number base to decimal. You're limited to binary, octal, and hexadecimal.

Classifying values

Often, you may have values that need to be classified into a group. For example, if you have ages of people, you might want to classify them into groups such as 17 or younger, 18–24, 25–34, and so on.

The easiest way to perform this classification is with a lookup table. Figure 32.13 shows ages in column A and classifications in column B. Column B uses the lookup table in D2:E9. The formula in cell B2 is

=VLOOKUP(A2,$D$2:$E$9,2)

Image described by surrounding text.

Figure 32.13 Using a lookup table to classify ages into age ranges.

This formula was copied to the cells below.

You can also use a lookup table for nonnumeric data. Figure 32.14 shows a lookup table that is used to assign a region to a state.

Worksheet with columns A and B listing states in the United States and their corresponding regions. Column D and E list similar data alphabetically.

Figure 32.14 Using a lookup table to assign a region for a state.

The two-column lookup table is in the range D2:E52. The formula in cell B2, which was copied to the cells below, is

=VLOOKUP(A2,$D$2:$E$52,2,FALSE)

imageA workbook that contains the examples in this section is available on this book's website at www.wiley.com/go/excel2016bible. The filename is classifying data.xlsx.

Tip

A side benefit is that the VLOOKUP function will return FALSE if an exact match is not found — a good way to spot misspelled states, in this example. Using FALSE as the last argument in the function indicates that an exact match is required.

Joining columns

To combine data in two more columns, you can usually use the concatenation operator (&) in a formula. For example, the following formula combines the contents of cells A1, B1, and C1:

=A1&B1&C1

Often, you'll need to insert spaces between the cells — for example, if the columns contain a title, first name, and last name. Concatenating using the formulas above would produce something like Mr.ThomasJones. To add spaces (to produce Mr. Thomas Jones), modify the formula:

=A1&" "&B1&" "&C1

You can also use the Flash Fill feature (discussed earlier in this chapter) to join columns without using formulas. Just provide an example or two in an adjacent column, and press Ctrl+E. Excel will perform the concatenation for the other rows.

Rearranging columns

If you need to rearrange the columns in a worksheet, you could insert a blank column and then drag another column into the new blank column. But the moved column leaves a gap, which you need to delete.

Here's an easier way:

1. Click the column header of the column you want to move.

2. Choose Home image Clipboard image Cut.

3. Click the column header to the right of where you want the column to go.

4. Right-click and choose Insert Cut Cells from the shortcut menu.

Repeat these steps until the columns are in the order you want.

Randomizing the rows

If you need to arrange the rows in random order, here's a quick way to do it. In the column to the right of the data, insert this formula into the first cell and copy it down:

=RAND()

Then sort the data using this column as the sort key. The rows will be in random order, and you can delete the column.

Extracting a filename from a URL

In some cases, you may have a list of URLs and need to extract only the filename. The following formula returns the filename from a URL. Assume cell A2 contains this URL:

http://example.com/assets/images/horse.jpg

The following formula returns horse.jpg:

=RIGHT(A2,LEN(A2)-FIND("*",SUBSTITUTE(A2,"/","*",LEN(A2)-LEN(SUBSTITUTE(A2,"/","")))))

This formula returns all text that follows the last slash character. If cell A2 doesn't contain a slash character, the formula returns an error.

To extract the URL without the filename, use this formula:

=LEFT(A2,FIND("*",SUBSTITUTE(A2,"/","*",LEN(A2)-LEN(SUBSTITUTE(A2,"/","")))))

Note

This type of extraction is a good use for the Flash Fill feature. (See “Splitting text,” earlier in this chapter.)

Matching text in a list

You may have some data that you need to check against another list. For example, you may want to identify the data rows in which data in a particular column appears in a different list. Figure 32.15 shows a simple example. The data is in columns A:C. The goal is to identify the rows in which the Member Num appears in the Resigned Members list, in column F. These rows can then be deleted.

Image described by caption and surrounding text.

Figure 32.15 The goal is to identify member numbers that are in the resigned members list in column F.

imageThis workbook, named match names.xlsx, is available on this book's website at www.wiley.com/go/excel2016bible.

Here's a formula, entered into cell D2 and copied down, that will do the job:

=IF(COUNTIF($F$2:$F$22,B2)>0,"Resigned","" )

This formula displays the word Resigned if the Member Num in column B is found in the Resigned Members list. If the Member number is not found, it returns an empty string. If the list is sorted by column D, the rows for all resigned members will appear together and can be quickly deleted.

This technique can be adapted to other types of list-matching tasks.

Changing vertical data to horizontal data

Figure 32.16 shows a common type of data layout that you might see when importing a file. Each record consists of three consecutive cells in a single column: Name, Department, and Location. The goal is to convert this data so each record appears in three columns.

Image described by caption.

Figure 32.16 Vertical data that needs to be converted to three columns.

There are several ways to convert this type of data, but here's a method that's fairly easy. It requires a small amount of setup, but the work is done with a single formula, which is copied to a range.

Start by creating some numeric vertical and horizontal “headers,” as shown in Figure 32.17. Column C contains numbers that correspond to the first row of each data item (in this case, the Name). In this example, I put the following values in column C: 1, 4, 7, 10, 13, 16, and 19. You can use a simple formula to generate this series of numbers.

Image described by caption and surrounding text.

Figure 32.17 Headers that are used to convert the vertical data into rows.

The horizontal range of headers consists of consecutive integers, starting with 1. In this example, each record contains three cells of data, so the horizontal header contains 1, 2, and 3.

imageThis workbook, named vertical data.xlsx, is available on this book's website at www.wiley.com/go/excel2016bible

Here's the formula that goes into cell D2:

=OFFSET($A$1,$C2+D$1-2,0)

Copy this formula across to the next two columns and down to the next six rows. The result is shown in Figure 32.18.

Worksheet presenting the vertical data in column A and the categorized data in columns D, E, and F listing names, departments, and office branches.

Figure 32.18 A single formula transforms the vertical data into rows.

You can easily adapt this technique to work with vertical data that contains a different number of rows. For example, if each record contained ten rows of data, the column C header values would be 1, 11, 21, 31, and so on. The horizontal headers would consist of values 1 through 10 rather than 1 through 3.

Notice that the formula uses an absolute reference to cell A1. That reference won't change when the formula is copied, so all the formulas use cell A1 as the base. If the data begins in a different cell, change $A$1 to the address of the first cell.

The formula also uses “mixed” referencing in the second argument of the OFFSET function. The C2 reference has a dollar sign in front of C, so column C is the absolute part of the reference. In the D1 reference, the dollar sign is before the 1, so row 1 is the absolute part of the reference.

imageSee Chapter 10, “Introducing Formulas and Functions,” for more about using mixed references in formulas.

Filling gaps in an imported report

When you import data, you can sometimes end up with a worksheet that looks something like the one shown in Figure 32.19. This type of report formatting is common. As you can see, an entry in column A applies to several rows of data. If you sort this type of list, the missing data messes things up, and you can no longer tell who sold what when.

Worksheet with columns A to D labeled Sales Rep, Month, Units Sold, and Amount. Data are grouped by sales representative from January to March. Only rows 3,6, 9, and 12 are with complete data.

Figure 32.19 This report contains gaps in the Sales Rep column.

If the report is small, you can enter the missing cell values manually or by using a series of Home image Editing image Fill image Down commands (or the Ctrl+D shortcut). But if you have a large list that's in this format, here's a better way:

1. Select the range that has the gaps (A3:A14, in this example).

2. Choose Home image Editing image Find & Select image Go to Special. The Go to Special dialog box appears.

3. Select the Blanks option and click OK. This action selects the blank cells in the original selection.

4. In the formula bar, type an equal sign (=) followed by the address of the first cell with an entry in the column (=A3, in this example), and press Ctrl+Enter.

5. Reselect the original range and press Ctrl+C to copy the selection.

6. Choose Home image Clipboard image Paste image Paste Values to convert the formulas to values.

After you complete these steps, the gaps are filled in with the correct information, and your worksheet looks similar to the one shown in Figure 32.20.

Worksheet similar to that in Figure 32.20 now with gaps filled in Sales Rep column.

Figure 32.20 The gaps are gone, and this list can now be sorted.

imageThis workbook, named fill in gaps.xlsx, is available on this book's website at www.wiley.com/go/excel2016bible.

Checking spelling

If you use a word-processing program, you probably take advantage of its spell checker feature. Spelling mistakes can be embarrassing when they appear in a text document, but they can cause serious problems when they occur within your data. For example, if you tabulate data by month, a misspelled month name will make it appear that a year has 13 months.

To access the Excel spell checker, choose Review image Proofing image Spelling, or press F7. To check the spelling in just a particular range, select the range before you activate the spell checker.

If the spell checker finds any words it doesn't recognize as correct, it displays the Spelling dialog box. The options are fairly self-explanatory.

imageSee Chapter 31, “Making Your Worksheets Error Free,” for more about the Spelling dialog box.

Replacing or removing text in cells

You may need to systematically replace (or remove) certain characters in a column of data. For example, you may need to replace all backslash characters with forward slash characters. In many cases, you can use Excel's Find and Replace dialog box to accomplish this task. To remove text using the Find and Replace dialog box, just leave the Replace With field empty.

In other situations, you may need a formula-based solution. Consider the data shown in Figure 32.21. The goal is to replace the second hyphen character with a colon for the part numbers in Column A. Using Find and Replace wouldn't work because there isn't a way to specify that only the second hyphen should be replaced.

Worksheet with column A listing part numbers with hyphens and column B listing modified numbers with colons replacing the second hyphens in each number.

Figure 32.21 To replace only the second hyphen in these cells, Find and Replace is not an option.

In this case, the solution is a fairly simple formula that replaces the second occurrence of a hyphen with a colon:

=SUBSTITUTE(A2,"-",":",2)

To remove the second occurrence of a hyphen, just omit the third argument for the SUBSTITUTE function:

=SUBSTITUTE(A2,"-",,2)

This is another example where Flash Fill can also do the job.

Note

If you've worked with programming languages, you may be familiar with the concept of regular expressions. A regular expression is a way to match strings of text using concise (and often confusing) codes. Excel doesn't support regular expressions, but if you search the web you'll find ways to incorporate regular expressions in VBA, plus a few add-ins that provide this feature in the workbook environment.

Try PUP for Data Cleaning

My Power Utility Pak add-in consists of more than 50 general-purpose Excel utilities — including several tools that can assist you when cleaning up data. The one I use most often is called Text Tools.

Text Tools dialog box with higlighted Remove by postion under Operation and a cursor pointing Character position2 in the drop-down list. Close and Apply buttons are on the top right.

Text Tools has options to

· Change the case of text: Uppercase, lowercase, proper case, sentence case, and toggle case.

· Add text to cells: Specify text to add before the first character, after the last character, or after a specific character number.

· Remove by position: Specify the number of characters to remove and the location within the cell.

· Remove spaces: Leading spaces, trailing spaces, excess spaces, or all spaces.

· Delete characters: Nonprinting, alpha, nonalpha, numeric, or nonnumeric.

Text Tools is a stay-on-top dialog box, so it's always handy. It operates directly on the text, so no formulas are required. It's fast, and it even has an undo feature.

You can download a free 30-day trial from my website (www.spreadsheetpage.com). Or use the coupon in the back of the book to purchase a discounted license.

Adding text to cells

If you need to add text to a cell, one solution is to use a new column of formulas. Here are some examples:

· The following formula adds ID: and a space to the beginning of a cell:

="ID: "&A2

· The following formula adds .mp3 to the end of a cell:

=A2&".mp3"

· The following formula inserts a hyphen after the third character in a cell:

=LEFT(A2,3)&"-"&RIGHT(A2,LEN(A2)-3)

You can also use the Flash Fill feature to add text to cells.

Fixing trailing minus signs

Imported data sometimes displays negative values with a trailing minus sign. For example, a negative value may appear as 3,498– rather than the more common –3,498. Excel does not convert these values. In fact, it considers them to be nonnumeric text.

The solution is so simple it may even surprise you:

1. Select the data that has the trailing minus signs. The selection can also include positive values.

2. Choose Data image Data Tools image Text to Columns. The Text to Columns dialog box appears.

3. Click Finish.

This procedure works because of a default setting in the Advanced Text Import Settings dialog box (which you don't even see, normally). To display this dialog box, shown in Figure 32.22, go to step 3 in the Text to Columns Wizard dialog box and click Advanced.

Advanced Text Import Settings dialog box with two drop-downs for Decimal Separator and Thousands Separator and a Reset button below.

Figure 32.22 The Trailing Minus for Negative Numbers option makes it easy to fix trailing minus signs in a range of data.

A Data Cleaning Checklist

This section contains a list of items that could cause problems with data. Not all these are relevant to every set of data.

· Does each column have a unique and descriptive header?

· Is each column of data formatted consistently?

· Did you check for duplicate or missing rows?

· For text data, are the words consistent in terms of case?

· Did you check for spelling errors?

· Does the data contain any extra spaces?

· Are the columns arranged in the proper (or logical) order?

· Are any cells blank that shouldn't be blank?

· Did you correct any trailing minus signs?

· Are the columns wide enough to display all data?

Exporting Data

This chapter began with a section on importing data, so it's only appropriate to end it with a discussion of exporting data to a file that's not a standard Excel file.

Exporting to a text file

When you choose File image Save As, the Save As dialog box lets you choose from a variety of text file formats. The three types are

· CSV: Comma-separated value files

· TXT: Tab delimited files

· PRN: Formatted text

I discuss these file types in the sections that follow.

CSV files

When you export a worksheet to a CSV file, the data is saved as displayed. In other words, if a cell contains 12.8312344 but is formatted to display with two decimal places, the value will be saved as 12.83.

Cells are delimited with a comma character, and rows are delimited with a carriage return and line feed.

Note

If you export a file using the Mac variant, rows are delimited with a carriage return only (no line feed character).

Note that if a cell contains a comma, the cell value is saved within quotation marks. If a cell contains a quotation mark character, that character appears twice.

TXT files

Exporting a workbook to a TXT file is almost identical to the CSV file format described earlier. The only difference is that cells are separated by a tab character rather than a comma.

If your worksheet contains any Unicode characters, you should export the file using the Unicode variant. Otherwise, Unicode characters will be saved as question mark characters.

PRN files

A PRN file is very much like a printed image of the worksheet. The cells are separated by multiple space characters. Also, a line is limited to 240 characters. If a line exceeds that limit, the remainder appears on the next line. PRN files are rarely used.

Exporting to other file formats

Excel also lets you save your work in several other formats:

· Data Interchange Format: These files have a .dif extension. Not used very often.

· Symbolic Link: These files have an .sylk extension. Not used very often.

· Portable Document Format: These files have a .pdf extension. This is a common “read-only” file format.

· XML Paper Specification Document: These files have an .xps extension. Microsoft's alternative to PDF files. Not used very often.

· Web Page: These files have an .htm extension. Often, saving a file as a workbook will generate a directory of ancillary files required to render the page accurately.

· OpenDocument Spreadsheet: These files have an .ods extension. They're compatible with various open source spreadsheet programs.