Scrub-a-Dub-Dub: Cleaning Data - Where’s the Beef - Excel Data Analysis For Dummies, 2nd Edition (2014)

Excel Data Analysis For Dummies, 2nd Edition (2014)

Part I. Where's the Beef?

Chapter 3. Scrub-a-Dub-Dub: Cleaning Data

In This Chapter

arrow Editing an imported workbook

arrow Cleaning data with text functions

arrow Keeping data clean with validation

You will greatly benefit from exploring the techniques often necessary for cleaning up and rearranging workbook data. You know why? Because almost always the data that you start with — especially data that you import from other programs — will be pretty disorganized and dirty. Getting your data into a clean form makes it easier to work with and analyze the data.

Editing Your Imported Workbook

I start this discussion with some basic workbook editing techniques. If you take a look at the workbook shown in Figure 3-1, you see that the data, although neatly formatted, doesn't appear as an Excel table. The workbook shown in Figure 3-1, for example, includes blank columns and rows. The workbook also uses some columns that are inadequately sized. The width for column I, for example, is too small to display the values stored there. (That’s why those #s appear.)

You will often encounter situations like this. The workbook shown in Figure 3-1, for example, has actually been imported from QuickBooks. You can use several workbook-editing techniques to clean up a workbook. In the following sections, I give you a rundown of the most useful ones.

image

Figure 3-1: This worksheet needs to clean up its act.

Delete unnecessary columns

To delete unnecessary columns (these might be blank columns or columns that store data that you don't need), click the column letter to select the column. Then choose the Home tab’s Delete command.

image You can select multiple columns for multiple deletions by holding down the Ctrl key and then individually clicking column letters.

Delete unnecessary rows

To delete unnecessary rows, you follow the same steps that you do to delete unnecessary columns. Just click the row number and then choose the Home tab’s Delete command. To delete multiple rows, hold down the Ctrl key and then select the row numbers for each of the rows that you want to delete. After making your selections, choose the Home tab’s Delete command.

Resize columns

To resize (enlarge the width of) a column so that its contents clearly show, double-click the column letter box's right corner or click AutoFit Column Width on the Format button’s drop-down (Home tab). For example, in Figure 3-2, column H is too narrow to displays its values. Excel displays several pound signs (########) in the cells in column H to indicate the column is too narrow to adequately display its values.

image

Figure 3-2: Column H needs to gain a little weight.

Just double-click the column letter label, and Excel resizes the column so that it's wide enough to display the values or labels stored in that column. Check out Figure 3-3 to see how Excel has resized the width of column H to display its values.

image

Figure 3-3: Ah … now you can see the data.

You can also resize a column by selecting it and then choosing the Home tab’s Format⇒Column Width command. When Excel displays the Column Width dialog box, as shown in Figure 3-4, you can enter a larger value into the Column Width text box and then click OK. The value that you enter is the number of characters that can fit in a column.

image

Figure 3-4: Set column width here.

image For you manually inclined fiddlers, you can also resize a column by clicking and dragging the left corner of the column letter label box. You can resize the column to any width by dragging this border.

image In Excel 2007 and Excel 2010, select the column and use the Home tab’s Format⇒Width command to display the Column Width dialog box and change the column width.

Resize rows

You can resize rows like you resize columns. Just select the row number label and then choose the Home tab’s Format⇒Row Height command. When Excel displays the Row Height dialog box, as shown in Figure 3-5, you can enter a larger value into the Row Height text box.

image

Figure 3-5: Set row height here.

image Row height is measured in points. (A point equals 1/72 of an inch.)

image In Excel 2007 and Excel 2010, select the row and use the Home tab’s Format⇒Row Height command to display the Row Height dialog box and change the row height.

Erase unneeded cell contents

To erase the contents of a range that contains unneeded data, select the worksheet range and then choose the Home tab’s Clear⇒Clear All command. Excel erases both the contents of the cells in the selected range and any formatting assigned to those cells.

Format numeric values

To change the formatting of values in a workbook that you want to analyze, first select the range of what you want to reformat. Then choose the Home tab’s Number command. When Excel displays the Format Cells dialog box, as shown in Figure 3-6, choose from its tabs to change the formatting of the selected range. For example, use choices from the Number tab to assign numeric formatting to values in the selected range. You use options from the Alignment tab to change the way the text and values are positioned in the cell, from the Font tab to choose the font used for values and labels in the selected range, and from the Border tab to assign cell border borders to the selected range.

image

Figure 3-6: Format numeric values here.

image The buttons and boxes that appear just above the Number command button provide for several, convenient, one-click formatting options. For example, you can click the command button marked with the currency symbol to format the selected range using the accounting format.

Copying worksheet data

To copy worksheet data, first select the data that you want to duplicate. You can copy a single cell or range of cells. Choose the Home tab’s Copy command and then select the range into which you want to place the copied data. Remember: You can select a single cell or a range of cells. Then choose the Home tab’s Paste command.

You can also copy worksheet ranges by dragging the mouse. To do this, select the worksheet range that you want to copy. Then hold down the Ctrl key and drag the range border.

Moving worksheet data

To move worksheet data to some new location, select the range that stores the data. Choose the Home tab’s Cut command and click the cell in the upper-left corner of the range into which you want to move the worksheet data. Then choose the Home tab’s Paste command.

You can also move worksheet ranges by dragging the mouse. To do this, select the worksheet range that you want to copy and then drag the range border.

Replacing data in fields

One of the most common commands that I find myself using to clean up a list is the Home tab’s Find & Select command. To use this command, first select the column with the data that you want to clean by clicking that column’s letter. Next choose Find & Select⇒Replace so that Excel displays the Find and Replace dialog box, as shown in Figure 3-7.

image

Figure 3-7: Keep data in its place with the Find and Replace dialog box.

Enter the incorrect text that you want to find in the Find What text box and then enter the correct text in the Replace With text box. Then click the Replace All button to fix the incorrect text.

Cleaning Data with Text Functions

One of the common problems with data that you import is that your text labels aren't quite right. For example, you might find yourself with the city, state, and ZIP code information that's part of an address stored in a single cell rather than in three separate cells. Or, you might find that same information stored in three separate cells when you want the data stored in a single cell. You might also find that pieces of information that you want stored as labels instead are stored as values and vice versa.

What’s the big deal, Steve?

Just to give you a quick idea of what I mean here, take a look at Figures 3-8 and 3-9. Okay, this is fake data, sure. But the examples show a common situation. The list information shown in Figure 3-8 uses unnecessarily lengthy product names, goofs up some customer names by appending store numbers to customer names, and then puts all of the city and state information into one field. Yuk.

image

Figure 3-8: Good worksheet data; tough to analyze.

image

Figure 3-9: Much better: Rearranged worksheet data that’s easy to analyze.

In Figure 3-9, see how I rearrange this information so that it’s much more easily sorted and filtered. For example, the PRODUCT2 field abbreviates the product name by changing Big Bob’s Guide to to just BBgt. The store names are essentially edited down to just the first word in the store name — an easy change that enables you to see sales for Bean’s Tackle, Mac’s Shack, and Steve’s Charters. The ADDRESS information is split into two fields: CITY and STATE.

image Here’s one other important point about Figure 3-9: The rearrangement shown in Figure 3-9 makes it possible to cross-tabulate the data using a pivot table (something I talk more about in Chapter 4).

The answer to some of your problems

All the editing performed in Figure 3-9 is performed using text functions, so here, I discuss these babies.

image You can grab a ZIP file from the companion website that includes most of the Excel workbooks shown in the pages of this book. I mention this because if you’re really curious about how text functions are used in Figure 3-9, you can grab the actual workbook and check out the formulas. The ZIP file is available at this book's companion website. See the Introduction for more on how to access the website.

Excel provides two dozen text functions that enable you to manipulate text strings in ways to easily rearrange and manipulate the data that you import into an Excel workbook. In the following paragraphs, I explain how to use the primary text functions.

image If you’ve just read the word function and you’re scratching your head, you might want to review the contents of the Appendix.

By the way, I skip discussions of three text functions that I don’t think you’ll have occasion to use for scrubbing data: BAHTEXT (rewrites values using Thai characters); CHAR (returns the character represented by an American National Standards Institute [ANSI] code number); and CODE (returns the ANSI code represented by character). To get descriptions of these other text functions, click the down arrow button next to the AutoSum function on the Home tab and choose More Functions from the drop-down list Excel displays. When Excel displays the Insert Function dialog box, select the Text entry from the Or Select A Category box, and then scroll through the list of text functions that Excel displays in the Select a Function box until you see the function that you have a question for — most likely, the function that I incorrectly assume you don’t need information about.

image In Excel 2007 or Excel 2010, you choose the Home tab’s choose Insert⇒Function to display the Insert Function dialog box.

The CLEAN function

Using the CLEAN function removes nonprintable characters text. For example, if the text labels shown in a column are using crazy nonprintable characters that end up showing as solid blocks or goofy symbols, you can use the CLEAN function to clean up this text. The cleaned-up text can be stored in another column. You can then work with the cleaned text column.

The CLEAN function uses the following syntax:

CLEAN(text)

The text argument is the text string or a reference to the cell holding the text string that you want to clean. For example, to clean the text stored in Cell A1, use the following syntax:

CLEAN(A1)

The CONCATENATE function

The CONCATENATE function combines, or joins chunks of text into a single text string. The CONCATENATE function uses the following syntax:

CONCATENATE(text1,text2,text3,…)

The text1, text2, text3, and so on arguments are the chunks of text that you want to combine into a single string. For example, if the city, state, and ZIP code were stored in fields named city, state, and zip, you could create a single text string that stores this information by using the following syntax:

CONCATENATE(city,state,zip)

If city were Redmond, state were WA, and zip were 98052, this function returns this text string:

RedmondWA98052

The smashed-together nature of the concatenated city, state, and ZIP code information isn’t a typographical mistake, by the way. To concatenate this information but include spaces, you need to include spaces as function arguments. For example, the following syntax:

CONCATENATE("Redmond", " ","WA", " ","98052")

returns the text string

Redmond WA 98052

The EXACT function

The EXACT function compares two text strings. If the two text strings are exactly the same, the EXACT function returns the logical value for true, which is 1. If the two text strings differ in any way, the EXACT function returns the logical value for false, which is 0. The EXACT function is case-sensitive. For example, Redmond spelled with a capital R differs from redmond spelled with a lowercase r.

The EXACT function uses the following syntax:

EXACT(text1,text2)

The text1 and text2 arguments are the text strings that you want to compare. For example, to check whether the two strings "Redmond" and "redmond" are the same, use the following formula:

EXACT("Redmond","redmond")

This function returns the logical value for false, 0, because these two text strings don't match exactly. One begins with an uppercase R and the other begins with a lowercase r.

The FIND function

The FIND function finds the starting character position of one text string within another text string. For example, if you want to know at what position within a text string the two-letter state abbreviation WA starts, you could use the FIND function.

The FIND function uses the following syntax:

FIND(find_text,within_text,start_num)

The find_text argument is the text that you’re looking for. The within_text argument identifies where or what you’re searching. The start_num argument tells Excel at what point within the string it should begin its search. For example, to find at what point the two-letter state abbreviation WA begins in the string Redmond WA 98052, use the following formula:

FIND("WA","Redmond WA 98052",1)

The function returns the value 9 because WA begins at the ninth position (because spaces are counted).

The start_num function argument is optional. If you omit this argument, Excel begins searching at the very beginning of the string.

The FIXED function

The FIXED function rounds a value to specified precision and then converts the rounded value to text. The function uses the following syntax:

FIXED(number,decimals,no_commas)

The number argument supplies the value that you want to round and convert to text. The optional decimals argument tells Excel how many places to the right of the decimal point that you want to round. The optional no_commas argument needs to be either 1 (if you want commas) or 0 (if you don’t want commas) in the returned text.

For example, to round to a whole number and convert to text the value 1234.56789, use the following formula:

FIXED(1234.56789,0,1)

The function returns the text 1,235.

The LEFT function

The LEFT function returns a specified number of characters from the left end of a text string. The function uses the following syntax:

LEFT(text,num_chars)

The text argument either supplies the text string or references the cell holding the text string. The optional num_chars argument tells Excel how many characters to grab.

For example, to grab the leftmost seven characters from the text string Redmond WA, use the following formula:

LEFT("Redmond WA",7)

The function returns the text Redmond.

The LEN function

The LEN function counts the number of characters in a text string. The function uses the following syntax:

LEN(text)

The text argument either supplies the text string that you want to measure or references the cell holding the text string. For example, to measure the length of the text string in cell I81, use the following formula:

LEN(I81)

If cell I81 holds the text string Semper fidelis, the function returns the value 14. Spaces are counted as characters, too.

The LOWER function

The LOWER function returns an all-lowercase version of a text string. The function uses the following syntax:

LOWER(text)

The text argument either supplies the text string that you want to convert or references the cell holding the text string. For example, to convert the text string PROFESSIONAL to professional, use the following formula:

LOWER("PROFESSIONAL")

The function returns professional.

The MID function

The MID function returns a chunk of text in the middle of text string. The function uses the following syntax:

MID(text,start_num,num_char)

The text argument either supplies the text string from which you grab some text fragment or it references the cell holding the text string. The start_num argument tells Excel where the text fragment starts that you want to grab. The num_char argument tells Excel how long the text fragment is. For example, to grab the text fragment tac from the text string tic tac toe, use the following formula:

=MID("tic tac toe",5,3)

The function returns tac.

The PROPER function

The PROPER function capitalizes the first letter in every word in a text string. The function uses the following syntax:

PROPER(text)

The text argument either supplies the text string or references the cell holding the text string. For example, to capitalize the initial letters in the text string ambassador kennedy, use the following formula:

PROPER("ambassador kennedy")

The function returns the text string Ambassador Kennedy.

The REPLACE function

The REPLACE function replaces a portion of a text string. The function uses the following syntax:

REPLACE(old_text,start_num,num_chars,new_text)

The old_text argument, which is case-sensitive, either supplies the text string from which you grab some text fragment or it references the cell holding the text string. The start_num argument, which is the starting position, tells Excel where the text starts that you want to replace. The num_chars argument tells Excel the length of the text fragment (how many characters) that you want to replace. The new_text argument, also case-sensitive, tells Excel what new text you want to use to replace the old text. For example, to replace the name Chamberlain with the name Churchill in the text string Mr. Chamberlain, use the following formula:

REPLACE("Mr. Chamberlain",5,11,"Churchill")

The function returns the text string Mr. Churchill.

The REPT function

The REPT function repeats a text string. The function uses the following syntax:

REPT(text,number_times)

The text argument either supplies the text string or references the cell holding the text string. The number_times argument tells Excel how many times you want to repeat the text. For example, the following formula:

REPT("Walla",2")

returns the text string WallaWalla.

The RIGHT function

The RIGHT function returns a specified number of characters from the right end of a text string. The function uses the following syntax:

RIGHT(text,num_chars)

The text argument either supplies the text string that you want to manipulate or references the cell holding the text string. The num_chars argument tells Excel how many characters to grab.

For example, to grab the rightmost two characters from the text string Redmond WA, use the following formula:

RIGHT("Redmond WA",2)

The function returns the text WA.

The SEARCH function

The SEARCH function calculates the starting position of a text fragment within a text string. The function uses the following syntax:

SEARCH(find_text,within_text,start_num)

The find_text argument tells Excel what text fragment you’re looking for. The within_text argument tells Excel what text string that you want to search. The start_num argument tells Excel where to start its search. The start_num argument is optional. If you leave it blank, Excel starts the search at beginning of the within_text string.

For example, to identify the position at which the text fragment Churchill starts in the text string Mr. Churchill, use the following formula:

SEARCH("Churchill","Mr. Churchill",1)

The function returns the value 5.

The SUBSTITUTE function

The SUBSTITUTE function replaces occurrences of text in a text string. The function uses the following syntax:

SUBSTITUTE(text,old_text,new_text,instances)

The text argument tells Excel what text string you want to edit by replacing some text fragment. The old_text argument identifies the to-be-replaced text fragment. The new_text supplies the new replacement text.

As an example of how the SUBSTITUTE function works, suppose that you need to replace the word Senator with the word President in the text string Senator Obama.

SUBSTITUTE("Senator Obama","Senator","President")

The function returns the text string President Obama.

The instances argument is optional, but you can use it to tell Excel for which instance of old_text you want to make the substitution. For example, the function

SUBSTITUTE("Senator Senator","Senator","President",1)

returns the text string President Senator Obama.

The function

SUBSTITUTE("Senator Senator Obama","Senator","President",2)

returns the text string Senator President Obama.

If you leave the instances argument blank, Excel replaces each occurrence of the old_text with the new_text. For example, the function

SUBSTITUTE("Senator Senator Obama","Senator","President")

returns the text string Senator Senator Obama.

The T function

The T function returns its argument if the argument is text. If the argument isn’t text, the function returns nothing. The function uses the following syntax:

T(value)

For example, the formula T(123) returns nothing because 123 is a value. The formula T("Seattle") returns Seattle because Seattle is a text string.

The TEXT function

The TEXT function formats a value and then returns the value as text. The function uses the following syntax:

TEXT(value,format_text)

The value argument is the value that you want formatted and returned as text. The format_text argument is a text string that shows the currency symbol and placement, commas, and decimal places that you want. For example, the formula

=TEXT(1234.5678,"$##,###.00")

returns the text $1,234.57.

image Note that the function rounds the value.

The TRIM function

The TRIM function removes extra spaces from the right end of a text string. The function uses the following syntax:

TRIM(text)

The text argument is the text string or, more likely, a reference to the cell holding the text string.

The UPPER function

The UPPER function returns an all-uppercase version of a text string. The function uses the following syntax:

UPPER(text)

The text argument either supplies the text string that you want to convert or it references the cell holding the text string. For example, to convert the text string professional to PROFESSIONAL, you can use the following formula:

UPPER("professional")

The function returns the text string PROFESSIONAL.

The VALUE function

The VALUE function converts a text string that looks like a value to a value. The function uses the following syntax:

VALUE(text)

The text argument either supplies the text string that you want to convert or it references the cell holding the text string. For example, to convert the text string $123,456.78 — assume that this isn’t a value but a text string — you can use the following formula:

VALUE("$123,456.78")

The function returns the value 123456.78.

Converting text function formulas to text

You might need to know how to convert a formula — such as a formula that uses a text function — to the label or value that it returns. For example, suppose you find yourself with a worksheet full of text-function-based formulas because you used the text functions to clean up the list data. And now you want to just work with labels and values.

You can convert formulas to the labels and values that they return by selecting the worksheet range that holds the formulas, choosing the Home tab’s Copy command, and then choosing the Home tab’s Paste⇒Paste Values command without deselecting the currently selected range. Note that to get to the Paste submenu, you need to click the lower half of the Paste command button.

Using Validation to Keep Data Clean

One useful command related to this business of keeping your data clean is the Data Validation command. Use this command to describe what information can be entered into a cell. The command also enables you to supply messages that give data input information and error messages that attempt to help someone correct data entry errors.

To use Data Validation, follow these steps:

1. Select the worksheet range where the to-be-validated data will go.

You can do this by dragging your mouse or by using the navigation keys.

2. Choose the Data tab’s Data Validation command to tell Excel that you want to set up data validation for the selected range.

Excel displays the Data Validation dialog box, as shown in Figure 3-10.

image

Figure 3-10: Keep data clean with the Data Validation dialog box.

3. On the Settings tab of the Data Validation dialog box, use the Validation Criteria text boxes to describe what is valid data.

Use choices from the Allow drop-down list box, for example, to supply what types of information can go into the range: whole numbers, decimal numbers, values from the list, valid dates, valid times, text of a particular length, and so on.

Use choices from the Data drop-down list box to further define your validation criteria. The Data drop-down list box provides several comparisons that can be made as part of the validation: between, not between, equal to, not equal to, greater than, and so on.

Refine the validation criteria, if necessary, using any of the other drop-down list boxes available. Note: The other validation criteria options depend on what you enter into the Allow and Data drop-down list boxes. For example, as shown in Figure 3-11, if you indicate that you want to allow only whole numbers between a particular range of minimum and maximum values, Excel provides Minimum and Maximum text boxes for you to enter or define the range. However, if you select other entries from the Allow or Data drop-down list boxes, you see other text boxes appearing on the Settings tab. In other words, Excel customizes the Settings tab depending on the kind of validation criteria that you define.

4. Fine-tune the validation.

After you describe the validation criteria, either select or deselect (clear) the Ignore Blank check box to indicate whether blank cells are allowed.

5. (Optional) Consider expanding the scope of the data validation.

Select the Apply These Changes to All Other Cells with the Same Settings check box to indicate whether the validation criteria should be expanded to other similar cells.

image Click the Clear All button, and Excel clears (removes) the validation criteria.

image

Figure 3-11: Create a data entry instruction message.

6. Provide an input message from the Input Message tab of the Data Validation dialog box.

The Input Message tab, as shown in Figure 3-11, enables you to tell Excel to display a small message when a cell with specified data validation is selected. To create the input message, you enter a title for the message into the Title text box and message text into the Input Message text box. Make sure that the Show Input Message When Cell Is Selected check box is selected. Look at Figure 3-12 to see how the Input Message entered in Figure 3-11 looks on the workbook.

image

Figure 3-12: A data entry instruction message is helpful.

7. Provide an error message from the Error Alert tab of the Data Validation dialog box. (See Figure 3-13.)

You can also supply an error message that Excel displays when someone attempts to enter invalid data. To create an error message, first verify that the Show Error Alert After Invalid Data Is Entered check box is selected. Then use the Style drop-down list box to select what Excel should do when it encounters invalid data: Stop the data entry on the user without the incorrect data entry, or simply display an informational message after the data has been entered.

Just like creating an input message, enter the error message title into the Title text box. Then enter the full text of the error message into the Error Message text box. In Figure 3-13, you can see a completed Error Alert tab. Check out Figure 3-14 for how the error message appears after a user enters invalid data.

image Curious about the options in the Style drop-down list box (as shown in Figure 3-13)? The style of the error alert determines what command buttons the error message presents when someone attempts to enter bad data. If the error style is Stop, the error message box displays Retry and Cancel command buttons. If the error style is Warning, the error message box displays Yes, No, and Cancel command buttons. If the error style is Informational, the error message box displays OK and Cancel command buttons.

image

Figure 3-13: Create an annoying data entry error message.

image

Figure 3-14: Britney would be proud, you dunderhead.