Sorting and Searching - Computer Science Programming Basics in Ruby (2013)

Computer Science Programming Basics in Ruby (2013)

Chapter 7. Sorting and Searching

IN THIS CHAPTER

§ Popular sorting algorithms

§ Analyzing the complexity of algorithms

§ Search algorithms

7.1 Introduction

Entire books have been written on sorting and searching with computers. We introduce the topic here only to stress, once again, that writing programs is not the target of computer science; solving problems efficiently and effectively with the limited resources found in a computer is the real goal.

It turns out that computers spend a tremendous amount of time sorting. Just as we discussed different algorithms for computing prime numbers, we will now discuss three basic, comparison-based sorting algorithms. None of these are truly efficient. Efficient comparison-based sorting is beyond the scope of this introductory text. Additionally, we introduce a radix sort, one that capitalizes on the nature of the elements stored, rather than individual comparison between elements.

The sorting problem is described as follows:

Given a list of elements provided as input in any arbitrary order, these elements having an established ordinal value, namely a collating sequence, reorder them so that they appear according to their ordinal value from lowest to highest.

GEM OF WISDOM

A common conjecture is that computers around the world spend the majority of their time sorting. Hence, it is difficult to talk much about computer science without talking about sorting. There are many sorting approaches.

For example, consider the following list of numbers as input: 5, 3, 7, 5, 2, 9. A sorted output corresponding to this input is: 2, 3, 5, 5, 7, 9.

In the following subsections, we will describe three comparison-based sorting algorithms and briefly compare them to demonstrate how to determine which one is the best. The implementation of each sorting algorithm will be presented in the context of grades for a final exam in a programming class. We want to provide a sorted list of the final scores, shown as percentages, to the students given an unsorted list. In each case, we describe the algorithm in plain language and then provide a corresponding Ruby implementation. Once the algorithms are presented, we discuss how we measure the notion of “best.”

7.1.1 Selection Sort

Selection sort is the simplest to explain and the most intuitive. Imagine you have a deck of cards in your hand, and they have numbers on them. If you wanted to sort them, one easy way is to just select the smallest number in the deck and bring it to the top. Now repeat the process for all cards other than the one that you just did. If you repeated this process until the entire deck was selected, you would end up with a sorted deck of cards. The algorithm just described is selection sort.

The selection sort algorithm is formally defined as follows:

1. Start with the entire list marked as unprocessed.

2. Find the smallest element in the yet unprocessed list; swap it with the element that is in the first position of the unprocessed list; reset the unprocessed list starting with the second element.

3. Repeat step 2 for an additional n – 2 times for the remaining n – 1 numbers in the list. After n – 1 iterations, the nth element, by definition, is the largest and is in the correct location.

We’ve already discussed arrays, so our Ruby code will first initialize an array and populate it with randomly generated numbers. The rand(x) function, where x is an integer, returns a randomly generated integer in the range [0, x].

The Ruby code for the selection sort is given in Example 7-1.

Example 7-1. Code for selection sort

1 # Code for selection sort

2 # 35 students in our class

3 NUM_STUDENTS = 35

4 # Max grade of 100%

5 MAX_GRADE = 100

6 num_compare = 0

7 arr = Array.new(NUM_STUDENTS)

8

9 # Randomly populate arr

10 for i in (0..NUM_STUDENTS - 1)

11 # Maximum possible grade is 100%, keep in mind that rand(5) returns

possible values 0-4, so

we must add 1 to MAX_GRADE

12 arr[i] = rand(MAX_GRADE + 1)

13 end

14

15 # Output current values of arr

16 puts "Input list:"

17 for i in (0..NUM_STUDENTS - 1)

18 puts "arr[" + i.to_s + "] ==> " + arr[i].to_s

19 end

20

21 # Now let's use a selection sort. We first find the lowest number in the

22 # array and then we move it to the beginning of the list

23 for i in (0..NUM_STUDENTS - 2)

24 min_pos = i

25 for j in (i + 1)..(NUM_STUDENTS - 1)

26 num_compare = num_compare + 1

27 if (arr[j] < arr[min_pos])

28 min_pos = j

29 end

30 end

31 # Knowing the min, swap with current first element (at position i)

32 temp = arr[i]

33 arr[i] = arr[min_pos]

34 arr[min_pos] = temp

35 end

36

37 # Now output the sorted array

38 puts "Sorted list:"

39 for i in (0..NUM_STUDENTS - 1)

40 puts "arr[" + i.to_s + "] ==> " + arr[i].to_s

41 end

42

43 puts "Number of Comparisons ==> " + num_compare.to_s

§ Lines 3 and 5 declare important constants that represent the problem. If the number of students in the class changes, we have to change only one constant.

§ Line 7 initializes an array called arr that will hold the randomly generated numbers and ultimately the sorted list.

§ Lines 10–13 step through the array arr and initialize each element to a randomly generated number in the range [0, MAX_GRADE].

§ Lines 17–19 output the initial list so that you can examine its contents. Comment this out if you want to try a large set of numbers to sort.

§ Line 23 is where the real work begins.

§ Lines 23–35, the outer loop, ensure that we repeat the core of step 1 n – 2 times.

§ Line 24 is the first line of finding the minimum value in the list. We set the first position of the unprocessed list to min_pos.

§ Lines 25–30 iterate through the rest of the unprocessed list to find a value smaller than the item located at position min_pos. If we find such a value, as in line 27, we update the value of min_pos as in line 28. Once we have found the minimum value, we perform the latter part of step 2 and swap it with the first position in the unprocessed list. The outer loop repeats until the entire list is sorted.

§ Line 26 counts the number of comparisons performed and is simply here for pedagogical purposes to determine the best sorting algorithm.

§ Lines 38–43 output the sorted list and the number of comparisons.

GEM OF WISDOM

Selection sort works by repeatedly finding the lowest remaining number and bringing it to the top. Selection sort is explained first since intuitively it is the easiest to understand. If you are confused by Example 7-1, come back to it after a break. Please do not just skip past it and hope that the rest of the chapter gets easier. It does not.

7.1.2 Insertion Sort

Insertion sort is a little trickier than selection sort. Imagine once again that you have a deck of cards and that you are given an additional card to add to this deck. You could start at the top of your deck and look for the right place to insert your new card. If you started with only one card and gradually built the deck, you would always have a sorted deck.

The insertion sort algorithm is formally defined as follows:

Step 1: Consider only the first element, and thus, our list is sorted.

Step 2: Consider the next element; insert that element into the proper position in the already-sorted list.

Step 3: Repeat this process of adding one new number for all n numbers.

The Ruby code for an insertion sort is given in Example 7-2.

GEM OF WISDOM

Insertion sort works by leaving the first element alone and declaring it as a sorted list of size 1. The next element is inserted into the right position in our newly sorted list (either above or below the element we started with). We continue by taking each new element and inserting it in the right position in our list. By the end, all of our insertions result in a single sorted list.

Example 7-2. Code for insertion sort

1 # Code for insertion sort

2 # Declare useful constants

3 NUM_STUDENTS = 35

4 MAX_GRADE = 100

5 num_compare = 0

6 arr = Array.new(NUM_STUDENTS)

7

8 # Randomly populate arr

9

10 for i in (0..NUM_STUDENTS - 1)

11 arr[i] = rand(MAX_GRADE + 1)

12 end

13

14 # Output randomly generated array

15 puts "Input array:"

16 for i in (0..NUM_STUDENTS - 1)

17 puts "arr[" + i.to_s + "] ==> " + arr[i].to_s

18 end

19

20 # Now let's use an insertion sort

21 # Insert lowest number in the array at the right place in the array

22 for i in (0..NUM_STUDENTS - 1)

23 # Now start at current bottom and move toward arr[i]

24 j = i

25 done = false

26 while ((j > 0) and (! done))

27 num_compare = num_compare + 1

28 # If the bottom value is lower than values above it, swap it until it

29 # lands in a place where it is not lower than the next item above it

30 if (arr[j] < arr[j - 1])

31 temp = arr[j - 1]

32 arr[j - 1] = arr[j]

33 arr[j] = temp

34 else

35 done = true

36 end

37 j = j - 1

38 end

39 end

40

41 # Now output the sorted array

42 puts "Sorted array:"

43 for i in (0..NUM_STUDENTS - 1)

44 puts "arr[" + i.to_s + "] ==> " + arr[i].to_s

45 end

46 puts "Number of Comparisons ==> " + num_compare.to_s

§ Lines 22–39 contain the core outer loop that inserts the next number in the list into the right place.

§ Lines 26–38 contain the inner loop that swaps numbers starting at the beginning of the unsorted list until the number falls into the right place.

§ Once the number is in the right place, the flag done is set to true in line 35.

7.1.3 Bubble Sort

Bubble sort is based on percolation; that is, elements successively percolate to the right order by swapping neighboring elements. This is like continuously and repetitively comparing adjacent pairs of cards within your deck.

The bubble sort uses two relatively straightforward loops. The outer loop ensures that the core process in the inner loop is repeated n – 1 times. The core process is to loop through the list and, for any successive elements in the list, check the following: if the value we are currently examining is larger than the next member of the list, simply swap those two values. Thus, each value will fall down the list into its proper place. Essentially, small values “bubble” to the top of the list, hence the name “bubble sort.”

The bubble sort algorithm is formally defined as follows:

Step 1: Loop through all entries of the list.

Step 2: Compare each entry to all successive entries and swap entries if they are out of order.

Step 3: Repeat this process a total of n – 1 times.

The Ruby code is given in Example 7-3. An efficiency optimization (not shown) terminates the processing once no swaps occur. This conceptually does not affect the efficiency of the sort, but typically does so in practice.

GEM OF WISDOM

Bubble sort is a little tricky. It is not how people would likely sort. The premise is that if we repeatedly place successive elements in order, eventually the smallest element will bubble up to the top. It is clever and sometimes is more efficient than the other algorithms we have discussed. So it is worth knowing. Take some time and step through this code.

Example 7-3. Code for bubble sort

1 # Code for bubble sort

2 NUM_STUDENTS = 35

3 # Max grade of 100%

4 MAX_GRADE = 100

5 num_compare = 0

6 arr = Array.new(NUM_STUDENTS)

7

8 # Randomly put some final exam grades into arr

9

10 for i in (0..NUM_STUDENTS - 1)

11 arr[i] = rand(MAX_GRADE + 1)

12 end

13

14 # Output randomly generated array

15 puts "Input array:"

16 for i in (0..NUM_STUDENTS - 1)

17 puts "arr[" + i.to_s + "] ==> " + arr[i].to_s

18 end

19

20 # Now let's use bubble sort. Swap pairs iteratively as we loop through the

21 # array from the beginning of the array to the second-to-last value

22 for i in (0..NUM_STUDENTS - 2)

23 # From arr[i + 1] to the end of the array

24 for j in ((i + 1)..NUM_STUDENTS - 1)

25 num_compare = num_compare + 1

26 # If the first value is greater than the second value, swap them

27 if (arr[i] > arr[j])

28 temp = arr[j]

29 arr[j] = arr[i]

30 arr[i] = temp

31 end

32 end

33 end

34

35 # Now output the sorted array

36 puts "Sorted array:"

37 for i in (0..NUM_STUDENTS - 1)

38 puts "arr[" + i.to_s + "] ==> " + arr[i].to_s

39 end

40 puts "Number of Comparisons ==> " + num_compare.to_s

§ Lines 22–33 contain the core algorithm.

§ Lines 24–32 contain the inner loop that swaps all elements that are larger than their next successive element.

7.1.4 Radix Sort

The radix sort is very different from the others. The sorting algorithms we have discussed compare the entire number with other numbers in the list and ultimately make a decision as to where an element belongs based on its number. The radix sort works by sorting the list by each successive digit. The idea is that if we first sort all the units or ones digits in a list and then sort all the tens digits and so on, ultimately, when we run out of digits, we will have a sorted list. Sorting by a single digit can be done by running one of the three sorting algorithms we have discussed. It can also be done by storing all values that match the digit in an array. We use this method so that the algorithm ends up looking very different from the algorithms we have already discussed.

Let’s make sure we are clear about the idea of sorting values one digit at a time.

Consider a list of values:

47

21

90

Now let’s sort them based only on their rightmost digit. The rightmost digits are 7, 1, 0. We can sort these as 0, 1, 7. Now let’s look at our list:

90

21

47

It is clearly not in sorted order (but at least the rightmost digit is nicely sorted).

Now we move on to the next digit. It is 9, 2, 4. Sorting this, we obtain 2, 4, 9. Here is the list:

21

47

90

It is now sorted. You may wonder why we start at the rightmost digit. The reason is that we know every number has at least one digit, so we can start there. Some numbers may be bigger or smaller than others, so we have to start at the right and work our way to the left. Now let’s consider the use of a hash. For the same example, we start with:

47

21

90

GEM OF WISDOM

In radix sort, unlike the other sorting algorithms discussed, no comparison of elements is made. Instead, radix sort repeatedly sorts elements digit by digit, commencing from the rightmost digit until the last digit is done. Radix sort illustrates that there are numerous unique approaches to the sorting problem; thus, investigate alternatives rather than simply selecting the first solution that comes to mind.

Now let’s make a hash bucket for each possible digit, so we have a bucket for 0, a bucket for 1, and finally a bucket for 9.

We read the rightmost digit and put it into the correct bucket. This results in:

0 → 90

1 → 21

7 → 47

We now read the buckets in order from 0 to 9 and output all values in the bucket to continue the sort. This yields:

90

21

47

This is the same place we were at when we sorted the rightmost digit. This works because we process the hash buckets in order from 0 to 9. Now we repopulate our hash buckets with the tens digit. We obtain:

2 → 21

4 → 47

9 → 90

Reading the buckets in order gives us our sorted result of 21, 47, 90.

To review, we are building a hash of the following form:

0 → [Array of matching values for the digit 0]

1 → [Array of matching values for the digit 1]

...

9 → [Array of matching values for the digit 9]

It can be seen that our hash of 10 entries (one for each digit) points to an array of matches for that specific digit. Note that this works because we know we are sorting only one digit at a time, and we know the full set of valid digits.

The code for radix sort is shown in Example 7-4.

Example 7-4. Code for radix sort

1 # Code for radix sort

2 NUM_STUDENTS = 35

3 MAX_GRADE = 100

4 arr = Array.new(NUM_STUDENTS)

5

6 # Randomly put some grades into the array *as strings*

7 for i in (0..NUM_STUDENTS - 1)

8 arr[i] = rand(MAX_GRADE + 1).to_s

9 end

10

11 # Output array and find the maximum number of digits in the generated array

12 puts "Input array: "

13 max_length = 0

14 for i in (0..NUM_STUDENTS - 1)

15 puts "arr[" + i.to_s + "] ==> " + arr[i]

16 if arr[i].length > max_length

17 max_length = arr[i].length

18 end

19 end

20 puts "Max length ==> " + max_length.to_s

21

22 # Add 0 padding based on the max length, simplifying the sort algorithm

23 for i in (0..NUM_STUDENTS - 1)

24 arr[i] = arr[i].rjust(max_length, "0")

25 end

26

27 # Now let's use a radix sort. Go through each digit and

28 # add each element to an array corresponding to the digits.

29 for i in (0..max_length - 1)

30 # Clear out and reset the bucket

31 buckets = Hash.new()

32 for j in 0..9

33 buckets[j.to_s] = Array.new()

34 end

35

36 # Add each number to its respective digit bucket

37 for j in 0..NUM_STUDENTS - 1

38 num = arr[j]

39 digit = num[max_length - 1 - i]

40 buckets[digit].push(num)

41 end

42 # Flatten the buckets into a one-dimensional array

43 arr = buckets.values.flatten

44 end

45

46 # Now output the sorted array

47 puts "Sorted array:"

48 for i in (0..NUM_STUDENTS - 1)

49 puts "arr[" + i.to_s + "] ==> " + arr[i].to_s

50 end

§ Lines 2–9 initialize the list to be sorted as we have described in all the other sorts. Notice that we are storing the numbers as strings, rather than integers, so we can easily access the individual digits of the number. The list is initialized with random values.

§ One addition is a loop, on lines 23–25, that right-justifies the array elements in the list using the Ruby rjust function. This pads the numbers in the list with leading zeros.

Since we are going to loop through the entries in the list digit by digit, it is crucial that all numbers contain the same number of digits. Padding with zeros in the front of the number is the best way to ensure that all numbers are of the same length.

§ Lines 29–44 contain the outer loop that processes the list one digit at a time. For each digit, the entire list will be traversed.

§ On lines 31–34, we reset the hash named bucket that we discussed in the description of the algorithm.

§ Lines 37–41 are the inner loop that adds each number to its corresponding bucket.

§ Line 43 uses two functions, values and flatten, to give a new array representing the values sorted according to the current digit we are processing. The values function returns an array of all the values in a hash, which is analogous to the keys function discussed in Section 6.3, “Hashes.” The flatten function takes a two-dimensional array and returns a one-dimensional array with the same elements, as shown in the following irb session:

§ irb(main):001:0> arr = [[1, 2], [3, 4], [5, 6]]

§ => [[1, 2], [3, 4], [5, 6]]

§ irb(main):002:0> arr.flatten

=> [1, 2, 3, 4, 5, 6]

7.2 Complexity Analysis

To evaluate an algorithm, a common approach is to analyze its complexity. That is, we essentially count the number of steps involved in executing the algorithm.

An intuitive explanation of complexity analysis is the following. We caution you that our explanation is clearly an oversimplification, but it suffices for our purposes. Given a certain input size, assuming that to process a single element takes one unit of time, how many units of time are involved in processing n elements of input? This is essentially what complexity analysis attempts to answer. As can be seen, it is unnecessary to determine exactly the computer time involved in each step; instead, we simply determine the number of logical steps that occur in a given algorithm. In reality, we can have families of steps (say, one family is addition and subtraction, the other multiplication and division). We then count how many steps of each family are required.

A simple example will be to evaluate the complexity of computing a2+ab+b2 for a large number, say, n, of pairs of a and b. Computing directly, we should have 3n multiplications and 2n additions. However, we can compute the same expression using (a + b)2ab, which can be done in 2n multiplications and 2n additions (note that we consider addition and subtraction to be in the same family of steps). Thus, the second expression is better than the original. For very large values of n, this may make a significant difference in computation time. This is a very simple example, but it provides a background for our discussions of the complexity of sorting algorithms.

In complexity analysis, we forgo constants; thus, the distinctions between n and n – 1 in terms of complexity are nonexistent. More so, we often assume that all computations are of the same family of operations. In terms of our complexity analysis, it does not matter whether the list shrinks or grows; for simplicity, assume it shrinks.

Now consider the three presented comparison-based sorting algorithms. For all, the outer loop has n steps, and for the inner loop the size of the list shrinks by one with each pass. So the first time it takes n steps, the next time n – 1, the next time n – 2, and so on. Thus, the number of steps is:

n + (n – 1) + (n – 2) + ... + 3 + 2 = 2 + 3 + ... + (n – 2) + (n – 1) + n

If you add 1 to the rewriting of the sum, it becomes a well-known arithmetic series, and its total is . So the total number of steps for these sorts is .

Clearly, for any n > 0, is less than n2; however, the complexity is still considered roughly n2. The official notation is and is pronounced “on the order of” or “big oh.” The reason the complexity is is because complexity is only an approximation, and clearly the dominant portion of is n2. For our purposes, if you grasp the concept of the dominant portion to determine complexity, you are ahead of the game.

For the complexity analogies, cn2, where c is a constant, is considered for any finite c. For actual computations, however, the value of c may be important. Again, refer to a book on complexity theory to understand, in detail, this important concept. For a reading list on algorithms and complexity, see Appendix A.

As an aside, order computation typically involves best-, average-, and worst-case analyses. For the selection and bubble sort algorithms presented, the best-, average-, and worst-case analyses are the same, since regardless of the initial ordering of the list, the processing is identical. As previously mentioned, there is an implementation of bubble sort that checks for no swapping with potential early termination. In such a case, the best-case analysis, which occurs when the initial list is already sorted, is .

Now let’s turn to insertion sort, which is somewhat trickier to analyze. Here we are finding the rightmost element at which point to insert the value. For an already sorted list, the rightmost element will occur immediately, and we will end up at only n steps! Thus, the best-case analysis for insertion sort is . However, if the list is precisely the opposite of sorted, namely, in descending order, we must process until the end of the list for each step. Thus, once again, we end up with steps. Thus, the worst-case analysis for insertion sort is . It turns out that the average-case analysis is likewise .

The radix sort works in , where d is the number of digits that must be processed and n is the number of entries that are to be sorted. Hence, it should run much faster than the other examples. It should be noted that other algorithms—quicksort, mergesort, and heapsort—all run in time. The radix sort might at first appear to be faster than these, but it depends on how many digits are processed. A 64-bit integer might require processing each bit as a digit. Hence, the runtime where d = 64 will be . This might sound good, but n log(n) time will be faster where n < 264. So, in comparing the three sorts as presented, the average- and worst-case analyses for each on the comparison-based sorts is , while the radix sort can vary from linear time (sorting values with a single bit) to an infinite amount of time, as the number of digits is theoretically not constrained.

7.3 Searching

Searching a list of names or numbers is another very common computer science task. There are many search algorithms, but the key in developing a search algorithm is to determine which type of candidate search process matches the particular need. The following are the two questions (parameters) that affect our search algorithm selection:

§ Is the list we are searching sorted or unsorted?

§ Are the searched list elements unique, or are there duplicate values within the list?

For simplicity, we illustrate the search process using only a unique element list. That is, our implementation assumes that there are no duplicate values. We then discuss what needs to be modified in the algorithm and corresponding Ruby implementation to support duplicate values. Given the level of programming sophistication you now possess, we forgo presenting the only slightly modified implementation that supports duplicate values and leave it as an exercise for you. Once again, we revisit the final exam grade example we used in the sections on sorting.

We now discuss two types of searches. The first is for an unsorted list called a linear search, and the second is for an ordered or sorted list; it is called a binary search.

7.3.1 Linear Search

Consider the problem of finding a number or a name, or more accurately, its position, in an unsorted list of unique elements. The simplest means to accomplish this is to visit each element in the list and check whether the element in the list matches the sought-after value. This is called a linear or sequential search since, in the worst case, the entire list must be searched in a linear fashion (one item after another). This occurs when the sought-after value either is in the last position or is absent from the list. Obviously, the average case requires searching half the list since the sought-after value can be found equally likely anywhere in the list, and the best case occurs when the sought-after value is the first element in the list. The algorithm is as follows:

1. For every element in the list, check whether the element is equal to the value to be found.

2. If the element looked for is found, then the position where the element is found is returned. Otherwise, continue to the next element in the list.

Continue the search until either the element looked for is found or the end of the list is reached.

A Ruby implementation for unique element linear search is provided in Example 7-5.

Example 7-5. Code for linear search

1 # Code for linear search

2 NUM_STUDENTS = 35

3 MAX_GRADE = 100

4 arr = Array.new(NUM_STUDENTS)

5 value_to_find = 8

6 i = 1

7 found = false

8

9 # Randomly put some student grades into arr

10 for i in (0..NUM_STUDENTS - 1)

11 arr[i] = rand(MAX_GRADE + 1)

12 end

13

14 puts "Input List:"

15 for i in (0..NUM_STUDENTS - 1)

16 puts "arr[" + i.to_s + "] ==> " + arr[i].to_s

17 end

18 i = 0

19 # Loop over the list until it ends or we have found our value

20 while ((i < NUM_STUDENTS) and (notfound))

21 # We found it :)

22 if (arr[i] == value_to_find)

23 puts "Found " + value_to_find.to_s + " at position " + i.to_s + " of

the list."

24 found = true

25 end

26 i = i + 1

27 end

28

29 # If we haven't found the value at this point, it doesn't exist in our list

30 if (notfound)

31 puts "There is no " + value_to_find.to_s + " in the list."

32 end

Consider now the case of an unsorted list with potentially duplicate elements. In this case, it is necessary to check each and every element in the list, since an element matching the sought-after value does not imply completion of the search process. Thus, the only difference between this algorithm and the unique element linear search algorithm is that we continue through the entire list without terminating the loop if a matching element is found.

§ In line 5, we initialize the sought-after value. Clearly the user would be prompted with some nice box to fill in, but we do not want to get distracted with user-interface issues. Ultimately the user fills in the nice box, pulls down a value list, or clicks on a radio button, and a variable such as value_to_find will be initialized.

§ Next, a flag called found is set to false on line 7. This is used so that the search will terminate when the sought-after value is indeed found.

§ In lines 10–12, the list is initialized and filled with some random values.

§ The key loop starts at line 20, where the list is traversed one item at a time. Each time, the comparison in line 22 tests to determine if the element in the array matches the sought-after value the user is trying to find. If the value matches, a message is displayed, the flag is set to true, and the loop terminates.

§ Finally, on line 30, a check is made to determine if the element was not found—in other words, is absent from the list. If this is the case, the value in the found flag will remain false. If it is still false after traversing the entire list, this means that no value in the list matched the sought-after value, and the user is notified.

7.3.2 Binary Search

Binary search operates on an ordered set of numbers. The idea is to search the list precisely how you might automatically perfectly search a phone book. A phone book is ordered from A to Z. Ideally, you initially search the halfway point in the phone book. For example, if the phone book had 99 names, ideally you would initially look at name number 50. Let’s say it starts with an M, since M is the 13th letter in the alphabet. If the name we are looking for starts with anything from A to M, for example, “Fred,” we find the halfway point between those beginning with A and those beginning with M. If, on the other hand, we are searching for a name farther down the alphabet than those names that start with M, for example, “Sam,” we find the halfway point between those beginning with M and those beginning with Z. Each time, we find the precise middle element of those elements left to be searched and repeat the process. We terminate the process once the sought-after value is found or when the remaining list of elements to search consists of only one value.

By following this process, each time we compare, we reduce half of the search space. This halving process results in a tremendous savings in terms of the number of comparisons made. A linear search must compare each element in the list; a binary search reduces the search space in half each time. Thus, 2x = n is the equation needed to determine how many comparisons (x) are needed to find a sought-after value in an n element list. Solving for x, we obtain x = log2(n).

GEM OF WISDOM

Binary search is one of the finest examples of computer science helping to make software work smart instead of just working hard. A linear search of 1 million elements takes on average half a million comparisons. A binary search takes 20. That is an average savings of 499,980 comparisons! So think before you code.

Instead of an algorithm needed to find an element using a linear search, we now have an search algorithm. Now, for example, consider a sorted list with 1,048,576 elements. For a linear search, on average, we would need to compare 524,288 elements against the sought-after value, but we may need to perform a total of 1,048,576 comparisons.

In contrast, in a binary search we are guaranteed to search using only log2(1,048,576) = 20 comparisons. Instead of 524,288 comparisons in the average case, the binary search algorithm requires only 20. That is, the number of comparisons required by the binary search algorithm is less than 0.004% of the expected number of comparisons needed by the linear search algorithm. As an aside, and are equivalent, since they differ strictly by a constant. Hence, generally speaking, the notation is preferred. Of course, binary search is possible only if the original list is sorted. The binary search explanation given earlier is for a unique element list only. However, before presenting the modification needed for potential duplicate elements, a few remarks regarding the use of binary search must be made:

§ First, binary search assumes an ordered list. If the list is unordered, it must be sorted prior to the search, or a binary search won’t work. Sorting involves a greater time complexity than searching. Thus, if the search will occur rarely, it might not be wise to sort the list. On the other hand, searches often occur frequently and the updating of the list occurs infrequently. Thus, in such cases, always sort (order) the list and then use binary search.

§ The average- and worst-case search times for binary search are , while the average- and worst-case search times for linear search are . What is interesting, however, is that unlike for linear search, where, in practice, the worst-case search time is double that of the average case, for binary search both times are roughly identical in practice.

In Example 7-6, we present a Ruby implementation of unique element binary search. Note that we introduce on line 16 a built-in Ruby feature to check if a value is already present within an array and on line 22 to sort an array in place.

Example 7-6. Code for binary search

1 # Code for binary search

2 NUM_STUDENTS = 30

3 MAX_GRADE = 100

4 arr = Array.new(NUM_STUDENTS)

5 # The value we are looking for

6 value_to_find = 7

7 low = 0

8 high = NUM_STUDENTS - 1

9 middle = (low + high) / 2

10 found = false

11

12 # Randomly put some exam grades into the array

13 for i in (0..NUM_STUDENTS - 1)

14 new_value = rand(MAX_GRADE + 1)

15 # make sure the new value is unique

16 while (arr.include?(new_value))

17 new_value = rand(MAX_GRADE + 1)

18 end

19 arr[i] = new_value

20 end

21 # Sort the array (with Ruby's built-in sort)

22 arr.sort!

23

24 print "Input List: "

25 for i in (0..NUM_STUDENTS - 1)

26 puts "arr[" + i.to_s + "] ==> " + arr[i].to_s

27 end

28

29 while ((low <= high) and (notfound))

30 middle = (low + high) / 2

31 # We found it :)

32 if arr[middle] == value_to_find

33 puts "Found grade " + value_to_find.to_s + "% at position " + middle.to_s

34 found = true

35 end

36

37 # If the value should be lower than middle, search the lower half,

38 # otherwise, search the upper half

39 if (arr[middle] < value_to_find)

40 low = middle + 1

41 else

42 high = middle - 1

43 end

44 end

45

46 if (notfound)

47 puts "There is no grade of " + value_to_find.to_s + "% in the list."

48 end

We use these features to simplify the code illustrated. Note that this type of abstraction, namely, the use of the built-in encapsulated feature, simplifies software development, increases readability, and simplifies software maintenance. Its use is paramount in practice. It should be understood that this book uses Ruby as a tool for learning concepts of computer science and basic programming, and not as an attempt to teach all the capabilities of the Ruby interpreter. See the additional reading list in Appendix A if you are interested in exploring additional built-in features of Ruby.

§ Line 9 computes the first middle range.

§ Lines 29–44 implement the key loop that keeps cutting the search space down by half.

§ Line 32 is the comparison to the sought-after value. If we find the element, we are done and we update the same found flag that was used for the linear search. If the value is less than the middle we update the high side of the range, and if it is greater we update the low side of the range. At the end, we verify that the sought-after value was indeed found.

As with the linear search algorithm, the modification required to support possible duplicate values is relatively minimal for the binary search algorithm. Since binary search requires an ordered list, if a sought-after value is found, then all duplicates must be adjacent to the position just found.

Thus, to find all duplicates, positions immediately preceding and following the current position are checked for as long as the sought-after value is found. That is, in succession, adjacent positions earlier and earlier in the list are checked while the stored element value equals that which is sought after, and similarly for later and later positions. Again, the implementation of this change is left as an exercise to the reader.

7.4 Summary

We discussed the various sorting schemes, both comparison-based and non-comparison-based, and the strengths of each. We introduced the field of complexity analysis, a tool used to quantify the performance of an algorithm. Finally, we discussed searching and provided a real-world example of searching techniques.

Specifically, we described four elementary sorting algorithms. The first three presented—selection, insertion, and bubble sort—are comparison based. That is, elements are compared against one another to determine a sorted ordering. The fourth algorithm, radix sort, differs substantially. Instead of comparing elements, radix sort orders the elements according to their representation, starting from the rightmost digit. Once all digits of the element representation are processed, the list is sorted. The complexity of these sorting algorithms is presented in Table 7-1, where n represents the number of elements in the list and k represents the number of digits needed to represent the largest element. The best-case scenario occurs when the original list is already sorted; the worst-case scenario occurs when the original list is in reserve order; and the average-case scenario represents an original random ordering.

Table 7-1. Sort algorithm complexity summary

Best case

Worst case

Average case

Selection sort

Insertion sort

Bubble sort

Radix sort

Likewise, we presented two searching algorithms: linear and binary search. Linear search can be used to search any list, whereas binary search requires a sorted list. The complexity of these searching algorithms is presented in Table 7-2, where n represents the number of elements in the list searched. The best-case scenario occurs when the first element encountered is the element sought; the worst-case scenario occurs when the sought after element is missing; and the average-case scenario represents a random list.

Table 7-2. Search algorithm complexity summary

Best case

Worst case

Average case

Linear search

Binary search

7.4.1 Key Concepts

§ Sorting is a common problem that occurs in many places in computer science. We focus primarily on comparison-based sorting, where we simply compare the items to determine the order. Radix sort sorts numbers without directly comparing them.

§ Searching can be done naively by linearly searching through a list, but if the list is sorted, we can take advantage of binary search to improve performance.

§ When discussing algorithm performance, computer scientists use complexity analysis.

7.4.2 Key Definitions

§ Comparison-based sort: A sorting method that relies on directly comparing the elements.

§ Complexity analysis: A mathematical way of analyzing an algorithm’s performance.

7.5 Exercises

1. Radix sort, as presented, works for integers. Modify the algorithm in Example 7-4 to sort English names.

2. For each input sequence provided in the following list, state which presented comparison-based sort or sorts would require the fewest steps. Explain why.

a. 5 2 4 3 1

b. 1 2 3 4 5

c. 5 4 3 2 1

3. You are provided a lengthy unsorted list and told to search it.

a. Which search algorithm would you use?

b. If you were told that you will need to search the list many times, would your search strategy change? If so, how?

c. At which point would you change your approach if you were to change it?

4. The complexity of the comparison-based sorting algorithms presented, on the average case, is . Design a comparison-based sorting algorithm with a lower complexity. What is the underlying premise that lowers its complexity?

5. Generate a 100-element list containing integers in the range 0–10. Sort the list with selection sort and with radix sort.

a. Which is faster? Why?

b. Try this again, but with 10,000 elements. Note the relative difference. Why does it exist?