Programming Problems in Java: A Primer for The Technical Interview (2014)
Chapter 9. Selection
Selection is the last type of search we investigate.
In programming, we define the median of an array in terms of the sorted order of the array. The median of an array of n values is the value of the element at index n∕2 in sorted order. That is if an array of values is ordered, the median is the value of the middle element. Although this definition is not the same as the statistical median, it is commonly useful in programming.
Selection is a generalization of the problem of finding the median. Concretely, selection is the problem of finding the kth median of an array A of n elements. That is the element x such that there are at least k - 1 elements in A which are less than or equal to x, and n - k elements of A greater than or equal to x
As is often the case with algorithms, it is just as easy to handle the general case of selection as it is to handle the specific case of finding the median. In the following we will see that doing so does not harm run time efficiency. It is just as efficient to find the element in the kth position of a sorted array as it is to find the median of an array when the values in the array can be unbounded.
In this section we will provide a progression of solutions for the problem of selection. We begin with selection through sorting, which is a simple but inefficient solution to the problem. We then look at a randomized selection algorithm, and improve it to deterministic bounds. We then conclude with a selection algorithm designed to operate over large data sets, a problem commonly asked at Google and Microsoft.
9.1 Selection through sort
The simplest approach to selection is to pre-sort the array, and return the element that resides in the kthposition. Using the built-inArraysṡortutility function, the method is simple to implement.
Listing 9.1: Selection by Sorting
int select(int[] array, int k) {
Arrays.sort(array);
return array[k];
}
An implementation of the median from this method is simple. Recall both that arrays are 0-indexed and, from our definition above, that we want the rightmost of two middle elements.
Listing 9.2: Median by Sorting
int median(int[] array) {
return select(array, (array.length - 1) / 2);
}
This method always returns the correct answer, and it is hard to improve in terms of brevity. However it requires a full sort of the data. This turns out to be inefficient, as selection doesn’t require the full sort order. Instead we only need to know whether or not there are k smaller elements in the array. In order to improve the run time, we first look at the problem of finding a proper pivot for which to partition instead of pre-sorting.
9.2 The pivot and partition operation
In binary search, we were able to cut the search space in half by knowing that every element with an index lower than the probe’s was smaller than the probe, and every element with an index higher than the probe’s was larger than the probe. Finding a good pivot element will allow similar narrowing of the search space in selection.
This is achieved by pivoting an array on an element that is close to the median. Given a value and an array, the pivot operation leaves the array in a state where every element of the array smaller than the pivot precedes it, and every element in the array larger than the pivot succeeds it.
There are many methods of implementing a partition on an array. One could use a list and push back larger values and push front smaller values.
The following listing partitions the sub-array defined by the offset and length on the value provided. The algorithm used initializes the array with the pivot element in the first position. It then enumerates the array from right to left. When an entry with value less than the pivot is detected, the tail of the array is shifted to the right of the pivot. That smaller value is then stored in the empty location created by the shift.
The return value is an index into the array. Every element of the array which is less than or equal to the pivot has a lower index than the return value. Every element that is greater than the pivot value has an index no less than the return value.
Listing 9.3: Partition on an Element
int partition(int[] array,
int pos,
int offset,
int length) {
if (length <= 1) {
return offset;
}
int val = array[pos];
array[pos] = array[offset];
array[offset] = val;
int left = offset;
int right = offset + length - 1;
while (left < right) {
if (val < array[right]) {
right--;
} else {
int tmp = array[left + 1];
array[left + 1] = array[left];
array[left] = tmp;
if (right != (left + 1)) {
array[left] = array[right];
array[right] = tmp;
}
left++;
}
}
return left;
}
The use of the left starting position and the size on which to pivot will be helpful in the sequel. We can now implement and analyze selection by choosing an arbitrary pivot.
9.3 Selection with a pivot
We begin with a rule to choose a pivot. We suppose a rule is declared as below. The parameters define the sub-array from which to choose a pivot.
Listing 9.4: Definition of Find Pivot Index
abstract int pivot_index(int[] array,
int offset,
int size);
Given a rule to choose a pivot, implementing selection can be done recursively. A partition operation first provides the sort location of the pivot element. If the desired locationkis lower than the location of the pivot, we exclude all elements of the array from the pivot to the end. If the desired location is higher than the location of the pivot, we exclude all elements with indices smaller than the pivot and updatek. We finally recursively apply selection to the range of the array not excluded.
Listing 9.5: Selection with Pivot
int select_with_pivot(int[] array,
int offset,
int length,
int k) {
int index = pivot_index(array, offset, length);
int pos = partition(array,
index,
offset,
length);
if (pos == k) {
return array[k];
}
if (pos > k) {
return select_with_pivot(array,
offset,
pos - offset,
k);
}
return select_with_pivot(array,
pos + 1,
length
- (pos + 1 - offset),
k);
}
int select_with_pivot(int[] array, int k) {
return select_with_pivot(array, 0,
array.length,
k);
}
This implementation would have expected linear running time over input provided uniformly at random. The reason is that we can expect to exclude half the size of the list with every pivot. However the run time is not guaranteed because the pivot may not provide a strong separation of the search space. In the worst case, where the array is sorted this implementation runs in quadratic time. The problem is the choice of element on which to pivot.
To provide a linear running time selection algorithm, we chose as a pivot the median of the medians.
9.4 Median of medians
The problem at hand is to guarantee that the choice of the pivot element separates the array into two pieces each with length at least some constant fraction of the original array. This could be done if we could always choose the median. But that is our goal, and we do not have that tool available. None-the-less, it is possible to choose an approximation of the median by reducing this original problem to a smaller problem. The following solution is called the median-of-medians or median-of-three .
The reduction works by first bucketing the array into sub-arrays of constant length. From each sub-array we can find a median value, which is simple if each bucket is of a constant length. A short listing for length of three makes this clear.
Listing 9.6: Median of 3
int median_of_3(int[] array,
int offset,
int length) {
int[] sub = Arrays.copyOfRange(array,
offset,
offset + length);
Arrays.sort(sub);
return sub[(length - 1) / 2];
}
Choosing the median value of these medians will give us a value that is guaranteed to not be too far from the original actual median of the array. Analysis shows that the pivot chosen is approximately the median. Hence the running time of this process is linear to the length of the original arrays.
Implementing this algorithm is an exercise in recursion. We first bucket the array into sub-arrays of length 3, calculate the median of these three elements, and generate a new array consisting of these medians. Afterward we recursively find the median of the new array until it is small enough to decide in constant time.
Note the implementation operates on values, and then afterward find the index of the discovered value. The code can be changed to operate on indices, but doing so comes at the loss of some readability.
Listing 9.7: Median of Medians
int find_index(int[] array, int offset, int val) {
int index = offset;
while (val != array[index]) {
index++;
}
return index;
}
int median_of_medians(int[] array,
int offset,
int length) {
if (length <= 3) {
return find_index(array,
offset,
median_of_3(array,
offset,
length));
}
int num = (length / 3)
+ ((length % 3) > 0 ? 1 : 0);
int[] medians = new int[num];
for (int step = 0; step < num; step++) {
int index = step ⋆ 3;
medians[step] =
median_of_3(array,
offset + index,
Math.min(3, length - index));
}
return find_index(array,
offset,
median(medians));
}
With this we have an optimally efficient implementation of the selection algorithm. Interviewers often ask for pieces of the above and allow the candidate to continue through the problem as time permits.
A common generalization asks for an implementation of selection over a large data set.
9.5 Selection on large data
A difficult interview question, and usually a follow-up to a discussion of the selection problem, is to find the median on a large amount of data. The problem is stated that the data is too large to fit in memory and that it is desirable to avoid iterating through all the data in sequence.
To overcome the memory limitation problem we break the data up into separate arrays, requiring the candidate to operate over many small lists. The problem then reduces to finding a candidate median in a sub-array and determining that element’s position in the overall data. If the median is not an element of one list, search must continue on the next array. As we can see, this question has become both a selection problem and a binary search problem.
In the listings in this section we provide a solution for finding the median of an arbitrary number of arrays. We first search the first array, and failing to find the median value the next is searched.
For clarity we separate out a number of initialization and ancillary functions.
As with binary search over multiple lists, a set of bounds for search must be associated with each list. To accomplish this, we define a class that holds a reference to a list, an offset associated with that list, and the length of each array to search. Together, the offset and length determine the lower and upper bounds for binary search. These items are calledArrayBounds.
Listing 9.8: Definition of ArrayBounds in Selection on Large Data
class ArrayBounds {
int[] array;
int offset;
int length;
ArrayBounds(int[] array,
int offset,
int length) {
this.array = Arrays.copyOf(array,
array.length);
this.offset = offset;
this.length = length;
}
}
We initialize the list ofArrayBoundsobjects using the following method.
Listing 9.9: Initialization for Selection on Large Data
int initialize_data(int[][] arrays,
List<ArrayBounds> data) {
int total_size = 0;
for (int[] array : arrays) {
data.add(new ArrayBounds(array,
0,
array.length));
total_size += array.length;
}
return total_size;
}
In our solution we must distinguish between the position of an element in an array and its index. The reason for this is that if we find that a value is in the first position of n lists, it is also in the nth overall position. However the sum of the indices is 0. We distinguish between 0-based indices and 1-based position counters by variable name.
We shall also see that we must pivot over a value that may not be in the array. In order to properly update the search bounds, we must be able to know whether or not the pivot operation encountered the pivot value. Pivot information is stored in the following data type. Thepositionentry is the position of the pivot andfoundis whether or not the pivot value was encountered in partitioning.
Listing 9.10: Definition of Pivot in Selection on Large Data
class Pivot {
int position;
boolean found;
Pivot(int position, boolean found) {
this.position = position;
this.found = found;
}
}
nth_elementfinds the nth element in the array and rearranges the elements within the range in such a way that elements to the left of n are lesser and elements to the right of n are greater than n.
Listing 9.11: Find the nth Element in an Array
void nth_element(int[] data,
int begin,
int end,
int k) {
int endIndex = end - 1;
while (begin < endIndex) {
int i = begin;
int j = endIndex;
int mid = data[(i + j) / 2];
while (i < j) {
if (data[i] >= mid) {
int temp = data[j];
data[j] = data[i];
data[i] = temp;
j--;
} else {
i++;
}
}
if (data[i] > mid) {
i--;
}
if (k <= i) {
endIndex = i;
} else {
begin = i + 1;
}
}
}
With the pre-amble concluded, the following listing is the main loop of a solution for the median of large data. It takes each array in turn and determines if the median is a member of that list. This is accomplished by doing a binary search over the bounds of each list and finding the location of that element’s value within the remaining lists.
Listing 9.12: Median of Large Data
int median(int[][] arrays) {
List<ArrayBounds> data =
new ArrayList<ArrayBounds>();
int total_size = initialize_data(arrays, data);
int median_pos = (total_size - 1) / 2 + 1;
int pos_offset = 0;
for (int data_index = 0;
data_index < data.size();
data_index++) {
ArrayBounds tuple = data.get(data_index);
while (0 != tuple.length) {
int mid_index = tuple.offset
+ tuple.length / 2;
nth_element(tuple.array,
tuple.offset,
tuple.offset + tuple.length,
mid_index);
int value = tuple.array[mid_index];
int mid_pos = pos_offset + mid_index + 1;
List<Pivot> pivots = new ArrayList<Pivot>();
pivots.add(new Pivot(mid_index, true));
partition_array_on_pivot(data_index,
value,
data,
pivots,
mid_pos);
if (mid_pos == median_pos) {
return value;
}
update_search_bounds(median_pos,
mid_pos,
data_index,
data,
pivots);
}
pos_offset += tuple.offset;
}
ArrayBounds tuple = data.get(data.size() - 1);
return tuple.array[tuple.offset];
}
Two methods used in the solution have not yet been introduced:partition_array_on_pivotandupdate_search_bounds. Thepartition_array_on_pivotmethod pivots all other lists on the value being considered for the median in the current list. It uses the followingpartition_around_valuemethod to partition an array around a given element.
Listing 9.13: Partition an Array Around a Given Value
int partition_around_value(int[] array,
int begin,
int end,
int value) {
while (begin != end) {
while (value > array[begin]) {
begin++;
if (begin == end) {
return begin;
}
}
do {
end--;
if (begin == end) {
return begin;
}
} while (value <= array[end]);
int temp = array[begin];
array[begin] = array[end];
array[end] = temp;
}
return begin;
}
partition_array_on_pivotfirst separates values above and below the pivot value. It then checks the minimum value of that partition to see if the pivot is a member of the upper partition.
Listing 9.14: Partition Array on Pivot
void partition_array_on_pivot(
int index,
int value,
List<ArrayBounds> data,
List<Pivot> pivots,
Integer next_pos) {
for (int offset = 1;
(offset + index) < data.size();
offset++) {
int data_index = index + offset;
ArrayBounds tuple = data.get(data_index);
int begin = tuple.offset;
int end = begin + tuple.length;
int pos = partition_around_value(
tuple.array,
begin,
end,
value);
boolean found_value = false;
if (end != pos) {
nth_element(tuple.array, pos, end, pos);
if (value == tuple.array[pos]) {
found_value = true;
}
}
pivots.add(new Pivot(pos, found_value));
Pivot last = pivots.get(pivots.size() - 1);
next_pos += (tuple.offset
+ last.position
+ (last.found ? 1 : 0));
}
}
The second method isupdate_search_boundswhich updates the search bounds for every list. The update is based on whether the over-all median is less than or greater than the considered value. It is during this update that the binary search is affected on the current list.
Listing 9.15: Update Search Bounds After Partition
void update_search_bounds(int target_pos,
int current_pos,
int data_index,
List<ArrayBounds> data,
List<Pivot> pivots) {
for (int offset = 0;
(offset + data_index) < data.size();
offset++) {
ArrayBounds tuple = data.get(
data_index + offset);
if (current_pos <= target_pos) {
tuple.length -= (pivots.get(offset).position
- tuple.offset);
tuple.offset = pivots.get(offset).position;
if (0 == offset) {
tuple.offset++;
tuple.length--;
}
} else {
tuple.length = (pivots.get(offset).position
- tuple.offset);
}
}
}
All together, these methods provide a complete solution. The main idea of combining selection and binary search is presented in the main loop. And it is that which is often the insight required for solving this problem.