Question: Please find out the least k numbers out of n numbers. For example, if given the 8 numbers 4, 5, 1, 6, 2, 7, 3 and 8, please return the least 4 numbers 1, 2, 3 and 4.
Analysis: The naïve solution is sort the n input numbers increasingly, and the least k numbers should be the first k numbers. Since it needs to sort, its time complexity is O(nlogn). Interviewers will ask us explore more efficient solutions.
Solution 1: O(nlogk) time efficiency, be suitable for data with huge size
A data container with capacity k is firstly created to store the least k numbers, and then a number is read out of the n input numbers at each time. If the container has less than k numbers, the number read at current round (denoted as num) is inserted into container directly. If it contains k numbers already, num cannot be inserted directly any more. However, it may replace an existing number in the container. We get the maximum number of the k numbers in the container, and compare it with num. If num is less than the maximum number, we replace the maximum number with num. Otherwise we discard num, since we already have k numbers in the container which are all less than num and it cannot be one of the least k numbers.
Three steps may be required when a number is read and the container is full: The first step is to find the maximum number, secondly we may delete the maximum number, and at last we may insert a new number. The second and third steps are optional, which depend on whether the number read at current round is greater than the maximum number in container or not. If we implement the data container as a binary tree, it costs O(logk) time for these three steps. Therefore, the overall time complexity is O(nlogk) for n input numbers.
We have different choices for the data container. Since we need to get the maximum number out of k numbers, it intuitively might a maximum heap. In a maximum heap, its root is always greater than its children, so it costs O(1) time to get the maximum number. However, it takes O(logk) time to insert and delete a number.
We have to write a lot of code for a maximum heap, and it is too difficult in the dozensofminute interview. We can also implement it as a redblack tree. A redblack tree classifies its nodes into red and black categories, and assure that it is somewhat balanced based on a set of rules. Therefore, it costs O(logk) time to find, insert and delete a number. The classes set and multiset in STL are all based on redblack trees. We may use data containers in STL directly if our interviewers are not against it. The following sample code is based on the multiset in STL:
typedef multiset<int, greater<int> > intSet;
typedef multiset<int, greater<int> >::iterator setIterator;
void GetLeastNumbers(const vector<int>& data, intSet& leastNumbers, int k)
{
leastNumbers.clear();
if(k < 1  data.size() < k)
return;
vector<int>::const_iterator iter = data.begin();
for(; iter != data.end(); ++ iter)
{
if((leastNumbers.size()) < k)
leastNumbers.insert(*iter);
else
{
setIterator iterGreatest = leastNumbers.begin();
if(*iter < *(leastNumbers.begin()))
{
leastNumbers.erase(iterGreatest);
leastNumbers.insert(*iter);
}
}
}
}
Solution 2: O(n) time efficiency, be suitable only when we can reorder the input
We can also utilize the function Partition in quick sort to solve this problem with a hypothesis. It assumes that n input numbers are contained in an array. If it takes the kth number as a pilot to partition the input array, all of numbers less than the kth number should be at the left side and other greater ones should be at the right side. The k numbers at the left side are the least k numbers after the partition. We can develop the following code according to this solution:
void GetLeastNumbers(int* input, int n, int* output, int k)
{
if(input == NULL  output == NULL  k > n  n <= 0  k <= 0)
return;
int start = 0;
int end = n  1;
int index = Partition(input, n, start, end);
while(index != k  1)
{
if(index > k  1)
{
end = index  1;
index = Partition(input, n, start, end);
}
else
{
start = index + 1;
index = Partition(input, n, start, end);
}
}
for(int i = 0; i < k; ++i)
output[i] = input[i];
}
Comparison between two solutions
The second solution based on the function Partition costs only O(n) time, so it is more efficient than the first one. However, it has two obvious limitations: One limitation is that it needs to load all input numbers into an array, and the other is that we have to reorder the input numbers.
Even though the first takes more time, the second solution does have the two limitations as the first one. It is not required to reorder the input numbers (data in the code above). We read a number from data at each round, and all write operations are taken in the container leastNumbers. It does not require loading all input number into memory at one time, so it is suitable for hugesize data. Supposing our interview asks us get the least k numbers from a hugesize input. Obviously we cannot load all data with huge size into limited memory at one time. We can read a number from auxiliary space (such as disk) at each round with the first solution, and determine whether we need to insert it into the container leastNumbers. It works once memory can accommodate leastNumbers, so it is especially works when n is huge and k is small.
The characters of these two solutions can be summarized in Table 1:
First Solution

Second Solution
 
Time complexity

O(n*logk)

O(n)

Reorder input numbers?

No

Yes

Suitable for hugesize data?

Yes

No

Table 1: Pros and cons of two solutions
Since each solution has its own pros and cons, candidates had better to ask interviews for more requirements and details to choose the most suitable solution, including the input data size and whether it is allowed to reorder the input numbers.
The discussion about this problem is included in my book <Coding Interviews: Questions, Analysis & Solutions>, with some revisions. You may find the details of this book on Amazon.com, or Apress.
The author Harry He owns all the rights of this post. If you are going to use part of or the whole of this ariticle in your blog or webpages, please add a reference to http://codercareer.blogspot.com/. If you are going to use it in your books, please contact me (zhedahht@gmail.com) . Thanks.
The discussion about this problem is included in my book <Coding Interviews: Questions, Analysis & Solutions>, with some revisions. You may find the details of this book on Amazon.com, or Apress.
The author Harry He owns all the rights of this post. If you are going to use part of or the whole of this ariticle in your blog or webpages, please add a reference to http://codercareer.blogspot.com/. If you are going to use it in your books, please contact me (zhedahht@gmail.com) . Thanks.
the second solution is not O(n), that is O(n*n).
ReplyDeleteThe second solution is based on the Partition function, which is also the base of quick sort. The complexity of Partition is O(n), whose details is shown in http://en.wikipedia.org/wiki/Quicksort.
ReplyDeletefor the second approach, O(n) is the expected time complexity and for the worst case complexity, it's O(n^2)
ReplyDeleteonly if you use median of medians you can get O(n) in worst case
ReplyDeletehttp://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm__Median_of_Medians_algorithm
if k << n
ReplyDeleteyou can use variation of bubble sort with outer loop running k times and the inner loop runs for n times.
private static int getMax(int [] array, int left, int right){
ReplyDeleteif (left=rightmax? leftmax:rightmax;
}
return array[left];
}
public static void findtheleast(int [] array, int number){
int max=getMax(array,0,array.length1);
int [] tmp=new int [max+1];
for (int i=0;i<array.length;++i){
tmp[array[i]]++;
}
int i=0;
while (i<=number){
if (tmp[i]!=0){
System.out.print(i+",");
}
i++;
}
System.out.println();
}
What a wonderful post. I would like to say that this post is really very informational post. I really like it. data analysis service
ReplyDelete