Question: How
to get the median from a stream of numbers at any time? The median is middle
value of numbers. If the count of numbers is even, the median is defined as the
average value of the two numbers in middle.
Analysis: Since
numbers come from a stream, the count of numbers is dynamic, and increases over
time. If a data container is defined for the numbers from a stream, new numbers
will be inserted into the container when they are deserialized. Let us find an
appropriate data structure for such a data container.
An array is the simplest choice. The array should be sorted,
because we are going to get its median. Even though it only costs O(lgn) time to find the position to be
inserted with binary search algorithm, it costs O(n) time to insert a number into a sorted array, because O(n) numbers will be moved if there are n numbers in the array. It is very
efficient to get the median, since it only takes O(1) time to access to a
number in an array with an index.
A sorted list is
another choice. It takes O(n) time to
find the appropriate position to insert a new number. Additionally, the time to
get the median can be optimized to O(1) if we define two pointers which points
to the central one or two elements.
A better choice available is a binary search tree, because
it only costs O(lgn) on average to
insert a new node. However, the time complexity is O(n) for the worst cases, when numbers are inserted in sorted
(increasingly or decreasingly) order. To get the median number from a binary
search tree, auxiliary data to record the number of nodes of its subtree is
necessary for each node. It also requires O(lgn) time to get the median node on overage, but O(n) time for the worst cases.
We may utilize a balanced binary search tree, AVL, to avoid the
worst cases. Usually the balance factor of a node in AVL trees is the height
difference between its right subtree and left subtree. We may modify a little
bit here: Define the balance factor as the difference of number of nodes
between its right subtree and left subtree. It costs O(lgn) time to insert a new node into an AVL, and O(1) time to get the
median for all cases.
An AVL is efficient, but it is not implemented unfortunately
in libraries of the most common programming languages. It is also very
difficult for candidates to implement the left/right rotation of AVL trees in
dozens of minutes during interview. Let us looks for better solutions.
As shown in Figure 1, if all numbers are sorted, the numbers
which are related to the median are indexed by P1 and P2. If the count of numbers
is odd, P1 and P2 point to the same central number. If the count is even, P1
and P2 point to two numbers in middle.
Median can be get or calculated with the numbers pointed by
P1 are P2. It is noticeable that all numbers are divided into two parts. The numbers
in the first half are less than the numbers in the second half. Moreover, the
number indexed by P1 is the greatest number in the first half, and the number
indexed by P2 is the least one in the second half.
Figure 1: Numbers are divided in two parts by one or two numbers in its center. 
Therefore, numbers in the first half are inserted into a max
heap, and numbers in the second half are inserted into a min heap. It costs
O(lgn) time to insert a number into a
heap. Since the median can be get or calculated with the root of a min heap and a max heap, it only takes O(1) time.
Table 1 compares the solutions above with a sorted array, a
sorted list, a binary search tree, an AVL tree, as well as a min heap and a max
heap.
Type for Data Container

Time to Insert

Time to Get Median

Sorted Array

O(n)

O(1)

Sorted List

O(n)

O(1)

Binary Search Tree

O(lgn) on
average, O(n) for the worst cases

O(lgn) on average,
O(n) for the worst cases

AVL

O(lgn)

O(1)

Max Heap and Min Heap

O(lgn)

O(1)

Table 1: Summary of solutions with a sorted array, a
sorted list, a binary search tree, an AVL tree, as well as a min heap and a max
heap.
Let us consider the implementation details. All numbers
should be evenly divided into two parts, so the count of number in min heap and
max heap should diff 1 at most. To achieve such a division, a new number is
inserted into the min heap if the count of existing numbers is even; otherwise it
is inserted into the max heap.
We also should make sure that the numbers in the max heap
are less than the numbers in the min heap. Supposing the count of existing
numbers is even, a new number will be inserted into the min heap. If the new
number is less than some numbers in the max heap, it violates our rule that all
numbers in the min heap should be greater than numbers in the min heap.
In such a case, we can insert the new number into the max
heap first, and then pop the greatest number from the max heap, and push it
into the min heap. Since the number pushed into the min heap is the former
greatest number in the max heap, all numbers in the min heap are greater than
numbers in the max heap with the newly inserted number.
The situation is similar when the count of existing numbers
is odd and the new number to be inserted is greater than some numbers in the
min heap. Please analyze the insertion process carefully by yourself.
The following is sample code in C++. Even though there are
no types for heaps in STL, we can build heaps with vectors utilizing function push_heap and pop_heap. Comparing functor less and greater are employed for max heaps and
min heaps correspondingly.
template<typename T> class
DynamicArray
{
public:
void
Insert(T num)
{
if(((minHeap.size()
+ maxHeap.size()) & 1) == 0)
{
if(maxHeap.size()
> 0 && num < maxHeap[0])
{
maxHeap.push_back(num);
push_heap(maxHeap.begin(),
maxHeap.end(), less<T>());
num = maxHeap[0];
pop_heap(maxHeap.begin(),
maxHeap.end(), less<T>());
maxHeap.pop_back();
}
minHeap.push_back(num);
push_heap(minHeap.begin(),
minHeap.end(), greater<T>());
}
else
{
if(minHeap.size()
> 0 && minHeap[0] < num)
{
minHeap.push_back(num);
push_heap(minHeap.begin(),
minHeap.end(), greater<T>());
num = minHeap[0];
pop_heap(minHeap.begin(),
minHeap.end(), greater<T>());
minHeap.pop_back();
}
maxHeap.push_back(num);
push_heap(maxHeap.begin(),
maxHeap.end(), less<T>());
}
}
int
GetMedian()
{
int
size = minHeap.size() + maxHeap.size();
if(size == 0)
throw
exception("No numbers are available");
T median = 0;
if(size
& 1 == 1)
median = minHeap[0];
else
median = (minHeap[0] + maxHeap[0])
/ 2;
return
median;
}
private:
vector<T> minHeap;
vector<T> maxHeap;
};
In the code above, function Insert is
used to insert a new number deserialized from a stream, and GetMedian is used to get the median of the existing numbers dynamically.
The discussion about this problem is included in my book <Coding Interviews: Questions, Analysis & Solutions>, with some revisions. You may find the details of this book on Amazon.com, or Apress.
The author Harry He owns all the rights of this post. If you are going
to use part of or the whole of this ariticle in your blog
or webpages, please add a reference to http://codercareer.blogspot.com/. If you are going to use it in your books, please contact him via zhedahht@gmail.com . Thanks.
For BST, median can be accessed in just O(1) time by keeping a pointer to the current median. During insertions and deletions, keep a track of the current median's position and check if that might change due to the insertion/deletion. In general, two insertions on left of median cause median to move one place to left. Two deletions on left cause it to move one place to right, and so on..! This can be captured by keeping a variable left_shift. insertleft=>left_shift++, deleteleft=>left_shift, insertright=>left_shift, deleteright=>left_shift++. when left_shift==2, move median one place to left and left_shift=0, and when left_shift==2, move median one place to right and left_shift=0.
ReplyDeleteThe author Gurmeet Singh owns all the rights of this post. If you are going to use part of or the whole of this comment in your blog or webpages, please add a reference to http://chocolovey.blogspot.com/
i dont think you own the rights to this post, do you ?
Deletelol
DeleteThis comment has been removed by the author.
ReplyDelete