Yesterday Joe Celko posted on microsoft.public.sqlserver.programming about how to write an elegant query for Weighted median. He managed to get the correct results but always ended up with ugly code.
Joe had a feeling an elegant query existed but was not seeing it. Anyway, since Celko’s books have helped me in the past, I thought I should help him now.
Consider this sample data
The most common approach to calculate the median value I have seen is this
It’s a great method and it works in almost all cases. Yes, in almost all cases. There are circumstances where SQL Server will not give the correct result. The reason for me writing this blog post is that I suddenly got this result in return from inner query.
As you can see, the difference calculated by a-b suddenly doesn’t match! The simple answer is that SQL Server is working on sets and doesn’t care about the physical order of rows in a table. You can provide a logical order, and we did in this case, but still the result is wrong because of the duplicate x values. SQL can choose which row of the three to pick first when calculating the a and b value.
How can we overcome this behaviour, so that the median calculation will work no matter if table is a cluster or a heap? It’s not that hard. See this query.
Yes it works in all cases! But why? It is all about the math. See same table below where I now have added my calculated y column.
The first result has an even number of sample rows and the second result has an odd number of sample rows. The b column is the one used in my query too. Understanding the math, you can see that for even number of rows, the two rows of interest is the ones with {0, 2} for y value, and with odd number of rows, the one row of interest is the {1} for y value. The trick I use here is that there can’t be three values {0, 1, 2} because there can’t be both odd number and even number of rows in the source data.
And now how to do the weighted median? Well, we follow the same approach as above and write this piece of code. The trick here is to get all rows with same value where the median rows resides, {1} or {0, 2}.
I hope you will understand the simplicity of the algorithm.