# Efficiently Computing Kendall's Tau

Typically when people talk about correlation they are referring to the Pearson’s product-moment coefficient:

$$\rho_{X,Y}={E[(X-\mu_X)(Y-\mu_Y)] \over \sigma_X\sigma_Y}$$

The Pearson coefficient is 1 if the datasets have a perfectly positive linear relationship and -1 if they have a perfectly negative linear relationship. But what if our data has a clear positive relationship, but it’s not linear? Or what if our data isn’t even numeric and doesn’t have a meaningful way of computing the average, $\mu$, or standard deviation, $\sigma$?

In these cases, Kendall’s Tau is a useful way of measuring the correlation since it only requires that we have a total ordering for each of our datasets. For each pair of observations, $(x_1, y_1)$ and $(x_2, y_2)$, we call the pair concordant if: $$x_1 < x_2 \text{ and } y_1 < y_2$$ $$\text{or}$$ $$x_1 > x_2 \text{ and } y_1 > y_2$$ …and we call the pair discordant if: $$x_1 < x_2 \text{ and } y_1 > y_2$$ $$\text{or}$$ $$x_1 > x_2 \text{ and } y_1 < y_2$$ If $x_1 = x_2 \text{ or } y_1 = y_2$, the pair is neither concordant nor discordant.

Kendall’s Tau is then defined as: $$\tau = \frac{n_c-n_d}{\frac{1}{2} n (n-1) }$$ Where $n_c$ is the number of concordant pairs and $n_d$ is the number of discordant pairs. Since $n (n-1) / 2$ is the total number of pairs, this value ranges from -1 to 1.

Unfortunately, this approach doesn’t deal well with tied values. Consider the following set of $(x,y)$ observations: $$(1,1), (1,1), (2,2), (3,3)$$ There’s a perfectly positive linear relationship between X and Y, but only 5 of the 6 pairs are concordant. For this case we want to use the $\tau_B$ modified version:

$$\tau_B = \frac{n_c-n_d}{\sqrt{(n_0-n_1)(n_0-n_2)}}$$

…where:

$$n_0 = n(n-1)/2$$ $$n_1 = \text{Number of pairs with tied values in } X$$ $$n_2 = \text{Number of pairs with tied values in } Y$$

## Computing Naively

We can compute $\tau_B$ in $O(n^{2})$ by looking at every pair of observations and tallying the number of concordant, discordant, and tied pairs. Once we have the tallies, we’ll apply the formula:

For a given pair of observations, we’ll construct a map describing which tallies it will contribute to:

Now we need a way of generating every pair:

Finally, we put it all together by computing the relations tally for each pair and combining them using merge-with:

## Knight’s Algorithm

In 1966, William R. Knight was a visiting statistician at the Fisheries Research Board of Canada. He wrote:

The problem of calculating Kendall’s tau arose while attempting to evaluate species associations in catches by the Canadian east coast offshore fishery. Sample sizes ranging up to 400 were common, making manual calculations out of the question; indeed, an initial program using an asymptotically inefficient method proved expensively slow.

Necessity is the mother of invention, so he came up with a clever algorithm for computing Kendall’s Tau in $O(n \log{n})$ which he published in his paper entitled “A Computer Method for Calculating Kendall’s Tau with Ungrouped Data”.

First, sort the observations by their $x$ values using your favorite $O(n \log{n})$ algorithm. Next, sort that sorted list by the $y$ values using a slightly modified merge sort that keeps track of the size of the swaps it had to perform.

Recall that merge sort works as follows:

1. Divide the unsorted list into $n$ sublists, each containing 1 element (a list of 1 element is considered sorted).
2. Repeatedly merge sublists to produce new sublists until there is only 1 sublist remaining. This will be the sorted list. (description and animation from Wikipedia)

The trick is performed when merging sublists. The list was originally sorted by $x$ values, so whenever an element from the second sublist is smaller than the next element from the first sublist we know that the corresponding observation is discordant with however many elements remain in the first sublist.

We can implement this modified merge sort by first handling the case of merging two sorted sequences:

Now, we can do the full merge sort by applying that function to piece sizes that double until the whole collection is covered by a single sorted piece:

The only thing we are missing now is the tallies of tied pairs. We could use clojure.core/frequencies, but Knight’s original paper alludes to a different way which takes advantage of the fact that at different stages of the algorithm we have the list sorted by $X$ and then $Y$. Most implementations do something like:

Now we have all the pieces, so we just have to put them together:

## Conclusion

There are certainly many things I would write differently above if I was really trying for performance. The goal here was to clearly illustrate the algorithm and maintain the asymptotic run-time characteristics.

Also, I recently submitted a patch to the Apache Commons Math library that contains an implementation of this in pure Java if that’s your thing.

I think this algorithm is a clever little gem and I really enjoyed learning it. Deconstructing a familiar algorithm like merge sort and utilizing its internal operations for some other purpose is a neat approach that I’ll definitely keep in my algorithmic toolbox.