Significance Test for Kendall's Tau-b
A variation of the definition of the Kendall correlation coefficient is necessary in order to deal with data samples with tied ranks. It known as the Kendall’s tau-b coefficient and is more effective in determining whether two non-parametric data samples with ties are correlated.
Formally, the Kendall’s tau-b is defined as follows. It replaces the denominator of the original definition with the product of square roots of data pair counts not tied in the target features.
In the context of our previous example based on the data set survey, N1 would be the number of student pairs with different smoking habits, whereas N2 would be the number of student pairs with different exercise practice levels. We say the two variables Exer and Smoke in the data set are uncorrelated if their correlation coefficient is zero.
As before, the function cor shows that the Kendall correlation coefficient between Exer and Smoke is 0.083547.
+ levels=c("Never","Occas","Regul","Heavy")))
> exer <- as.numeric(factor(survey$Exer,
+ levels=c("None","Some","Freq")))
> m <- cbind(exer, smoke)
> cor(m, method="kendall", use="pairwise")
exer smoke
exer 1.000000 0.083547
smoke 0.083547 1.000000
In order to decide whether the variables are uncorrelated, we test the null hypothesis that τB = 0. The alternative hypothesis is that the variables are correlated, and τB is non-zero.
To test the null hypothesis, we apply the function cor.test, and found a p-value of 0.1709. Hence we do not reject the null hypothesis that variables are uncorrelated at 0.05 significance level.
Kendall’s rank correlation tau
data: exer and smoke
z = 1.3694, p-value = 0.1709
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
0.083547
The same can be achieved with the latest version of rpudplus. We begin with the function rpucor, which now uses Kendall’s tau-b to compute the correlation coefficient, and comes up with the same correlation coefficient as the function cor.
> rpucor(m, method="kendall", use="pairwise")
exer smoke
exer 1.000000 0.083547
smoke 0.083547 1.000000
attr(,"method")
[1] "kendall"
attr(,"use")
[1] "pairwise.complete.obs"
Alternatively, we can now use the function rpucor.test to extract the Kendall’s tau-b from the estimate component of the result.
> rt$estimate
exer smoke
exer 1.000000 0.083547
smoke 0.083547 1.000000
And we can find the p-value of the correlation estimate in the p.value component.
To decide whether to reject the null hypothesis that the variables are uncorrelated, we compare the p-values against 0.05. Since the comparison result at ["exer", "smoke"] is FALSE, we do not reject the null hypothesis that Exer and Smoke are uncorrelated at 0.05 significance level.
To evaluate performance, we create random sample data drawn from 16 ordered symbols, and found the function cor takes more than one and a half minutes to compute the Kendall’s coefficient for 4000 observations in 24 features on an AMD Phenom II X4 CPU. At the time of writing, the function cor.test does not yet support multivariate data for comparison.
+ set.seed(seed)
+ matrix(sample(1:16, dim*num, replace=TRUE), nrow=num)
+ }
> m <- test.data(24, 4000)
>
> system.time(cor(m, method="kendall"))
user system elapsed
95.610 0.010 95.614
The same task would take merely 0.5 seconds for rpucor on a NVIDIA GTX 460 GPU.
And it takes similarly short amount of time for rpucor.test.
Exercises
- Determine which pairs of variables in the data set USJudgeRatings are correlated based on the Kendall’s method at 0.05 significance level.
- Determine which pairs of variables in the data set ToothGrowth are correlated based on the Kendall’s method at 0.05 significance level. (Hint: transform its dichotomy variable supp into numeric type before proceed.)