Figure 1: Spearman correlation heat map with correlation coefficient and significance levels based on the mtcars data set.
Figure 1: Spearman correlation heat map with correlation coefficient and significance levels based on the mtcars data set.

In a recent paper we included data from a survey we conducted. During the publication process, one of the reviewers asked for a more in depth statistical analysis of the data set. He (she?) explicitly expressed a special interest in correlating the variables of the survey in order to spot any interesting correlations. This posed a number of problems:

  1. The data set has 35 variables. This translates into a huge correlation matrix.

  2. There are many ways to correlate variables. In my field the Pearson product-moment correlation coefficient (Pearson's r) and the Spearman's rank correlation coefficient (Spearman's rho) are the most common ones. Which one should I choose and why?

  3. Correlations by themselves are not very useful. You, most likely, want to have at least some information about the statistical significance of the correlation.

The Data Set

For the sake of demonstration we will use the mtcars data set, provided by the stats package. The data set includes 11 variables with 32 observations, hence rendering the task a bit more manageable.

Table 1: The first five (of 32) observations of the mtcars data set.
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Table 2: Description of the variable labels of the mtcars data set.
axis description
mpg Miles/(US) gallon
cyl Number of cylinders
disp Displacement (cu.in.)
hp Gross horsepower
drat Rear axle ratio
wt Weight (lb/1000)
qsec 1/4 mile time
vs V/S
am Transmission (0 = automatic, 1 = manual)
gear Number of forward gears
carb Number of carburetors

Choosing a Suitable Correlation Algorithm

As most papers in my field use either Pearson's r or Spearman's rho I began my research there. As it turns out, there are two types of algorithms: parametric and non-parametric. Parametric algorithms depend on some assumptions about the data set. In case of Pearson's r you need to know about the distribution, scaling and dependency of the variables of the data set.

Spearman's rho, on the other hand, is a non-parametric algorithm. It compares two ranked variables of either ordinal, interval or ratio type and describes how well they can be described using a monotonic function. No further assumptions on the data set are necessary which is handy if you do not know too much about the distribution of your data.

Let's Get Started!

R, a statistical computing programming language, is being used for the statistical analysis with some additional libraries: ggplot2 for plotting, Hmisc to create a correlation matrix, reshape2 to meld the data frame as well as stats to provide the mtcars data set.

library(ggplot2)
library(reshape2)
library(Hmisc)
library(stats)

The first thing is to calculate the correlation of the individual variables to another. As we have our data already in a data frame, merely a call of the rcorr function of the Hmisc library is necessary. This creates a new list with two entries: ”r” the correlation coefficients and ”P” the significance levels.

rcorr Computes a matrix of Pearson's r or Spearman's rho rank correlation coefficients for all possible pairs of columns of a matrix. Missing values are deleted in pairs rather than deleting all rows of x having any missing variables. Ranks are computed using efficient algorithms, using midranks for ties.
-- R Documentation / rcorr {Hmisc}

d <- mtcars
cormatrix = rcorr(as.matrix(d), type='spearman')
Table 3: Resulting correlation matrix - correlation coefficients (cormatrix$r)
mpg cyl disp hp drat wt qsec vs am gear carb
mpg 1.00 -0.91 -0.91 -0.89 0.65 -0.89 0.47 0.71 0.56 0.54 -0.66
cyl -0.91 1.00 0.93 0.90 -0.68 0.86 -0.57 -0.81 -0.52 -0.56 0.58
disp -0.91 0.93 1.00 0.85 -0.68 0.90 -0.46 -0.72 -0.62 -0.59 0.54
hp -0.89 0.90 0.85 1.00 -0.52 0.77 -0.67 -0.75 -0.36 -0.33 0.73
drat 0.65 -0.68 -0.68 -0.52 1.00 -0.75 0.09 0.45 0.69 0.74 -0.13
wt -0.89 0.86 0.90 0.77 -0.75 1.00 -0.23 -0.59 -0.74 -0.68 0.50
qsec 0.47 -0.57 -0.46 -0.67 0.09 -0.23 1.00 0.79 -0.20 -0.15 -0.66
vs 0.71 -0.81 -0.72 -0.75 0.45 -0.59 0.79 1.00 0.17 0.28 -0.63
am 0.56 -0.52 -0.62 -0.36 0.69 -0.74 -0.20 0.17 1.00 0.81 -0.06
gear 0.54 -0.56 -0.59 -0.33 0.74 -0.68 -0.15 0.28 0.81 1.00 0.11
carb -0.66 0.58 0.54 0.73 -0.13 0.50 -0.66 -0.63 -0.06 0.11 1.00
Table 4: Resulting correlation matrix - significance levels (cormatrix$P)
mpg cyl disp hp drat wt qsec vs am gear carb
mpg NA 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00
cyl 0.00 NA 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
disp 0.00 0.00 NA 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00
hp 0.00 0.00 0.00 NA 0.00 0.00 0.00 0.00 0.04 0.06 0.00
drat 0.00 0.00 0.00 0.00 NA 0.00 0.62 0.01 0.00 0.00 0.49
wt 0.00 0.00 0.00 0.00 0.00 NA 0.21 0.00 0.00 0.00 0.00
qsec 0.01 0.00 0.01 0.00 0.62 0.21 NA 0.00 0.26 0.42 0.00
vs 0.00 0.00 0.00 0.00 0.01 0.00 0.00 NA 0.36 0.12 0.00
am 0.00 0.00 0.00 0.04 0.00 0.00 0.26 0.36 NA 0.00 0.73
gear 0.00 0.00 0.00 0.06 0.00 0.00 0.42 0.12 0.00 NA 0.53
carb 0.00 0.00 0.00 0.00 0.49 0.00 0.00 0.00 0.73 0.53 NA

The correlation coefficients can be plotted using a heat map representation. ggplot2 provides the geom_tile geometric object for this purpose. In order to plot the correlation matrix we need to meld the data frame first.

cordata = melt(cormatrix$r)
ggplot(cordata, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile() + xlab("") + ylab("")
Figure 2: Spearman correlation heat map based on the mtcars data set. No information about significance levels is included
Figure 2: Spearman correlation heat map based on the mtcars data set. No information about significance levels is included

Figure 2 shows the resulting heat map, plotted by ggplot2. However, some key data is missing. There is no information about the significance levels nor does the plot include the numeric values of Spearman's rho. In order to efficiently use the little space per tile, I wrote a little function ("abbreviateSTR") which abbreviates the values. The resulting strings are stored in a label column. This column can then be plotted onto the corresponding tile. To further enhance the distinction between significant (P > 0.05) and insignificant correlations, a red, semi transparent, ”X” is being overlaid onto insignificant tiles. The resulting plot is shown in figure 1.

Conclusion

By using the non-parametric Spearman algorithm it is possible to create a preliminary overview of the correlations within a data set. This information can be plotted utilizing a heat map representation in conjunction with significance levels as well as numeric values of Spearman's rho. The resulting plot is a comprehensible representation of the data set, which allows quick identification of significant correlations.

Looking at the heat map, one could get the impression that the gross horsepower has a strong negative correlation (r ≈ -0.89, P < 0.01) with the gasoline consumption. Who would have thought?

Complete R Script

#!/usr/bin/env Rscript

library(ggplot2)
library(reshape2)
library(Hmisc)
library(stats)

abbreviateSTR <- function(value, prefix){  # format string more concisely
  lst = c()
  for (item in value) {
    if (is.nan(item) || is.na(item)) { # if item is NaN return empty string
      lst <- c(lst, '')
      next
    }
    item <- round(item, 2) # round to two digits
    if (item == 0) { # if rounding results in 0 clarify
      item = '<.01'
    }
    item <- as.character(item)
    item <- sub("(^[0])+", "", item)    # remove leading 0: 0.05 -> .05
    item <- sub("(^-[0])+", "-", item)  # remove leading -0: -0.05 -> -.05
    lst <- c(lst, paste(prefix, item, sep = ""))
  }
  return(lst)
}

d <- mtcars

cormatrix = rcorr(as.matrix(d), type='spearman')
cordata = melt(cormatrix$r)
cordata$labelr = abbreviateSTR(melt(cormatrix$r)$value, 'r')
cordata$labelP = abbreviateSTR(melt(cormatrix$P)$value, 'P')
cordata$label = paste(cordata$labelr, "\n", 
                      cordata$labelP, sep = "")
cordata$strike = ""
cordata$strike[cormatrix$P > 0.05] = "X"

txtsize <- par('din')[2] / 2
ggplot(cordata, aes(x=Var1, y=Var2, fill=value)) + geom_tile() + 
  theme(axis.text.x = element_text(angle=90, hjust=TRUE)) +
  xlab("") + ylab("") + 
  geom_text(label=cordata$label, size=txtsize) + 
  geom_text(label=cordata$strike, size=txtsize * 4, color="red", alpha=0.4)

Comments

comments powered by Disqus