Friday, October 12, 2012

Visualizing correlation matrices

The correlation is one of the most common and most useful statistics. A correlation is a single number that describes the degree of relationship between two variables. The function corrcoef provided by numpy returns a matrix R of correlation coefficients calculated from an input matrix X whose rows are variables and whose columns are observations. Each element of the matrix R represents the correlation between two variables and it is computed as

where cov(X,Y) is the covariance between X and Y, while σX and σY are the standard deviations. If N is number of variables then R is a N-by-N matrix. Then, when we have a large number of variables we need a way to visualize R. The following snippet uses a pseudocolor plot to visualize R:
from numpy import corrcoef, sum, log, arange
from numpy.random import rand
from pylab import pcolor, show, colorbar, xticks, yticks

# generating some uncorrelated data
data = rand(10,100) # each row of represents a variable

# creating correlation between the variables
# variable 2 is correlated with all the other variables
data[2,:] = sum(data,0)
# variable 4 is correlated with variable 8
data[4,:] = log(data[8,:])*0.5

# plotting the correlation matrix
R = corrcoef(data)
pcolor(R)
colorbar()
yticks(arange(0.5,10.5),range(0,10))
xticks(arange(0.5,10.5),range(0,10))
show()
The result should be as follows:


As we expected, the correlation coefficients for the variable 2 are higher than the others and we observe a strong correlation between the variables 4 and 8.

10 comments:

  1. Don't use the jet colormap!

    http://www.jwave.vt.edu/~rkriz/Projects/create_color_table/color_07.pdf

    https://abandonmatlab.wordpress.com/2011/05/07/lets-talk-colormaps/

    http://cresspahl.blogspot.com/2012/03/expanded-control-of-octaves-colormap.html

    I think the hot colormap would be a better choice here

    ReplyDelete
  2. In some cases, Hinton diagrams can be far more useful. See http://www.scipy.org/Cookbook/Matplotlib/HintonDiagrams

    ReplyDelete
  3. hey,

    i get a strange error when running the script:


    /Users/xxx/src/matplotlib/lib/matplotlib/backends/backend_macosx.pyc in draw_quad_mesh(self, gc, master_transform, meshWidth, meshHeight, coordinates, offsets, offsetTrans, facecolors, antialiased, showedges)
    98 facecolors,
    99 antialiased,
    --> 100 showedges)
    101
    102 def new_gc(self):

    "only length-1 arrays can be converted to Python scalars"

    also, the colorbar is not visible
    what to do?

    ReplyDelete
  4. which version of matplotlib/python are you using?

    ReplyDelete
  5. hey,

    i'm using Python 2.7.3 and matplotlib '1.2.x' on os x.
    btw: if i leave out the colorbar command the error doesn't show up.

    ReplyDelete
  6. hello again.

    actually, i dont know why i had this unstable version installed.
    i used pip to install the stable 1.1.1 version and now it works like a charm.

    thanks for the fast reply and keep up the good work here :)

    ReplyDelete
  7. I like the correlation example and will try that later on some of my data. It is also cool that we uses the same theme on blogger. /Magnus

    ReplyDelete
    Replies
    1. Thanks Magnus. I like this theme because it's simple. If you're interested in matrix visualization don't forget to try Hinton diagrams also.

      Delete