## Friday, September 14, 2012

### Boxplot with matplotlib

A boxplot (also known as a box-and-whisker diagram) is a way of summarizing a set of data measured on an interval scale. In this post I will show how to make a boxplot with pylab using a dataset that contains the monthly totals of the number of new cases of measles, mumps, and chicken pox for New York City during the years 1931-1971. The data was extracted from the Hipel-McLeod Time Series Datasets Collection and you can download it from here in the matlab format.
Let's make a box plot of the monthly distribution of chicken pox cases:
```from pylab import *

NYCdiseases = loadmat('NYCDiseases.mat') # it's a matlab file

# multiple box plots on one figure
# Chickenpox cases by month
figure(1)
# NYCdiseases['chickenPox'] is a matrix
# with 30 rows (1 per year) and 12 columns (1 per month)
boxplot(NYCdiseases['chickenPox'])
labels = ('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')
xticks(range(1,13),labels, rotation=15)
xlabel('Month')
ylabel('Chickenpox cases')
title('Chickenpox cases in NYC 1931-1971')
```
The result should be as follows:

On each box, the central mark is the median, the edges of the box are the lower hinge (defined as the 25th percentile) and the upper hinge (the 75th percentile), the whiskers extend to the most extreme data points not considered outliers, these ones are plotted individually.
Using the graph we can compare the range and distribution of the chickenpox cases for each month. We can observe that March and April are the month with the highest number of cases but also the ones with the greatest variability. We can compare the distribution of the three diseases in the same way:
```# building the data matrix
data = [NYCdiseases['measles'],
NYCdiseases['mumps'], NYCdiseases['chickenPox']]

figure(2)
boxplot(data)
xticks([1,2,3],('measles','mumps','chickenPox'), rotation=15)
ylabel('Monthly cases')
title('Contagious childhood disease in NYC 1931-1971')

show()
```
And this is the result:

Here, we can observe that the chicken pox distribution has the median higher than the other diseases. The mumps distribution seems to have small variability compared to the other ones and the measles distribution has a low median but a very high number of outliers.