Let's make a box plot of the monthly distribution of chicken pox cases:
from pylab import * from scipy.io import loadmat NYCdiseases = loadmat('NYCDiseases.mat') # it's a matlab file # multiple box plots on one figure # Chickenpox cases by month figure(1) # NYCdiseases['chickenPox'] is a matrix # with 30 rows (1 per year) and 12 columns (1 per month) boxplot(NYCdiseases['chickenPox']) labels = ('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec') xticks(range(1,13),labels, rotation=15) xlabel('Month') ylabel('Chickenpox cases') title('Chickenpox cases in NYC 1931-1971')The result should be as follows:
On each box, the central mark is the median, the edges of the box are the lower hinge (defined as the 25th percentile) and the upper hinge (the 75th percentile), the whiskers extend to the most extreme data points not considered outliers, these ones are plotted individually.
Using the graph we can compare the range and distribution of the chickenpox cases for each month. We can observe that March and April are the month with the highest number of cases but also the ones with the greatest variability. We can compare the distribution of the three diseases in the same way:
# building the data matrix data = [NYCdiseases['measles'], NYCdiseases['mumps'], NYCdiseases['chickenPox']] figure(2) boxplot(data) xticks([1,2,3],('measles','mumps','chickenPox'), rotation=15) ylabel('Monthly cases') title('Contagious childhood disease in NYC 1931-1971') show()And this is the result:
Here, we can observe that the chicken pox distribution has the median higher than the other diseases. The mumps distribution seems to have small variability compared to the other ones and the measles distribution has a low median but a very high number of outliers.