Wednesday, July 27, 2011

PCA and image compression with numpy

In the previous post we have seen the princomp function. This function performs principal components analysis (PCA) on the n-by-p data matrix and uses all the p principal component to computed the principal component scores. In this new post, we will see a modified version of the princomp where the representation of the original data in the in the principal component space is computed with less than p principal components:
from numpy import mean,cov,cumsum,dot,linalg,size,flipud

def princomp(A,numpc=0):
 # computing eigenvalues and eigenvectors of covariance matrix
 M = (A-mean(A.T,axis=1)).T # subtract the mean (along columns)
 [latent,coeff] = linalg.eig(cov(M))
 p = size(coeff,axis=1)
 idx = argsort(latent) # sorting the eigenvalues
 idx = idx[::-1]       # in ascending order
 # sorting eigenvectors according to the sorted eigenvalues
 coeff = coeff[:,idx]
 latent = latent[idx] # sorting eigenvalues
 if numpc < p and numpc >= 0:
  coeff = coeff[:,range(numpc)] # cutting some PCs if needed
 score = dot(coeff.T,M) # projection of the data in the new space
 return coeff,score,latent
The following code uses the new version of the princomp to compute the PCA of a matrix that represents an image in gray scale. The PCA is computed ten times with an increasing number of principal components. The script show the images reconstructed using less than 50 principal components (out of 200).
from pylab import imread,subplot,imshow,title,gray,figure,show,NullLocator
A = imread('shakira.jpg') # load an image
A = mean(A,2) # to get a 2-D array
full_pc = size(A,axis=1) # numbers of all the principal components
i = 1
dist = []
for numpc in range(0,full_pc+10,10): # 0 10 20 ... full_pc
 coeff, score, latent = princomp(A,numpc)
 Ar = dot(coeff,score).T+mean(A,axis=0) # image reconstruction
 # difference in Frobenius norm
 # showing the pics reconstructed with less than 50 PCs
 if numpc <= 50:
  ax = subplot(2,3,i,frame_on=False)
  ax.xaxis.set_major_locator(NullLocator()) # remove ticks
  i += 1 
  title('PCs # '+str(numpc))

title('numpc FULL')
The resulting images:

We can see that 40 principal components are enough to reconstruct the original image.

At the end of this experiment, we can plot the distance of the reconstructed images from the original image in Frobenius norm (red curve) and the cumulative sum of the eigenvalues (blue curve). Recall that the cumulative sum of the eigenvalues shows the level of variance accounted by each of the corresponding eigenvectors. On the x axis there is the number of eigenvalues/eigenvectors used.
from pylab import plot,axis
perc = cumsum(latent)/sum(latent)
dist = dist/max(dist)

Friday, July 22, 2011

Principal Component Analysis with numpy

The following function is a three-line implementation of the Principal Component Analysis (PCA). It is inspired by the function princomp of the matlab's statistics toolbox.
from numpy import mean,cov,double,cumsum,dot,linalg,array,rank
from pylab import plot,subplot,axis,stem,show,figure

def princomp(A):
 """ performs principal components analysis 
     (PCA) on the n-by-p data matrix A
     Rows of A correspond to observations, columns to variables. 

 Returns :  
  coeff :
    is a p-by-p matrix, each column containing coefficients 
    for one principal component.
  score : 
    the principal component scores; that is, the representation 
    of A in the principal component space. Rows of SCORE 
    correspond to observations, columns to components.
  latent : 
    a vector containing the eigenvalues 
    of the covariance matrix of A.
 # computing eigenvalues and eigenvectors of covariance matrix
 M = (A-mean(A.T,axis=1)).T # subtract the mean (along columns)
 [latent,coeff] = linalg.eig(cov(M)) # attention:not always sorted
 score = dot(coeff.T,M) # projection of the data in the new space
 return coeff,score,latent
(In this other post you can find an updated version of this function).
In the following test a 2D dataset wil be used. The result of this test is a plot with the two principal components (dashed lines), the original data (blue dots) and the new data (red stars). As we expected the first principal component describes the direction of maximum variance and the second is orthogonal to the first.
A = array([ [2.4,0.7,2.9,2.2,3.0,2.7,1.6,1.1,1.6,0.9],
            [2.5,0.5,2.2,1.9,3.1,2.3,2,1,1.5,1.1] ])

coeff, score, latent = princomp(A.T)

# every eigenvector describe the direction
# of a principal component.
m = mean(A,axis=1)
plot([0, -coeff[0,0]*2]+m[0], [0, -coeff[0,1]*2]+m[1],'--k')
plot([0, coeff[1,0]*2]+m[0], [0, coeff[1,1]*2]+m[1],'--k')
plot(A[0,:],A[1,:],'ob') # the data
# new data

In this second example princomp(.) is tested on a 4D dataset. In this example the matrix of the data is rank deficient and only the first two components are necessary to bring the information of the entry dataset.
A = array([[-1, 1, 2, 2],
           [-2, 3, 1, 0],
           [ 4, 0, 3,-1]],dtype=double)

coeff, score, latent = princomp(A)
perc = cumsum(latent)/sum(latent)
# the following plot shows that first two components 
# account for 100% of the variance.
print 'the principal component scores'
print score.T # only the first two columns are nonzero
print 'The rank of A is'
print rank(A)  # indeed, the rank of A is 2

Coefficients for principal components
[[  1.464140e+00   1.588382e+00   0.000000e+00  -4.440892e-16]
 [  2.768170e+00  -1.292503e+00  -2.775557e-17   6.557254e-16]
 [ -4.232310e+00  -2.958795e-01   1.110223e-16  -3.747002e-16]]
The rank of A is

Monday, July 18, 2011

Jukowsi Transformation

The following code applies the Jukowsi transformation to various circles on the complex plane.
from pylab import plot,show,axis,subplot
import cmath
from numpy import *

def cplxcircm(radius,center):
 """Returns an array z where every z[i] is a point 
    on the circle's circumference on the complex plane"""
 teta = linspace(0,pi*2,150)
 z = []
 for a in teta:
  # rotating of radius+0j of a radians and translation
  zc = (radius+0j)*cmath.exp(complex(0,a)) - center
 return array( z, dtype=complex )

def jukowsi(z,L):
 return z+L/z # asimmetric Jukowski transformation

# let's try the transformation on three different circles
z = cplxcircm(1,complex(0.5,0)) #1st circle
w = jukowsi(z,1) # applying the trasformation

z = cplxcircm(0.8,complex(0.4,0.3))
w = jukowsi(z,0.188) 

z = cplxcircm(1.1,complex(0.1,0.0))
w = jukowsi(z,1)
The script shows the following figure. Every subplot shows two curves on the complex plane (x real values, y imaginary values), the blue curve is the starting curve and the red is the transformation.

Thursday, July 14, 2011

Polynomial curve fitting

We have seen already how to a fit a given set of points minimizing an error function, now we will see how to find a fitting polynomial for the data using the function polyfit provided by numpy:
from numpy import *
import pylab

# data to fit
x = random.rand(6)
y = random.rand(6)

# fit the data with a 4th degree polynomial
z4 = polyfit(x, y, 4) 
p4 = poly1d(z4) # construct the polynomial 

z5 = polyfit(x, y, 5)
p5 = poly1d(z5)

xx = linspace(0, 1, 100)
pylab.plot(x, y, 'o', xx, p4(xx),'-g', xx, p5(xx),'-b')
pylab.legend(['data to fit', '4th degree poly', '5th degree poly'])

Let's see the two polynomials:

Tuesday, July 12, 2011

Dice rolling experiment

If we roll a die a large number of times, and we compute the mean and variance, as exaplained here, we’d expect to obtain a mean = 3.5 and a variance = 2.916. Let's simulate that with a script:
import pylab
import math
# Rolling the die 1000 times
v = pylab.randint(1,7,size=(1000))

print 'mean',pylab.mean(v)
print 'variance',pylab.var(v)

pylab.hist(v, bins=6) # histogram of the outcoming
Here's the result:
mean 3.435
variance 2.781775

Monday, July 11, 2011

Prime factor decomposition of a number

The following function compute the prime factors (PFs) of an integer.
from math import floor

def factors(n):
 result = []
 for i in range(2,n+1): # test all integers between 2 and n
  s = 0;
  while n/i == floor(n/float(i)): # is n/i an integer?
   n = n/float(i)
   s += 1
  if s > 0:
   for k in range(s):
    result.append(i) # i is a pf s times
   if n == 1:
    return result

# test
print factors(90)
The result will be
[2, 3, 3, 5]
This means that 90 is equal to 2*3*3*5.

Wednesday, July 6, 2011

How to use reflection

A simple code snippet that use reflection to print names of attributes, methods and doc strings:
class Obj:
 """ An object that use reflection """

 def __init__(self,name):
  """ the constructor of this object """ = name

 def print_methods(self):
  """ print all the methods of this object and their doc string"""
  print '\n* Methods *'
  for names in dir(self):
   attr = getattr(self,names)
   if callable(attr):
    print names,':',attr.__doc__

 def print_attributes(self):
  """ print all the attributes of this object and their value """
  print '* Attributes *'
  for names in dir(self):
   attr = getattr(self,names)
   if not callable(attr):
    print names,':',attr

 def print_all(self):
  """ calls all the methods of this object """
  for names in dir(self):
   attr = getattr(self,names)
   if callable(attr) and names != 'print_all' and names != '__init__':
    attr() # calling the method

o = Obj('the my object')
Which gives the following output:
* Attributes *
__doc__ :  An object that use reflection 
__module__ : __main__
name : the my object

* Methods *
__init__ :  the constructor of this object 
print_all :  calls all the methods of this object 
print_my_attributes :  print all the attributes of this object 
print_my_methods :  print all the methods of this object 

Monday, July 4, 2011

How to plot biorhythm

The following script plot the biorhythm of a person born in 14/3/1988 in a range of 20 days.
from datetime import date
import matplotlib.dates
from pylab import *
from numpy import array,sin,pi

t0 = date(1988,3,14).toordinal()
t1 =
t = array(range((t1-10),(t1+10))) # range of 20 days

y = 100*[sin(2*pi*(t-t0)/23),  # Physical
         sin(2*pi*(t-t0)/28),  # Emotional
         sin(2*pi*(t-t0)/33)]; # Intellectual

# converting ordinals to date
label = []
for p in t:

fig = figure()
ax = fig.gca()
# adding a legend
legend(['Physical', 'Emotional', 'Intellectual'])
# formatting the dates on the x axis

The resulting graph:

Friday, July 1, 2011

Approximating pi

This script uses the following formula
to approximate the value of pi with a fixed number of correct digits.
import math
def pi(digits):
 k = 0
 pi = 0.0
 e = 1.0
 tol = pow(10,-digits) # is minimum error that we want to accept
 while e > tol:
  pi_old = pi
  pi += (2*pow(-1,k)*pow(3,.5-k))/(2*k+1)
  e = abs(pi-pi_old) # current error
  k += 1
  print '%0.16f %20.16f' % (pi,e)
 print '\nerror',e,'\niterations ',k
 print '%0.16f' % pi,'result'
 print '%0.16f' % math.pi,'real pi'

pi(8) # approximating pi with 8 correcet digits
During the execution you can see the current approximated value on the left column and the difference with the approximation at the previous step on the right column. At the end, the difference between the approximated value and the value provided by the math python module will be printed.
$ python 
3.4641016151377544   3.4641016151377544
3.0792014356780038   0.3849001794597506
3.1561814715699539   0.0769800358919501
3.1378528915956800   0.0183285799742738
3.1426047456630846   0.0047518540674045
3.1413087854628832   0.0012959602002014
3.1416743126988376   0.0003655272359544
3.1415687159417840   0.0001055967570536
3.1415997738115058   0.0000310578697218
3.1415905109380802   0.0000092628734256
3.1415933045030817   0.0000027935650015
3.1415924542876463   0.0000008502154354
3.1415927150203800   0.0000002607327336
3.1415926345473140   0.0000000804730660
3.1415926595217138   0.0000000249743999
3.1415926517339976   0.0000000077877162

error 7.78771624965e-09 
iterations  16
3.1415926517339976 result
3.1415926535897931 real pi