Friday, July 20, 2012

Distribution fitting with scipy

Distribution fitting is the procedure of selecting a statistical distribution that best fits to a dataset generated by some random process. In this post we will see how to fit a distribution using the techniques implemented in the Scipy library.
This is the first snippet:
from scipy.stats import norm
from numpy import linspace
from pylab import plot,show,hist,figure,title

# picking 150 of from a normal distrubution
# with mean 0 and standard deviation 1
samp = norm.rvs(loc=0,scale=1,size=150) 

param = norm.fit(samp) # distribution fitting

# now, param[0] and param[1] are the mean and 
# the standard deviation of the fitted distribution
x = linspace(-5,5,100)
# fitted distribution
pdf_fitted = norm.pdf(x,loc=param[0],scale=param[1])
# original distribution
pdf = norm.pdf(x)

title('Normal distribution')
plot(x,pdf_fitted,'r-',x,pdf,'b-')
hist(samp,normed=1,alpha=.3)
show()
The result should be as follows


In the code above a dataset of 150 samples have been created using a normal distribution with mean 0 and standar deviation 1, then a fitting procedure have been applied on the data. In the figure we can see the original distribution (blue curve) and the fitted distribution (red curve) and we can observe that they are really similar.
Let's do the same with a Rayleigh distribution:
from scipy.stats import norm,rayleigh

samp = rayleigh.rvs(loc=5,scale=2,size=150) # samples generation

param = rayleigh.fit(samp) # distribution fitting

x = linspace(5,13,100)
# fitted distribution
pdf_fitted = rayleigh.pdf(x,loc=param[0],scale=param[1])
# original distribution
pdf = rayleigh.pdf(x,loc=5,scale=2)

title('Rayleigh distribution')
plot(x,pdf_fitted,'r-',x,pdf,'b-')
hist(samp,normed=1,alpha=.3)
show()
The resulting plot:


As expected, the two distributions are very close.

8 comments:

  1. or you could plug your samples into http://zunzun.com/ :D

    ReplyDelete
  2. The actual direct link would be:

    http://zunzun.com/StatisticalDistributions/1/

    ReplyDelete
  3. Hurray! Been missing Glowing Python posts. Happy to see a new one, learn something new.

    ReplyDelete
  4. I think, this does nothing else than calculating the mean and standard deviation of samp:
    >>> samp = norm.rvs(loc=0,scale=1,size=150)
    >>> param = norm.fit(samp)
    >>> mu = np.mean(samp)
    >>> sigma = np.std(samp)
    >>> mu==param[0]
    True
    >>> sigma==param[1]
    True
    >>>

    ReplyDelete
    Replies
    1. According to the scipy documentation it should perform a Maximum Likelihood Estimate.

      Delete
  5. If it helps, some code for doing this w/o normalizing, which plots the gaussian fit over the real histogram:

    from scipy.stats import norm
    from numpy import linspace
    from pylab import plot,show,hist

    def PlotHistNorm(data, log=False):
    # distribution fitting
    param = norm.fit(data)
    mean = param[0]
    sd = param[1]

    #Set large limits
    xlims = [-6*sd+mean, 6*sd+mean]

    #Plot histogram
    histdata = hist(data,bins=12,alpha=.3,log=log)

    #Generate X points
    x = linspace(xlims[0],xlims[1],500)

    #Get Y points via Normal PDF with fitted parameters
    pdf_fitted = norm.pdf(x,loc=mean,scale=sd)

    #Get histogram data, in this case bin edges
    xh = [0.5 * (histdata[1][r] + histdata[1][r+1]) for r in xrange(len(histdata[1])-1)]

    #Get bin width from this
    binwidth = (max(xh) - min(xh)) / len(histdata[1])

    #Scale the fitted PDF by area of the histogram
    pdf_fitted = pdf_fitted * (len(data) * binwidth)

    #Plot PDF
    plot(x,pdf_fitted,'r-')

    ReplyDelete
    Replies
    1. this code doesn't work….

      Delete
  6. This comment has been removed by the author.

    ReplyDelete