Tuesday, April 22, 2014

Parameters selection with Cross-Validation

Most of the pattern recognition techniques have one or more free parameters and choose them for a given classification problem is often not a trivial task. In real applications we only have access to a finite set of examples, usually smaller than we wanted, and we need to test our model on samples not seen during the training process. A model that would just classify the samples that it has seen would have a very good score, but would definitely fail to predict unseen data. This situation is called overfitting and to avoid it we need to apply an appropriate validation procedure to select the parameters. A tool that can help us solve this problem is the Cross-Validation (CV). The idea behind CV is simple: the data are split into train and test sets several consecutive times and the averaged value of the prediction scores obtained with the different sets is the evaluation of the classifier.
Let's see a simple example where a smoothing parameter for a Bayesian classifier is select using the capabilities of the Sklearn library.
To begin we load one of the test datasets provided by sklearn (the same used here) and we hold 33% of the samples for the final evaluation:
from sklearn.datasets import load_digits
data = load_digits()
from sklearn.cross_validation import train_test_split
X,X_test,y,y_test = train_test_split(data.data,data.target,
                                     test_size=.33,
                                     random_state=1899)
Now, we import the classifier we want to use (a Bernoullian Naive Bayes in this case), specify a set of values for the parameter we want to choose and run a grid search:
from sklearn.naive_bayes import BernoulliNB
# test the model for alpha = 0.1, 0.2, ..., 1.0
parameters = [{'alpha':np.linspace(0.1,1,10)}]

from sklearn.grid_search import GridSearchCV
clf = GridSearchCV(BernoulliNB(), parameters, cv=10, scoring='f1')
clf.fit(X,y) # running the grid search
The grid search has evaluated the classifier for each value specified for the parameter alpha using the CV. We can visualize the results as follows:
res = zip(*[(f1m, f1s.std(), p['alpha']) 
            for p, f1m, f1s in clf.grid_scores_])
subplot(2,1,1)
plot(res[2],res[0],'-o')
subplot(2,1,2)
plot(res[2],res[1],'-o')
show()

The plots above show the average score (top) and the standard deviation of the score (bottom) for each values of alpha used. Looking at the graphs it seems plausible that a small alpha could be a good choice.
We can also see thet using the alpha value that gave us the best results on the the test set we selected at the beginning gives us results that are similar to the ones obtained during the CV stage:
from sklearn.metrics import f1_score
print 'Best alpha in CV = %0.01f' % clf.best_params_['alpha']
final = f1_score(y_test,clf.best_estimator_.predict(X_test))
print 'F1-score on the final testset: %0.5f' % final
Best alpha in CV = 0.1
F1-score on the final testset: 0.85861