tutorial.ipynb.ref
442 lines
| 11.9 KiB
| text/plain
|
TextLexer
/ tests / tutorial.ipynb.ref
slojo404
|
r6288 | { | |
Matthias BUSSONNIER
|
r8621 | "metadata": { | |
"name": "tutorial" | |||
}, | |||
slojo404
|
r6288 | "nbformat": 3, | |
Matthias BUSSONNIER
|
r8621 | "nbformat_minor": 0, | |
slojo404
|
r6288 | "worksheets": [ | |
{ | |||
"cells": [ | |||
{ | |||
"cell_type": "heading", | |||
"level": 1, | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
"An Introduction to machine learning with scikit-learn" | |||
] | |||
}, | |||
{ | |||
"cell_type": "heading", | |||
"level": 1, | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
"Section contents" | |||
] | |||
}, | |||
{ | |||
"cell_type": "markdown", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
Matthias BUSSONNIER
|
r8621 | "In this section, we introduce the machine learning\n", | |
"vocabulary that we use through-out scikit-learn and give a\n", | |||
slojo404
|
r6288 | "simple learning example." | |
] | |||
}, | |||
{ | |||
"cell_type": "heading", | |||
"level": 2, | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
"Machine learning: the problem setting" | |||
] | |||
}, | |||
{ | |||
"cell_type": "markdown", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
"source": [ | |||
"In general, a learning problem considers a set of n\n", | |||
"samples of\n", | |||
"data and try to predict properties of unknown data. If each sample is\n", | |||
"more than a single number, and for instance a multi-dimensional entry\n", | |||
"(aka multivariate\n", | |||
"data), is it said to have several attributes,\n", | |||
slojo404
|
r6288 | "or features." | |
] | |||
}, | |||
{ | |||
"cell_type": "markdown", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
"We can separate learning problems in a few large categories:" | |||
] | |||
}, | |||
{ | |||
"cell_type": "markdown", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
Matthias BUSSONNIER
|
r8621 | "supervised learning,\n", | |
"in which the data comes with additional attributes that we want to predict\n", | |||
"(:ref:`Click here <supervised-learning>`\n", | |||
"to go to the Scikit-Learn supervised learning page).This problem\n", | |||
slojo404
|
r6288 | "can be either:" | |
] | |||
}, | |||
{ | |||
"cell_type": "markdown", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
"source": [ | |||
"classification:\n", | |||
"samples belong to two or more classes and we\n", | |||
"want to learn from already labeled data how to predict the class\n", | |||
"of unlabeled data. An example of classification problem would\n", | |||
"be the digit recognition example, in which the aim is to assign\n", | |||
"each input vector to one of a finite number of discrete\n", | |||
slojo404
|
r6288 | "categories." | |
] | |||
}, | |||
{ | |||
"cell_type": "markdown", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
Matthias BUSSONNIER
|
r8621 | "regression:\n", | |
"if the desired output consists of one or more\n", | |||
"continuous variables, then the task is called regression. An\n", | |||
"example of a regression problem would be the prediction of the\n", | |||
slojo404
|
r6288 | "length of a salmon as a function of its age and weight." | |
] | |||
}, | |||
{ | |||
"cell_type": "markdown", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
"source": [ | |||
"unsupervised learning,\n", | |||
"in which the training data consists of a set of input vectors x\n", | |||
"without any corresponding target values. The goal in such problems\n", | |||
"may be to discover groups of similar examples within the data, where\n", | |||
"it is called clustering,\n", | |||
"or to determine the distribution of data within the input space, known as\n", | |||
"density estimation, or\n", | |||
"to project the data from a high-dimensional space down to two or thee\n", | |||
"dimensions for the purpose of visualization\n", | |||
"(:ref:`Click here <unsupervised-learning>`\n", | |||
slojo404
|
r6288 | "to go to the Scikit-Learn unsupervised learning page)." | |
] | |||
}, | |||
{ | |||
"cell_type": "heading", | |||
"level": 2, | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
"Training set and testing set" | |||
] | |||
}, | |||
{ | |||
"cell_type": "markdown", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
Matthias BUSSONNIER
|
r8621 | "Machine learning is about learning some properties of a data set\n", | |
"and applying them to new data. This is why a common practice in\n", | |||
"machine learning to evaluate an algorithm is to split the data\n", | |||
"at hand in two sets, one that we call a training set on which\n", | |||
"we learn data properties, and one that we call a testing set,\n", | |||
slojo404
|
r6288 | "on which we test these properties." | |
] | |||
}, | |||
{ | |||
"cell_type": "heading", | |||
"level": 2, | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
"Loading an example dataset" | |||
] | |||
}, | |||
{ | |||
"cell_type": "markdown", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
Matthias BUSSONNIER
|
r8621 | "scikit-learn comes with a few standard datasets, for instance the\n", | |
"iris and digits\n", | |||
slojo404
|
r6288 | "datasets for classification and the boston house prices dataset for regression.:" | |
] | |||
}, | |||
{ | |||
"cell_type": "code", | |||
"collapsed": false, | |||
"input": [ | |||
Matthias BUSSONNIER
|
r8621 | "from sklearn import datasets\n", | |
"iris = datasets.load_iris()\n", | |||
slojo404
|
r6288 | "digits = datasets.load_digits()" | |
], | |||
"language": "python", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "outputs": [] | |
}, | |||
{ | |||
"cell_type": "markdown", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
Matthias BUSSONNIER
|
r8621 | "A dataset is a dictionary-like object that holds all the data and some\n", | |
"metadata about the data. This data is stored in the .data member,\n", | |||
"which is a n_samples, n_features array. In the case of supervised\n", | |||
"problem, explanatory variables are stored in the .target member. More\n", | |||
"details on the different datasets can be found in the :ref:`dedicated\n", | |||
slojo404
|
r6288 | "section <datasets>`." | |
] | |||
}, | |||
{ | |||
"cell_type": "markdown", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
Matthias BUSSONNIER
|
r8621 | "For instance, in the case of the digits dataset, digits.data gives\n", | |
slojo404
|
r6288 | "access to the features that can be used to classify the digits samples:" | |
] | |||
}, | |||
{ | |||
"cell_type": "code", | |||
"collapsed": false, | |||
"input": [ | |||
"print digits.data # doctest: +NORMALIZE_WHITESPACE" | |||
], | |||
"language": "python", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "outputs": [] | |
}, | |||
{ | |||
"cell_type": "markdown", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
Matthias BUSSONNIER
|
r8621 | "and digits.target gives the ground truth for the digit dataset, that\n", | |
"is the number corresponding to each digit image that we are trying to\n", | |||
slojo404
|
r6288 | "learn:" | |
] | |||
}, | |||
{ | |||
"cell_type": "code", | |||
"collapsed": false, | |||
"input": [ | |||
"digits.target" | |||
], | |||
"language": "python", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "outputs": [] | |
}, | |||
{ | |||
"cell_type": "heading", | |||
"level": 2, | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
"Shape of the data arrays" | |||
] | |||
}, | |||
{ | |||
"cell_type": "markdown", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
Matthias BUSSONNIER
|
r8621 | "The data is always a 2D array, n_samples, n_features, although\n", | |
"the original data may have had a different shape. In the case of the\n", | |||
"digits, each original sample is an image of shape 8, 8 and can be\n", | |||
slojo404
|
r6288 | "accessed using:" | |
] | |||
}, | |||
{ | |||
"cell_type": "code", | |||
"collapsed": false, | |||
"input": [ | |||
"digits.images[0]" | |||
], | |||
"language": "python", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "outputs": [] | |
}, | |||
{ | |||
"cell_type": "markdown", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
Matthias BUSSONNIER
|
r8621 | "The :ref:`simple example on this dataset\n", | |
"<example_plot_digits_classification.py>` illustrates how starting\n", | |||
"from the original problem one can shape the data for consumption in\n", | |||
slojo404
|
r6288 | "the scikit-learn." | |
] | |||
}, | |||
{ | |||
"cell_type": "heading", | |||
"level": 2, | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
"Learning and Predicting" | |||
] | |||
}, | |||
{ | |||
"cell_type": "markdown", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
Matthias BUSSONNIER
|
r8621 | "In the case of the digits dataset, the task is to predict the value of a\n", | |
"hand-written digit from an image. We are given samples of each of the 10\n", | |||
"possible classes on which we fit an\n", | |||
"estimator to be able to predict\n", | |||
slojo404
|
r6288 | "the labels corresponding to new data." | |
] | |||
}, | |||
{ | |||
"cell_type": "markdown", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
Matthias BUSSONNIER
|
r8621 | "In scikit-learn, an estimator is just a plain Python class that\n", | |
slojo404
|
r6288 | "implements the methods fit(X, Y) and predict(T)." | |
] | |||
}, | |||
{ | |||
"cell_type": "markdown", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
Matthias BUSSONNIER
|
r8621 | "An example of estimator is the class sklearn.svm.SVC that\n", | |
"implements Support Vector Classification. The\n", | |||
"constructor of an estimator takes as arguments the parameters of the\n", | |||
"model, but for the time being, we will consider the estimator as a black\n", | |||
slojo404
|
r6288 | "box:" | |
] | |||
}, | |||
{ | |||
"cell_type": "code", | |||
"collapsed": false, | |||
"input": [ | |||
Matthias BUSSONNIER
|
r8621 | "from sklearn import svm\n", | |
slojo404
|
r6288 | "clf = svm.SVC(gamma=0.001, C=100.)" | |
], | |||
"language": "python", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "outputs": [] | |
}, | |||
{ | |||
"cell_type": "heading", | |||
"level": 2, | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
"Choosing the parameters of the model" | |||
] | |||
}, | |||
{ | |||
"cell_type": "markdown", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
Matthias BUSSONNIER
|
r8621 | "In this example we set the value of gamma manually. It is possible\n", | |
"to automatically find good values for the parameters by using tools\n", | |||
"such as :ref:`grid search <grid_search>` and :ref:`cross validation\n", | |||
slojo404
|
r6288 | "<cross_validation>`." | |
] | |||
}, | |||
{ | |||
"cell_type": "markdown", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
Matthias BUSSONNIER
|
r8621 | "We call our estimator instance clf as it is a classifier. It now must\n", | |
"be fitted to the model, that is, it must learn from the model. This is\n", | |||
"done by passing our training set to the fit method. As a training\n", | |||
"set, let us use all the images of our dataset apart from the last\n", | |||
slojo404
|
r6288 | "one:" | |
] | |||
}, | |||
{ | |||
"cell_type": "code", | |||
"collapsed": false, | |||
"input": [ | |||
"clf.fit(digits.data[:-1], digits.target[:-1])" | |||
], | |||
"language": "python", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "outputs": [] | |
}, | |||
{ | |||
"cell_type": "markdown", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
Matthias BUSSONNIER
|
r8621 | "Now you can predict new values, in particular, we can ask to the\n", | |
"classifier what is the digit of our last image in the digits dataset,\n", | |||
slojo404
|
r6288 | "which we have not used to train the classifier:" | |
] | |||
}, | |||
{ | |||
"cell_type": "code", | |||
"collapsed": false, | |||
"input": [ | |||
"clf.predict(digits.data[-1])" | |||
], | |||
"language": "python", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "outputs": [] | |
}, | |||
{ | |||
"cell_type": "markdown", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
"The corresponding image is the following:" | |||
] | |||
}, | |||
{ | |||
"cell_type": "markdown", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
Matthias BUSSONNIER
|
r8621 | "As you can see, it is a challenging task: the images are of poor\n", | |
slojo404
|
r6288 | "resolution. Do you agree with the classifier?" | |
] | |||
}, | |||
{ | |||
"cell_type": "markdown", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
Matthias BUSSONNIER
|
r8621 | "A complete example of this classification problem is available as an\n", | |
"example that you can run and study:\n", | |||
slojo404
|
r6288 | ":ref:`example_plot_digits_classification.py`." | |
] | |||
}, | |||
{ | |||
"cell_type": "heading", | |||
"level": 2, | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
"Model persistence" | |||
] | |||
}, | |||
{ | |||
"cell_type": "markdown", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
Matthias BUSSONNIER
|
r8621 | "It is possible to save a model in the scikit by using Python's built-in\n", | |
slojo404
|
r6288 | "persistence model, namely pickle:" | |
] | |||
}, | |||
{ | |||
"cell_type": "code", | |||
"collapsed": false, | |||
"input": [ | |||
Matthias BUSSONNIER
|
r8621 | "from sklearn import svm\n", | |
"from sklearn import datasets\n", | |||
"clf = svm.SVC()\n", | |||
"iris = datasets.load_iris()\n", | |||
"X, y = iris.data, iris.target\n", | |||
"clf.fit(X, y)\n", | |||
"import pickle\n", | |||
"s = pickle.dumps(clf)\n", | |||
"clf2 = pickle.loads(s)\n", | |||
"clf2.predict(X[0])\n", | |||
slojo404
|
r6288 | "y[0]" | |
], | |||
"language": "python", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "outputs": [] | |
}, | |||
{ | |||
"cell_type": "markdown", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "source": [ | |
Matthias BUSSONNIER
|
r8621 | "In the specific case of the scikit, it may be more interesting to use\n", | |
"joblib's replacement of pickle (joblib.dump & joblib.load),\n", | |||
"which is more efficient on big data, but can only pickle to the disk\n", | |||
slojo404
|
r6288 | "and not to a string:" | |
] | |||
}, | |||
{ | |||
"cell_type": "code", | |||
"collapsed": false, | |||
"input": [ | |||
Matthias BUSSONNIER
|
r8621 | "from sklearn.externals import joblib\n", | |
slojo404
|
r6288 | "joblib.dump(clf, 'filename.pkl') # doctest: +SKIP" | |
], | |||
"language": "python", | |||
Matthias BUSSONNIER
|
r8621 | "metadata": {}, | |
slojo404
|
r6288 | "outputs": [] | |
} | |||
Matthias BUSSONNIER
|
r8621 | ], | |
"metadata": {} | |||
slojo404
|
r6288 | } | |
] | |||
} |