Show More
@@ -0,0 +1,225 b'' | |||||
|
1 | .. _introduction: | |||
|
2 | ||||
|
3 | An Introduction to machine learning with scikit-learn | |||
|
4 | ======================================================================= | |||
|
5 | ||||
|
6 | .. topic:: Section contents | |||
|
7 | ||||
|
8 | In this section, we introduce the `machine learning | |||
|
9 | <http://en.wikipedia.org/wiki/Machine_learning>`_ | |||
|
10 | vocabulary that we use through-out `scikit-learn` and give a | |||
|
11 | simple learning example. | |||
|
12 | ||||
|
13 | ||||
|
14 | Machine learning: the problem setting | |||
|
15 | --------------------------------------- | |||
|
16 | ||||
|
17 | In general, a learning problem considers a set of n | |||
|
18 | `samples <http://en.wikipedia.org/wiki/Sample_(statistics)>`_ of | |||
|
19 | data and try to predict properties of unknown data. If each sample is | |||
|
20 | more than a single number, and for instance a multi-dimensional entry | |||
|
21 | (aka `multivariate <http://en.wikipedia.org/wiki/Multivariate_random_variable>`_ | |||
|
22 | data), is it said to have several attributes, | |||
|
23 | or **features**. | |||
|
24 | ||||
|
25 | We can separate learning problems in a few large categories: | |||
|
26 | ||||
|
27 | * `supervised learning <http://en.wikipedia.org/wiki/Supervised_learning>`_, | |||
|
28 | in which the data comes with additional attributes that we want to predict | |||
|
29 | (:ref:`Click here <supervised-learning>` | |||
|
30 | to go to the Scikit-Learn supervised learning page).This problem | |||
|
31 | can be either: | |||
|
32 | ||||
|
33 | * `classification | |||
|
34 | <http://en.wikipedia.org/wiki/Classification_in_machine_learning>`_: | |||
|
35 | samples belong to two or more classes and we | |||
|
36 | want to learn from already labeled data how to predict the class | |||
|
37 | of unlabeled data. An example of classification problem would | |||
|
38 | be the digit recognition example, in which the aim is to assign | |||
|
39 | each input vector to one of a finite number of discrete | |||
|
40 | categories. | |||
|
41 | ||||
|
42 | * `regression <http://en.wikipedia.org/wiki/Regression_analysis>`_: | |||
|
43 | if the desired output consists of one or more | |||
|
44 | continuous variables, then the task is called *regression*. An | |||
|
45 | example of a regression problem would be the prediction of the | |||
|
46 | length of a salmon as a function of its age and weight. | |||
|
47 | ||||
|
48 | * `unsupervised learning <http://en.wikipedia.org/wiki/Unsupervised_learning>`_, | |||
|
49 | in which the training data consists of a set of input vectors x | |||
|
50 | without any corresponding target values. The goal in such problems | |||
|
51 | may be to discover groups of similar examples within the data, where | |||
|
52 | it is called `clustering <http://en.wikipedia.org/wiki/Cluster_analysis>`_, | |||
|
53 | or to determine the distribution of data within the input space, known as | |||
|
54 | `density estimation <http://en.wikipedia.org/wiki/Density_estimation>`_, or | |||
|
55 | to project the data from a high-dimensional space down to two or thee | |||
|
56 | dimensions for the purpose of *visualization* | |||
|
57 | (:ref:`Click here <unsupervised-learning>` | |||
|
58 | to go to the Scikit-Learn unsupervised learning page). | |||
|
59 | ||||
|
60 | .. topic:: Training set and testing set | |||
|
61 | ||||
|
62 | Machine learning is about learning some properties of a data set | |||
|
63 | and applying them to new data. This is why a common practice in | |||
|
64 | machine learning to evaluate an algorithm is to split the data | |||
|
65 | at hand in two sets, one that we call a **training set** on which | |||
|
66 | we learn data properties, and one that we call a **testing set**, | |||
|
67 | on which we test these properties. | |||
|
68 | ||||
|
69 | .. _loading_example_dataset: | |||
|
70 | ||||
|
71 | Loading an example dataset | |||
|
72 | -------------------------- | |||
|
73 | ||||
|
74 | `scikit-learn` comes with a few standard datasets, for instance the | |||
|
75 | `iris <http://en.wikipedia.org/wiki/Iris_flower_data_set>`_ and `digits | |||
|
76 | <http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits>`_ | |||
|
77 | datasets for classification and the `boston house prices dataset | |||
|
78 | <http://archive.ics.uci.edu/ml/datasets/Housing>`_ for regression.:: | |||
|
79 | ||||
|
80 | >>> from sklearn import datasets | |||
|
81 | >>> iris = datasets.load_iris() | |||
|
82 | >>> digits = datasets.load_digits() | |||
|
83 | ||||
|
84 | A dataset is a dictionary-like object that holds all the data and some | |||
|
85 | metadata about the data. This data is stored in the ``.data`` member, | |||
|
86 | which is a ``n_samples, n_features`` array. In the case of supervised | |||
|
87 | problem, explanatory variables are stored in the ``.target`` member. More | |||
|
88 | details on the different datasets can be found in the :ref:`dedicated | |||
|
89 | section <datasets>`. | |||
|
90 | ||||
|
91 | For instance, in the case of the digits dataset, ``digits.data`` gives | |||
|
92 | access to the features that can be used to classify the digits samples:: | |||
|
93 | ||||
|
94 | >>> print digits.data # doctest: +NORMALIZE_WHITESPACE | |||
|
95 | [[ 0. 0. 5. ..., 0. 0. 0.] | |||
|
96 | [ 0. 0. 0. ..., 10. 0. 0.] | |||
|
97 | [ 0. 0. 0. ..., 16. 9. 0.] | |||
|
98 | ..., | |||
|
99 | [ 0. 0. 1. ..., 6. 0. 0.] | |||
|
100 | [ 0. 0. 2. ..., 12. 0. 0.] | |||
|
101 | [ 0. 0. 10. ..., 12. 1. 0.]] | |||
|
102 | ||||
|
103 | and `digits.target` gives the ground truth for the digit dataset, that | |||
|
104 | is the number corresponding to each digit image that we are trying to | |||
|
105 | learn:: | |||
|
106 | ||||
|
107 | >>> digits.target | |||
|
108 | array([0, 1, 2, ..., 8, 9, 8]) | |||
|
109 | ||||
|
110 | .. topic:: Shape of the data arrays | |||
|
111 | ||||
|
112 | The data is always a 2D array, `n_samples, n_features`, although | |||
|
113 | the original data may have had a different shape. In the case of the | |||
|
114 | digits, each original sample is an image of shape `8, 8` and can be | |||
|
115 | accessed using:: | |||
|
116 | ||||
|
117 | >>> digits.images[0] | |||
|
118 | array([[ 0., 0., 5., 13., 9., 1., 0., 0.], | |||
|
119 | [ 0., 0., 13., 15., 10., 15., 5., 0.], | |||
|
120 | [ 0., 3., 15., 2., 0., 11., 8., 0.], | |||
|
121 | [ 0., 4., 12., 0., 0., 8., 8., 0.], | |||
|
122 | [ 0., 5., 8., 0., 0., 9., 8., 0.], | |||
|
123 | [ 0., 4., 11., 0., 1., 12., 7., 0.], | |||
|
124 | [ 0., 2., 14., 5., 10., 12., 0., 0.], | |||
|
125 | [ 0., 0., 6., 13., 10., 0., 0., 0.]]) | |||
|
126 | ||||
|
127 | The :ref:`simple example on this dataset | |||
|
128 | <example_plot_digits_classification.py>` illustrates how starting | |||
|
129 | from the original problem one can shape the data for consumption in | |||
|
130 | the `scikit-learn`. | |||
|
131 | ||||
|
132 | ||||
|
133 | Learning and Predicting | |||
|
134 | ------------------------ | |||
|
135 | ||||
|
136 | In the case of the digits dataset, the task is to predict the value of a | |||
|
137 | hand-written digit from an image. We are given samples of each of the 10 | |||
|
138 | possible classes on which we *fit* an | |||
|
139 | `estimator <http://en.wikipedia.org/wiki/Estimator>`_ to be able to *predict* | |||
|
140 | the labels corresponding to new data. | |||
|
141 | ||||
|
142 | In `scikit-learn`, an **estimator** is just a plain Python class that | |||
|
143 | implements the methods `fit(X, Y)` and `predict(T)`. | |||
|
144 | ||||
|
145 | An example of estimator is the class ``sklearn.svm.SVC`` that | |||
|
146 | implements `Support Vector Classification | |||
|
147 | <http://en.wikipedia.org/wiki/Support_vector_machine>`_. The | |||
|
148 | constructor of an estimator takes as arguments the parameters of the | |||
|
149 | model, but for the time being, we will consider the estimator as a black | |||
|
150 | box:: | |||
|
151 | ||||
|
152 | >>> from sklearn import svm | |||
|
153 | >>> clf = svm.SVC(gamma=0.001, C=100.) | |||
|
154 | ||||
|
155 | .. topic:: Choosing the parameters of the model | |||
|
156 | ||||
|
157 | In this example we set the value of ``gamma`` manually. It is possible | |||
|
158 | to automatically find good values for the parameters by using tools | |||
|
159 | such as :ref:`grid search <grid_search>` and :ref:`cross validation | |||
|
160 | <cross_validation>`. | |||
|
161 | ||||
|
162 | We call our estimator instance `clf` as it is a classifier. It now must | |||
|
163 | be fitted to the model, that is, it must `learn` from the model. This is | |||
|
164 | done by passing our training set to the ``fit`` method. As a training | |||
|
165 | set, let us use all the images of our dataset apart from the last | |||
|
166 | one:: | |||
|
167 | ||||
|
168 | >>> clf.fit(digits.data[:-1], digits.target[:-1]) | |||
|
169 | SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, | |||
|
170 | gamma=0.001, kernel='rbf', probability=False, scale_C=True, | |||
|
171 | shrinking=True, tol=0.001) | |||
|
172 | ||||
|
173 | Now you can predict new values, in particular, we can ask to the | |||
|
174 | classifier what is the digit of our last image in the `digits` dataset, | |||
|
175 | which we have not used to train the classifier:: | |||
|
176 | ||||
|
177 | >>> clf.predict(digits.data[-1]) | |||
|
178 | array([ 8.]) | |||
|
179 | ||||
|
180 | The corresponding image is the following: | |||
|
181 | ||||
|
182 | .. image:: ../../auto_examples/tutorial/images/plot_digits_last_image_1.png | |||
|
183 | :target: ../../auto_examples/tutorial/plot_digits_last_image.html | |||
|
184 | :align: center | |||
|
185 | :scale: 50 | |||
|
186 | ||||
|
187 | As you can see, it is a challenging task: the images are of poor | |||
|
188 | resolution. Do you agree with the classifier? | |||
|
189 | ||||
|
190 | A complete example of this classification problem is available as an | |||
|
191 | example that you can run and study: | |||
|
192 | :ref:`example_plot_digits_classification.py`. | |||
|
193 | ||||
|
194 | ||||
|
195 | Model persistence | |||
|
196 | ----------------- | |||
|
197 | ||||
|
198 | It is possible to save a model in the scikit by using Python's built-in | |||
|
199 | persistence model, namely `pickle <http://docs.python.org/library/pickle.html>`_:: | |||
|
200 | ||||
|
201 | >>> from sklearn import svm | |||
|
202 | >>> from sklearn import datasets | |||
|
203 | >>> clf = svm.SVC() | |||
|
204 | >>> iris = datasets.load_iris() | |||
|
205 | >>> X, y = iris.data, iris.target | |||
|
206 | >>> clf.fit(X, y) | |||
|
207 | SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.25, | |||
|
208 | kernel='rbf', probability=False, scale_C=True, shrinking=True, tol=0.001) | |||
|
209 | ||||
|
210 | >>> import pickle | |||
|
211 | >>> s = pickle.dumps(clf) | |||
|
212 | >>> clf2 = pickle.loads(s) | |||
|
213 | >>> clf2.predict(X[0]) | |||
|
214 | array([ 0.]) | |||
|
215 | >>> y[0] | |||
|
216 | 0 | |||
|
217 | ||||
|
218 | In the specific case of the scikit, it may be more interesting to use | |||
|
219 | joblib's replacement of pickle (``joblib.dump`` & ``joblib.load``), | |||
|
220 | which is more efficient on big data, but can only pickle to the disk | |||
|
221 | and not to a string:: | |||
|
222 | ||||
|
223 | >>> from sklearn.externals import joblib | |||
|
224 | >>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP | |||
|
225 |
General Comments 0
You need to be logged in to leave comments.
Login now