Show More
@@ -0,0 +1,25 b'' | |||
|
1 | import os | |
|
2 | import errno | |
|
3 | import subprocess | |
|
4 | import nose.tools as nt | |
|
5 | ||
|
6 | test_rst_fname = 'tests/tutorial.rst.ref' | |
|
7 | ref_ipynb_fname = 'tests/tutorial.ipynb.ref' | |
|
8 | test_generate_ipynb_fname = 'tests/tutorial.ipynb' | |
|
9 | ||
|
10 | ||
|
11 | def clean_dir(): | |
|
12 | "Remove generated ipynb file created during conversion" | |
|
13 | try: | |
|
14 | os.unlink(test_generate_ipynb_fname) | |
|
15 | except OSError, e: | |
|
16 | if e.errno != errno.ENOENT: | |
|
17 | raise | |
|
18 | ||
|
19 | ||
|
20 | @nt.with_setup(clean_dir, clean_dir) | |
|
21 | def test_command_line(): | |
|
22 | with open(ref_ipynb_fname, 'rb') as f: | |
|
23 | ref_output = f.read() | |
|
24 | output = subprocess.check_output(['./rst2ipynb.py', test_rst_fname]) | |
|
25 | nt.assert_equal(ref_output, output) |
@@ -0,0 +1,396 b'' | |||
|
1 | { | |
|
2 | "metadata": {}, | |
|
3 | "nbformat": 3, | |
|
4 | "worksheets": [ | |
|
5 | { | |
|
6 | "cells": [ | |
|
7 | { | |
|
8 | "cell_type": "heading", | |
|
9 | "level": 1, | |
|
10 | "source": [ | |
|
11 | "An Introduction to machine learning with scikit-learn" | |
|
12 | ] | |
|
13 | }, | |
|
14 | { | |
|
15 | "cell_type": "heading", | |
|
16 | "level": 1, | |
|
17 | "source": [ | |
|
18 | "Section contents" | |
|
19 | ] | |
|
20 | }, | |
|
21 | { | |
|
22 | "cell_type": "markdown", | |
|
23 | "source": [ | |
|
24 | "In this section, we introduce the machine learning", | |
|
25 | "vocabulary that we use through-out scikit-learn and give a", | |
|
26 | "simple learning example." | |
|
27 | ] | |
|
28 | }, | |
|
29 | { | |
|
30 | "cell_type": "heading", | |
|
31 | "level": 2, | |
|
32 | "source": [ | |
|
33 | "Machine learning: the problem setting" | |
|
34 | ] | |
|
35 | }, | |
|
36 | { | |
|
37 | "cell_type": "markdown", | |
|
38 | "source": [ | |
|
39 | "In general, a learning problem considers a set of n", | |
|
40 | "samples of", | |
|
41 | "data and try to predict properties of unknown data. If each sample is", | |
|
42 | "more than a single number, and for instance a multi-dimensional entry", | |
|
43 | "(aka multivariate", | |
|
44 | "data), is it said to have several attributes,", | |
|
45 | "or features." | |
|
46 | ] | |
|
47 | }, | |
|
48 | { | |
|
49 | "cell_type": "markdown", | |
|
50 | "source": [ | |
|
51 | "We can separate learning problems in a few large categories:" | |
|
52 | ] | |
|
53 | }, | |
|
54 | { | |
|
55 | "cell_type": "markdown", | |
|
56 | "source": [ | |
|
57 | "supervised learning,", | |
|
58 | "in which the data comes with additional attributes that we want to predict", | |
|
59 | "(:ref:`Click here <supervised-learning>`", | |
|
60 | "to go to the Scikit-Learn supervised learning page).This problem", | |
|
61 | "can be either:" | |
|
62 | ] | |
|
63 | }, | |
|
64 | { | |
|
65 | "cell_type": "markdown", | |
|
66 | "source": [ | |
|
67 | "classification:", | |
|
68 | "samples belong to two or more classes and we", | |
|
69 | "want to learn from already labeled data how to predict the class", | |
|
70 | "of unlabeled data. An example of classification problem would", | |
|
71 | "be the digit recognition example, in which the aim is to assign", | |
|
72 | "each input vector to one of a finite number of discrete", | |
|
73 | "categories." | |
|
74 | ] | |
|
75 | }, | |
|
76 | { | |
|
77 | "cell_type": "markdown", | |
|
78 | "source": [ | |
|
79 | "regression:", | |
|
80 | "if the desired output consists of one or more", | |
|
81 | "continuous variables, then the task is called regression. An", | |
|
82 | "example of a regression problem would be the prediction of the", | |
|
83 | "length of a salmon as a function of its age and weight." | |
|
84 | ] | |
|
85 | }, | |
|
86 | { | |
|
87 | "cell_type": "markdown", | |
|
88 | "source": [ | |
|
89 | "unsupervised learning,", | |
|
90 | "in which the training data consists of a set of input vectors x", | |
|
91 | "without any corresponding target values. The goal in such problems", | |
|
92 | "may be to discover groups of similar examples within the data, where", | |
|
93 | "it is called clustering,", | |
|
94 | "or to determine the distribution of data within the input space, known as", | |
|
95 | "density estimation, or", | |
|
96 | "to project the data from a high-dimensional space down to two or thee", | |
|
97 | "dimensions for the purpose of visualization", | |
|
98 | "(:ref:`Click here <unsupervised-learning>`", | |
|
99 | "to go to the Scikit-Learn unsupervised learning page)." | |
|
100 | ] | |
|
101 | }, | |
|
102 | { | |
|
103 | "cell_type": "heading", | |
|
104 | "level": 2, | |
|
105 | "source": [ | |
|
106 | "Training set and testing set" | |
|
107 | ] | |
|
108 | }, | |
|
109 | { | |
|
110 | "cell_type": "markdown", | |
|
111 | "source": [ | |
|
112 | "Machine learning is about learning some properties of a data set", | |
|
113 | "and applying them to new data. This is why a common practice in", | |
|
114 | "machine learning to evaluate an algorithm is to split the data", | |
|
115 | "at hand in two sets, one that we call a training set on which", | |
|
116 | "we learn data properties, and one that we call a testing set,", | |
|
117 | "on which we test these properties." | |
|
118 | ] | |
|
119 | }, | |
|
120 | { | |
|
121 | "cell_type": "heading", | |
|
122 | "level": 2, | |
|
123 | "source": [ | |
|
124 | "Loading an example dataset" | |
|
125 | ] | |
|
126 | }, | |
|
127 | { | |
|
128 | "cell_type": "markdown", | |
|
129 | "source": [ | |
|
130 | "scikit-learn comes with a few standard datasets, for instance the", | |
|
131 | "iris and digits", | |
|
132 | "datasets for classification and the boston house prices dataset for regression.:" | |
|
133 | ] | |
|
134 | }, | |
|
135 | { | |
|
136 | "cell_type": "code", | |
|
137 | "collapsed": false, | |
|
138 | "input": [ | |
|
139 | "from sklearn import datasets", | |
|
140 | "iris = datasets.load_iris()", | |
|
141 | "digits = datasets.load_digits()" | |
|
142 | ], | |
|
143 | "language": "python", | |
|
144 | "outputs": [] | |
|
145 | }, | |
|
146 | { | |
|
147 | "cell_type": "markdown", | |
|
148 | "source": [ | |
|
149 | "A dataset is a dictionary-like object that holds all the data and some", | |
|
150 | "metadata about the data. This data is stored in the .data member,", | |
|
151 | "which is a n_samples, n_features array. In the case of supervised", | |
|
152 | "problem, explanatory variables are stored in the .target member. More", | |
|
153 | "details on the different datasets can be found in the :ref:`dedicated", | |
|
154 | "section <datasets>`." | |
|
155 | ] | |
|
156 | }, | |
|
157 | { | |
|
158 | "cell_type": "markdown", | |
|
159 | "source": [ | |
|
160 | "For instance, in the case of the digits dataset, digits.data gives", | |
|
161 | "access to the features that can be used to classify the digits samples:" | |
|
162 | ] | |
|
163 | }, | |
|
164 | { | |
|
165 | "cell_type": "code", | |
|
166 | "collapsed": false, | |
|
167 | "input": [ | |
|
168 | "print digits.data # doctest: +NORMALIZE_WHITESPACE" | |
|
169 | ], | |
|
170 | "language": "python", | |
|
171 | "outputs": [] | |
|
172 | }, | |
|
173 | { | |
|
174 | "cell_type": "markdown", | |
|
175 | "source": [ | |
|
176 | "and digits.target gives the ground truth for the digit dataset, that", | |
|
177 | "is the number corresponding to each digit image that we are trying to", | |
|
178 | "learn:" | |
|
179 | ] | |
|
180 | }, | |
|
181 | { | |
|
182 | "cell_type": "code", | |
|
183 | "collapsed": false, | |
|
184 | "input": [ | |
|
185 | "digits.target" | |
|
186 | ], | |
|
187 | "language": "python", | |
|
188 | "outputs": [] | |
|
189 | }, | |
|
190 | { | |
|
191 | "cell_type": "heading", | |
|
192 | "level": 2, | |
|
193 | "source": [ | |
|
194 | "Shape of the data arrays" | |
|
195 | ] | |
|
196 | }, | |
|
197 | { | |
|
198 | "cell_type": "markdown", | |
|
199 | "source": [ | |
|
200 | "The data is always a 2D array, n_samples, n_features, although", | |
|
201 | "the original data may have had a different shape. In the case of the", | |
|
202 | "digits, each original sample is an image of shape 8, 8 and can be", | |
|
203 | "accessed using:" | |
|
204 | ] | |
|
205 | }, | |
|
206 | { | |
|
207 | "cell_type": "code", | |
|
208 | "collapsed": false, | |
|
209 | "input": [ | |
|
210 | "digits.images[0]" | |
|
211 | ], | |
|
212 | "language": "python", | |
|
213 | "outputs": [] | |
|
214 | }, | |
|
215 | { | |
|
216 | "cell_type": "markdown", | |
|
217 | "source": [ | |
|
218 | "The :ref:`simple example on this dataset", | |
|
219 | "<example_plot_digits_classification.py>` illustrates how starting", | |
|
220 | "from the original problem one can shape the data for consumption in", | |
|
221 | "the scikit-learn." | |
|
222 | ] | |
|
223 | }, | |
|
224 | { | |
|
225 | "cell_type": "heading", | |
|
226 | "level": 2, | |
|
227 | "source": [ | |
|
228 | "Learning and Predicting" | |
|
229 | ] | |
|
230 | }, | |
|
231 | { | |
|
232 | "cell_type": "markdown", | |
|
233 | "source": [ | |
|
234 | "In the case of the digits dataset, the task is to predict the value of a", | |
|
235 | "hand-written digit from an image. We are given samples of each of the 10", | |
|
236 | "possible classes on which we fit an", | |
|
237 | "estimator to be able to predict", | |
|
238 | "the labels corresponding to new data." | |
|
239 | ] | |
|
240 | }, | |
|
241 | { | |
|
242 | "cell_type": "markdown", | |
|
243 | "source": [ | |
|
244 | "In scikit-learn, an estimator is just a plain Python class that", | |
|
245 | "implements the methods fit(X, Y) and predict(T)." | |
|
246 | ] | |
|
247 | }, | |
|
248 | { | |
|
249 | "cell_type": "markdown", | |
|
250 | "source": [ | |
|
251 | "An example of estimator is the class sklearn.svm.SVC that", | |
|
252 | "implements Support Vector Classification. The", | |
|
253 | "constructor of an estimator takes as arguments the parameters of the", | |
|
254 | "model, but for the time being, we will consider the estimator as a black", | |
|
255 | "box:" | |
|
256 | ] | |
|
257 | }, | |
|
258 | { | |
|
259 | "cell_type": "code", | |
|
260 | "collapsed": false, | |
|
261 | "input": [ | |
|
262 | "from sklearn import svm", | |
|
263 | "clf = svm.SVC(gamma=0.001, C=100.)" | |
|
264 | ], | |
|
265 | "language": "python", | |
|
266 | "outputs": [] | |
|
267 | }, | |
|
268 | { | |
|
269 | "cell_type": "heading", | |
|
270 | "level": 2, | |
|
271 | "source": [ | |
|
272 | "Choosing the parameters of the model" | |
|
273 | ] | |
|
274 | }, | |
|
275 | { | |
|
276 | "cell_type": "markdown", | |
|
277 | "source": [ | |
|
278 | "In this example we set the value of gamma manually. It is possible", | |
|
279 | "to automatically find good values for the parameters by using tools", | |
|
280 | "such as :ref:`grid search <grid_search>` and :ref:`cross validation", | |
|
281 | "<cross_validation>`." | |
|
282 | ] | |
|
283 | }, | |
|
284 | { | |
|
285 | "cell_type": "markdown", | |
|
286 | "source": [ | |
|
287 | "We call our estimator instance clf as it is a classifier. It now must", | |
|
288 | "be fitted to the model, that is, it must learn from the model. This is", | |
|
289 | "done by passing our training set to the fit method. As a training", | |
|
290 | "set, let us use all the images of our dataset apart from the last", | |
|
291 | "one:" | |
|
292 | ] | |
|
293 | }, | |
|
294 | { | |
|
295 | "cell_type": "code", | |
|
296 | "collapsed": false, | |
|
297 | "input": [ | |
|
298 | "clf.fit(digits.data[:-1], digits.target[:-1])" | |
|
299 | ], | |
|
300 | "language": "python", | |
|
301 | "outputs": [] | |
|
302 | }, | |
|
303 | { | |
|
304 | "cell_type": "markdown", | |
|
305 | "source": [ | |
|
306 | "Now you can predict new values, in particular, we can ask to the", | |
|
307 | "classifier what is the digit of our last image in the digits dataset,", | |
|
308 | "which we have not used to train the classifier:" | |
|
309 | ] | |
|
310 | }, | |
|
311 | { | |
|
312 | "cell_type": "code", | |
|
313 | "collapsed": false, | |
|
314 | "input": [ | |
|
315 | "clf.predict(digits.data[-1])" | |
|
316 | ], | |
|
317 | "language": "python", | |
|
318 | "outputs": [] | |
|
319 | }, | |
|
320 | { | |
|
321 | "cell_type": "markdown", | |
|
322 | "source": [ | |
|
323 | "The corresponding image is the following:" | |
|
324 | ] | |
|
325 | }, | |
|
326 | { | |
|
327 | "cell_type": "markdown", | |
|
328 | "source": [ | |
|
329 | "As you can see, it is a challenging task: the images are of poor", | |
|
330 | "resolution. Do you agree with the classifier?" | |
|
331 | ] | |
|
332 | }, | |
|
333 | { | |
|
334 | "cell_type": "markdown", | |
|
335 | "source": [ | |
|
336 | "A complete example of this classification problem is available as an", | |
|
337 | "example that you can run and study:", | |
|
338 | ":ref:`example_plot_digits_classification.py`." | |
|
339 | ] | |
|
340 | }, | |
|
341 | { | |
|
342 | "cell_type": "heading", | |
|
343 | "level": 2, | |
|
344 | "source": [ | |
|
345 | "Model persistence" | |
|
346 | ] | |
|
347 | }, | |
|
348 | { | |
|
349 | "cell_type": "markdown", | |
|
350 | "source": [ | |
|
351 | "It is possible to save a model in the scikit by using Python's built-in", | |
|
352 | "persistence model, namely pickle:" | |
|
353 | ] | |
|
354 | }, | |
|
355 | { | |
|
356 | "cell_type": "code", | |
|
357 | "collapsed": false, | |
|
358 | "input": [ | |
|
359 | "from sklearn import svm", | |
|
360 | "from sklearn import datasets", | |
|
361 | "clf = svm.SVC()", | |
|
362 | "iris = datasets.load_iris()", | |
|
363 | "X, y = iris.data, iris.target", | |
|
364 | "clf.fit(X, y)", | |
|
365 | "import pickle", | |
|
366 | "s = pickle.dumps(clf)", | |
|
367 | "clf2 = pickle.loads(s)", | |
|
368 | "clf2.predict(X[0])", | |
|
369 | "y[0]" | |
|
370 | ], | |
|
371 | "language": "python", | |
|
372 | "outputs": [] | |
|
373 | }, | |
|
374 | { | |
|
375 | "cell_type": "markdown", | |
|
376 | "source": [ | |
|
377 | "In the specific case of the scikit, it may be more interesting to use", | |
|
378 | "joblib's replacement of pickle (joblib.dump & joblib.load),", | |
|
379 | "which is more efficient on big data, but can only pickle to the disk", | |
|
380 | "and not to a string:" | |
|
381 | ] | |
|
382 | }, | |
|
383 | { | |
|
384 | "cell_type": "code", | |
|
385 | "collapsed": false, | |
|
386 | "input": [ | |
|
387 | "from sklearn.externals import joblib", | |
|
388 | "joblib.dump(clf, 'filename.pkl') # doctest: +SKIP" | |
|
389 | ], | |
|
390 | "language": "python", | |
|
391 | "outputs": [] | |
|
392 | } | |
|
393 | ] | |
|
394 | } | |
|
395 | ] | |
|
396 | } No newline at end of file |
|
1 | NO CONTENT: file renamed from tutorial.rst to tests/tutorial.rst.ref |
General Comments 0
You need to be logged in to leave comments.
Login now