Show More
@@ -0,0 +1,25 b'' | |||||
|
1 | import os | |||
|
2 | import errno | |||
|
3 | import subprocess | |||
|
4 | import nose.tools as nt | |||
|
5 | ||||
|
6 | test_rst_fname = 'tests/tutorial.rst.ref' | |||
|
7 | ref_ipynb_fname = 'tests/tutorial.ipynb.ref' | |||
|
8 | test_generate_ipynb_fname = 'tests/tutorial.ipynb' | |||
|
9 | ||||
|
10 | ||||
|
11 | def clean_dir(): | |||
|
12 | "Remove generated ipynb file created during conversion" | |||
|
13 | try: | |||
|
14 | os.unlink(test_generate_ipynb_fname) | |||
|
15 | except OSError, e: | |||
|
16 | if e.errno != errno.ENOENT: | |||
|
17 | raise | |||
|
18 | ||||
|
19 | ||||
|
20 | @nt.with_setup(clean_dir, clean_dir) | |||
|
21 | def test_command_line(): | |||
|
22 | with open(ref_ipynb_fname, 'rb') as f: | |||
|
23 | ref_output = f.read() | |||
|
24 | output = subprocess.check_output(['./rst2ipynb.py', test_rst_fname]) | |||
|
25 | nt.assert_equal(ref_output, output) |
@@ -0,0 +1,396 b'' | |||||
|
1 | { | |||
|
2 | "metadata": {}, | |||
|
3 | "nbformat": 3, | |||
|
4 | "worksheets": [ | |||
|
5 | { | |||
|
6 | "cells": [ | |||
|
7 | { | |||
|
8 | "cell_type": "heading", | |||
|
9 | "level": 1, | |||
|
10 | "source": [ | |||
|
11 | "An Introduction to machine learning with scikit-learn" | |||
|
12 | ] | |||
|
13 | }, | |||
|
14 | { | |||
|
15 | "cell_type": "heading", | |||
|
16 | "level": 1, | |||
|
17 | "source": [ | |||
|
18 | "Section contents" | |||
|
19 | ] | |||
|
20 | }, | |||
|
21 | { | |||
|
22 | "cell_type": "markdown", | |||
|
23 | "source": [ | |||
|
24 | "In this section, we introduce the machine learning", | |||
|
25 | "vocabulary that we use through-out scikit-learn and give a", | |||
|
26 | "simple learning example." | |||
|
27 | ] | |||
|
28 | }, | |||
|
29 | { | |||
|
30 | "cell_type": "heading", | |||
|
31 | "level": 2, | |||
|
32 | "source": [ | |||
|
33 | "Machine learning: the problem setting" | |||
|
34 | ] | |||
|
35 | }, | |||
|
36 | { | |||
|
37 | "cell_type": "markdown", | |||
|
38 | "source": [ | |||
|
39 | "In general, a learning problem considers a set of n", | |||
|
40 | "samples of", | |||
|
41 | "data and try to predict properties of unknown data. If each sample is", | |||
|
42 | "more than a single number, and for instance a multi-dimensional entry", | |||
|
43 | "(aka multivariate", | |||
|
44 | "data), is it said to have several attributes,", | |||
|
45 | "or features." | |||
|
46 | ] | |||
|
47 | }, | |||
|
48 | { | |||
|
49 | "cell_type": "markdown", | |||
|
50 | "source": [ | |||
|
51 | "We can separate learning problems in a few large categories:" | |||
|
52 | ] | |||
|
53 | }, | |||
|
54 | { | |||
|
55 | "cell_type": "markdown", | |||
|
56 | "source": [ | |||
|
57 | "supervised learning,", | |||
|
58 | "in which the data comes with additional attributes that we want to predict", | |||
|
59 | "(:ref:`Click here <supervised-learning>`", | |||
|
60 | "to go to the Scikit-Learn supervised learning page).This problem", | |||
|
61 | "can be either:" | |||
|
62 | ] | |||
|
63 | }, | |||
|
64 | { | |||
|
65 | "cell_type": "markdown", | |||
|
66 | "source": [ | |||
|
67 | "classification:", | |||
|
68 | "samples belong to two or more classes and we", | |||
|
69 | "want to learn from already labeled data how to predict the class", | |||
|
70 | "of unlabeled data. An example of classification problem would", | |||
|
71 | "be the digit recognition example, in which the aim is to assign", | |||
|
72 | "each input vector to one of a finite number of discrete", | |||
|
73 | "categories." | |||
|
74 | ] | |||
|
75 | }, | |||
|
76 | { | |||
|
77 | "cell_type": "markdown", | |||
|
78 | "source": [ | |||
|
79 | "regression:", | |||
|
80 | "if the desired output consists of one or more", | |||
|
81 | "continuous variables, then the task is called regression. An", | |||
|
82 | "example of a regression problem would be the prediction of the", | |||
|
83 | "length of a salmon as a function of its age and weight." | |||
|
84 | ] | |||
|
85 | }, | |||
|
86 | { | |||
|
87 | "cell_type": "markdown", | |||
|
88 | "source": [ | |||
|
89 | "unsupervised learning,", | |||
|
90 | "in which the training data consists of a set of input vectors x", | |||
|
91 | "without any corresponding target values. The goal in such problems", | |||
|
92 | "may be to discover groups of similar examples within the data, where", | |||
|
93 | "it is called clustering,", | |||
|
94 | "or to determine the distribution of data within the input space, known as", | |||
|
95 | "density estimation, or", | |||
|
96 | "to project the data from a high-dimensional space down to two or thee", | |||
|
97 | "dimensions for the purpose of visualization", | |||
|
98 | "(:ref:`Click here <unsupervised-learning>`", | |||
|
99 | "to go to the Scikit-Learn unsupervised learning page)." | |||
|
100 | ] | |||
|
101 | }, | |||
|
102 | { | |||
|
103 | "cell_type": "heading", | |||
|
104 | "level": 2, | |||
|
105 | "source": [ | |||
|
106 | "Training set and testing set" | |||
|
107 | ] | |||
|
108 | }, | |||
|
109 | { | |||
|
110 | "cell_type": "markdown", | |||
|
111 | "source": [ | |||
|
112 | "Machine learning is about learning some properties of a data set", | |||
|
113 | "and applying them to new data. This is why a common practice in", | |||
|
114 | "machine learning to evaluate an algorithm is to split the data", | |||
|
115 | "at hand in two sets, one that we call a training set on which", | |||
|
116 | "we learn data properties, and one that we call a testing set,", | |||
|
117 | "on which we test these properties." | |||
|
118 | ] | |||
|
119 | }, | |||
|
120 | { | |||
|
121 | "cell_type": "heading", | |||
|
122 | "level": 2, | |||
|
123 | "source": [ | |||
|
124 | "Loading an example dataset" | |||
|
125 | ] | |||
|
126 | }, | |||
|
127 | { | |||
|
128 | "cell_type": "markdown", | |||
|
129 | "source": [ | |||
|
130 | "scikit-learn comes with a few standard datasets, for instance the", | |||
|
131 | "iris and digits", | |||
|
132 | "datasets for classification and the boston house prices dataset for regression.:" | |||
|
133 | ] | |||
|
134 | }, | |||
|
135 | { | |||
|
136 | "cell_type": "code", | |||
|
137 | "collapsed": false, | |||
|
138 | "input": [ | |||
|
139 | "from sklearn import datasets", | |||
|
140 | "iris = datasets.load_iris()", | |||
|
141 | "digits = datasets.load_digits()" | |||
|
142 | ], | |||
|
143 | "language": "python", | |||
|
144 | "outputs": [] | |||
|
145 | }, | |||
|
146 | { | |||
|
147 | "cell_type": "markdown", | |||
|
148 | "source": [ | |||
|
149 | "A dataset is a dictionary-like object that holds all the data and some", | |||
|
150 | "metadata about the data. This data is stored in the .data member,", | |||
|
151 | "which is a n_samples, n_features array. In the case of supervised", | |||
|
152 | "problem, explanatory variables are stored in the .target member. More", | |||
|
153 | "details on the different datasets can be found in the :ref:`dedicated", | |||
|
154 | "section <datasets>`." | |||
|
155 | ] | |||
|
156 | }, | |||
|
157 | { | |||
|
158 | "cell_type": "markdown", | |||
|
159 | "source": [ | |||
|
160 | "For instance, in the case of the digits dataset, digits.data gives", | |||
|
161 | "access to the features that can be used to classify the digits samples:" | |||
|
162 | ] | |||
|
163 | }, | |||
|
164 | { | |||
|
165 | "cell_type": "code", | |||
|
166 | "collapsed": false, | |||
|
167 | "input": [ | |||
|
168 | "print digits.data # doctest: +NORMALIZE_WHITESPACE" | |||
|
169 | ], | |||
|
170 | "language": "python", | |||
|
171 | "outputs": [] | |||
|
172 | }, | |||
|
173 | { | |||
|
174 | "cell_type": "markdown", | |||
|
175 | "source": [ | |||
|
176 | "and digits.target gives the ground truth for the digit dataset, that", | |||
|
177 | "is the number corresponding to each digit image that we are trying to", | |||
|
178 | "learn:" | |||
|
179 | ] | |||
|
180 | }, | |||
|
181 | { | |||
|
182 | "cell_type": "code", | |||
|
183 | "collapsed": false, | |||
|
184 | "input": [ | |||
|
185 | "digits.target" | |||
|
186 | ], | |||
|
187 | "language": "python", | |||
|
188 | "outputs": [] | |||
|
189 | }, | |||
|
190 | { | |||
|
191 | "cell_type": "heading", | |||
|
192 | "level": 2, | |||
|
193 | "source": [ | |||
|
194 | "Shape of the data arrays" | |||
|
195 | ] | |||
|
196 | }, | |||
|
197 | { | |||
|
198 | "cell_type": "markdown", | |||
|
199 | "source": [ | |||
|
200 | "The data is always a 2D array, n_samples, n_features, although", | |||
|
201 | "the original data may have had a different shape. In the case of the", | |||
|
202 | "digits, each original sample is an image of shape 8, 8 and can be", | |||
|
203 | "accessed using:" | |||
|
204 | ] | |||
|
205 | }, | |||
|
206 | { | |||
|
207 | "cell_type": "code", | |||
|
208 | "collapsed": false, | |||
|
209 | "input": [ | |||
|
210 | "digits.images[0]" | |||
|
211 | ], | |||
|
212 | "language": "python", | |||
|
213 | "outputs": [] | |||
|
214 | }, | |||
|
215 | { | |||
|
216 | "cell_type": "markdown", | |||
|
217 | "source": [ | |||
|
218 | "The :ref:`simple example on this dataset", | |||
|
219 | "<example_plot_digits_classification.py>` illustrates how starting", | |||
|
220 | "from the original problem one can shape the data for consumption in", | |||
|
221 | "the scikit-learn." | |||
|
222 | ] | |||
|
223 | }, | |||
|
224 | { | |||
|
225 | "cell_type": "heading", | |||
|
226 | "level": 2, | |||
|
227 | "source": [ | |||
|
228 | "Learning and Predicting" | |||
|
229 | ] | |||
|
230 | }, | |||
|
231 | { | |||
|
232 | "cell_type": "markdown", | |||
|
233 | "source": [ | |||
|
234 | "In the case of the digits dataset, the task is to predict the value of a", | |||
|
235 | "hand-written digit from an image. We are given samples of each of the 10", | |||
|
236 | "possible classes on which we fit an", | |||
|
237 | "estimator to be able to predict", | |||
|
238 | "the labels corresponding to new data." | |||
|
239 | ] | |||
|
240 | }, | |||
|
241 | { | |||
|
242 | "cell_type": "markdown", | |||
|
243 | "source": [ | |||
|
244 | "In scikit-learn, an estimator is just a plain Python class that", | |||
|
245 | "implements the methods fit(X, Y) and predict(T)." | |||
|
246 | ] | |||
|
247 | }, | |||
|
248 | { | |||
|
249 | "cell_type": "markdown", | |||
|
250 | "source": [ | |||
|
251 | "An example of estimator is the class sklearn.svm.SVC that", | |||
|
252 | "implements Support Vector Classification. The", | |||
|
253 | "constructor of an estimator takes as arguments the parameters of the", | |||
|
254 | "model, but for the time being, we will consider the estimator as a black", | |||
|
255 | "box:" | |||
|
256 | ] | |||
|
257 | }, | |||
|
258 | { | |||
|
259 | "cell_type": "code", | |||
|
260 | "collapsed": false, | |||
|
261 | "input": [ | |||
|
262 | "from sklearn import svm", | |||
|
263 | "clf = svm.SVC(gamma=0.001, C=100.)" | |||
|
264 | ], | |||
|
265 | "language": "python", | |||
|
266 | "outputs": [] | |||
|
267 | }, | |||
|
268 | { | |||
|
269 | "cell_type": "heading", | |||
|
270 | "level": 2, | |||
|
271 | "source": [ | |||
|
272 | "Choosing the parameters of the model" | |||
|
273 | ] | |||
|
274 | }, | |||
|
275 | { | |||
|
276 | "cell_type": "markdown", | |||
|
277 | "source": [ | |||
|
278 | "In this example we set the value of gamma manually. It is possible", | |||
|
279 | "to automatically find good values for the parameters by using tools", | |||
|
280 | "such as :ref:`grid search <grid_search>` and :ref:`cross validation", | |||
|
281 | "<cross_validation>`." | |||
|
282 | ] | |||
|
283 | }, | |||
|
284 | { | |||
|
285 | "cell_type": "markdown", | |||
|
286 | "source": [ | |||
|
287 | "We call our estimator instance clf as it is a classifier. It now must", | |||
|
288 | "be fitted to the model, that is, it must learn from the model. This is", | |||
|
289 | "done by passing our training set to the fit method. As a training", | |||
|
290 | "set, let us use all the images of our dataset apart from the last", | |||
|
291 | "one:" | |||
|
292 | ] | |||
|
293 | }, | |||
|
294 | { | |||
|
295 | "cell_type": "code", | |||
|
296 | "collapsed": false, | |||
|
297 | "input": [ | |||
|
298 | "clf.fit(digits.data[:-1], digits.target[:-1])" | |||
|
299 | ], | |||
|
300 | "language": "python", | |||
|
301 | "outputs": [] | |||
|
302 | }, | |||
|
303 | { | |||
|
304 | "cell_type": "markdown", | |||
|
305 | "source": [ | |||
|
306 | "Now you can predict new values, in particular, we can ask to the", | |||
|
307 | "classifier what is the digit of our last image in the digits dataset,", | |||
|
308 | "which we have not used to train the classifier:" | |||
|
309 | ] | |||
|
310 | }, | |||
|
311 | { | |||
|
312 | "cell_type": "code", | |||
|
313 | "collapsed": false, | |||
|
314 | "input": [ | |||
|
315 | "clf.predict(digits.data[-1])" | |||
|
316 | ], | |||
|
317 | "language": "python", | |||
|
318 | "outputs": [] | |||
|
319 | }, | |||
|
320 | { | |||
|
321 | "cell_type": "markdown", | |||
|
322 | "source": [ | |||
|
323 | "The corresponding image is the following:" | |||
|
324 | ] | |||
|
325 | }, | |||
|
326 | { | |||
|
327 | "cell_type": "markdown", | |||
|
328 | "source": [ | |||
|
329 | "As you can see, it is a challenging task: the images are of poor", | |||
|
330 | "resolution. Do you agree with the classifier?" | |||
|
331 | ] | |||
|
332 | }, | |||
|
333 | { | |||
|
334 | "cell_type": "markdown", | |||
|
335 | "source": [ | |||
|
336 | "A complete example of this classification problem is available as an", | |||
|
337 | "example that you can run and study:", | |||
|
338 | ":ref:`example_plot_digits_classification.py`." | |||
|
339 | ] | |||
|
340 | }, | |||
|
341 | { | |||
|
342 | "cell_type": "heading", | |||
|
343 | "level": 2, | |||
|
344 | "source": [ | |||
|
345 | "Model persistence" | |||
|
346 | ] | |||
|
347 | }, | |||
|
348 | { | |||
|
349 | "cell_type": "markdown", | |||
|
350 | "source": [ | |||
|
351 | "It is possible to save a model in the scikit by using Python's built-in", | |||
|
352 | "persistence model, namely pickle:" | |||
|
353 | ] | |||
|
354 | }, | |||
|
355 | { | |||
|
356 | "cell_type": "code", | |||
|
357 | "collapsed": false, | |||
|
358 | "input": [ | |||
|
359 | "from sklearn import svm", | |||
|
360 | "from sklearn import datasets", | |||
|
361 | "clf = svm.SVC()", | |||
|
362 | "iris = datasets.load_iris()", | |||
|
363 | "X, y = iris.data, iris.target", | |||
|
364 | "clf.fit(X, y)", | |||
|
365 | "import pickle", | |||
|
366 | "s = pickle.dumps(clf)", | |||
|
367 | "clf2 = pickle.loads(s)", | |||
|
368 | "clf2.predict(X[0])", | |||
|
369 | "y[0]" | |||
|
370 | ], | |||
|
371 | "language": "python", | |||
|
372 | "outputs": [] | |||
|
373 | }, | |||
|
374 | { | |||
|
375 | "cell_type": "markdown", | |||
|
376 | "source": [ | |||
|
377 | "In the specific case of the scikit, it may be more interesting to use", | |||
|
378 | "joblib's replacement of pickle (joblib.dump & joblib.load),", | |||
|
379 | "which is more efficient on big data, but can only pickle to the disk", | |||
|
380 | "and not to a string:" | |||
|
381 | ] | |||
|
382 | }, | |||
|
383 | { | |||
|
384 | "cell_type": "code", | |||
|
385 | "collapsed": false, | |||
|
386 | "input": [ | |||
|
387 | "from sklearn.externals import joblib", | |||
|
388 | "joblib.dump(clf, 'filename.pkl') # doctest: +SKIP" | |||
|
389 | ], | |||
|
390 | "language": "python", | |||
|
391 | "outputs": [] | |||
|
392 | } | |||
|
393 | ] | |||
|
394 | } | |||
|
395 | ] | |||
|
396 | } No newline at end of file |
1 | NO CONTENT: file renamed from tutorial.rst to tests/tutorial.rst.ref |
|
NO CONTENT: file renamed from tutorial.rst to tests/tutorial.rst.ref |
General Comments 0
You need to be logged in to leave comments.
Login now