##// END OF EJS Templates
adding in a copy of a sci-kit learn tutorial I have been using for development
slojo404 -
Show More
@@ -0,0 +1,225 b''
1 .. _introduction:
2
3 An Introduction to machine learning with scikit-learn
4 =======================================================================
5
6 .. topic:: Section contents
7
8 In this section, we introduce the `machine learning
9 <http://en.wikipedia.org/wiki/Machine_learning>`_
10 vocabulary that we use through-out `scikit-learn` and give a
11 simple learning example.
12
13
14 Machine learning: the problem setting
15 ---------------------------------------
16
17 In general, a learning problem considers a set of n
18 `samples <http://en.wikipedia.org/wiki/Sample_(statistics)>`_ of
19 data and try to predict properties of unknown data. If each sample is
20 more than a single number, and for instance a multi-dimensional entry
21 (aka `multivariate <http://en.wikipedia.org/wiki/Multivariate_random_variable>`_
22 data), is it said to have several attributes,
23 or **features**.
24
25 We can separate learning problems in a few large categories:
26
27 * `supervised learning <http://en.wikipedia.org/wiki/Supervised_learning>`_,
28 in which the data comes with additional attributes that we want to predict
29 (:ref:`Click here <supervised-learning>`
30 to go to the Scikit-Learn supervised learning page).This problem
31 can be either:
32
33 * `classification
34 <http://en.wikipedia.org/wiki/Classification_in_machine_learning>`_:
35 samples belong to two or more classes and we
36 want to learn from already labeled data how to predict the class
37 of unlabeled data. An example of classification problem would
38 be the digit recognition example, in which the aim is to assign
39 each input vector to one of a finite number of discrete
40 categories.
41
42 * `regression <http://en.wikipedia.org/wiki/Regression_analysis>`_:
43 if the desired output consists of one or more
44 continuous variables, then the task is called *regression*. An
45 example of a regression problem would be the prediction of the
46 length of a salmon as a function of its age and weight.
47
48 * `unsupervised learning <http://en.wikipedia.org/wiki/Unsupervised_learning>`_,
49 in which the training data consists of a set of input vectors x
50 without any corresponding target values. The goal in such problems
51 may be to discover groups of similar examples within the data, where
52 it is called `clustering <http://en.wikipedia.org/wiki/Cluster_analysis>`_,
53 or to determine the distribution of data within the input space, known as
54 `density estimation <http://en.wikipedia.org/wiki/Density_estimation>`_, or
55 to project the data from a high-dimensional space down to two or thee
56 dimensions for the purpose of *visualization*
57 (:ref:`Click here <unsupervised-learning>`
58 to go to the Scikit-Learn unsupervised learning page).
59
60 .. topic:: Training set and testing set
61
62 Machine learning is about learning some properties of a data set
63 and applying them to new data. This is why a common practice in
64 machine learning to evaluate an algorithm is to split the data
65 at hand in two sets, one that we call a **training set** on which
66 we learn data properties, and one that we call a **testing set**,
67 on which we test these properties.
68
69 .. _loading_example_dataset:
70
71 Loading an example dataset
72 --------------------------
73
74 `scikit-learn` comes with a few standard datasets, for instance the
75 `iris <http://en.wikipedia.org/wiki/Iris_flower_data_set>`_ and `digits
76 <http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits>`_
77 datasets for classification and the `boston house prices dataset
78 <http://archive.ics.uci.edu/ml/datasets/Housing>`_ for regression.::
79
80 >>> from sklearn import datasets
81 >>> iris = datasets.load_iris()
82 >>> digits = datasets.load_digits()
83
84 A dataset is a dictionary-like object that holds all the data and some
85 metadata about the data. This data is stored in the ``.data`` member,
86 which is a ``n_samples, n_features`` array. In the case of supervised
87 problem, explanatory variables are stored in the ``.target`` member. More
88 details on the different datasets can be found in the :ref:`dedicated
89 section <datasets>`.
90
91 For instance, in the case of the digits dataset, ``digits.data`` gives
92 access to the features that can be used to classify the digits samples::
93
94 >>> print digits.data # doctest: +NORMALIZE_WHITESPACE
95 [[ 0. 0. 5. ..., 0. 0. 0.]
96 [ 0. 0. 0. ..., 10. 0. 0.]
97 [ 0. 0. 0. ..., 16. 9. 0.]
98 ...,
99 [ 0. 0. 1. ..., 6. 0. 0.]
100 [ 0. 0. 2. ..., 12. 0. 0.]
101 [ 0. 0. 10. ..., 12. 1. 0.]]
102
103 and `digits.target` gives the ground truth for the digit dataset, that
104 is the number corresponding to each digit image that we are trying to
105 learn::
106
107 >>> digits.target
108 array([0, 1, 2, ..., 8, 9, 8])
109
110 .. topic:: Shape of the data arrays
111
112 The data is always a 2D array, `n_samples, n_features`, although
113 the original data may have had a different shape. In the case of the
114 digits, each original sample is an image of shape `8, 8` and can be
115 accessed using::
116
117 >>> digits.images[0]
118 array([[ 0., 0., 5., 13., 9., 1., 0., 0.],
119 [ 0., 0., 13., 15., 10., 15., 5., 0.],
120 [ 0., 3., 15., 2., 0., 11., 8., 0.],
121 [ 0., 4., 12., 0., 0., 8., 8., 0.],
122 [ 0., 5., 8., 0., 0., 9., 8., 0.],
123 [ 0., 4., 11., 0., 1., 12., 7., 0.],
124 [ 0., 2., 14., 5., 10., 12., 0., 0.],
125 [ 0., 0., 6., 13., 10., 0., 0., 0.]])
126
127 The :ref:`simple example on this dataset
128 <example_plot_digits_classification.py>` illustrates how starting
129 from the original problem one can shape the data for consumption in
130 the `scikit-learn`.
131
132
133 Learning and Predicting
134 ------------------------
135
136 In the case of the digits dataset, the task is to predict the value of a
137 hand-written digit from an image. We are given samples of each of the 10
138 possible classes on which we *fit* an
139 `estimator <http://en.wikipedia.org/wiki/Estimator>`_ to be able to *predict*
140 the labels corresponding to new data.
141
142 In `scikit-learn`, an **estimator** is just a plain Python class that
143 implements the methods `fit(X, Y)` and `predict(T)`.
144
145 An example of estimator is the class ``sklearn.svm.SVC`` that
146 implements `Support Vector Classification
147 <http://en.wikipedia.org/wiki/Support_vector_machine>`_. The
148 constructor of an estimator takes as arguments the parameters of the
149 model, but for the time being, we will consider the estimator as a black
150 box::
151
152 >>> from sklearn import svm
153 >>> clf = svm.SVC(gamma=0.001, C=100.)
154
155 .. topic:: Choosing the parameters of the model
156
157 In this example we set the value of ``gamma`` manually. It is possible
158 to automatically find good values for the parameters by using tools
159 such as :ref:`grid search <grid_search>` and :ref:`cross validation
160 <cross_validation>`.
161
162 We call our estimator instance `clf` as it is a classifier. It now must
163 be fitted to the model, that is, it must `learn` from the model. This is
164 done by passing our training set to the ``fit`` method. As a training
165 set, let us use all the images of our dataset apart from the last
166 one::
167
168 >>> clf.fit(digits.data[:-1], digits.target[:-1])
169 SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
170 gamma=0.001, kernel='rbf', probability=False, scale_C=True,
171 shrinking=True, tol=0.001)
172
173 Now you can predict new values, in particular, we can ask to the
174 classifier what is the digit of our last image in the `digits` dataset,
175 which we have not used to train the classifier::
176
177 >>> clf.predict(digits.data[-1])
178 array([ 8.])
179
180 The corresponding image is the following:
181
182 .. image:: ../../auto_examples/tutorial/images/plot_digits_last_image_1.png
183 :target: ../../auto_examples/tutorial/plot_digits_last_image.html
184 :align: center
185 :scale: 50
186
187 As you can see, it is a challenging task: the images are of poor
188 resolution. Do you agree with the classifier?
189
190 A complete example of this classification problem is available as an
191 example that you can run and study:
192 :ref:`example_plot_digits_classification.py`.
193
194
195 Model persistence
196 -----------------
197
198 It is possible to save a model in the scikit by using Python's built-in
199 persistence model, namely `pickle <http://docs.python.org/library/pickle.html>`_::
200
201 >>> from sklearn import svm
202 >>> from sklearn import datasets
203 >>> clf = svm.SVC()
204 >>> iris = datasets.load_iris()
205 >>> X, y = iris.data, iris.target
206 >>> clf.fit(X, y)
207 SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.25,
208 kernel='rbf', probability=False, scale_C=True, shrinking=True, tol=0.001)
209
210 >>> import pickle
211 >>> s = pickle.dumps(clf)
212 >>> clf2 = pickle.loads(s)
213 >>> clf2.predict(X[0])
214 array([ 0.])
215 >>> y[0]
216 0
217
218 In the specific case of the scikit, it may be more interesting to use
219 joblib's replacement of pickle (``joblib.dump`` & ``joblib.load``),
220 which is more efficient on big data, but can only pickle to the disk
221 and not to a string::
222
223 >>> from sklearn.externals import joblib
224 >>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP
225
General Comments 0
You need to be logged in to leave comments. Login now