upstream/ipython Commit - r6236:5b22c19d

1

.. _introduction:

2

3

An Introduction to machine learning with scikit-learn

4

=======================================================================

5

6

.. topic:: Section contents

7

8

In this section, we introduce the `machine learning

9

<http://en.wikipedia.org/wiki/Machine_learning>`_

10

vocabulary that we use through-out `scikit-learn` and give a

11

simple learning example.

12

13

14

Machine learning: the problem setting

15

---------------------------------------

16

17

In general, a learning problem considers a set of n

18

`samples <http://en.wikipedia.org/wiki/Sample_(statistics)>`_ of

19

data and try to predict properties of unknown data. If each sample is

20

more than a single number, and for instance a multi-dimensional entry

21

(aka `multivariate <http://en.wikipedia.org/wiki/Multivariate_random_variable>`_

22

data), is it said to have several attributes,

23

or **features**.

24

25

We can separate learning problems in a few large categories:

26

27

* `supervised learning <http://en.wikipedia.org/wiki/Supervised_learning>`_,

28

in which the data comes with additional attributes that we want to predict

29

(:ref:`Click here <supervised-learning>`

30

to go to the Scikit-Learn supervised learning page).This problem

31

can be either:

32

33

* `classification

34

<http://en.wikipedia.org/wiki/Classification_in_machine_learning>`_:

35

samples belong to two or more classes and we

36

want to learn from already labeled data how to predict the class

37

of unlabeled data. An example of classification problem would

38

be the digit recognition example, in which the aim is to assign

39

each input vector to one of a finite number of discrete

40

categories.

41

42

* `regression <http://en.wikipedia.org/wiki/Regression_analysis>`_:

43

if the desired output consists of one or more

44

continuous variables, then the task is called *regression*. An

45

example of a regression problem would be the prediction of the

46

length of a salmon as a function of its age and weight.

47

48

* `unsupervised learning <http://en.wikipedia.org/wiki/Unsupervised_learning>`_,

49

in which the training data consists of a set of input vectors x

50

without any corresponding target values. The goal in such problems

51

may be to discover groups of similar examples within the data, where

52

it is called `clustering <http://en.wikipedia.org/wiki/Cluster_analysis>`_,

53

or to determine the distribution of data within the input space, known as

54

`density estimation <http://en.wikipedia.org/wiki/Density_estimation>`_, or

55

to project the data from a high-dimensional space down to two or thee

56

dimensions for the purpose of *visualization*

57

(:ref:`Click here <unsupervised-learning>`

58

to go to the Scikit-Learn unsupervised learning page).

59

60

.. topic:: Training set and testing set

61

62

Machine learning is about learning some properties of a data set

63

and applying them to new data. This is why a common practice in

64

machine learning to evaluate an algorithm is to split the data

65

at hand in two sets, one that we call a **training set** on which

66

we learn data properties, and one that we call a **testing set**,

67

on which we test these properties.

68

69

.. _loading_example_dataset:

70

71

Loading an example dataset

72

--------------------------

73

74

`scikit-learn` comes with a few standard datasets, for instance the

75

`iris <http://en.wikipedia.org/wiki/Iris_flower_data_set>`_ and `digits

76

<http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits>`_

77

datasets for classification and the `boston house prices dataset

78

<http://archive.ics.uci.edu/ml/datasets/Housing>`_ for regression.::

79

80

>>> from sklearn import datasets

81

>>> iris = datasets.load_iris()

82

>>> digits = datasets.load_digits()

83

84

A dataset is a dictionary-like object that holds all the data and some

85

metadata about the data. This data is stored in the ``.data`` member,

86

which is a ``n_samples, n_features`` array. In the case of supervised

87

problem, explanatory variables are stored in the ``.target`` member. More

88

details on the different datasets can be found in the :ref:`dedicated

89

section <datasets>`.

90

91

For instance, in the case of the digits dataset, ``digits.data`` gives

92

access to the features that can be used to classify the digits samples::

93

94

>>> print digits.data # doctest: +NORMALIZE_WHITESPACE

95

[[ 0. 0. 5. ..., 0. 0. 0.]

96

[ 0. 0. 0. ..., 10. 0. 0.]

97

[ 0. 0. 0. ..., 16. 9. 0.]

98

...,

99

[ 0. 0. 1. ..., 6. 0. 0.]

100

[ 0. 0. 2. ..., 12. 0. 0.]

101

[ 0. 0. 10. ..., 12. 1. 0.]]

102

103

and `digits.target` gives the ground truth for the digit dataset, that

104

is the number corresponding to each digit image that we are trying to

105

learn::

106

107

>>> digits.target

108

array([0, 1, 2, ..., 8, 9, 8])

109

110

.. topic:: Shape of the data arrays

111

112

The data is always a 2D array, `n_samples, n_features`, although

113

the original data may have had a different shape. In the case of the

114

digits, each original sample is an image of shape `8, 8` and can be

115

accessed using::

116

117

>>> digits.images[0]

118

array([[ 0., 0., 5., 13., 9., 1., 0., 0.],

119

[ 0., 0., 13., 15., 10., 15., 5., 0.],

120

[ 0., 3., 15., 2., 0., 11., 8., 0.],

121

[ 0., 4., 12., 0., 0., 8., 8., 0.],

122

[ 0., 5., 8., 0., 0., 9., 8., 0.],

123

[ 0., 4., 11., 0., 1., 12., 7., 0.],

124

[ 0., 2., 14., 5., 10., 12., 0., 0.],

125

[ 0., 0., 6., 13., 10., 0., 0., 0.]])

126

127

The :ref:`simple example on this dataset

128

<example_plot_digits_classification.py>` illustrates how starting

129

from the original problem one can shape the data for consumption in

130

the `scikit-learn`.

131

132

133

Learning and Predicting

134

------------------------

135

136

In the case of the digits dataset, the task is to predict the value of a

137

hand-written digit from an image. We are given samples of each of the 10

138

possible classes on which we *fit* an

139

`estimator <http://en.wikipedia.org/wiki/Estimator>`_ to be able to *predict*

140

the labels corresponding to new data.

141

142

In `scikit-learn`, an **estimator** is just a plain Python class that

143

implements the methods `fit(X, Y)` and `predict(T)`.

144

145

An example of estimator is the class ``sklearn.svm.SVC`` that

146

implements `Support Vector Classification

147

<http://en.wikipedia.org/wiki/Support_vector_machine>`_. The

148

constructor of an estimator takes as arguments the parameters of the

149

model, but for the time being, we will consider the estimator as a black

150

box::

151

152

>>> from sklearn import svm

153

>>> clf = svm.SVC(gamma=0.001, C=100.)

154

155

.. topic:: Choosing the parameters of the model

156

157

In this example we set the value of ``gamma`` manually. It is possible

158

to automatically find good values for the parameters by using tools

159

such as :ref:`grid search <grid_search>` and :ref:`cross validation

160

<cross_validation>`.

161

162

We call our estimator instance `clf` as it is a classifier. It now must

163

be fitted to the model, that is, it must `learn` from the model. This is

164

done by passing our training set to the ``fit`` method. As a training

165

set, let us use all the images of our dataset apart from the last

166

one::

167

168

>>> clf.fit(digits.data[:-1], digits.target[:-1])

169

SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,

170

gamma=0.001, kernel='rbf', probability=False, scale_C=True,

171

shrinking=True, tol=0.001)

172

173

Now you can predict new values, in particular, we can ask to the

174

classifier what is the digit of our last image in the `digits` dataset,

175

which we have not used to train the classifier::

176

177

>>> clf.predict(digits.data[-1])

178

array([ 8.])

179

180

The corresponding image is the following:

181

182

.. image:: ../../auto_examples/tutorial/images/plot_digits_last_image_1.png

183

:target: ../../auto_examples/tutorial/plot_digits_last_image.html

184

:align: center

185

:scale: 50

186

187

As you can see, it is a challenging task: the images are of poor

188

resolution. Do you agree with the classifier?

189

190

A complete example of this classification problem is available as an

191

example that you can run and study:

192

:ref:`example_plot_digits_classification.py`.

193

194

195

Model persistence

196

-----------------

197

198

It is possible to save a model in the scikit by using Python's built-in

199

persistence model, namely `pickle <http://docs.python.org/library/pickle.html>`_::

200

201

>>> from sklearn import svm

202

>>> from sklearn import datasets

203

>>> clf = svm.SVC()

204

>>> iris = datasets.load_iris()

205

>>> X, y = iris.data, iris.target

206

>>> clf.fit(X, y)

207

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.25,

208

kernel='rbf', probability=False, scale_C=True, shrinking=True, tol=0.001)

209

210

>>> import pickle

211

>>> s = pickle.dumps(clf)

212

>>> clf2 = pickle.loads(s)

213

>>> clf2.predict(X[0])

214

array([ 0.])

215

>>> y[0]

216

0

217

218

In the specific case of the scikit, it may be more interesting to use

219

joblib's replacement of pickle (``joblib.dump`` & ``joblib.load``),

220

which is more efficient on big data, but can only pickle to the disk

221

and not to a string::

222

223

>>> from sklearn.externals import joblib

224

>>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP

225

	Site-wide shortcuts
/	Use quick search box
g h	Goto home page
g g	Goto my private gists page
g G	Goto my public gists page
g 0-9	Goto bookmarked items from 0-9
n r	New repository page
n g	New gist page

	Repositories
g s	Goto summary page
g c	Goto changelog page
g f	Goto files page
g F	Goto files page with file search activated
g p	Goto pull requests page
g o	Goto repository settings
g O	Goto repository access permissions settings
t s	Toggle sidebar on some pages

@@ -0,0 +1,225 b''
	1	.. _introduction:
	2
	3	An Introduction to machine learning with scikit-learn
	4	=======================================================================
	5
	6	.. topic:: Section contents
	7
	8	In this section, we introduce the `machine learning
	9	<http://en.wikipedia.org/wiki/Machine_learning>`_
	10	vocabulary that we use through-out `scikit-learn` and give a
	11	simple learning example.
	12
	13
	14	Machine learning: the problem setting
	15	---------------------------------------
	16
	17	In general, a learning problem considers a set of n
	18	`samples <http://en.wikipedia.org/wiki/Sample_(statistics)>`_ of
	19	data and try to predict properties of unknown data. If each sample is
	20	more than a single number, and for instance a multi-dimensional entry
	21	(aka `multivariate <http://en.wikipedia.org/wiki/Multivariate_random_variable>`_
	22	data), is it said to have several attributes,
	23	or features.
	24
	25	We can separate learning problems in a few large categories:
	26
	27	* `supervised learning <http://en.wikipedia.org/wiki/Supervised_learning>`_,
	28	in which the data comes with additional attributes that we want to predict
	29	(:ref:`Click here <supervised-learning>`
	30	to go to the Scikit-Learn supervised learning page).This problem
	31	can be either:
	32
	33	* `classification
	34	<http://en.wikipedia.org/wiki/Classification_in_machine_learning>`_:
	35	samples belong to two or more classes and we
	36	want to learn from already labeled data how to predict the class
	37	of unlabeled data. An example of classification problem would
	38	be the digit recognition example, in which the aim is to assign
	39	each input vector to one of a finite number of discrete
	40	categories.
	41
	42	* `regression <http://en.wikipedia.org/wiki/Regression_analysis>`_:
	43	if the desired output consists of one or more
	44	continuous variables, then the task is called regression. An
	45	example of a regression problem would be the prediction of the
	46	length of a salmon as a function of its age and weight.
	47
	48	* `unsupervised learning <http://en.wikipedia.org/wiki/Unsupervised_learning>`_,
	49	in which the training data consists of a set of input vectors x
	50	without any corresponding target values. The goal in such problems
	51	may be to discover groups of similar examples within the data, where
	52	it is called `clustering <http://en.wikipedia.org/wiki/Cluster_analysis>`_,
	53	or to determine the distribution of data within the input space, known as
	54	`density estimation <http://en.wikipedia.org/wiki/Density_estimation>`_, or
	55	to project the data from a high-dimensional space down to two or thee
	56	dimensions for the purpose of visualization
	57	(:ref:`Click here <unsupervised-learning>`
	58	to go to the Scikit-Learn unsupervised learning page).
	59
	60	.. topic:: Training set and testing set
	61
	62	Machine learning is about learning some properties of a data set
	63	and applying them to new data. This is why a common practice in
	64	machine learning to evaluate an algorithm is to split the data
	65	at hand in two sets, one that we call a training set on which
	66	we learn data properties, and one that we call a testing set,
	67	on which we test these properties.
	68
	69	.. _loading_example_dataset:
	70
	71	Loading an example dataset
	72	--------------------------
	73
	74	`scikit-learn` comes with a few standard datasets, for instance the
	75	`iris <http://en.wikipedia.org/wiki/Iris_flower_data_set>`_ and `digits
	76	<http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits>`_
	77	datasets for classification and the `boston house prices dataset
	78	<http://archive.ics.uci.edu/ml/datasets/Housing>`_ for regression.::
	79
	80	>>> from sklearn import datasets
	81	>>> iris = datasets.load_iris()
	82	>>> digits = datasets.load_digits()
	83
	84	A dataset is a dictionary-like object that holds all the data and some
	85	metadata about the data. This data is stored in the ``.data`` member,
	86	which is a ``n_samples, n_features`` array. In the case of supervised
	87	problem, explanatory variables are stored in the ``.target`` member. More
	88	details on the different datasets can be found in the :ref:`dedicated
	89	section <datasets>`.
	90
	91	For instance, in the case of the digits dataset, ``digits.data`` gives
	92	access to the features that can be used to classify the digits samples::
	93
	94	>>> print digits.data # doctest: +NORMALIZE_WHITESPACE
	95	[[ 0. 0. 5. ..., 0. 0. 0.]
	96	[ 0. 0. 0. ..., 10. 0. 0.]
	97	[ 0. 0. 0. ..., 16. 9. 0.]
	98	...,
	99	[ 0. 0. 1. ..., 6. 0. 0.]
	100	[ 0. 0. 2. ..., 12. 0. 0.]
	101	[ 0. 0. 10. ..., 12. 1. 0.]]
	102
	103	and `digits.target` gives the ground truth for the digit dataset, that
	104	is the number corresponding to each digit image that we are trying to
	105	learn::
	106
	107	>>> digits.target
	108	array([0, 1, 2, ..., 8, 9, 8])
	109
	110	.. topic:: Shape of the data arrays
	111
	112	The data is always a 2D array, `n_samples, n_features`, although
	113	the original data may have had a different shape. In the case of the
	114	digits, each original sample is an image of shape `8, 8` and can be
	115	accessed using::
	116
	117	>>> digits.images[0]
	118	array([[ 0., 0., 5., 13., 9., 1., 0., 0.],
	119	[ 0., 0., 13., 15., 10., 15., 5., 0.],
	120	[ 0., 3., 15., 2., 0., 11., 8., 0.],
	121	[ 0., 4., 12., 0., 0., 8., 8., 0.],
	122	[ 0., 5., 8., 0., 0., 9., 8., 0.],
	123	[ 0., 4., 11., 0., 1., 12., 7., 0.],
	124	[ 0., 2., 14., 5., 10., 12., 0., 0.],
	125	[ 0., 0., 6., 13., 10., 0., 0., 0.]])
	126
	127	The :ref:`simple example on this dataset
	128	<example_plot_digits_classification.py>` illustrates how starting
	129	from the original problem one can shape the data for consumption in
	130	the `scikit-learn`.
	131
	132
	133	Learning and Predicting
	134	------------------------
	135
	136	In the case of the digits dataset, the task is to predict the value of a
	137	hand-written digit from an image. We are given samples of each of the 10
	138	possible classes on which we fit an
	139	`estimator <http://en.wikipedia.org/wiki/Estimator>`_ to be able to predict
	140	the labels corresponding to new data.
	141
	142	In `scikit-learn`, an estimator is just a plain Python class that
	143	implements the methods `fit(X, Y)` and `predict(T)`.
	144
	145	An example of estimator is the class ``sklearn.svm.SVC`` that
	146	implements `Support Vector Classification
	147	<http://en.wikipedia.org/wiki/Support_vector_machine>`_. The
	148	constructor of an estimator takes as arguments the parameters of the
	149	model, but for the time being, we will consider the estimator as a black
	150	box::
	151
	152	>>> from sklearn import svm
	153	>>> clf = svm.SVC(gamma=0.001, C=100.)
	154
	155	.. topic:: Choosing the parameters of the model
	156
	157	In this example we set the value of ``gamma`` manually. It is possible
	158	to automatically find good values for the parameters by using tools
	159	such as :ref:`grid search <grid_search>` and :ref:`cross validation
	160	<cross_validation>`.
	161
	162	We call our estimator instance `clf` as it is a classifier. It now must
	163	be fitted to the model, that is, it must `learn` from the model. This is
	164	done by passing our training set to the ``fit`` method. As a training
	165	set, let us use all the images of our dataset apart from the last
	166	one::
	167
	168	>>> clf.fit(digits.data[:-1], digits.target[:-1])
	169	SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
	170	gamma=0.001, kernel='rbf', probability=False, scale_C=True,
	171	shrinking=True, tol=0.001)
	172
	173	Now you can predict new values, in particular, we can ask to the
	174	classifier what is the digit of our last image in the `digits` dataset,
	175	which we have not used to train the classifier::
	176
	177	>>> clf.predict(digits.data[-1])
	178	array([ 8.])
	179
	180	The corresponding image is the following:
	181
	182	.. image:: ../../auto_examples/tutorial/images/plot_digits_last_image_1.png
	183	:target: ../../auto_examples/tutorial/plot_digits_last_image.html
	184	:align: center
	185	:scale: 50
	186
	187	As you can see, it is a challenging task: the images are of poor
	188	resolution. Do you agree with the classifier?
	189
	190	A complete example of this classification problem is available as an
	191	example that you can run and study:
	192	:ref:`example_plot_digits_classification.py`.
	193
	194
	195	Model persistence
	196	-----------------
	197
	198	It is possible to save a model in the scikit by using Python's built-in
	199	persistence model, namely `pickle <http://docs.python.org/library/pickle.html>`_::
	200
	201	>>> from sklearn import svm
	202	>>> from sklearn import datasets
	203	>>> clf = svm.SVC()
	204	>>> iris = datasets.load_iris()
	205	>>> X, y = iris.data, iris.target
	206	>>> clf.fit(X, y)
	207	SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.25,
	208	kernel='rbf', probability=False, scale_C=True, shrinking=True, tol=0.001)
	209
	210	>>> import pickle
	211	>>> s = pickle.dumps(clf)
	212	>>> clf2 = pickle.loads(s)
	213	>>> clf2.predict(X[0])
	214	array([ 0.])
	215	>>> y[0]
	216	0
	217
	218	In the specific case of the scikit, it may be more interesting to use
	219	joblib's replacement of pickle (``joblib.dump`` & ``joblib.load``),
	220	which is more efficient on big data, but can only pickle to the disk
	221	and not to a string::
	222
	223	>>> from sklearn.externals import joblib
	224	>>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP
	225