upstream/ipython Commit - r6236:5b22c19d

1

.. _introduction:

2

3

An Introduction to machine learning with scikit-learn

4

=======================================================================

5

6

.. topic:: Section contents

7

8

In this section, we introduce the `machine learning

9

<http://en.wikipedia.org/wiki/Machine_learning>`_

10

vocabulary that we use through-out `scikit-learn` and give a

11

simple learning example.

12

13

14

Machine learning: the problem setting

15

---------------------------------------

16

17

In general, a learning problem considers a set of n

18

`samples <http://en.wikipedia.org/wiki/Sample_(statistics)>`_ of

19

data and try to predict properties of unknown data. If each sample is

20

more than a single number, and for instance a multi-dimensional entry

21

(aka `multivariate <http://en.wikipedia.org/wiki/Multivariate_random_variable>`_

22

data), is it said to have several attributes,

23

or **features**.

24

25

We can separate learning problems in a few large categories:

26

27

* `supervised learning <http://en.wikipedia.org/wiki/Supervised_learning>`_,

28

in which the data comes with additional attributes that we want to predict

29

(:ref:`Click here <supervised-learning>`

30

to go to the Scikit-Learn supervised learning page).This problem

31

can be either:

32

33

* `classification

34

<http://en.wikipedia.org/wiki/Classification_in_machine_learning>`_:

35

samples belong to two or more classes and we

36

want to learn from already labeled data how to predict the class

37

of unlabeled data. An example of classification problem would

38

be the digit recognition example, in which the aim is to assign

39

each input vector to one of a finite number of discrete

40

categories.

41

42

* `regression <http://en.wikipedia.org/wiki/Regression_analysis>`_:

43

if the desired output consists of one or more

44

continuous variables, then the task is called *regression*. An

45

example of a regression problem would be the prediction of the

46

length of a salmon as a function of its age and weight.

47

48

* `unsupervised learning <http://en.wikipedia.org/wiki/Unsupervised_learning>`_,

49

in which the training data consists of a set of input vectors x

50

without any corresponding target values. The goal in such problems

51

may be to discover groups of similar examples within the data, where

52

it is called `clustering <http://en.wikipedia.org/wiki/Cluster_analysis>`_,

53

or to determine the distribution of data within the input space, known as

54

`density estimation <http://en.wikipedia.org/wiki/Density_estimation>`_, or

55

to project the data from a high-dimensional space down to two or thee

56

dimensions for the purpose of *visualization*

57

(:ref:`Click here <unsupervised-learning>`

58

to go to the Scikit-Learn unsupervised learning page).

59

60

.. topic:: Training set and testing set

61

62

Machine learning is about learning some properties of a data set

63

and applying them to new data. This is why a common practice in

64

machine learning to evaluate an algorithm is to split the data

65

at hand in two sets, one that we call a **training set** on which

66

we learn data properties, and one that we call a **testing set**,

67

on which we test these properties.

68

69

.. _loading_example_dataset:

70

71

Loading an example dataset

72

--------------------------

73

74

`scikit-learn` comes with a few standard datasets, for instance the

75

`iris <http://en.wikipedia.org/wiki/Iris_flower_data_set>`_ and `digits

76

<http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits>`_

77

datasets for classification and the `boston house prices dataset

78

<http://archive.ics.uci.edu/ml/datasets/Housing>`_ for regression.::

79

80

>>> from sklearn import datasets

81

>>> iris = datasets.load_iris()

82

>>> digits = datasets.load_digits()

83

84

A dataset is a dictionary-like object that holds all the data and some

85

metadata about the data. This data is stored in the ``.data`` member,

86

which is a ``n_samples, n_features`` array. In the case of supervised

87

problem, explanatory variables are stored in the ``.target`` member. More

88

details on the different datasets can be found in the :ref:`dedicated

89

section <datasets>`.

90

91

For instance, in the case of the digits dataset, ``digits.data`` gives

92

access to the features that can be used to classify the digits samples::

93

94

>>> print digits.data # doctest: +NORMALIZE_WHITESPACE

95

[[ 0. 0. 5. ..., 0. 0. 0.]

96

[ 0. 0. 0. ..., 10. 0. 0.]

97

[ 0. 0. 0. ..., 16. 9. 0.]

98

...,

99

[ 0. 0. 1. ..., 6. 0. 0.]

100

[ 0. 0. 2. ..., 12. 0. 0.]

101

[ 0. 0. 10. ..., 12. 1. 0.]]

102

103

and `digits.target` gives the ground truth for the digit dataset, that

104

is the number corresponding to each digit image that we are trying to

105

learn::

106

107

>>> digits.target

108

array([0, 1, 2, ..., 8, 9, 8])

109

110

.. topic:: Shape of the data arrays

111

112

The data is always a 2D array, `n_samples, n_features`, although

113

the original data may have had a different shape. In the case of the

114

digits, each original sample is an image of shape `8, 8` and can be

115

accessed using::

116

117

>>> digits.images[0]

118

array([[ 0., 0., 5., 13., 9., 1., 0., 0.],

119

[ 0., 0., 13., 15., 10., 15., 5., 0.],

120

[ 0., 3., 15., 2., 0., 11., 8., 0.],

121

[ 0., 4., 12., 0., 0., 8., 8., 0.],

122

[ 0., 5., 8., 0., 0., 9., 8., 0.],

123

[ 0., 4., 11., 0., 1., 12., 7., 0.],

124

[ 0., 2., 14., 5., 10., 12., 0., 0.],

125

[ 0., 0., 6., 13., 10., 0., 0., 0.]])

126

127

The :ref:`simple example on this dataset

128

<example_plot_digits_classification.py>` illustrates how starting

129

from the original problem one can shape the data for consumption in

130

the `scikit-learn`.

131

132

133

Learning and Predicting

134

------------------------

135

136

In the case of the digits dataset, the task is to predict the value of a

137

hand-written digit from an image. We are given samples of each of the 10

138

possible classes on which we *fit* an

139

`estimator <http://en.wikipedia.org/wiki/Estimator>`_ to be able to *predict*

140

the labels corresponding to new data.

141

142

In `scikit-learn`, an **estimator** is just a plain Python class that

143

implements the methods `fit(X, Y)` and `predict(T)`.

144

145

An example of estimator is the class ``sklearn.svm.SVC`` that

146

implements `Support Vector Classification

147

<http://en.wikipedia.org/wiki/Support_vector_machine>`_. The

148

constructor of an estimator takes as arguments the parameters of the

149

model, but for the time being, we will consider the estimator as a black

150

box::

151

152

>>> from sklearn import svm

153

>>> clf = svm.SVC(gamma=0.001, C=100.)

154

155

.. topic:: Choosing the parameters of the model

156

157

In this example we set the value of ``gamma`` manually. It is possible

158

to automatically find good values for the parameters by using tools

159

such as :ref:`grid search <grid_search>` and :ref:`cross validation

160

<cross_validation>`.

161

162

We call our estimator instance `clf` as it is a classifier. It now must

163

be fitted to the model, that is, it must `learn` from the model. This is

164

done by passing our training set to the ``fit`` method. As a training

165

set, let us use all the images of our dataset apart from the last

166

one::

167

168

>>> clf.fit(digits.data[:-1], digits.target[:-1])

169

SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,

170

gamma=0.001, kernel='rbf', probability=False, scale_C=True,

171

shrinking=True, tol=0.001)

172

173

Now you can predict new values, in particular, we can ask to the

174

classifier what is the digit of our last image in the `digits` dataset,

175

which we have not used to train the classifier::

176

177

>>> clf.predict(digits.data[-1])

178

array([ 8.])

179

180

The corresponding image is the following:

181

182

.. image:: ../../auto_examples/tutorial/images/plot_digits_last_image_1.png

183

:target: ../../auto_examples/tutorial/plot_digits_last_image.html

184

:align: center

185

:scale: 50

186

187

As you can see, it is a challenging task: the images are of poor

188

resolution. Do you agree with the classifier?

189

190

A complete example of this classification problem is available as an

191

example that you can run and study:

192

:ref:`example_plot_digits_classification.py`.

193

194

195

Model persistence

196

-----------------

197

198

It is possible to save a model in the scikit by using Python's built-in

199

persistence model, namely `pickle <http://docs.python.org/library/pickle.html>`_::

200

201

>>> from sklearn import svm

202

>>> from sklearn import datasets

203

>>> clf = svm.SVC()

204

>>> iris = datasets.load_iris()

205

>>> X, y = iris.data, iris.target

206

>>> clf.fit(X, y)

207

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.25,

208

kernel='rbf', probability=False, scale_C=True, shrinking=True, tol=0.001)

209

210

>>> import pickle

211

>>> s = pickle.dumps(clf)

212

>>> clf2 = pickle.loads(s)

213

>>> clf2.predict(X[0])

214

array([ 0.])

215

>>> y[0]

216

0

217

218

In the specific case of the scikit, it may be more interesting to use

219

joblib's replacement of pickle (``joblib.dump`` & ``joblib.load``),

220

which is more efficient on big data, but can only pickle to the disk

221

and not to a string::

222

223

>>> from sklearn.externals import joblib

224

>>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP

225

	Site-wide shortcuts
/	Use quick search box
g h	Goto home page
g g	Goto my private gists page
g G	Goto my public gists page
g 0-9	Goto bookmarked items from 0-9
n r	New repository page
n g	New gist page

	Repositories
g s	Goto summary page
g c	Goto changelog page
g f	Goto files page
g F	Goto files page with file search activated
g p	Goto pull requests page
g o	Goto repository settings
g O	Goto repository access permissions settings
t s	Toggle sidebar on some pages

			@@ -0,0 +1,225 b''
		1	.. _introduction:
		2
		3	An Introduction to machine learning with scikit-learn
		4	=======================================================================
		5
		6	.. topic:: Section contents
		7
		8	In this section, we introduce the `machine learning
		9	<http://en.wikipedia.org/wiki/Machine_learning>`_
		10	vocabulary that we use through-out `scikit-learn` and give a
		11	simple learning example.
		12
		13
		14	Machine learning: the problem setting
		15	---------------------------------------
		16
		17	In general, a learning problem considers a set of n
		18	`samples <http://en.wikipedia.org/wiki/Sample_(statistics)>`_ of
		19	data and try to predict properties of unknown data. If each sample is
		20	more than a single number, and for instance a multi-dimensional entry
		21	(aka `multivariate <http://en.wikipedia.org/wiki/Multivariate_random_variable>`_
		22	data), is it said to have several attributes,
		23	or features.
		24
		25	We can separate learning problems in a few large categories:
		26
		27	* `supervised learning <http://en.wikipedia.org/wiki/Supervised_learning>`_,
		28	in which the data comes with additional attributes that we want to predict
		29	(:ref:`Click here <supervised-learning>`
		30	to go to the Scikit-Learn supervised learning page).This problem
		31	can be either:
		32
		33	* `classification
		34	<http://en.wikipedia.org/wiki/Classification_in_machine_learning>`_:
		35	samples belong to two or more classes and we
		36	want to learn from already labeled data how to predict the class
		37	of unlabeled data. An example of classification problem would
		38	be the digit recognition example, in which the aim is to assign
		39	each input vector to one of a finite number of discrete
		40	categories.
		41
		42	* `regression <http://en.wikipedia.org/wiki/Regression_analysis>`_:
		43	if the desired output consists of one or more
		44	continuous variables, then the task is called regression. An
		45	example of a regression problem would be the prediction of the
		46	length of a salmon as a function of its age and weight.
		47
		48	* `unsupervised learning <http://en.wikipedia.org/wiki/Unsupervised_learning>`_,
		49	in which the training data consists of a set of input vectors x
		50	without any corresponding target values. The goal in such problems
		51	may be to discover groups of similar examples within the data, where
		52	it is called `clustering <http://en.wikipedia.org/wiki/Cluster_analysis>`_,
		53	or to determine the distribution of data within the input space, known as
		54	`density estimation <http://en.wikipedia.org/wiki/Density_estimation>`_, or
		55	to project the data from a high-dimensional space down to two or thee
		56	dimensions for the purpose of visualization
		57	(:ref:`Click here <unsupervised-learning>`
		58	to go to the Scikit-Learn unsupervised learning page).
		59
		60	.. topic:: Training set and testing set
		61
		62	Machine learning is about learning some properties of a data set
		63	and applying them to new data. This is why a common practice in
		64	machine learning to evaluate an algorithm is to split the data
		65	at hand in two sets, one that we call a training set on which
		66	we learn data properties, and one that we call a testing set,
		67	on which we test these properties.
		68
		69	.. _loading_example_dataset:
		70
		71	Loading an example dataset
		72	--------------------------
		73
		74	`scikit-learn` comes with a few standard datasets, for instance the
		75	`iris <http://en.wikipedia.org/wiki/Iris_flower_data_set>`_ and `digits
		76	<http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits>`_
		77	datasets for classification and the `boston house prices dataset
		78	<http://archive.ics.uci.edu/ml/datasets/Housing>`_ for regression.::
		79
		80	>>> from sklearn import datasets
		81	>>> iris = datasets.load_iris()
		82	>>> digits = datasets.load_digits()
		83
		84	A dataset is a dictionary-like object that holds all the data and some
		85	metadata about the data. This data is stored in the ``.data`` member,
		86	which is a ``n_samples, n_features`` array. In the case of supervised
		87	problem, explanatory variables are stored in the ``.target`` member. More
		88	details on the different datasets can be found in the :ref:`dedicated
		89	section <datasets>`.
		90
		91	For instance, in the case of the digits dataset, ``digits.data`` gives
		92	access to the features that can be used to classify the digits samples::
		93
		94	>>> print digits.data # doctest: +NORMALIZE_WHITESPACE
		95	[[ 0. 0. 5. ..., 0. 0. 0.]
		96	[ 0. 0. 0. ..., 10. 0. 0.]
		97	[ 0. 0. 0. ..., 16. 9. 0.]
		98	...,
		99	[ 0. 0. 1. ..., 6. 0. 0.]
		100	[ 0. 0. 2. ..., 12. 0. 0.]
		101	[ 0. 0. 10. ..., 12. 1. 0.]]
		102
		103	and `digits.target` gives the ground truth for the digit dataset, that
		104	is the number corresponding to each digit image that we are trying to
		105	learn::
		106
		107	>>> digits.target
		108	array([0, 1, 2, ..., 8, 9, 8])
		109
		110	.. topic:: Shape of the data arrays
		111
		112	The data is always a 2D array, `n_samples, n_features`, although
		113	the original data may have had a different shape. In the case of the
		114	digits, each original sample is an image of shape `8, 8` and can be
		115	accessed using::
		116
		117	>>> digits.images[0]
		118	array([[ 0., 0., 5., 13., 9., 1., 0., 0.],
		119	[ 0., 0., 13., 15., 10., 15., 5., 0.],
		120	[ 0., 3., 15., 2., 0., 11., 8., 0.],
		121	[ 0., 4., 12., 0., 0., 8., 8., 0.],
		122	[ 0., 5., 8., 0., 0., 9., 8., 0.],
		123	[ 0., 4., 11., 0., 1., 12., 7., 0.],
		124	[ 0., 2., 14., 5., 10., 12., 0., 0.],
		125	[ 0., 0., 6., 13., 10., 0., 0., 0.]])
		126
		127	The :ref:`simple example on this dataset
		128	<example_plot_digits_classification.py>` illustrates how starting
		129	from the original problem one can shape the data for consumption in
		130	the `scikit-learn`.
		131
		132
		133	Learning and Predicting
		134	------------------------
		135
		136	In the case of the digits dataset, the task is to predict the value of a
		137	hand-written digit from an image. We are given samples of each of the 10
		138	possible classes on which we fit an
		139	`estimator <http://en.wikipedia.org/wiki/Estimator>`_ to be able to predict
		140	the labels corresponding to new data.
		141
		142	In `scikit-learn`, an estimator is just a plain Python class that
		143	implements the methods `fit(X, Y)` and `predict(T)`.
		144
		145	An example of estimator is the class ``sklearn.svm.SVC`` that
		146	implements `Support Vector Classification
		147	<http://en.wikipedia.org/wiki/Support_vector_machine>`_. The
		148	constructor of an estimator takes as arguments the parameters of the
		149	model, but for the time being, we will consider the estimator as a black
		150	box::
		151
		152	>>> from sklearn import svm
		153	>>> clf = svm.SVC(gamma=0.001, C=100.)
		154
		155	.. topic:: Choosing the parameters of the model
		156
		157	In this example we set the value of ``gamma`` manually. It is possible
		158	to automatically find good values for the parameters by using tools
		159	such as :ref:`grid search <grid_search>` and :ref:`cross validation
		160	<cross_validation>`.
		161
		162	We call our estimator instance `clf` as it is a classifier. It now must
		163	be fitted to the model, that is, it must `learn` from the model. This is
		164	done by passing our training set to the ``fit`` method. As a training
		165	set, let us use all the images of our dataset apart from the last
		166	one::
		167
		168	>>> clf.fit(digits.data[:-1], digits.target[:-1])
		169	SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
		170	gamma=0.001, kernel='rbf', probability=False, scale_C=True,
		171	shrinking=True, tol=0.001)
		172
		173	Now you can predict new values, in particular, we can ask to the
		174	classifier what is the digit of our last image in the `digits` dataset,
		175	which we have not used to train the classifier::
		176
		177	>>> clf.predict(digits.data[-1])
		178	array([ 8.])
		179
		180	The corresponding image is the following:
		181
		182	.. image:: ../../auto_examples/tutorial/images/plot_digits_last_image_1.png
		183	:target: ../../auto_examples/tutorial/plot_digits_last_image.html
		184	:align: center
		185	:scale: 50
		186
		187	As you can see, it is a challenging task: the images are of poor
		188	resolution. Do you agree with the classifier?
		189
		190	A complete example of this classification problem is available as an
		191	example that you can run and study:
		192	:ref:`example_plot_digits_classification.py`.
		193
		194
		195	Model persistence
		196	-----------------
		197
		198	It is possible to save a model in the scikit by using Python's built-in
		199	persistence model, namely `pickle <http://docs.python.org/library/pickle.html>`_::
		200
		201	>>> from sklearn import svm
		202	>>> from sklearn import datasets
		203	>>> clf = svm.SVC()
		204	>>> iris = datasets.load_iris()
		205	>>> X, y = iris.data, iris.target
		206	>>> clf.fit(X, y)
		207	SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.25,
		208	kernel='rbf', probability=False, scale_C=True, shrinking=True, tol=0.001)
		209
		210	>>> import pickle
		211	>>> s = pickle.dumps(clf)
		212	>>> clf2 = pickle.loads(s)
		213	>>> clf2.predict(X[0])
		214	array([ 0.])
		215	>>> y[0]
		216	0
		217
		218	In the specific case of the scikit, it may be more interesting to use
		219	joblib's replacement of pickle (``joblib.dump`` & ``joblib.load``),
		220	which is more efficient on big data, but can only pickle to the disk
		221	and not to a string::
		222
		223	>>> from sklearn.externals import joblib
		224	>>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP
		225