##// END OF EJS Templates
update ref pyfile
Matthias BUSSONNIER -
Show More
@@ -1,1181 +1,1181 b''
1 1 ## An Introduction to the Scientific Python Ecosystem
2 2
3 3 # While the Python language is an excellent tool for general-purpose programming, with a highly readable syntax, rich and powerful data types (strings, lists, sets, dictionaries, arbitrary length integers, etc) and a very comprehensive standard library, it was not designed specifically for mathematical and scientific computing. Neither the language nor its standard library have facilities for the efficient representation of multidimensional datasets, tools for linear algebra and general matrix manipulations (an essential building block of virtually all technical computing), nor any data visualization facilities.
4 4 #
5 5 # In particular, Python lists are very flexible containers that can be nested arbitrarily deep and which can hold any Python object in them, but they are poorly suited to represent efficiently common mathematical constructs like vectors and matrices. In contrast, much of our modern heritage of scientific computing has been built on top of libraries written in the Fortran language, which has native support for vectors and matrices as well as a library of mathematical functions that can efficiently operate on entire arrays at once.
6 6
7 7 ### Scientific Python: a collaboration of projects built by scientists
8 8
9 9 # The scientific community has developed a set of related Python libraries that provide powerful array facilities, linear algebra, numerical algorithms, data visualization and more. In this appendix, we will briefly outline the tools most frequently used for this purpose, that make "Scientific Python" something far more powerful than the Python language alone.
10 10 #
11 11 # For reasons of space, we can only describe in some detail the central Numpy library, but below we provide links to the websites of each project where you can read their documentation in more detail.
12 12 #
13 13 # First, let's look at an overview of the basic tools that most scientists use in daily research with Python. The core of this ecosystem is composed of:
14 14 #
15 15 # * Numpy: the basic library that most others depend on, it provides a powerful array type that can represent multidmensional datasets of many different kinds and that supports arithmetic operations. Numpy also provides a library of common mathematical functions, basic linear algebra, random number generation and Fast Fourier Transforms. Numpy can be found at [numpy.scipy.org](http://numpy.scipy.org)
16 16 #
17 17 # * Scipy: a large collection of numerical algorithms that operate on numpy arrays and provide facilities for many common tasks in scientific computing, including dense and sparse linear algebra support, optimization, special functions, statistics, n-dimensional image processing, signal processing and more. Scipy can be found at [scipy.org](http://scipy.org).
18 18 #
19 19 # * Matplotlib: a data visualization library with a strong focus on producing high-quality output, it supports a variety of common scientific plot types in two and three dimensions, with precise control over the final output and format for publication-quality results. Matplotlib can also be controlled interactively allowing graphical manipulation of your data (zooming, panning, etc) and can be used with most modern user interface toolkits. It can be found at [matplotlib.sf.net](http://matplotlib.sf.net).
20 20 #
21 21 # * IPython: while not strictly scientific in nature, IPython is the interactive environment in which many scientists spend their time. IPython provides a powerful Python shell that integrates tightly with Matplotlib and with easy access to the files and operating system, and which can execute in a terminal or in a graphical Qt console. IPython also has a web-based notebook interface that can combine code with text, mathematical expressions, figures and multimedia. It can be found at [ipython.org](http://ipython.org).
22 22 #
23 23 # While each of these tools can be installed separately, in our opinion the most convenient way today of accessing them (especially on Windows and Mac computers) is to install the [Free Edition of the Enthought Python Distribution](http://www.enthought.com/products/epd_free.php) which contain all the above. Other free alternatives on Windows (but not on Macs) are [Python(x,y)](http://code.google.com/p/pythonxy) and [ Christoph Gohlke's packages page](http://www.lfd.uci.edu/~gohlke/pythonlibs).
24 24 #
25 25 # These four 'core' libraries are in practice complemented by a number of other tools for more specialized work. We will briefly list here the ones that we think are the most commonly needed:
26 26 #
27 27 # * Sympy: a symbolic manipulation tool that turns a Python session into a computer algebra system. It integrates with the IPython notebook, rendering results in properly typeset mathematical notation. [sympy.org](http://sympy.org).
28 28 #
29 29 # * Mayavi: sophisticated 3d data visualization; [code.enthought.com/projects/mayavi](http://code.enthought.com/projects/mayavi).
30 30 #
31 31 # * Cython: a bridge language between Python and C, useful both to optimize performance bottlenecks in Python and to access C libraries directly; [cython.org](http://cython.org).
32 32 #
33 33 # * Pandas: high-performance data structures and data analysis tools, with powerful data alignment and structural manipulation capabilities; [pandas.pydata.org](http://pandas.pydata.org).
34 34 #
35 35 # * Statsmodels: statistical data exploration and model estimation; [statsmodels.sourceforge.net](http://statsmodels.sourceforge.net).
36 36 #
37 37 # * Scikit-learn: general purpose machine learning algorithms with a common interface; [scikit-learn.org](http://scikit-learn.org).
38 38 #
39 39 # * Scikits-image: image processing toolbox; [scikits-image.org](http://scikits-image.org).
40 40 #
41 41 # * NetworkX: analysis of complex networks (in the graph theoretical sense); [networkx.lanl.gov](http://networkx.lanl.gov).
42 42 #
43 43 # * PyTables: management of hierarchical datasets using the industry-standard HDF5 format; [www.pytables.org](http://www.pytables.org).
44 44 #
45 45 # Beyond these, for any specific problem you should look on the internet first, before starting to write code from scratch. There's a good chance that someone, somewhere, has written an open source library that you can use for part or all of your problem.
46 46
47 47 ### A note about the examples below
48 48
49 49 # In all subsequent examples, you will see blocks of input code, followed by the results of the code if the code generated output. This output may include text, graphics and other result objects. These blocks of input can be pasted into your interactive IPython session or notebook for you to execute. In the print version of this document, a thin vertical bar on the left of the blocks of input and output shows which blocks go together.
50 50 #
51 51 # If you are reading this text as an actual IPython notebook, you can press `Shift-Enter` or use the 'play' button on the toolbar (right-pointing triangle) to execute each block of code, known as a 'cell' in IPython:
52 52
53 53 # In[71]:
54 54 # This is a block of code, below you'll see its output
55 55 print "Welcome to the world of scientific computing with Python!"
56 56
57 57 # Out[71]:
58 58 # Welcome to the world of scientific computing with Python!
59 59 #
60 60 ## Motivation: the trapezoidal rule
61 61
62 62 # In subsequent sections we'll provide a basic introduction to the nuts and bolts of the basic scientific python tools; but we'll first motivate it with a brief example that illustrates what you can do in a few lines with these tools. For this, we will use the simple problem of approximating a definite integral with the trapezoid rule:
63 63 #
64 64 # $$
65 65 # \int_{a}^{b} f(x)\, dx \approx \frac{1}{2} \sum_{k=1}^{N} \left( x_{k} - x_{k-1} \right) \left( f(x_{k}) + f(x_{k-1}) \right).
66 66 # $$
67 67 #
68 68 # Our task will be to compute this formula for a function such as:
69 69 #
70 70 # $$
71 71 # f(x) = (x-3)(x-5)(x-7)+85
72 72 # $$
73 73 #
74 74 # integrated between $a=1$ and $b=9$.
75 75 #
76 76 # First, we define the function and sample it evenly between 0 and 10 at 200 points:
77 77
78 78 # In[1]:
79 79 def f(x):
80 80 return (x-3)*(x-5)*(x-7)+85
81 81
82 82 import numpy as np
83 83 x = np.linspace(0, 10, 200)
84 84 y = f(x)
85 85
86 86 # We select $a$ and $b$, our integration limits, and we take only a few points in that region to illustrate the error behavior of the trapezoid approximation:
87 87
88 88 # In[2]:
89 89 a, b = 1, 9
90 90 xint = x[logical_and(x>=a, x<=b)][::30]
91 91 yint = y[logical_and(x>=a, x<=b)][::30]
92 92
93 93 # Let's plot both the function and the area below it in the trapezoid approximation:
94 94
95 95 # In[3]:
96 96 import matplotlib.pyplot as plt
97 97 plt.plot(x, y, lw=2)
98 98 plt.axis([0, 10, 0, 140])
99 99 plt.fill_between(xint, 0, yint, facecolor='gray', alpha=0.4)
100 100 plt.text(0.5 * (a + b), 30,r"$\int_a^b f(x)dx$", horizontalalignment='center', fontsize=20);
101 101
102 102 # Out[3]:
103 # image file: tests/ipynbref/IntroNumPy.orig_files/IntroNumPy.orig_fig_00.svg
103 # image file: tests/ipynbref/IntroNumPy_orig_files/IntroNumPy_orig_fig_00.svg
104 104
105 105 # Compute the integral both at high accuracy and with the trapezoid approximation
106 106
107 107 # In[4]:
108 108 from scipy.integrate import quad, trapz
109 109 integral, error = quad(f, 1, 9)
110 110 trap_integral = trapz(yint, xint)
111 111 print "The integral is: %g +/- %.1e" % (integral, error)
112 112 print "The trapezoid approximation with", len(xint), "points is:", trap_integral
113 113 print "The absolute error is:", abs(integral - trap_integral)
114 114
115 115 # Out[4]:
116 116 # The integral is: 680 +/- 7.5e-12
117 117 # The trapezoid approximation with 6 points is: 621.286411141
118 118 # The absolute error is: 58.7135888589
119 119 #
120 120 # This simple example showed us how, combining the numpy, scipy and matplotlib libraries we can provide an illustration of a standard method in elementary calculus with just a few lines of code. We will now discuss with more detail the basic usage of these tools.
121 121
122 122 ## NumPy arrays: the right data structure for scientific computing
123 123
124 124 ### Basics of Numpy arrays
125 125
126 126 # We now turn our attention to the Numpy library, which forms the base layer for the entire 'scipy ecosystem'. Once you have installed numpy, you can import it as
127 127
128 128 # In[5]:
129 129 import numpy
130 130
131 131 # though in this book we will use the common shorthand
132 132
133 133 # In[6]:
134 134 import numpy as np
135 135
136 136 # As mentioned above, the main object provided by numpy is a powerful array. We'll start by exploring how the numpy array differs from Python lists. We start by creating a simple list and an array with the same contents of the list:
137 137
138 138 # In[7]:
139 139 lst = [10, 20, 30, 40]
140 140 arr = np.array([10, 20, 30, 40])
141 141
142 142 # Elements of a one-dimensional array are accessed with the same syntax as a list:
143 143
144 144 # In[8]:
145 145 lst[0]
146 146
147 147 # Out[8]:
148 148 # 10
149 149
150 150
151 151 # In[9]:
152 152 arr[0]
153 153
154 154 # Out[9]:
155 155 # 10
156 156
157 157
158 158 # In[10]:
159 159 arr[-1]
160 160
161 161 # Out[10]:
162 162 # 40
163 163
164 164
165 165 # In[11]:
166 166 arr[2:]
167 167
168 168 # Out[11]:
169 169 # array([30, 40])
170 170
171 171
172 172 # The first difference to note between lists and arrays is that arrays are *homogeneous*; i.e. all elements of an array must be of the same type. In contrast, lists can contain elements of arbitrary type. For example, we can change the last element in our list above to be a string:
173 173
174 174 # In[12]:
175 175 lst[-1] = 'a string inside a list'
176 176 lst
177 177
178 178 # Out[12]:
179 179 # [10, 20, 30, 'a string inside a list']
180 180
181 181
182 182 # but the same can not be done with an array, as we get an error message:
183 183
184 184 # In[13]:
185 185 arr[-1] = 'a string inside an array'
186 186
187 187 # Out[13]:
188 188 ---------------------------------------------------------------------------
189 189 ValueError Traceback (most recent call last)
190 190 /home/fperez/teach/book-math-labtool/<ipython-input-13-29c0bfa5fa8a> in <module>()
191 191 ----> 1 arr[-1] = 'a string inside an array'
192 192
193 193 ValueError: invalid literal for long() with base 10: 'a string inside an array'
194 194
195 195 # The information about the type of an array is contained in its *dtype* attribute:
196 196
197 197 # In[14]:
198 198 arr.dtype
199 199
200 200 # Out[14]:
201 201 # dtype('int32')
202 202
203 203
204 204 # Once an array has been created, its dtype is fixed and it can only store elements of the same type. For this example where the dtype is integer, if we store a floating point number it will be automatically converted into an integer:
205 205
206 206 # In[15]:
207 207 arr[-1] = 1.234
208 208 arr
209 209
210 210 # Out[15]:
211 211 # array([10, 20, 30, 1])
212 212
213 213
214 214 # Above we created an array from an existing list; now let us now see other ways in which we can create arrays, which we'll illustrate next. A common need is to have an array initialized with a constant value, and very often this value is 0 or 1 (suitable as starting value for additive and multiplicative loops respectively); `zeros` creates arrays of all zeros, with any desired dtype:
215 215
216 216 # In[16]:
217 217 np.zeros(5, float)
218 218
219 219 # Out[16]:
220 220 # array([ 0., 0., 0., 0., 0.])
221 221
222 222
223 223 # In[17]:
224 224 np.zeros(3, int)
225 225
226 226 # Out[17]:
227 227 # array([0, 0, 0])
228 228
229 229
230 230 # In[18]:
231 231 np.zeros(3, complex)
232 232
233 233 # Out[18]:
234 234 # array([ 0.+0.j, 0.+0.j, 0.+0.j])
235 235
236 236
237 237 # and similarly for `ones`:
238 238
239 239 # In[19]:
240 240 print '5 ones:', np.ones(5)
241 241
242 242 # Out[19]:
243 243 # 5 ones: [ 1. 1. 1. 1. 1.]
244 244 #
245 245 # If we want an array initialized with an arbitrary value, we can create an empty array and then use the fill method to put the value we want into the array:
246 246
247 247 # In[20]:
248 248 a = empty(4)
249 249 a.fill(5.5)
250 250 a
251 251
252 252 # Out[20]:
253 253 # array([ 5.5, 5.5, 5.5, 5.5])
254 254
255 255
256 256 # Numpy also offers the `arange` function, which works like the builtin `range` but returns an array instead of a list:
257 257
258 258 # In[21]:
259 259 np.arange(5)
260 260
261 261 # Out[21]:
262 262 # array([0, 1, 2, 3, 4])
263 263
264 264
265 265 # and the `linspace` and `logspace` functions to create linearly and logarithmically-spaced grids respectively, with a fixed number of points and including both ends of the specified interval:
266 266
267 267 # In[22]:
268 268 print "A linear grid between 0 and 1:", np.linspace(0, 1, 5)
269 269 print "A logarithmic grid between 10**1 and 10**4: ", np.logspace(1, 4, 4)
270 270
271 271 # Out[22]:
272 272 # A linear grid between 0 and 1: [ 0. 0.25 0.5 0.75 1. ]
273 273 # A logarithmic grid between 10**1 and 10**4: [ 10. 100. 1000. 10000.]
274 274 #
275 275 # Finally, it is often useful to create arrays with random numbers that follow a specific distribution. The `np.random` module contains a number of functions that can be used to this effect, for example this will produce an array of 5 random samples taken from a standard normal distribution (0 mean and variance 1):
276 276
277 277 # In[23]:
278 278 np.random.randn(5)
279 279
280 280 # Out[23]:
281 281 # array([-0.08633343, -0.67375434, 1.00589536, 0.87081651, 1.65597822])
282 282
283 283
284 284 # whereas this will also give 5 samples, but from a normal distribution with a mean of 10 and a variance of 3:
285 285
286 286 # In[24]:
287 287 norm10 = np.random.normal(10, 3, 5)
288 288 norm10
289 289
290 290 # Out[24]:
291 291 # array([ 8.94879575, 5.53038269, 8.24847281, 12.14944165, 11.56209294])
292 292
293 293
294 294 ### Indexing with other arrays
295 295
296 296 # Above we saw how to index arrays with single numbers and slices, just like Python lists. But arrays allow for a more sophisticated kind of indexing which is very powerful: you can index an array with another array, and in particular with an array of boolean values. This is particluarly useful to extract information from an array that matches a certain condition.
297 297 #
298 298 # Consider for example that in the array `norm10` we want to replace all values above 9 with the value 0. We can do so by first finding the *mask* that indicates where this condition is true or false:
299 299
300 300 # In[25]:
301 301 mask = norm10 > 9
302 302 mask
303 303
304 304 # Out[25]:
305 305 # array([False, False, False, True, True], dtype=bool)
306 306
307 307
308 308 # Now that we have this mask, we can use it to either read those values or to reset them to 0:
309 309
310 310 # In[26]:
311 311 print 'Values above 9:', norm10[mask]
312 312
313 313 # Out[26]:
314 314 # Values above 9: [ 12.14944165 11.56209294]
315 315 #
316 316 # In[27]:
317 317 print 'Resetting all values above 9 to 0...'
318 318 norm10[mask] = 0
319 319 print norm10
320 320
321 321 # Out[27]:
322 322 # Resetting all values above 9 to 0...
323 323 # [ 8.94879575 5.53038269 8.24847281 0. 0. ]
324 324 #
325 325 ### Arrays with more than one dimension
326 326
327 327 # Up until now all our examples have used one-dimensional arrays. But Numpy can create arrays of aribtrary dimensions, and all the methods illustrated in the previous section work with more than one dimension. For example, a list of lists can be used to initialize a two dimensional array:
328 328
329 329 # In[28]:
330 330 lst2 = [[1, 2], [3, 4]]
331 331 arr2 = np.array([[1, 2], [3, 4]])
332 332 arr2
333 333
334 334 # Out[28]:
335 335 # array([[1, 2],
336 336 # [3, 4]])
337 337
338 338
339 339 # With two-dimensional arrays we start seeing the power of numpy: while a nested list can be indexed using repeatedly the `[ ]` operator, multidimensional arrays support a much more natural indexing syntax with a single `[ ]` and a set of indices separated by commas:
340 340
341 341 # In[29]:
342 342 print lst2[0][1]
343 343 print arr2[0,1]
344 344
345 345 # Out[29]:
346 346 # 2
347 347 # 2
348 348 #
349 349 # Most of the array creation functions listed above can be used with more than one dimension, for example:
350 350
351 351 # In[30]:
352 352 np.zeros((2,3))
353 353
354 354 # Out[30]:
355 355 # array([[ 0., 0., 0.],
356 356 # [ 0., 0., 0.]])
357 357
358 358
359 359 # In[31]:
360 360 np.random.normal(10, 3, (2, 4))
361 361
362 362 # Out[31]:
363 363 # array([[ 11.26788826, 4.29619866, 11.09346496, 9.73861307],
364 364 # [ 10.54025996, 9.5146268 , 10.80367214, 13.62204505]])
365 365
366 366
367 367 # In fact, the shape of an array can be changed at any time, as long as the total number of elements is unchanged. For example, if we want a 2x4 array with numbers increasing from 0, the easiest way to create it is:
368 368
369 369 # In[32]:
370 370 arr = np.arange(8).reshape(2,4)
371 371 print arr
372 372
373 373 # Out[32]:
374 374 # [[0 1 2 3]
375 375 # [4 5 6 7]]
376 376 #
377 377 # With multidimensional arrays, you can also use slices, and you can mix and match slices and single indices in the different dimensions (using the same array as above):
378 378
379 379 # In[33]:
380 380 print 'Slicing in the second row:', arr[1, 2:4]
381 381 print 'All rows, third column :', arr[:, 2]
382 382
383 383 # Out[33]:
384 384 # Slicing in the second row: [6 7]
385 385 # All rows, third column : [2 6]
386 386 #
387 387 # If you only provide one index, then you will get an array with one less dimension containing that row:
388 388
389 389 # In[34]:
390 390 print 'First row: ', arr[0]
391 391 print 'Second row: ', arr[1]
392 392
393 393 # Out[34]:
394 394 # First row: [0 1 2 3]
395 395 # Second row: [4 5 6 7]
396 396 #
397 397 # Now that we have seen how to create arrays with more than one dimension, it's a good idea to look at some of the most useful properties and methods that arrays have. The following provide basic information about the size, shape and data in the array:
398 398
399 399 # In[35]:
400 400 print 'Data type :', arr.dtype
401 401 print 'Total number of elements :', arr.size
402 402 print 'Number of dimensions :', arr.ndim
403 403 print 'Shape (dimensionality) :', arr.shape
404 404 print 'Memory used (in bytes) :', arr.nbytes
405 405
406 406 # Out[35]:
407 407 # Data type : int32
408 408 # Total number of elements : 8
409 409 # Number of dimensions : 2
410 410 # Shape (dimensionality) : (2, 4)
411 411 # Memory used (in bytes) : 32
412 412 #
413 413 # Arrays also have many useful methods, some especially useful ones are:
414 414
415 415 # In[36]:
416 416 print 'Minimum and maximum :', arr.min(), arr.max()
417 417 print 'Sum and product of all elements :', arr.sum(), arr.prod()
418 418 print 'Mean and standard deviation :', arr.mean(), arr.std()
419 419
420 420 # Out[36]:
421 421 # Minimum and maximum : 0 7
422 422 # Sum and product of all elements : 28 0
423 423 # Mean and standard deviation : 3.5 2.29128784748
424 424 #
425 425 # For these methods, the above operations area all computed on all the elements of the array. But for a multidimensional array, it's possible to do the computation along a single dimension, by passing the `axis` parameter; for example:
426 426
427 427 # In[37]:
428 428 print 'For the following array:\n', arr
429 429 print 'The sum of elements along the rows is :', arr.sum(axis=1)
430 430 print 'The sum of elements along the columns is :', arr.sum(axis=0)
431 431
432 432 # Out[37]:
433 433 # For the following array:
434 434 # [[0 1 2 3]
435 435 # [4 5 6 7]]
436 436 # The sum of elements along the rows is : [ 6 22]
437 437 # The sum of elements along the columns is : [ 4 6 8 10]
438 438 #
439 # As you can see in this example, the value of the `axis` parameter is the dimension which will be *consumed* once the operation has been carried out. This is why to sum along the rows we use `axis=0`.
439 # As you can see in this example, the value of the `axis` parameter is the dimension which will be *consumed* once the operation has been carried out. This is why to sum along the rows we use `axis=0`.
440 440 #
441 441 # This can be easily illustrated with an example that has more dimensions; we create an array with 4 dimensions and shape `(3,4,5,6)` and sum along the axis number 2 (i.e. the *third* axis, since in Python all counts are 0-based). That consumes the dimension whose length was 5, leaving us with a new array that has shape `(3,4,6)`:
442 442
443 443 # In[38]:
444 444 np.zeros((3,4,5,6)).sum(2).shape
445 445
446 446 # Out[38]:
447 447 # (3, 4, 6)
448 448
449 449
450 450 # Another widely used property of arrays is the `.T` attribute, which allows you to access the transpose of the array:
451 451
452 452 # In[39]:
453 453 print 'Array:\n', arr
454 454 print 'Transpose:\n', arr.T
455 455
456 456 # Out[39]:
457 457 # Array:
458 458 # [[0 1 2 3]
459 459 # [4 5 6 7]]
460 460 # Transpose:
461 461 # [[0 4]
462 462 # [1 5]
463 463 # [2 6]
464 464 # [3 7]]
465 465 #
466 466 # We don't have time here to look at all the methods and properties of arrays, here's a complete list. Simply try exploring some of these IPython to learn more, or read their description in the full Numpy documentation:
467 467 #
468 468 # arr.T arr.copy arr.getfield arr.put arr.squeeze
469 469 # arr.all arr.ctypes arr.imag arr.ravel arr.std
470 470 # arr.any arr.cumprod arr.item arr.real arr.strides
471 471 # arr.argmax arr.cumsum arr.itemset arr.repeat arr.sum
472 472 # arr.argmin arr.data arr.itemsize arr.reshape arr.swapaxes
473 473 # arr.argsort arr.diagonal arr.max arr.resize arr.take
474 474 # arr.astype arr.dot arr.mean arr.round arr.tofile
475 475 # arr.base arr.dtype arr.min arr.searchsorted arr.tolist
476 476 # arr.byteswap arr.dump arr.nbytes arr.setasflat arr.tostring
477 477 # arr.choose arr.dumps arr.ndim arr.setfield arr.trace
478 478 # arr.clip arr.fill arr.newbyteorder arr.setflags arr.transpose
479 479 # arr.compress arr.flags arr.nonzero arr.shape arr.var
480 480 # arr.conj arr.flat arr.prod arr.size arr.view
481 481 # arr.conjugate arr.flatten arr.ptp arr.sort
482 482
483 483 ### Operating with arrays
484 484
485 485 # Arrays support all regular arithmetic operators, and the numpy library also contains a complete collection of basic mathematical functions that operate on arrays. It is important to remember that in general, all operations with arrays are applied *element-wise*, i.e., are applied to all the elements of the array at the same time. Consider for example:
486 486
487 487 # In[40]:
488 488 arr1 = np.arange(4)
489 489 arr2 = np.arange(10, 14)
490 490 print arr1, '+', arr2, '=', arr1+arr2
491 491
492 492 # Out[40]:
493 493 # [0 1 2 3] + [10 11 12 13] = [10 12 14 16]
494 494 #
495 495 # Importantly, you must remember that even the multiplication operator is by default applied element-wise, it is *not* the matrix multiplication from linear algebra (as is the case in Matlab, for example):
496 496
497 497 # In[41]:
498 498 print arr1, '*', arr2, '=', arr1*arr2
499 499
500 500 # Out[41]:
501 501 # [0 1 2 3] * [10 11 12 13] = [ 0 11 24 39]
502 502 #
503 503 # While this means that in principle arrays must always match in their dimensionality in order for an operation to be valid, numpy will *broadcast* dimensions when possible. For example, suppose that you want to add the number 1.5 to `arr1`; the following would be a valid way to do it:
504 504
505 505 # In[42]:
506 506 arr1 + 1.5*np.ones(4)
507 507
508 508 # Out[42]:
509 509 # array([ 1.5, 2.5, 3.5, 4.5])
510 510
511 511
512 512 # But thanks to numpy's broadcasting rules, the following is equally valid:
513 513
514 514 # In[43]:
515 515 arr1 + 1.5
516 516
517 517 # Out[43]:
518 518 # array([ 1.5, 2.5, 3.5, 4.5])
519 519
520 520
521 521 # In this case, numpy looked at both operands and saw that the first (`arr1`) was a one-dimensional array of length 4 and the second was a scalar, considered a zero-dimensional object. The broadcasting rules allow numpy to:
522 522 #
523 523 # * *create* new dimensions of length 1 (since this doesn't change the size of the array)
524 524 # * 'stretch' a dimension of length 1 that needs to be matched to a dimension of a different size.
525 525 #
526 526 # So in the above example, the scalar 1.5 is effectively:
527 527 #
528 528 # * first 'promoted' to a 1-dimensional array of length 1
529 529 # * then, this array is 'stretched' to length 4 to match the dimension of `arr1`.
530 530 #
531 531 # After these two operations are complete, the addition can proceed as now both operands are one-dimensional arrays of length 4.
532 532 #
533 533 # This broadcasting behavior is in practice enormously powerful, especially because when numpy broadcasts to create new dimensions or to 'stretch' existing ones, it doesn't actually replicate the data. In the example above the operation is carried *as if* the 1.5 was a 1-d array with 1.5 in all of its entries, but no actual array was ever created. This can save lots of memory in cases when the arrays in question are large and can have significant performance implications.
534 534 #
535 535 # The general rule is: when operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward, creating dimensions of length 1 as needed. Two dimensions are considered compatible when
536 536 #
537 537 # * they are equal to begin with, or
538 538 # * one of them is 1; in this case numpy will do the 'stretching' to make them equal.
539 539 #
540 540 # If these conditions are not met, a `ValueError: frames are not aligned` exception is thrown, indicating that the arrays have incompatible shapes. The size of the resulting array is the maximum size along each dimension of the input arrays.
541 541
542 542 # This shows how the broadcasting rules work in several dimensions:
543 543
544 544 # In[44]:
545 545 b = np.array([2, 3, 4, 5])
546 546 print arr, '\n\n+', b , '\n----------------\n', arr + b
547 547
548 548 # Out[44]:
549 549 # [[0 1 2 3]
550 550 # [4 5 6 7]]
551 551 #
552 552 # + [2 3 4 5]
553 553 # ----------------
554 554 # [[ 2 4 6 8]
555 555 # [ 6 8 10 12]]
556 556 #
557 557 # Now, how could you use broadcasting to say add `[4, 6]` along the rows to `arr` above? Simply performing the direct addition will produce the error we previously mentioned:
558 558
559 559 # In[45]:
560 560 c = np.array([4, 6])
561 561 arr + c
562 562
563 563 # Out[45]:
564 564 ---------------------------------------------------------------------------
565 565 ValueError Traceback (most recent call last)
566 566 /home/fperez/teach/book-math-labtool/<ipython-input-45-62aa20ac1980> in <module>()
567 567 1 c = np.array([4, 6])
568 568 ----> 2 arr + c
569 569
570 570 ValueError: operands could not be broadcast together with shapes (2,4) (2)
571 571
572 572 # According to the rules above, the array `c` would need to have a *trailing* dimension of 1 for the broadcasting to work. It turns out that numpy allows you to 'inject' new dimensions anywhere into an array on the fly, by indexing it with the special object `np.newaxis`:
573 573
574 574 # In[46]:
575 575 (c[:, np.newaxis]).shape
576 576
577 577 # Out[46]:
578 578 # (2, 1)
579 579
580 580
581 581 # This is exactly what we need, and indeed it works:
582 582
583 583 # In[47]:
584 584 arr + c[:, np.newaxis]
585 585
586 586 # Out[47]:
587 587 # array([[ 4, 5, 6, 7],
588 588 # [10, 11, 12, 13]])
589 589
590 590
591 591 # For the full broadcasting rules, please see the official Numpy docs, which describe them in detail and with more complex examples.
592 592
593 593 # As we mentioned before, Numpy ships with a full complement of mathematical functions that work on entire arrays, including logarithms, exponentials, trigonometric and hyperbolic trigonometric functions, etc. Furthermore, scipy ships a rich special function library in the `scipy.special` module that includes Bessel, Airy, Fresnel, Laguerre and other classical special functions. For example, sampling the sine function at 100 points between $0$ and $2\pi$ is as simple as:
594 594
595 595 # In[48]:
596 596 x = np.linspace(0, 2*np.pi, 100)
597 597 y = np.sin(x)
598 598
599 599 ### Linear algebra in numpy
600 600
601 601 # Numpy ships with a basic linear algebra library, and all arrays have a `dot` method whose behavior is that of the scalar dot product when its arguments are vectors (one-dimensional arrays) and the traditional matrix multiplication when one or both of its arguments are two-dimensional arrays:
602 602
603 603 # In[49]:
604 604 v1 = np.array([2, 3, 4])
605 605 v2 = np.array([1, 0, 1])
606 606 print v1, '.', v2, '=', v1.dot(v2)
607 607
608 608 # Out[49]:
609 609 # [2 3 4] . [1 0 1] = 6
610 610 #
611 611 # Here is a regular matrix-vector multiplication, note that the array `v1` should be viewed as a *column* vector in traditional linear algebra notation; numpy makes no distinction between row and column vectors and simply verifies that the dimensions match the required rules of matrix multiplication, in this case we have a $2 \times 3$ matrix multiplied by a 3-vector, which produces a 2-vector:
612 612
613 613 # In[50]:
614 614 A = np.arange(6).reshape(2, 3)
615 615 print A, 'x', v1, '=', A.dot(v1)
616 616
617 617 # Out[50]:
618 618 # [[0 1 2]
619 619 # [3 4 5]] x [2 3 4] = [11 38]
620 620 #
621 621 # For matrix-matrix multiplication, the same dimension-matching rules must be satisfied, e.g. consider the difference between $A \times A^T$:
622 622
623 623 # In[51]:
624 624 print A.dot(A.T)
625 625
626 626 # Out[51]:
627 627 # [[ 5 14]
628 628 # [14 50]]
629 629 #
630 630 # and $A^T \times A$:
631 631
632 632 # In[52]:
633 633 print A.T.dot(A)
634 634
635 635 # Out[52]:
636 636 # [[ 9 12 15]
637 637 # [12 17 22]
638 638 # [15 22 29]]
639 639 #
640 640 # Furthermore, the `numpy.linalg` module includes additional functionality such as determinants, matrix norms, Cholesky, eigenvalue and singular value decompositions, etc. For even more linear algebra tools, `scipy.linalg` contains the majority of the tools in the classic LAPACK libraries as well as functions to operate on sparse matrices. We refer the reader to the Numpy and Scipy documentations for additional details on these.
641 641
642 642 ### Reading and writing arrays to disk
643 643
644 644 # Numpy lets you read and write arrays into files in a number of ways. In order to use these tools well, it is critical to understand the difference between a *text* and a *binary* file containing numerical data. In a text file, the number $\pi$ could be written as "3.141592653589793", for example: a string of digits that a human can read, with in this case 15 decimal digits. In contrast, that same number written to a binary file would be encoded as 8 characters (bytes) that are not readable by a human but which contain the exact same data that the variable `pi` had in the computer's memory.
645 645 #
646 646 # The tradeoffs between the two modes are thus:
647 647 #
648 648 # * Text mode: occupies more space, precision can be lost (if not all digits are written to disk), but is readable and editable by hand with a text editor. Can *only* be used for one- and two-dimensional arrays.
649 649 #
650 650 # * Binary mode: compact and exact representation of the data in memory, can't be read or edited by hand. Arrays of any size and dimensionality can be saved and read without loss of information.
651 651 #
652 652 # First, let's see how to read and write arrays in text mode. The `np.savetxt` function saves an array to a text file, with options to control the precision, separators and even adding a header:
653 653
654 654 # In[53]:
655 655 arr = np.arange(10).reshape(2, 5)
656 656 np.savetxt('test.out', arr, fmt='%.2e', header="My dataset")
657 657 !cat test.out
658 658
659 659 # Out[53]:
660 660 # # My dataset
661 661 # 0.00e+00 1.00e+00 2.00e+00 3.00e+00 4.00e+00
662 662 # 5.00e+00 6.00e+00 7.00e+00 8.00e+00 9.00e+00
663 663 #
664 664 # And this same type of file can then be read with the matching `np.loadtxt` function:
665 665
666 666 # In[54]:
667 667 arr2 = np.loadtxt('test.out')
668 668 print arr2
669 669
670 670 # Out[54]:
671 671 # [[ 0. 1. 2. 3. 4.]
672 672 # [ 5. 6. 7. 8. 9.]]
673 673 #
674 674 # For binary data, Numpy provides the `np.save` and `np.savez` routines. The first saves a single array to a file with `.npy` extension, while the latter can be used to save a *group* of arrays into a single file with `.npz` extension. The files created with these routines can then be read with the `np.load` function.
675 675 #
676 676 # Let us first see how to use the simpler `np.save` function to save a single array:
677 677
678 678 # In[55]:
679 679 np.save('test.npy', arr2)
680 680 # Now we read this back
681 681 arr2n = np.load('test.npy')
682 682 # Let's see if any element is non-zero in the difference.
683 683 # A value of True would be a problem.
684 684 print 'Any differences?', np.any(arr2-arr2n)
685 685
686 686 # Out[55]:
687 687 # Any differences? False
688 688 #
689 689 # Now let us see how the `np.savez` function works. You give it a filename and either a sequence of arrays or a set of keywords. In the first mode, the function will auotmatically name the saved arrays in the archive as `arr_0`, `arr_1`, etc:
690 690
691 691 # In[56]:
692 692 np.savez('test.npz', arr, arr2)
693 693 arrays = np.load('test.npz')
694 694 arrays.files
695 695
696 696 # Out[56]:
697 697 # ['arr_1', 'arr_0']
698 698
699 699
700 700 # Alternatively, we can explicitly choose how to name the arrays we save:
701 701
702 702 # In[57]:
703 703 np.savez('test.npz', array1=arr, array2=arr2)
704 704 arrays = np.load('test.npz')
705 705 arrays.files
706 706
707 707 # Out[57]:
708 708 # ['array2', 'array1']
709 709
710 710
711 711 # The object returned by `np.load` from an `.npz` file works like a dictionary, though you can also access its constituent files by attribute using its special `.f` field; this is best illustrated with an example with the `arrays` object from above:
712 712
713 713 # In[58]:
714 714 print 'First row of first array:', arrays['array1'][0]
715 715 # This is an equivalent way to get the same field
716 716 print 'First row of first array:', arrays.f.array1[0]
717 717
718 718 # Out[58]:
719 719 # First row of first array: [0 1 2 3 4]
720 720 # First row of first array: [0 1 2 3 4]
721 721 #
722 722 # This `.npz` format is a very convenient way to package compactly and without loss of information, into a single file, a group of related arrays that pertain to a specific problem. At some point, however, the complexity of your dataset may be such that the optimal approach is to use one of the standard formats in scientific data processing that have been designed to handle complex datasets, such as NetCDF or HDF5.
723 723 #
724 724 # Fortunately, there are tools for manipulating these formats in Python, and for storing data in other ways such as databases. A complete discussion of the possibilities is beyond the scope of this discussion, but of particular interest for scientific users we at least mention the following:
725 725 #
726 726 # * The `scipy.io` module contains routines to read and write Matlab files in `.mat` format and files in the NetCDF format that is widely used in certain scientific disciplines.
727 727 #
728 728 # * For manipulating files in the HDF5 format, there are two excellent options in Python: The PyTables project offers a high-level, object oriented approach to manipulating HDF5 datasets, while the h5py project offers a more direct mapping to the standard HDF5 library interface. Both are excellent tools; if you need to work with HDF5 datasets you should read some of their documentation and examples and decide which approach is a better match for your needs.
729 729
730 730 ## High quality data visualization with Matplotlib
731 731
732 732 # The [matplotlib](http://matplotlib.sf.net) library is a powerful tool capable of producing complex publication-quality figures with fine layout control in two and three dimensions; here we will only provide a minimal self-contained introduction to its usage that covers the functionality needed for the rest of the book. We encourage the reader to read the tutorials included with the matplotlib documentation as well as to browse its extensive gallery of examples that include source code.
733 733 #
734 734 # Just as we typically use the shorthand `np` for Numpy, we will use `plt` for the `matplotlib.pyplot` module where the easy-to-use plotting functions reside (the library contains a rich object-oriented architecture that we don't have the space to discuss here):
735 735
736 736 # In[59]:
737 737 import matplotlib.pyplot as plt
738 738
739 739 # The most frequently used function is simply called `plot`, here is how you can make a simple plot of $\sin(x)$ for $x \in [0, 2\pi]$ with labels and a grid (we use the semicolon in the last line to suppress the display of some information that is unnecessary right now):
740 740
741 741 # In[60]:
742 742 x = np.linspace(0, 2*np.pi)
743 743 y = np.sin(x)
744 744 plt.plot(x,y, label='sin(x)')
745 745 plt.legend()
746 746 plt.grid()
747 747 plt.title('Harmonic')
748 748 plt.xlabel('x')
749 749 plt.ylabel('y');
750 750
751 751 # Out[60]:
752 # image file: tests/ipynbref/IntroNumPy.orig_files/IntroNumPy.orig_fig_01.svg
752 # image file: tests/ipynbref/IntroNumPy_orig_files/IntroNumPy_orig_fig_01.svg
753 753
754 754 # You can control the style, color and other properties of the markers, for example:
755 755
756 756 # In[61]:
757 757 plt.plot(x, y, linewidth=2);
758 758
759 759 # Out[61]:
760 # image file: tests/ipynbref/IntroNumPy.orig_files/IntroNumPy.orig_fig_02.svg
760 # image file: tests/ipynbref/IntroNumPy_orig_files/IntroNumPy_orig_fig_02.svg
761 761
762 762 # In[62]:
763 763 plt.plot(x, y, 'o', markersize=5, color='r');
764 764
765 765 # Out[62]:
766 # image file: tests/ipynbref/IntroNumPy.orig_files/IntroNumPy.orig_fig_03.svg
766 # image file: tests/ipynbref/IntroNumPy_orig_files/IntroNumPy_orig_fig_03.svg
767 767
768 768 # We will now see how to create a few other common plot types, such as a simple error plot:
769 769
770 770 # In[63]:
771 771 # example data
772 772 x = np.arange(0.1, 4, 0.5)
773 773 y = np.exp(-x)
774 774
775 775 # example variable error bar values
776 776 yerr = 0.1 + 0.2*np.sqrt(x)
777 777 xerr = 0.1 + yerr
778 778
779 779 # First illustrate basic pyplot interface, using defaults where possible.
780 780 plt.figure()
781 781 plt.errorbar(x, y, xerr=0.2, yerr=0.4)
782 782 plt.title("Simplest errorbars, 0.2 in x, 0.4 in y");
783 783
784 784 # Out[63]:
785 # image file: tests/ipynbref/IntroNumPy.orig_files/IntroNumPy.orig_fig_04.svg
785 # image file: tests/ipynbref/IntroNumPy_orig_files/IntroNumPy_orig_fig_04.svg
786 786
787 787 # A simple log plot
788 788
789 789 # In[64]:
790 790 x = np.linspace(-5, 5)
791 791 y = np.exp(-x**2)
792 792 plt.semilogy(x, y);
793 793
794 794 # Out[64]:
795 # image file: tests/ipynbref/IntroNumPy.orig_files/IntroNumPy.orig_fig_05.svg
795 # image file: tests/ipynbref/IntroNumPy_orig_files/IntroNumPy_orig_fig_05.svg
796 796
797 797 # A histogram annotated with text inside the plot, using the `text` function:
798 798
799 799 # In[65]:
800 800 mu, sigma = 100, 15
801 801 x = mu + sigma * np.random.randn(10000)
802 802
803 803 # the histogram of the data
804 804 n, bins, patches = plt.hist(x, 50, normed=1, facecolor='g', alpha=0.75)
805 805
806 806 plt.xlabel('Smarts')
807 807 plt.ylabel('Probability')
808 808 plt.title('Histogram of IQ')
809 809 # This will put a text fragment at the position given:
810 810 plt.text(55, .027, r'$\mu=100,\ \sigma=15$', fontsize=14)
811 811 plt.axis([40, 160, 0, 0.03])
812 812 plt.grid(True)
813 813
814 814 # Out[65]:
815 # image file: tests/ipynbref/IntroNumPy.orig_files/IntroNumPy.orig_fig_06.svg
815 # image file: tests/ipynbref/IntroNumPy_orig_files/IntroNumPy_orig_fig_06.svg
816 816
817 817 ### Image display
818 818
819 819 # The `imshow` command can display single or multi-channel images. A simple array of random numbers, plotted in grayscale:
820 820
821 821 # In[66]:
822 822 from matplotlib import cm
823 823 plt.imshow(np.random.rand(5, 10), cmap=cm.gray, interpolation='nearest');
824 824
825 825 # Out[66]:
826 # image file: tests/ipynbref/IntroNumPy.orig_files/IntroNumPy.orig_fig_07.svg
826 # image file: tests/ipynbref/IntroNumPy_orig_files/IntroNumPy_orig_fig_07.svg
827 827
828 828 # A real photograph is a multichannel image, `imshow` interprets it correctly:
829 829
830 830 # In[67]:
831 831 img = plt.imread('stinkbug.png')
832 832 print 'Dimensions of the array img:', img.shape
833 833 plt.imshow(img);
834 834
835 835 # Out[67]:
836 836 # Dimensions of the array img: (375, 500, 3)
837 837 #
838 # image file: tests/ipynbref/IntroNumPy.orig_files/IntroNumPy.orig_fig_08.svg
838 # image file: tests/ipynbref/IntroNumPy_orig_files/IntroNumPy_orig_fig_08.svg
839 839
840 840 ### Simple 3d plotting with matplotlib
841 841
842 842 # Note that you must execute at least once in your session:
843 843
844 844 # In[68]:
845 845 from mpl_toolkits.mplot3d import Axes3D
846 846
847 847 # One this has been done, you can create 3d axes with the `projection='3d'` keyword to `add_subplot`:
848 848 #
849 849 # fig = plt.figure()
850 850 # fig.add_subplot(<other arguments here>, projection='3d')
851 851
852 852 # A simple surface plot:
853 853
854 854 # In[72]:
855 855 from mpl_toolkits.mplot3d.axes3d import Axes3D
856 856 from matplotlib import cm
857 857
858 858 fig = plt.figure()
859 859 ax = fig.add_subplot(1, 1, 1, projection='3d')
860 860 X = np.arange(-5, 5, 0.25)
861 861 Y = np.arange(-5, 5, 0.25)
862 862 X, Y = np.meshgrid(X, Y)
863 863 R = np.sqrt(X**2 + Y**2)
864 864 Z = np.sin(R)
865 865 surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=cm.jet,
866 866 linewidth=0, antialiased=False)
867 867 ax.set_zlim3d(-1.01, 1.01);
868 868
869 869 # Out[72]:
870 # image file: tests/ipynbref/IntroNumPy.orig_files/IntroNumPy.orig_fig_09.svg
870 # image file: tests/ipynbref/IntroNumPy_orig_files/IntroNumPy_orig_fig_09.svg
871 871
872 872 ## IPython: a powerful interactive environment
873 873
874 874 # A key component of the everyday workflow of most scientific computing environments is a good interactive environment, that is, a system in which you can execute small amounts of code and view the results immediately, combining both printing out data and opening graphical visualizations. All modern systems for scientific computing, commercial and open source, include such functionality.
875 875 #
876 876 # Out of the box, Python also offers a simple interactive shell with very limited capabilities. But just like the scientific community built Numpy to provide arrays suited for scientific work (since Pytyhon's lists aren't optimal for this task), it has also developed an interactive environment much more sophisticated than the built-in one. The [IPython project](http://ipython.org) offers a set of tools to make productive use of the Python language, all the while working interactively and with immedate feedback on your results. The basic tools that IPython provides are:
877 877 #
878 878 # 1. A powerful terminal shell, with many features designed to increase the fluidity and productivity of everyday scientific workflows, including:
879 879 #
880 880 # * rich introspection of all objects and variables including easy access to the source code of any function
881 881 # * powerful and extensible tab completion of variables and filenames,
882 882 # * tight integration with matplotlib, supporting interactive figures that don't block the terminal,
883 883 # * direct access to the filesystem and underlying operating system,
884 884 # * an extensible system for shell-like commands called 'magics' that reduce the work needed to perform many common tasks,
885 885 # * tools for easily running, timing, profiling and debugging your codes,
886 886 # * syntax highlighted error messages with much more detail than the default Python ones,
887 887 # * logging and access to all previous history of inputs, including across sessions
888 888 #
889 889 # 2. A Qt console that provides the look and feel of a terminal, but adds support for inline figures, graphical calltips, a persistent session that can survive crashes (even segfaults) of the kernel process, and more.
890 890 #
891 891 # 3. A web-based notebook that can execute code and also contain rich text and figures, mathematical equations and arbitrary HTML. This notebook presents a document-like view with cells where code is executed but that can be edited in-place, reordered, mixed with explanatory text and figures, etc.
892 892 #
893 893 # 4. A high-performance, low-latency system for parallel computing that supports the control of a cluster of IPython engines communicating over a network, with optimizations that minimize unnecessary copying of large objects (especially numpy arrays).
894 894 #
895 895 # We will now discuss the highlights of the tools 1-3 above so that you can make them an effective part of your workflow. The topic of parallel computing is beyond the scope of this document, but we encourage you to read the extensive [documentation](http://ipython.org/ipython-doc/rel-0.12.1/parallel/index.html) and [tutorials](http://minrk.github.com/scipy-tutorial-2011/) on this available on the IPython website.
896 896
897 897 ### The IPython terminal
898 898
899 899 # You can start IPython at the terminal simply by typing:
900 900 #
901 901 # $ ipython
902 902 #
903 903 # which will provide you some basic information about how to get started and will then open a prompt labeled `In [1]:` for you to start typing. Here we type $2^{64}$ and Python computes the result for us in exact arithmetic, returning it as `Out[1]`:
904 904 #
905 905 # $ ipython
906 906 # Python 2.7.2+ (default, Oct 4 2011, 20:03:08)
907 907 # Type "copyright", "credits" or "license" for more information.
908 908 #
909 909 # IPython 0.13.dev -- An enhanced Interactive Python.
910 910 # ? -> Introduction and overview of IPython's features.
911 911 # %quickref -> Quick reference.
912 912 # help -> Python's own help system.
913 913 # object? -> Details about 'object', use 'object??' for extra details.
914 914 #
915 915 # In [1]: 2**64
916 916 # Out[1]: 18446744073709551616L
917 917 #
918 918 # The first thing you should know about IPython is that all your inputs and outputs are saved. There are two variables named `In` and `Out` which are filled as you work with your results. Furthermore, all outputs are also saved to auto-created variables of the form `_NN` where `NN` is the prompt number, and inputs to `_iNN`. This allows you to recover quickly the result of a prior computation by referring to its number even if you forgot to store it as a variable. For example, later on in the above session you can do:
919 919 #
920 920 # In [6]: print _1
921 921 # 18446744073709551616
922 922
923 923 # We strongly recommend that you take a few minutes to read at least the basic introduction provided by the `?` command, and keep in mind that the `%quickref` command at all times can be used as a quick reference "cheat sheet" of the most frequently used features of IPython.
924 924 #
925 925 # At the IPython prompt, any valid Python code that you type will be executed similarly to the default Python shell (though often with more informative feedback). But since IPython is a *superset* of the default Python shell; let's have a brief look at some of its additional functionality.
926 926
927 927 # **Object introspection**
928 928 #
929 929 # A simple `?` command provides a general introduction to IPython, but as indicated in the banner above, you can use the `?` syntax to ask for details about any object. For example, if we type `_1?`, IPython will print the following details about this variable:
930 930 #
931 931 # In [14]: _1?
932 932 # Type: long
933 933 # Base Class: <type 'long'>
934 934 # String Form:18446744073709551616
935 935 # Namespace: Interactive
936 936 # Docstring:
937 937 # long(x[, base]) -> integer
938 938 #
939 939 # Convert a string or number to a long integer, if possible. A floating
940 940 #
941 941 # [etc... snipped for brevity]
942 942 #
943 943 # If you add a second `?` and for any oobject `x` type `x??`, IPython will try to provide an even more detailed analsysi of the object, including its syntax-highlighted source code when it can be found. It's possible that `x??` returns the same information as `x?`, but in many cases `x??` will indeed provide additional details.
944 944 #
945 945 # Finally, the `?` syntax is also useful to search *namespaces* with wildcards. Suppose you are wondering if there is any function in Numpy that may do text-related things; with `np.*txt*?`, IPython will print all the names in the `np` namespace (our Numpy shorthand) that have 'txt' anywhere in their name:
946 946 #
947 947 # In [17]: np.*txt*?
948 948 # np.genfromtxt
949 949 # np.loadtxt
950 950 # np.mafromtxt
951 951 # np.ndfromtxt
952 952 # np.recfromtxt
953 953 # np.savetxt
954 954
955 955 # **Tab completion**
956 956 #
957 957 # IPython makes the tab key work extra hard for you as a way to rapidly inspect objects and libraries. Whenever you have typed something at the prompt, by hitting the `<tab>` key IPython will try to complete the rest of the line. For this, IPython will analyze the text you had so far and try to search for Python data or files that may match the context you have already provided.
958 958 #
959 959 # For example, if you type `np.load` and hit the <tab> key, you'll see:
960 960 #
961 961 # In [21]: np.load<TAB HERE>
962 962 # np.load np.loads np.loadtxt
963 963 #
964 964 # so you can quickly find all the load-related functionality in numpy. Tab completion works even for function arguments, for example consider this function definition:
965 965 #
966 966 # In [20]: def f(x, frobinate=False):
967 967 # ....: if frobinate:
968 968 # ....: return x**2
969 969 # ....:
970 970 #
971 971 # If you now use the `<tab>` key after having typed 'fro' you'll get all valid Python completions, but those marked with `=` at the end are known to be keywords of your function:
972 972 #
973 973 # In [21]: f(2, fro<TAB HERE>
974 974 # frobinate= frombuffer fromfunction frompyfunc fromstring
975 975 # from fromfile fromiter fromregex frozenset
976 976 #
977 977 # at this point you can add the `b` letter and hit `<tab>` once more, and IPython will finish the line for you:
978 978 #
979 979 # In [21]: f(2, frobinate=
980 980 #
981 981 # As a beginner, simply get into the habit of using `<tab>` after most objects; it should quickly become second nature as you will see how helps keep a fluid workflow and discover useful information. Later on you can also customize this behavior by writing your own completion code, if you so desire.
982 982
983 983 # **Matplotlib integration**
984 984 #
985 985 # One of the most useful features of IPython for scientists is its tight integration with matplotlib: at the terminal IPython lets you open matplotlib figures without blocking your typing (which is what happens if you try to do the same thing at the default Python shell), and in the Qt console and notebook you can even view your figures embedded in your workspace next to the code that created them.
986 986 #
987 987 # The matplotlib support can be either activated when you start IPython by passing the `--pylab` flag, or at any point later in your session by using the `%pylab` command. If you start IPython with `--pylab`, you'll see something like this (note the extra message about pylab):
988 988 #
989 989 # $ ipython --pylab
990 990 # Python 2.7.2+ (default, Oct 4 2011, 20:03:08)
991 991 # Type "copyright", "credits" or "license" for more information.
992 992 #
993 993 # IPython 0.13.dev -- An enhanced Interactive Python.
994 994 # ? -> Introduction and overview of IPython's features.
995 995 # %quickref -> Quick reference.
996 996 # help -> Python's own help system.
997 997 # object? -> Details about 'object', use 'object??' for extra details.
998 998 #
999 999 # Welcome to pylab, a matplotlib-based Python environment [backend: Qt4Agg].
1000 1000 # For more information, type 'help(pylab)'.
1001 1001 #
1002 1002 # In [1]:
1003 1003 #
1004 1004 # Furthermore, IPython will import `numpy` with the `np` shorthand, `matplotlib.pyplot` as `plt`, and it will also load all of the numpy and pyplot top-level names so that you can directly type something like:
1005 1005 #
1006 1006 # In [1]: x = linspace(0, 2*pi, 200)
1007 1007 #
1008 1008 # In [2]: plot(x, sin(x))
1009 1009 # Out[2]: [<matplotlib.lines.Line2D at 0x9e7c16c>]
1010 1010 #
1011 1011 # instead of having to prefix each call with its full signature (as we have been doing in the examples thus far):
1012 1012 #
1013 1013 # In [3]: x = np.linspace(0, 2*np.pi, 200)
1014 1014 #
1015 1015 # In [4]: plt.plot(x, np.sin(x))
1016 1016 # Out[4]: [<matplotlib.lines.Line2D at 0x9e900ac>]
1017 1017 #
1018 1018 # This shorthand notation can be a huge time-saver when working interactively (it's a few characters but you are likely to type them hundreds of times in a session). But we should note that as you develop persistent scripts and notebooks meant for reuse, it's best to get in the habit of using the longer notation (known as *fully qualified names* as it's clearer where things come from and it makes for more robust, readable and maintainable code in the long run).
1019 1019
1020 1020 # **Access to the operating system and files**
1021 1021 #
1022 1022 # In IPython, you can type `ls` to see your files or `cd` to change directories, just like you would at a regular system prompt:
1023 1023 #
1024 1024 # In [2]: cd tests
1025 1025 # /home/fperez/ipython/nbconvert/tests
1026 1026 #
1027 1027 # In [3]: ls test.*
1028 1028 # test.aux test.html test.ipynb test.log test.out test.pdf test.rst test.tex
1029 1029 #
1030 1030 # Furthermore, if you use the `!` at the beginning of a line, any commands you pass afterwards go directly to the operating system:
1031 1031 #
1032 1032 # In [4]: !echo "Hello IPython"
1033 1033 # Hello IPython
1034 1034 #
1035 1035 # IPython offers a useful twist in this feature: it will substitute in the command the value of any *Python* variable you may have if you prepend it with a `$` sign:
1036 1036 #
1037 1037 # In [5]: message = 'IPython interpolates from Python to the shell'
1038 1038 #
1039 1039 # In [6]: !echo $message
1040 1040 # IPython interpolates from Python to the shell
1041 1041 #
1042 1042 # This feature can be extremely useful, as it lets you combine the power and clarity of Python for complex logic with the immediacy and familiarity of many shell commands. Additionally, if you start the line with *two* `$$` signs, the output of the command will be automatically captured as a list of lines, e.g.:
1043 1043 #
1044 1044 # In [10]: !!ls test.*
1045 1045 # Out[10]:
1046 1046 # ['test.aux',
1047 1047 # 'test.html',
1048 1048 # 'test.ipynb',
1049 1049 # 'test.log',
1050 1050 # 'test.out',
1051 1051 # 'test.pdf',
1052 1052 # 'test.rst',
1053 1053 # 'test.tex']
1054 1054 #
1055 1055 # As explained above, you can now use this as the variable `_10`. If you directly want to capture the output of a system command to a Python variable, you can use the syntax `=!`:
1056 1056 #
1057 1057 # In [11]: testfiles =! ls test.*
1058 1058 #
1059 1059 # In [12]: print testfiles
1060 1060 # ['test.aux', 'test.html', 'test.ipynb', 'test.log', 'test.out', 'test.pdf', 'test.rst', 'test.tex']
1061 1061 #
1062 1062 # Finally, the special `%alias` command lets you define names that are shorthands for system commands, so that you can type them without having to prefix them via `!` explicitly (for example, `ls` is an alias that has been predefined for you at startup).
1063 1063
1064 1064 # **Magic commands**
1065 1065 #
1066 1066 # IPython has a system for special commands, called 'magics', that let you control IPython itself and perform many common tasks with a more shell-like syntax: it uses spaces for delimiting arguments, flags can be set with dashes and all arguments are treated as strings, so no additional quoting is required. This kind of syntax is invalid in the Python language but very convenient for interactive typing (less parentheses, commans and quoting everywhere); IPython distinguishes the two by detecting lines that start with the `%` character.
1067 1067 #
1068 1068 # You can learn more about the magic system by simply typing `%magic` at the prompt, which will give you a short description plus the documentation on *all* available magics. If you want to see only a listing of existing magics, you can use `%lsmagic`:
1069 1069 #
1070 1070 # In [4]: lsmagic
1071 1071 # Available magic functions:
1072 1072 # %alias %autocall %autoindent %automagic %bookmark %c %cd %colors %config %cpaste
1073 1073 # %debug %dhist %dirs %doctest_mode %ds %ed %edit %env %gui %hist %history
1074 1074 # %install_default_config %install_ext %install_profiles %load_ext %loadpy %logoff %logon
1075 1075 # %logstart %logstate %logstop %lsmagic %macro %magic %notebook %page %paste %pastebin
1076 1076 # %pd %pdb %pdef %pdoc %pfile %pinfo %pinfo2 %pop %popd %pprint %precision %profile
1077 1077 # %prun %psearch %psource %pushd %pwd %pycat %pylab %quickref %recall %rehashx
1078 1078 # %reload_ext %rep %rerun %reset %reset_selective %run %save %sc %stop %store %sx %tb
1079 1079 # %time %timeit %unalias %unload_ext %who %who_ls %whos %xdel %xmode
1080 1080 #
1081 1081 # Automagic is ON, % prefix NOT needed for magic functions.
1082 1082 #
1083 1083 # Note how the example above omitted the eplicit `%` marker and simply uses `lsmagic`. As long as the 'automagic' feature is on (which it is by default), you can omit the `%` marker as long as there is no ambiguity with a Python variable of the same name.
1084 1084
1085 1085 # **Running your code**
1086 1086 #
1087 1087 # While it's easy to type a few lines of code in IPython, for any long-lived work you should keep your codes in Python scripts (or in IPython notebooks, see below). Consider that you have a script, in this case trivially simple for the sake of brevity, named `simple.py`:
1088 1088 #
1089 1089 # In [12]: !cat simple.py
1090 1090 # import numpy as np
1091 1091 #
1092 1092 # x = np.random.normal(size=100)
1093 1093 #
1094 1094 # print 'First elment of x:', x[0]
1095 1095 #
1096 1096 # The typical workflow with IPython is to use the `%run` magic to execute your script (you can omit the .py extension if you want). When you run it, the script will execute just as if it had been run at the system prompt with `python simple.py` (though since modules don't get re-executed on new imports by Python, all system initialization is essentially free, which can have a significant run time impact in some cases):
1097 1097 #
1098 1098 # In [13]: run simple
1099 1099 # First elment of x: -1.55872256289
1100 1100 #
1101 1101 # Once it completes, all variables defined in it become available for you to use interactively:
1102 1102 #
1103 1103 # In [14]: x.shape
1104 1104 # Out[14]: (100,)
1105 1105 #
1106 1106 # This allows you to plot data, try out ideas, etc, in a `%run`/interact/edit cycle that can be very productive. As you start understanding your problem better you can refine your script further, incrementally improving it based on the work you do at the IPython prompt. At any point you can use the `%hist` magic to print out your history without prompts, so that you can copy useful fragments back into the script.
1107 1107 #
1108 1108 # By default, `%run` executes scripts in a completely empty namespace, to better mimic how they would execute at the system prompt with plain Python. But if you use the `-i` flag, the script will also see your interactively defined variables. This lets you edit in a script larger amounts of code that still behave as if you had typed them at the IPython prompt.
1109 1109 #
1110 1110 # You can also get a summary of the time taken by your script with the `-t` flag; consider a different script `randsvd.py` that takes a bit longer to run:
1111 1111 #
1112 1112 # In [21]: run -t randsvd.py
1113 1113 #
1114 1114 # IPython CPU timings (estimated):
1115 1115 # User : 0.38 s.
1116 1116 # System : 0.04 s.
1117 1117 # Wall time: 0.34 s.
1118 1118 #
1119 1119 # `User` is the time spent by the computer executing your code, while `System` is the time the operating system had to work on your behalf, doing things like memory allocation that are needed by your code but that you didn't explicitly program and that happen inside the kernel. The `Wall time` is the time on a 'clock on the wall' between the start and end of your program.
1120 1120 #
1121 1121 # If `Wall > User+System`, your code is most likely waiting idle for certain periods. That could be waiting for data to arrive from a remote source or perhaps because the operating system has to swap large amounts of virtual memory. If you know that your code doesn't explicitly wait for remote data to arrive, you should investigate further to identify possible ways of improving the performance profile.
1122 1122 #
1123 1123 # If you only want to time how long a single statement takes, you don't need to put it into a script as you can use the `%timeit` magic, which uses Python's `timeit` module to very carefully measure timig data; `timeit` can measure even short statements that execute extremely fast:
1124 1124 #
1125 1125 # In [27]: %timeit a=1
1126 1126 # 10000000 loops, best of 3: 23 ns per loop
1127 1127 #
1128 1128 # and for code that runs longer, it automatically adjusts so the overall measurement doesn't take too long:
1129 1129 #
1130 1130 # In [28]: %timeit np.linalg.svd(x)
1131 1131 # 1 loops, best of 3: 310 ms per loop
1132 1132 #
1133 1133 # The `%run` magic still has more options for debugging and profiling data; you should read its documentation for many useful details (as always, just type `%run?`).
1134 1134
1135 1135 ### The graphical Qt console
1136 1136
1137 1137 # If you type at the system prompt (see the IPython website for installation details, as this requires some additional libraries):
1138 1138 #
1139 1139 # $ ipython qtconsole
1140 1140 #
1141 1141 # instead of opening in a terminal as before, IPython will start a graphical console that at first sight appears just like a terminal, but which is in fact much more capable than a text-only terminal. This is a specialized terminal designed for interactive scientific work, and it supports full multi-line editing with color highlighting and graphical calltips for functions, it can keep multiple IPython sessions open simultaneously in tabs, and when scripts run it can display the figures inline directly in the work area.
1142 1142 #
1143 1143 # <center><img src="ipython_qtconsole2.png" width=400px></center>
1144 1144
1145 1145 # % This cell is for the pdflatex output only
1146 1146 # \begin{figure}[htbp]
1147 1147 # \centering
1148 1148 # \includegraphics[width=3in]{ipython_qtconsole2.png}
1149 1149 # \caption{The IPython Qt console: a lightweight terminal for scientific exploration, with code, results and graphics in a soingle environment.}
1150 1150 # \end{figure}
1151 1151
1152 1152 # The Qt console accepts the same `--pylab` startup flags as the terminal, but you can additionally supply the value `--pylab inline`, which enables the support for inline graphics shown in the figure. This is ideal for keeping all the code and figures in the same session, given that the console can save the output of your entire session to HTML or PDF.
1153 1153 #
1154 1154 # Since the Qt console makes it far more convenient than the terminal to edit blocks of code with multiple lines, in this environment it's worth knowing about the `%loadpy` magic function. `%loadpy` takes a path to a local file or remote URL, fetches its contents, and puts it in the work area for you to further edit and execute. It can be an extremely fast and convenient way of loading code from local disk or remote examples from sites such as the [Matplotlib gallery](http://matplotlib.sourceforge.net/gallery.html).
1155 1155 #
1156 1156 # Other than its enhanced capabilities for code and graphics, all of the features of IPython we've explained before remain functional in this graphical console.
1157 1157
1158 1158 ### The IPython Notebook
1159 1159
1160 1160 # The third way to interact with IPython, in addition to the terminal and graphical Qt console, is a powerful web interface called the "IPython Notebook". If you run at the system console (you can omit the `pylab` flags if you don't need plotting support):
1161 1161 #
1162 1162 # $ ipython notebook --pylab inline
1163 1163 #
1164 1164 # IPython will start a process that runs a web server in your local machine and to which a web browser can connect. The Notebook is a workspace that lets you execute code in blocks called 'cells' and displays any results and figures, but which can also contain arbitrary text (including LaTeX-formatted mathematical expressions) and any rich media that a modern web browser is capable of displaying.
1165 1165 #
1166 1166 # <center><img src="ipython-notebook-specgram-2.png" width=400px></center>
1167 1167
1168 1168 # % This cell is for the pdflatex output only
1169 1169 # \begin{figure}[htbp]
1170 1170 # \centering
1171 1171 # \includegraphics[width=3in]{ipython-notebook-specgram-2.png}
1172 1172 # \caption{The IPython Notebook: text, equations, code, results, graphics and other multimedia in an open format for scientific exploration and collaboration}
1173 1173 # \end{figure}
1174 1174
1175 1175 # In fact, this document was written as a Notebook, and only exported to LaTeX for printing. Inside of each cell, all the features of IPython that we have discussed before remain functional, since ultimately this web client is communicating with the same IPython code that runs in the terminal. But this interface is a much more rich and powerful environment for maintaining long-term "live and executable" scientific documents.
1176 1176 #
1177 1177 # Notebook environments have existed in commercial systems like Mathematica(TM) and Maple(TM) for a long time; in the open source world the [Sage](http://sagemath.org) project blazed this particular trail starting in 2006, and now we bring all the features that have made IPython such a widely used tool to a Notebook model.
1178 1178 #
1179 1179 # Since the Notebook runs as a web application, it is possible to configure it for remote access, letting you run your computations on a persistent server close to your data, which you can then access remotely from any browser-equipped computer. We encourage you to read the extensive documentation provided by the IPython project for details on how to do this and many more features of the notebook.
1180 1180 #
1181 1181 # Finally, as we said earlier, IPython also has a high-level and easy to use set of libraries for parallel computing, that let you control (interactively if desired) not just one IPython but an entire cluster of 'IPython engines'. Unfortunately a detailed discussion of these tools is beyond the scope of this text, but should you need to parallelize your analysis codes, a quick read of the tutorials and examples provided at the IPython site may prove fruitful.
General Comments 0
You need to be logged in to leave comments. Login now