{
"metadata": {
"name": "IntroNumPy"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"An Introduction to the Scientific Python Ecosystem"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"While the Python language is an excellent tool for general-purpose programming, with a highly readable syntax, rich and powerful data types (strings, lists, sets, dictionaries, arbitrary length integers, etc) and a very comprehensive standard library, it was not designed specifically for mathematical and scientific computing. Neither the language nor its standard library have facilities for the efficient representation of multidimensional datasets, tools for linear algebra and general matrix manipulations (an essential building block of virtually all technical computing), nor any data visualization facilities.\n",
"\n",
"In particular, Python lists are very flexible containers that can be nested arbitrarily deep and which can hold any Python object in them, but they are poorly suited to represent efficiently common mathematical constructs like vectors and matrices. In contrast, much of our modern heritage of scientific computing has been built on top of libraries written in the Fortran language, which has native support for vectors and matrices as well as a library of mathematical functions that can efficiently operate on entire arrays at once."
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Scientific Python: a collaboration of projects built by scientists"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The scientific community has developed a set of related Python libraries that provide powerful array facilities, linear algebra, numerical algorithms, data visualization and more. In this appendix, we will briefly outline the tools most frequently used for this purpose, that make \"Scientific Python\" something far more powerful than the Python language alone.\n",
"\n",
"For reasons of space, we can only describe in some detail the central Numpy library, but below we provide links to the websites of each project where you can read their documentation in more detail.\n",
"\n",
"First, let's look at an overview of the basic tools that most scientists use in daily research with Python. The core of this ecosystem is composed of:\n",
"\n",
"* Numpy: the basic library that most others depend on, it provides a powerful array type that can represent multidmensional datasets of many different kinds and that supports arithmetic operations. Numpy also provides a library of common mathematical functions, basic linear algebra, random number generation and Fast Fourier Transforms. Numpy can be found at [numpy.scipy.org](http://numpy.scipy.org)\n",
"\n",
"* Scipy: a large collection of numerical algorithms that operate on numpy arrays and provide facilities for many common tasks in scientific computing, including dense and sparse linear algebra support, optimization, special functions, statistics, n-dimensional image processing, signal processing and more. Scipy can be found at [scipy.org](http://scipy.org).\n",
"\n",
"* Matplotlib: a data visualization library with a strong focus on producing high-quality output, it supports a variety of common scientific plot types in two and three dimensions, with precise control over the final output and format for publication-quality results. Matplotlib can also be controlled interactively allowing graphical manipulation of your data (zooming, panning, etc) and can be used with most modern user interface toolkits. It can be found at [matplotlib.sf.net](http://matplotlib.sf.net).\n",
"\n",
"* IPython: while not strictly scientific in nature, IPython is the interactive environment in which many scientists spend their time. IPython provides a powerful Python shell that integrates tightly with Matplotlib and with easy access to the files and operating system, and which can execute in a terminal or in a graphical Qt console. IPython also has a web-based notebook interface that can combine code with text, mathematical expressions, figures and multimedia. It can be found at [ipython.org](http://ipython.org).\n",
"\n",
"While each of these tools can be installed separately, in our opinion the most convenient way today of accessing them (especially on Windows and Mac computers) is to install the [Free Edition of the Enthought Python Distribution](http://www.enthought.com/products/epd_free.php) which contain all the above. Other free alternatives on Windows (but not on Macs) are [Python(x,y)](http://code.google.com/p/pythonxy) and [ Christoph Gohlke's packages page](http://www.lfd.uci.edu/~gohlke/pythonlibs).\n",
"\n",
"These four 'core' libraries are in practice complemented by a number of other tools for more specialized work. We will briefly list here the ones that we think are the most commonly needed:\n",
"\n",
"* Sympy: a symbolic manipulation tool that turns a Python session into a computer algebra system. It integrates with the IPython notebook, rendering results in properly typeset mathematical notation. [sympy.org](http://sympy.org).\n",
"\n",
"* Mayavi: sophisticated 3d data visualization; [code.enthought.com/projects/mayavi](http://code.enthought.com/projects/mayavi).\n",
"\n",
"* Cython: a bridge language between Python and C, useful both to optimize performance bottlenecks in Python and to access C libraries directly; [cython.org](http://cython.org).\n",
"\n",
"* Pandas: high-performance data structures and data analysis tools, with powerful data alignment and structural manipulation capabilities; [pandas.pydata.org](http://pandas.pydata.org).\n",
"\n",
"* Statsmodels: statistical data exploration and model estimation; [statsmodels.sourceforge.net](http://statsmodels.sourceforge.net).\n",
"\n",
"* Scikit-learn: general purpose machine learning algorithms with a common interface; [scikit-learn.org](http://scikit-learn.org).\n",
"\n",
"* Scikits-image: image processing toolbox; [scikits-image.org](http://scikits-image.org).\n",
"\n",
"* NetworkX: analysis of complex networks (in the graph theoretical sense); [networkx.lanl.gov](http://networkx.lanl.gov).\n",
"\n",
"* PyTables: management of hierarchical datasets using the industry-standard HDF5 format; [www.pytables.org](http://www.pytables.org).\n",
"\n",
"Beyond these, for any specific problem you should look on the internet first, before starting to write code from scratch. There's a good chance that someone, somewhere, has written an open source library that you can use for part or all of your problem."
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"A note about the examples below"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In all subsequent examples, you will see blocks of input code, followed by the results of the code if the code generated output. This output may include text, graphics and other result objects. These blocks of input can be pasted into your interactive IPython session or notebook for you to execute. In the print version of this document, a thin vertical bar on the left of the blocks of input and output shows which blocks go together.\n",
"\n",
"If you are reading this text as an actual IPython notebook, you can press `Shift-Enter` or use the 'play' button on the toolbar (right-pointing triangle) to execute each block of code, known as a 'cell' in IPython:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# This is a block of code, below you'll see its output\n",
"print \"Welcome to the world of scientific computing with Python!\""
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Welcome to the world of scientific computing with Python!\n"
]
}
],
"prompt_number": 71
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Motivation: the trapezoidal rule"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In subsequent sections we'll provide a basic introduction to the nuts and bolts of the basic scientific python tools; but we'll first motivate it with a brief example that illustrates what you can do in a few lines with these tools. For this, we will use the simple problem of approximating a definite integral with the trapezoid rule:\n",
"\n",
"$$\n",
"\\int_{a}^{b} f(x)\\, dx \\approx \\frac{1}{2} \\sum_{k=1}^{N} \\left( x_{k} - x_{k-1} \\right) \\left( f(x_{k}) + f(x_{k-1}) \\right).\n",
"$$\n",
"\n",
"Our task will be to compute this formula for a function such as:\n",
"\n",
"$$\n",
"f(x) = (x-3)(x-5)(x-7)+85\n",
"$$\n",
"\n",
"integrated between $a=1$ and $b=9$.\n",
"\n",
"First, we define the function and sample it evenly between 0 and 10 at 200 points:"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"def f(x):\n",
" return (x-3)*(x-5)*(x-7)+85\n",
"\n",
"import numpy as np\n",
"x = np.linspace(0, 10, 200)\n",
"y = f(x)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We select $a$ and $b$, our integration limits, and we take only a few points in that region to illustrate the error behavior of the trapezoid approximation:"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"a, b = 1, 9\n",
"xint = x[logical_and(x>=a, x<=b)][::30]\n",
"yint = y[logical_and(x>=a, x<=b)][::30]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's plot both the function and the area below it in the trapezoid approximation:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import matplotlib.pyplot as plt\n",
"plt.plot(x, y, lw=2)\n",
"plt.axis([0, 10, 0, 140])\n",
"plt.fill_between(xint, 0, yint, facecolor='gray', alpha=0.4)\n",
"plt.text(0.5 * (a + b), 30,r\"$\\int_a^b f(x)dx$\", horizontalalignment='center', fontsize=20);"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "display_data",
"svg": [
"\n",
"\n",
"\n",
"\n"
]
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Compute the integral both at high accuracy and with the trapezoid approximation"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from scipy.integrate import quad, trapz\n",
"integral, error = quad(f, 1, 9)\n",
"trap_integral = trapz(yint, xint)\n",
"print \"The integral is: %g +/- %.1e\" % (integral, error)\n",
"print \"The trapezoid approximation with\", len(xint), \"points is:\", trap_integral\n",
"print \"The absolute error is:\", abs(integral - trap_integral)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"The integral is: 680 +/- 7.5e-12\n",
"The trapezoid approximation with 6 points is: 621.286411141\n",
"The absolute error is: 58.7135888589\n"
]
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This simple example showed us how, combining the numpy, scipy and matplotlib libraries we can provide an illustration of a standard method in elementary calculus with just a few lines of code. We will now discuss with more detail the basic usage of these tools."
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"NumPy arrays: the right data structure for scientific computing"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Basics of Numpy arrays"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now turn our attention to the Numpy library, which forms the base layer for the entire 'scipy ecosystem'. Once you have installed numpy, you can import it as"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"import numpy"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"though in this book we will use the common shorthand"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"import numpy as np"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As mentioned above, the main object provided by numpy is a powerful array. We'll start by exploring how the numpy array differs from Python lists. We start by creating a simple list and an array with the same contents of the list:"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"lst = [10, 20, 30, 40]\n",
"arr = np.array([10, 20, 30, 40])"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Elements of a one-dimensional array are accessed with the same syntax as a list:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"lst[0]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 8,
"text": [
"10"
]
}
],
"prompt_number": 8
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"arr[0]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 9,
"text": [
"10"
]
}
],
"prompt_number": 9
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"arr[-1]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 10,
"text": [
"40"
]
}
],
"prompt_number": 10
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"arr[2:]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 11,
"text": [
"array([30, 40])"
]
}
],
"prompt_number": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first difference to note between lists and arrays is that arrays are *homogeneous*; i.e. all elements of an array must be of the same type. In contrast, lists can contain elements of arbitrary type. For example, we can change the last element in our list above to be a string:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"lst[-1] = 'a string inside a list'\n",
"lst"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 12,
"text": [
"[10, 20, 30, 'a string inside a list']"
]
}
],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"but the same can not be done with an array, as we get an error message:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"arr[-1] = 'a string inside an array'"
],
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "ValueError",
"evalue": "invalid literal for long() with base 10: 'a string inside an array'",
"output_type": "pyerr",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m/home/fperez/teach/book-math-labtool/\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0marr\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;33m-\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;34m'a string inside an array'\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;31mValueError\u001b[0m: invalid literal for long() with base 10: 'a string inside an array'"
]
}
],
"prompt_number": 13
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The information about the type of an array is contained in its *dtype* attribute:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"arr.dtype"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 14,
"text": [
"dtype('int32')"
]
}
],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once an array has been created, its dtype is fixed and it can only store elements of the same type. For this example where the dtype is integer, if we store a floating point number it will be automatically converted into an integer:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"arr[-1] = 1.234\n",
"arr"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 15,
"text": [
"array([10, 20, 30, 1])"
]
}
],
"prompt_number": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Above we created an array from an existing list; now let us now see other ways in which we can create arrays, which we'll illustrate next. A common need is to have an array initialized with a constant value, and very often this value is 0 or 1 (suitable as starting value for additive and multiplicative loops respectively); `zeros` creates arrays of all zeros, with any desired dtype:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"np.zeros(5, float)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 16,
"text": [
"array([ 0., 0., 0., 0., 0.])"
]
}
],
"prompt_number": 16
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"np.zeros(3, int)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 17,
"text": [
"array([0, 0, 0])"
]
}
],
"prompt_number": 17
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"np.zeros(3, complex)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 18,
"text": [
"array([ 0.+0.j, 0.+0.j, 0.+0.j])"
]
}
],
"prompt_number": 18
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"and similarly for `ones`:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print '5 ones:', np.ones(5)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"5 ones: [ 1. 1. 1. 1. 1.]\n"
]
}
],
"prompt_number": 19
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we want an array initialized with an arbitrary value, we can create an empty array and then use the fill method to put the value we want into the array:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"a = empty(4)\n",
"a.fill(5.5)\n",
"a"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 20,
"text": [
"array([ 5.5, 5.5, 5.5, 5.5])"
]
}
],
"prompt_number": 20
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Numpy also offers the `arange` function, which works like the builtin `range` but returns an array instead of a list:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"np.arange(5)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 21,
"text": [
"array([0, 1, 2, 3, 4])"
]
}
],
"prompt_number": 21
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"and the `linspace` and `logspace` functions to create linearly and logarithmically-spaced grids respectively, with a fixed number of points and including both ends of the specified interval:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print \"A linear grid between 0 and 1:\", np.linspace(0, 1, 5)\n",
"print \"A logarithmic grid between 10**1 and 10**4: \", np.logspace(1, 4, 4)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"A linear grid between 0 and 1: [ 0. 0.25 0.5 0.75 1. ]\n",
"A logarithmic grid between 10**1 and 10**4: [ 10. 100. 1000. 10000.]\n"
]
}
],
"prompt_number": 22
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, it is often useful to create arrays with random numbers that follow a specific distribution. The `np.random` module contains a number of functions that can be used to this effect, for example this will produce an array of 5 random samples taken from a standard normal distribution (0 mean and variance 1):"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"np.random.randn(5)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 23,
"text": [
"array([-0.08633343, -0.67375434, 1.00589536, 0.87081651, 1.65597822])"
]
}
],
"prompt_number": 23
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"whereas this will also give 5 samples, but from a normal distribution with a mean of 10 and a variance of 3:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"norm10 = np.random.normal(10, 3, 5)\n",
"norm10"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 24,
"text": [
"array([ 8.94879575, 5.53038269, 8.24847281, 12.14944165, 11.56209294])"
]
}
],
"prompt_number": 24
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Indexing with other arrays"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Above we saw how to index arrays with single numbers and slices, just like Python lists. But arrays allow for a more sophisticated kind of indexing which is very powerful: you can index an array with another array, and in particular with an array of boolean values. This is particluarly useful to extract information from an array that matches a certain condition.\n",
"\n",
"Consider for example that in the array `norm10` we want to replace all values above 9 with the value 0. We can do so by first finding the *mask* that indicates where this condition is true or false:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"mask = norm10 > 9\n",
"mask"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 25,
"text": [
"array([False, False, False, True, True], dtype=bool)"
]
}
],
"prompt_number": 25
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have this mask, we can use it to either read those values or to reset them to 0:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print 'Values above 9:', norm10[mask]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Values above 9: [ 12.14944165 11.56209294]\n"
]
}
],
"prompt_number": 26
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print 'Resetting all values above 9 to 0...'\n",
"norm10[mask] = 0\n",
"print norm10"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Resetting all values above 9 to 0...\n",
"[ 8.94879575 5.53038269 8.24847281 0. 0. ]\n"
]
}
],
"prompt_number": 27
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Arrays with more than one dimension"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Up until now all our examples have used one-dimensional arrays. But Numpy can create arrays of aribtrary dimensions, and all the methods illustrated in the previous section work with more than one dimension. For example, a list of lists can be used to initialize a two dimensional array:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"lst2 = [[1, 2], [3, 4]]\n",
"arr2 = np.array([[1, 2], [3, 4]])\n",
"arr2"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 28,
"text": [
"array([[1, 2],\n",
" [3, 4]])"
]
}
],
"prompt_number": 28
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With two-dimensional arrays we start seeing the power of numpy: while a nested list can be indexed using repeatedly the `[ ]` operator, multidimensional arrays support a much more natural indexing syntax with a single `[ ]` and a set of indices separated by commas:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print lst2[0][1]\n",
"print arr2[0,1]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"2\n",
"2\n"
]
}
],
"prompt_number": 29
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Most of the array creation functions listed above can be used with more than one dimension, for example:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"np.zeros((2,3))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 30,
"text": [
"array([[ 0., 0., 0.],\n",
" [ 0., 0., 0.]])"
]
}
],
"prompt_number": 30
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"np.random.normal(10, 3, (2, 4))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 31,
"text": [
"array([[ 11.26788826, 4.29619866, 11.09346496, 9.73861307],\n",
" [ 10.54025996, 9.5146268 , 10.80367214, 13.62204505]])"
]
}
],
"prompt_number": 31
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In fact, the shape of an array can be changed at any time, as long as the total number of elements is unchanged. For example, if we want a 2x4 array with numbers increasing from 0, the easiest way to create it is:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"arr = np.arange(8).reshape(2,4)\n",
"print arr"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[[0 1 2 3]\n",
" [4 5 6 7]]\n"
]
}
],
"prompt_number": 32
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With multidimensional arrays, you can also use slices, and you can mix and match slices and single indices in the different dimensions (using the same array as above):"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print 'Slicing in the second row:', arr[1, 2:4]\n",
"print 'All rows, third column :', arr[:, 2]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Slicing in the second row: [6 7]\n",
"All rows, third column : [2 6]\n"
]
}
],
"prompt_number": 33
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you only provide one index, then you will get an array with one less dimension containing that row:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print 'First row: ', arr[0]\n",
"print 'Second row: ', arr[1]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"First row: [0 1 2 3]\n",
"Second row: [4 5 6 7]\n"
]
}
],
"prompt_number": 34
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have seen how to create arrays with more than one dimension, it's a good idea to look at some of the most useful properties and methods that arrays have. The following provide basic information about the size, shape and data in the array:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print 'Data type :', arr.dtype\n",
"print 'Total number of elements :', arr.size\n",
"print 'Number of dimensions :', arr.ndim\n",
"print 'Shape (dimensionality) :', arr.shape\n",
"print 'Memory used (in bytes) :', arr.nbytes"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Data type : int32\n",
"Total number of elements : 8\n",
"Number of dimensions : 2\n",
"Shape (dimensionality) : (2, 4)\n",
"Memory used (in bytes) : 32\n"
]
}
],
"prompt_number": 35
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Arrays also have many useful methods, some especially useful ones are:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print 'Minimum and maximum :', arr.min(), arr.max()\n",
"print 'Sum and product of all elements :', arr.sum(), arr.prod()\n",
"print 'Mean and standard deviation :', arr.mean(), arr.std()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Minimum and maximum : 0 7\n",
"Sum and product of all elements : 28 0\n",
"Mean and standard deviation : 3.5 2.29128784748\n"
]
}
],
"prompt_number": 36
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For these methods, the above operations area all computed on all the elements of the array. But for a multidimensional array, it's possible to do the computation along a single dimension, by passing the `axis` parameter; for example:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print 'For the following array:\\n', arr\n",
"print 'The sum of elements along the rows is :', arr.sum(axis=1)\n",
"print 'The sum of elements along the columns is :', arr.sum(axis=0)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"For the following array:\n",
"[[0 1 2 3]\n",
" [4 5 6 7]]\n",
"The sum of elements along the rows is : [ 6 22]\n",
"The sum of elements along the columns is : [ 4 6 8 10]\n"
]
}
],
"prompt_number": 37
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see in this example, the value of the `axis` parameter is the dimension which will be *consumed* once the operation has been carried out. This is why to sum along the rows we use `axis=0`.\n",
"\n",
"This can be easily illustrated with an example that has more dimensions; we create an array with 4 dimensions and shape `(3,4,5,6)` and sum along the axis number 2 (i.e. the *third* axis, since in Python all counts are 0-based). That consumes the dimension whose length was 5, leaving us with a new array that has shape `(3,4,6)`:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"np.zeros((3,4,5,6)).sum(2).shape"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 38,
"text": [
"(3, 4, 6)"
]
}
],
"prompt_number": 38
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another widely used property of arrays is the `.T` attribute, which allows you to access the transpose of the array:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print 'Array:\\n', arr\n",
"print 'Transpose:\\n', arr.T"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Array:\n",
"[[0 1 2 3]\n",
" [4 5 6 7]]\n",
"Transpose:\n",
"[[0 4]\n",
" [1 5]\n",
" [2 6]\n",
" [3 7]]\n"
]
}
],
"prompt_number": 39
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We don't have time here to look at all the methods and properties of arrays, here's a complete list. Simply try exploring some of these IPython to learn more, or read their description in the full Numpy documentation:\n",
"\n",
" arr.T arr.copy arr.getfield arr.put arr.squeeze\n",
" arr.all arr.ctypes arr.imag arr.ravel arr.std\n",
" arr.any arr.cumprod arr.item arr.real arr.strides\n",
" arr.argmax arr.cumsum arr.itemset arr.repeat arr.sum\n",
" arr.argmin arr.data arr.itemsize arr.reshape arr.swapaxes\n",
" arr.argsort arr.diagonal arr.max arr.resize arr.take\n",
" arr.astype arr.dot arr.mean arr.round arr.tofile\n",
" arr.base arr.dtype arr.min arr.searchsorted arr.tolist\n",
" arr.byteswap arr.dump arr.nbytes arr.setasflat arr.tostring\n",
" arr.choose arr.dumps arr.ndim arr.setfield arr.trace\n",
" arr.clip arr.fill arr.newbyteorder arr.setflags arr.transpose\n",
" arr.compress arr.flags arr.nonzero arr.shape arr.var\n",
" arr.conj arr.flat arr.prod arr.size arr.view\n",
" arr.conjugate arr.flatten arr.ptp arr.sort "
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Operating with arrays"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Arrays support all regular arithmetic operators, and the numpy library also contains a complete collection of basic mathematical functions that operate on arrays. It is important to remember that in general, all operations with arrays are applied *element-wise*, i.e., are applied to all the elements of the array at the same time. Consider for example:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"arr1 = np.arange(4)\n",
"arr2 = np.arange(10, 14)\n",
"print arr1, '+', arr2, '=', arr1+arr2"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[0 1 2 3] + [10 11 12 13] = [10 12 14 16]\n"
]
}
],
"prompt_number": 40
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Importantly, you must remember that even the multiplication operator is by default applied element-wise, it is *not* the matrix multiplication from linear algebra (as is the case in Matlab, for example):"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print arr1, '*', arr2, '=', arr1*arr2"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[0 1 2 3] * [10 11 12 13] = [ 0 11 24 39]\n"
]
}
],
"prompt_number": 41
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"While this means that in principle arrays must always match in their dimensionality in order for an operation to be valid, numpy will *broadcast* dimensions when possible. For example, suppose that you want to add the number 1.5 to `arr1`; the following would be a valid way to do it:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"arr1 + 1.5*np.ones(4)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 42,
"text": [
"array([ 1.5, 2.5, 3.5, 4.5])"
]
}
],
"prompt_number": 42
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But thanks to numpy's broadcasting rules, the following is equally valid:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"arr1 + 1.5"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 43,
"text": [
"array([ 1.5, 2.5, 3.5, 4.5])"
]
}
],
"prompt_number": 43
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case, numpy looked at both operands and saw that the first (`arr1`) was a one-dimensional array of length 4 and the second was a scalar, considered a zero-dimensional object. The broadcasting rules allow numpy to:\n",
"\n",
"* *create* new dimensions of length 1 (since this doesn't change the size of the array)\n",
"* 'stretch' a dimension of length 1 that needs to be matched to a dimension of a different size.\n",
"\n",
"So in the above example, the scalar 1.5 is effectively:\n",
"\n",
"* first 'promoted' to a 1-dimensional array of length 1\n",
"* then, this array is 'stretched' to length 4 to match the dimension of `arr1`.\n",
"\n",
"After these two operations are complete, the addition can proceed as now both operands are one-dimensional arrays of length 4.\n",
"\n",
"This broadcasting behavior is in practice enormously powerful, especially because when numpy broadcasts to create new dimensions or to 'stretch' existing ones, it doesn't actually replicate the data. In the example above the operation is carried *as if* the 1.5 was a 1-d array with 1.5 in all of its entries, but no actual array was ever created. This can save lots of memory in cases when the arrays in question are large and can have significant performance implications.\n",
"\n",
"The general rule is: when operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward, creating dimensions of length 1 as needed. Two dimensions are considered compatible when\n",
"\n",
"* they are equal to begin with, or\n",
"* one of them is 1; in this case numpy will do the 'stretching' to make them equal.\n",
"\n",
"If these conditions are not met, a `ValueError: frames are not aligned` exception is thrown, indicating that the arrays have incompatible shapes. The size of the resulting array is the maximum size along each dimension of the input arrays."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This shows how the broadcasting rules work in several dimensions:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"b = np.array([2, 3, 4, 5])\n",
"print arr, '\\n\\n+', b , '\\n----------------\\n', arr + b"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[[0 1 2 3]\n",
" [4 5 6 7]] \n",
"\n",
"+ [2 3 4 5] \n",
"----------------\n",
"[[ 2 4 6 8]\n",
" [ 6 8 10 12]]\n"
]
}
],
"prompt_number": 44
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, how could you use broadcasting to say add `[4, 6]` along the rows to `arr` above? Simply performing the direct addition will produce the error we previously mentioned:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"c = np.array([4, 6])\n",
"arr + c"
],
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "ValueError",
"evalue": "operands could not be broadcast together with shapes (2,4) (2) ",
"output_type": "pyerr",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m/home/fperez/teach/book-math-labtool/\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[0mc\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0marray\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m4\u001b[0m\u001b[1;33m,\u001b[0m \u001b[1;36m6\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0marr\u001b[0m \u001b[1;33m+\u001b[0m \u001b[0mc\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;31mValueError\u001b[0m: operands could not be broadcast together with shapes (2,4) (2) "
]
}
],
"prompt_number": 45
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"According to the rules above, the array `c` would need to have a *trailing* dimension of 1 for the broadcasting to work. It turns out that numpy allows you to 'inject' new dimensions anywhere into an array on the fly, by indexing it with the special object `np.newaxis`:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"(c[:, np.newaxis]).shape"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 46,
"text": [
"(2, 1)"
]
}
],
"prompt_number": 46
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is exactly what we need, and indeed it works:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"arr + c[:, np.newaxis]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 47,
"text": [
"array([[ 4, 5, 6, 7],\n",
" [10, 11, 12, 13]])"
]
}
],
"prompt_number": 47
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For the full broadcasting rules, please see the official Numpy docs, which describe them in detail and with more complex examples."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we mentioned before, Numpy ships with a full complement of mathematical functions that work on entire arrays, including logarithms, exponentials, trigonometric and hyperbolic trigonometric functions, etc. Furthermore, scipy ships a rich special function library in the `scipy.special` module that includes Bessel, Airy, Fresnel, Laguerre and other classical special functions. For example, sampling the sine function at 100 points between $0$ and $2\\pi$ is as simple as:"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"x = np.linspace(0, 2*np.pi, 100)\n",
"y = np.sin(x)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 48
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Linear algebra in numpy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Numpy ships with a basic linear algebra library, and all arrays have a `dot` method whose behavior is that of the scalar dot product when its arguments are vectors (one-dimensional arrays) and the traditional matrix multiplication when one or both of its arguments are two-dimensional arrays:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"v1 = np.array([2, 3, 4])\n",
"v2 = np.array([1, 0, 1])\n",
"print v1, '.', v2, '=', v1.dot(v2)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[2 3 4] . [1 0 1] = 6\n"
]
}
],
"prompt_number": 49
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is a regular matrix-vector multiplication, note that the array `v1` should be viewed as a *column* vector in traditional linear algebra notation; numpy makes no distinction between row and column vectors and simply verifies that the dimensions match the required rules of matrix multiplication, in this case we have a $2 \\times 3$ matrix multiplied by a 3-vector, which produces a 2-vector:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"A = np.arange(6).reshape(2, 3)\n",
"print A, 'x', v1, '=', A.dot(v1)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[[0 1 2]\n",
" [3 4 5]] x [2 3 4] = [11 38]\n"
]
}
],
"prompt_number": 50
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For matrix-matrix multiplication, the same dimension-matching rules must be satisfied, e.g. consider the difference between $A \\times A^T$:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print A.dot(A.T)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[[ 5 14]\n",
" [14 50]]\n"
]
}
],
"prompt_number": 51
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"and $A^T \\times A$:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print A.T.dot(A)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[[ 9 12 15]\n",
" [12 17 22]\n",
" [15 22 29]]\n"
]
}
],
"prompt_number": 52
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Furthermore, the `numpy.linalg` module includes additional functionality such as determinants, matrix norms, Cholesky, eigenvalue and singular value decompositions, etc. For even more linear algebra tools, `scipy.linalg` contains the majority of the tools in the classic LAPACK libraries as well as functions to operate on sparse matrices. We refer the reader to the Numpy and Scipy documentations for additional details on these."
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Reading and writing arrays to disk"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Numpy lets you read and write arrays into files in a number of ways. In order to use these tools well, it is critical to understand the difference between a *text* and a *binary* file containing numerical data. In a text file, the number $\\pi$ could be written as \"3.141592653589793\", for example: a string of digits that a human can read, with in this case 15 decimal digits. In contrast, that same number written to a binary file would be encoded as 8 characters (bytes) that are not readable by a human but which contain the exact same data that the variable `pi` had in the computer's memory. \n",
"\n",
"The tradeoffs between the two modes are thus:\n",
"\n",
"* Text mode: occupies more space, precision can be lost (if not all digits are written to disk), but is readable and editable by hand with a text editor. Can *only* be used for one- and two-dimensional arrays.\n",
"\n",
"* Binary mode: compact and exact representation of the data in memory, can't be read or edited by hand. Arrays of any size and dimensionality can be saved and read without loss of information.\n",
"\n",
"First, let's see how to read and write arrays in text mode. The `np.savetxt` function saves an array to a text file, with options to control the precision, separators and even adding a header:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"arr = np.arange(10).reshape(2, 5)\n",
"np.savetxt('test.out', arr, fmt='%.2e', header=\"My dataset\")\n",
"!cat test.out"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"# My dataset\n",
"0.00e+00 1.00e+00 2.00e+00 3.00e+00 4.00e+00\n",
"5.00e+00 6.00e+00 7.00e+00 8.00e+00 9.00e+00\n"
]
}
],
"prompt_number": 53
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And this same type of file can then be read with the matching `np.loadtxt` function:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"arr2 = np.loadtxt('test.out')\n",
"print arr2"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[[ 0. 1. 2. 3. 4.]\n",
" [ 5. 6. 7. 8. 9.]]\n"
]
}
],
"prompt_number": 54
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For binary data, Numpy provides the `np.save` and `np.savez` routines. The first saves a single array to a file with `.npy` extension, while the latter can be used to save a *group* of arrays into a single file with `.npz` extension. The files created with these routines can then be read with the `np.load` function.\n",
"\n",
"Let us first see how to use the simpler `np.save` function to save a single array:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"np.save('test.npy', arr2)\n",
"# Now we read this back\n",
"arr2n = np.load('test.npy')\n",
"# Let's see if any element is non-zero in the difference.\n",
"# A value of True would be a problem.\n",
"print 'Any differences?', np.any(arr2-arr2n)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Any differences? False\n"
]
}
],
"prompt_number": 55
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let us see how the `np.savez` function works. You give it a filename and either a sequence of arrays or a set of keywords. In the first mode, the function will auotmatically name the saved arrays in the archive as `arr_0`, `arr_1`, etc:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"np.savez('test.npz', arr, arr2)\n",
"arrays = np.load('test.npz')\n",
"arrays.files"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 56,
"text": [
"['arr_1', 'arr_0']"
]
}
],
"prompt_number": 56
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, we can explicitly choose how to name the arrays we save:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"np.savez('test.npz', array1=arr, array2=arr2)\n",
"arrays = np.load('test.npz')\n",
"arrays.files"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 57,
"text": [
"['array2', 'array1']"
]
}
],
"prompt_number": 57
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The object returned by `np.load` from an `.npz` file works like a dictionary, though you can also access its constituent files by attribute using its special `.f` field; this is best illustrated with an example with the `arrays` object from above:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print 'First row of first array:', arrays['array1'][0]\n",
"# This is an equivalent way to get the same field\n",
"print 'First row of first array:', arrays.f.array1[0]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"First row of first array: [0 1 2 3 4]\n",
"First row of first array: [0 1 2 3 4]\n"
]
}
],
"prompt_number": 58
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This `.npz` format is a very convenient way to package compactly and without loss of information, into a single file, a group of related arrays that pertain to a specific problem. At some point, however, the complexity of your dataset may be such that the optimal approach is to use one of the standard formats in scientific data processing that have been designed to handle complex datasets, such as NetCDF or HDF5. \n",
"\n",
"Fortunately, there are tools for manipulating these formats in Python, and for storing data in other ways such as databases. A complete discussion of the possibilities is beyond the scope of this discussion, but of particular interest for scientific users we at least mention the following:\n",
"\n",
"* The `scipy.io` module contains routines to read and write Matlab files in `.mat` format and files in the NetCDF format that is widely used in certain scientific disciplines.\n",
"\n",
"* For manipulating files in the HDF5 format, there are two excellent options in Python: The PyTables project offers a high-level, object oriented approach to manipulating HDF5 datasets, while the h5py project offers a more direct mapping to the standard HDF5 library interface. Both are excellent tools; if you need to work with HDF5 datasets you should read some of their documentation and examples and decide which approach is a better match for your needs."
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"High quality data visualization with Matplotlib"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The [matplotlib](http://matplotlib.sf.net) library is a powerful tool capable of producing complex publication-quality figures with fine layout control in two and three dimensions; here we will only provide a minimal self-contained introduction to its usage that covers the functionality needed for the rest of the book. We encourage the reader to read the tutorials included with the matplotlib documentation as well as to browse its extensive gallery of examples that include source code.\n",
"\n",
"Just as we typically use the shorthand `np` for Numpy, we will use `plt` for the `matplotlib.pyplot` module where the easy-to-use plotting functions reside (the library contains a rich object-oriented architecture that we don't have the space to discuss here):"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"import matplotlib.pyplot as plt"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 59
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The most frequently used function is simply called `plot`, here is how you can make a simple plot of $\\sin(x)$ for $x \\in [0, 2\\pi]$ with labels and a grid (we use the semicolon in the last line to suppress the display of some information that is unnecessary right now):"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"x = np.linspace(0, 2*np.pi)\n",
"y = np.sin(x)\n",
"plt.plot(x,y, label='sin(x)')\n",
"plt.legend()\n",
"plt.grid()\n",
"plt.title('Harmonic')\n",
"plt.xlabel('x')\n",
"plt.ylabel('y');"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "display_data",
"svg": [
"\n",
"\n",
"\n",
"\n"
]
}
],
"prompt_number": 60
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can control the style, color and other properties of the markers, for example:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"plt.plot(x, y, linewidth=2);"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "display_data",
"svg": [
"\n",
"\n",
"\n",
"\n"
]
}
],
"prompt_number": 61
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"plt.plot(x, y, 'o', markersize=5, color='r');"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "display_data",
"svg": [
"\n",
"\n",
"\n",
"\n"
]
}
],
"prompt_number": 62
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will now see how to create a few other common plot types, such as a simple error plot:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# example data\n",
"x = np.arange(0.1, 4, 0.5)\n",
"y = np.exp(-x)\n",
"\n",
"# example variable error bar values\n",
"yerr = 0.1 + 0.2*np.sqrt(x)\n",
"xerr = 0.1 + yerr\n",
"\n",
"# First illustrate basic pyplot interface, using defaults where possible.\n",
"plt.figure()\n",
"plt.errorbar(x, y, xerr=0.2, yerr=0.4)\n",
"plt.title(\"Simplest errorbars, 0.2 in x, 0.4 in y\");"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "display_data",
"svg": [
"\n",
"\n",
"\n",
"\n"
]
}
],
"prompt_number": 63
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A simple log plot"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"x = np.linspace(-5, 5)\n",
"y = np.exp(-x**2)\n",
"plt.semilogy(x, y);"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "display_data",
"svg": [
"\n",
"\n",
"\n",
"\n"
]
}
],
"prompt_number": 64
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A histogram annotated with text inside the plot, using the `text` function:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"mu, sigma = 100, 15\n",
"x = mu + sigma * np.random.randn(10000)\n",
"\n",
"# the histogram of the data\n",
"n, bins, patches = plt.hist(x, 50, normed=1, facecolor='g', alpha=0.75)\n",
"\n",
"plt.xlabel('Smarts')\n",
"plt.ylabel('Probability')\n",
"plt.title('Histogram of IQ')\n",
"# This will put a text fragment at the position given:\n",
"plt.text(55, .027, r'$\\mu=100,\\ \\sigma=15$', fontsize=14)\n",
"plt.axis([40, 160, 0, 0.03])\n",
"plt.grid(True)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "display_data",
"svg": [
"\n",
"\n",
"\n",
"\n"
]
}
],
"prompt_number": 65
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Image display"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `imshow` command can display single or multi-channel images. A simple array of random numbers, plotted in grayscale:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from matplotlib import cm\n",
"plt.imshow(np.random.rand(5, 10), cmap=cm.gray, interpolation='nearest');"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "display_data",
"svg": [
"\n",
"\n",
"\n",
"\n"
]
}
],
"prompt_number": 66
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A real photograph is a multichannel image, `imshow` interprets it correctly:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"img = plt.imread('stinkbug.png')\n",
"print 'Dimensions of the array img:', img.shape\n",
"plt.imshow(img);"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Dimensions of the array img: (375, 500, 3)\n"
]
},
{
"output_type": "display_data",
"svg": [
"\n",
"\n",
"\n",
"\n"
]
}
],
"prompt_number": 67
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Simple 3d plotting with matplotlib"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that you must execute at least once in your session:"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"from mpl_toolkits.mplot3d import Axes3D"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 68
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One this has been done, you can create 3d axes with the `projection='3d'` keyword to `add_subplot`:\n",
"\n",
" fig = plt.figure()\n",
" fig.add_subplot(, projection='3d')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A simple surface plot:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from mpl_toolkits.mplot3d.axes3d import Axes3D\n",
"from matplotlib import cm\n",
"\n",
"fig = plt.figure()\n",
"ax = fig.add_subplot(1, 1, 1, projection='3d')\n",
"X = np.arange(-5, 5, 0.25)\n",
"Y = np.arange(-5, 5, 0.25)\n",
"X, Y = np.meshgrid(X, Y)\n",
"R = np.sqrt(X**2 + Y**2)\n",
"Z = np.sin(R)\n",
"surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=cm.jet,\n",
" linewidth=0, antialiased=False)\n",
"ax.set_zlim3d(-1.01, 1.01);"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "display_data",
"svg": [
"\n",
"\n",
"\n",
"