Table of Contents

A bit of Python

Python is available for you favorite platform on the downloads page (or you can choose the anaconda version instead). You can use the Python interpreter interactively by typing python at a terminal window. However, for data analysis we recommend IPython, which is a nicer front end to python (it works with both a regular Python install, or an anaconda version of it). If you have installed it (or are using department machines) it can be invoked with:

ipython

To quit, type control-d

To run python code in a file code.py, either type

run code.py

in the ipython interpreter, or

python code.py

at the unix command line.

In addition to python statements/expressions ipython allows you to type in shell commands and its own special magic commands, and it provides better integration with matplotlib, which is the best python plotting library.

You can use Python as a calculator. For example, what is the value of $(100\cdot 2 - 12^2) / 7 \cdot 5 + 2\;\;\;$?

  In [301]: (100*2 - 12**2) / 7*5 + 2
  Out[301]: 42

In order to compute something like $\sin(\pi/2)$ we first need to import the math module:

In [303]: import math
In [304]: math.sin(math.pi/2)
1.0

How do I find out what other mathematical functions are available?

  help("math")

Linear algebra in Python

Can I work with vectors and matrices in python?

Of course! Every data analysis tool is worth its bytes should. The numpy package provides the required magic.

Vectors and matrices are all represented as numpy arrays. First, some vectors:

In [1]: import numpy as np
 
In [2]: x = np.array([1,1])

We can multiply a vector by a scalar:

In [3]: x * 2
Out[3]: array([2, 2])

And we can add vectors:

In [4]: x + np.array([1,0])
Out[4]: array([2, 1])

After we introduce matrices, we'll show how to do inner products.

Let's create an array that represents the following matrix: \[\left ( \begin{array}{cc} 1 & 2\\ 3 & 4\\ 5 & 6 \end{array} \right ) \]

In [18]: X = np.array([[1,2], [4,3], [5,6]])
 
In [19]: X
Out[19]: 
array([[1, 2],
       [4, 3],
       [5, 6]])

We'll think of $X$ as the feature matrix of a machine learning dataset. To access a row of the matrix (corresponding to the features of the ith example in the dataset):

In [20]: X[0]
Out[20]: array([1, 2])

To access a column of the matrix (a single feature):

In [21]: X[:,0]
Out[22]: array([1, 4, 5])

Let's construct a weight vector for a linear classifier:

In [20]: w = np.array([1,-1])

We can easily compute the dot/inner product of a row of $X$ with the weight vector:

In [21]: np.inner(w, X[0])
Out[21]: -1

We can even compute the inner products for all the rows of the matrix all at once:

In [22]: np.inner(w, X)
Out[22]: array([-1,  1, -1])

Let's construct another matrix

In [33]: A = np.ones((2,3)) * 2
 
In [34]: A
Out[34]: 
array([[ 2.,  2.,  2.],
       [ 2.,  2.,  2.]])

Let's look for a way to compute the matrix product $A \times X$. Our first guess would be to try the multiplication operator, since we saw above that we can multiply a matrix by a scalar:

In [36]: A * X
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-36-578836d375ce> in <module>()
----> 1 A * X

So it didn't work. You can multiply matrices that are the same shape using the * operator, but it performs component-wise multiplication rather than matrix product. Instead, use the numpy function dot to do matrix multiplication:

In [37]: np.dot(A, X)
Out[37]: 
array([[ 20.,  22.],
       [ 20.,  22.]])

Turns out that dot is a method, so you can also do:

In [38]: A.dot(X)
Out[38]: 
array([[ 20.,  22.],
       [ 20.,  22.]])

An array is transposed by

In [39]: A.transpose()
Out[39]: 
array([[ 2.,  2.],
       [ 2.,  2.],
       [ 2.,  2.]])
 
In [41]: A.T
Out[41]: 
array([[ 2.,  2.],
       [ 2.,  2.],
       [ 2.,  2.]])

And let's take a look at all the methods that an array has:

In [39]: A.
A.T             A.cumsum        A.min           A.shape
A.all           A.data          A.nbytes        A.size
A.any           A.diagonal      A.ndim          A.sort
A.argmax        A.dot           A.newbyteorder  A.squeeze
A.argmin        A.dtype         A.nonzero       A.std
A.argpartition  A.dump          A.partition     A.strides
A.argsort       A.dumps         A.prod          A.sum
A.astype        A.fill          A.ptp           A.swapaxes
A.base          A.flags         A.put           A.take
A.byteswap      A.flat          A.ravel         A.tobytes
A.choose        A.flatten       A.real          A.tofile
A.clip          A.getfield      A.repeat        A.tolist
A.compress      A.imag          A.reshape       A.tostring
A.conj          A.item          A.resize        A.trace
A.conjugate     A.itemset       A.round         A.transpose
A.copy          A.itemsize      A.searchsorted  A.var
A.ctypes        A.max           A.setfield      A.view
A.cumprod       A.mean          A.setflags  

Elements and sub-matrices are easily extracted:

In [42]: X
Out[42]: 
array([[1, 2],
       [4, 3],
       [5, 6]])
 
In [43]: X[0,0]
Out[43]: 1
 
In [44]: X[-1,-1]
Out[44]: 6
 
In [46]: X[0:2, 0:2]
Out[46]: 
array([[1, 2],
       [4, 3]])
 
# my favorite way of indexing:  using an array!
In [47]: X[ [0,2] ]
Out[47]: 
array([[1, 2],
       [5, 6]])

How do I find the inverse of a matrix?

In [2]: z = np.array([[2,1,1],[1,2,2],[2,3,4]])
 
In [3]: z
Out[3]: 
array([[2, 1, 1],
       [1, 2, 2],
       [2, 3, 4]])
 
In [4]: np.linalg.inv(z)
Out[4]: 
array([[ 0.66666667, -0.33333333,  0.        ],
       [ 0.        ,  2.        , -1.        ],
       [-0.33333333, -1.33333333,  1.        ]])
 
In [5]: np.dot(z, np.linalg.inv(z))
Out[5]: 
array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

Plotting

Let's get on to that all important step of visualizing data. We will be using the matplotlib Python package for that. Let's start by plotting the function $f(x) = x^2$.

First, let's generate the numbers. Well, there are tons of ways to do so. Python has some nifty syntax for generating lists. Watch this! A list comprehension!!

In [9]: f = [i**2 for i in range(10)]
 
In [10]: f
Out[10]: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

There's an alternative way of doing this using numpy:

In [12]: f = np.arange(10)**2
 
In [13]: f
Out[13]: array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

To plot the data, first import the pyplot module.

In [6]: import matplotlib.pyplot as plt
 
In [7]: plt.plot(range(10), f, 'ob')
Out[7]: [<matplotlib.lines.Line2D at 0x10549b590>]

In order to actually see the plot you need to do:

In [8]: plt.show()

As an alternative, you can put matplotlib in interactive mode before plotting using the command plt.ion(). Also note that plotting functions accept either Python lists or numpy arrays.

We can add a second plot to the same axes by calling plot again:

In [16]: plt.plot(x, x, 'dr')