# Course Overview

## Objectives

Charles Mingus: Making the simple complicated is commonplace; making the complicated simple, awesomely simple, that's creativity.

• Develop intuition for finding patterns in data.
• Learn concepts and algorithms in machine learning research.
• Learn to program in python.
• Implement many common machine learning algorithms in python.
• Write python code to apply algorithms to various data sets.
• Use python to analyze and visualize results.
• Learn to write scientific papers in LaTeX with graphics from python.

## Materials

• Website http://www.cs.colostate.edu/~cs545 for on-campus and on-line students
• Announcements
• Schedule
• Bishop's Pattern Recognition and Machine Learning
• On-line material for learning python and latex

## Assignments

• About 7 assignments. Same ones for on and off-campus students.
• Implement a machine learning algorithm in python
• Apply it to a given data set
• Write a report in Latex describing
• data and machine learning problem being addressed,
• method you followed, including summary of python code and resources you used,
• results described with figures and explanations in text,
• discussion of the results, including your observations and questions you have,
• attempts to answer some of the questions if you have time with further experiments
• The last assignment will be of your own design.

# The Power of Statistics and Visualization

Simple analysis and dynamic visualization can lead to tremendous insight with little effort. For example, take the time to watch the video at ted.com by Hans Rosling:

The first two steps to a data analysis problem should always be:

1. read data into python, summarize and visualize the data until you have a sense of
• number of samples and attributes
• types of attributes and ranges of values
• any missing attribute values
2. list the initial set of questions to be answered about the data and guess at possible answers (hypotheses), based on your summaries and visualizations.

These steps are often ignored, or not given enough attention.

# Examples of Kinds of Problems We Will Study

• Given a bunch of images of hand-drawn digits, learn a mapping from image to digit (0 - 9) and test it on new images. (supervised learning: classification)
• Given a set of attributes (measurements) of a lot of automobiles, including their miles-per-gallon (mpg), learn a function of all attributes except mpg that predicts the mpg. (supervised learning: prediction)
• Given the definition of a game, learn a function of the current game position that produces the best next move. (reinforcement learning)
• Given gene expression data for a bunch of samples, find groupings that are common among the different treatments. (unsupervised learning)

# Introduction to Python

## Why Python?

What do you (I) want in a programming and computing environment?

• Concise, intuitive programming language.
• Ability to play with data and computation ideas.
• Data persistence.
• Matrix, vector structures and operations are easy.
• Interpreter, using same langauge as programs.
• Functional and object-oriented programming style
• Rich, easy-to-use visualization
• Fast computation.
• Large community of users and developers.
• Free

Python satisfies all, except perhaps fast computation, but getting faster all the time. (See Python Speed)

Python

• is an open-source language and environment for computing and graphics
• started in 1989 by Guido van Rossum at CWI in the Netherlands
• is a multi-paradigm programming language
• has dynamic typing
• has garbage collection
• easily extensible in C and C++
• available for Unix, Windows, and MacOS systems;
• is the language of choice for many researchers in machine learning and statistics, and
• is the required language in many job ads (see http://www.python.org/community/jobs/)

## Installing and Running Python

Python is installed on our department's systems.

On our systems, enter the ipython interactive environment with

ipython

To quit, type control-d

To run python code in file code.py, either type

run code.py

in ipython, or type

python code.py

at the unix command prompt.

When in the ipython, you may type python statements or expressions that are evaluated, or ipython commands. See the Video tutorial on using ipython, in five parts by Jeff Rush, for help getting started with ipython.

Documentation is immediately available for many things. For example:

> ipython
Python 2.7.3 (default, Jul 24 2012, 11:41:40)

IPython 0.12 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: list?
Type:		type
Base Class:	<type 'type'>
String Form:	<type 'list'>
Namespace:	Python builtin
Docstring:
list() -> new list
list(sequence) -> new list initialized from sequence's items

In [2]: help(list)
Help on class list in module __builtin__:

class list(object)
|  list() -> new list
|  list(sequence) -> new list initialized from sequence's items
|
|  Methods defined here:
|
|
.
.
.
|
|  sort(...)
|      L.sort(cmp=None, key=None, reverse=False) -- stable sort *IN PLACE*;
|      cmp(x, y) -> -1, 0, 1
|
|  ----------------------------------------------------------------------
|  Data and other attributes defined here:
|
|  __hash__ = None
|
|  __new__ = <built-in method __new__ of type object at 0x7ffd65be0580>
|      T.__new__(S, ...) -> a new object with type S, a subtype of T

What is the value of $(100\cdot 2 - 12^2) / 7 \cdot 5 + 2\;\;\;$?

In [301]: (100*2 - 12**2) / 7*5 + 2
Out[301]: 42

What is the value of $\sin(\pi/2)$?

In [302]: sin(pi/2)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)

/s/parsons/e/fac/anderson/public_html/cs545/dokuwiki-2009-12-25/data/pages/notes/<ipython console> in <module>()

NameError: name 'sin' is not defined

In [303]: import math
In [304]: math.sin(math.pi/2)
1.0

How do I find out what other trigonometric functions are available?

help("math")

## Plotting

Let's get on to that all important step of visualizing data. We will be using the matplotlib python module to do our visualizing. Let's start by plotting the values 0, 2, 4, 6, …, 20.

First, generate the numbers. Well, there are tons of ways to do so. First, using for loop.

In [16]: nums = []

In [17]: i = 0

In [18]: while i <= 20:
....:     nums = nums + [i]
....:     i = i + 2
....:
....:

In [19]: nums
Out[19]: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20]

To plot them, first import the module we want to use.

In [20]: import matplotlib.pyplot as plt

In [21]: plt.ion()

In [22]: plt.plot(nums)
Out[22]: [<matplotlib.lines.Line2D object at 0x335c190>]

You should see

Or, you can use a for loop to step through each element of a list.

In [23]: nums = []

In [24]: for i in [0,1,2,3,4,5,6,7,8,9,10]:
....:     nums = nums + [i*2]
....:
....:

In [25]: nums
Out[25]: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20]

That's a lot of numbers to type. Whenever you find yourself typing a lot, there is very likely a quick solution.

In [28]: range(10)
Out[28]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [29]: range(11)
Out[29]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [30]: nums = []

In [31]: for i in range(11):
....:     nums = nums + [i*2]
....:
....:

In [32]: nums
Out[32]: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20]

But, really, we hardly ever need for or while loops. Watch this! A list comprehension!!

In [33]: nums = [i*2 for i in range(11)]

In [34]: nums
Out[34]: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20]

Why don't we just multiply range(11) by 2?

In [35]: range(11)*2
Out[35]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Multiplication of lists is a duplication operator.

But, this does work as expected if we first convert the list to a numpy array. Don't forget to import the numpy module first.

In [36]: import numpy as np

In [37]: np.array(range(11))
Out[37]: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [38]: np.array(range(11)) * 2
Out[38]: array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20])

We do this so often there is a numpy function to do this.

In [39]: np.arange(11)
Out[39]: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [40]: np.arange(11)*2
Out[40]: array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20])

The smart python developers setup the plotting functions to accept either lists or numpy arrays, so a fast way of doing our plot is

In [41]: plt.plot(np.arange(11)*2)

We could also include $x$ values:

In [45]: plt.clf()

In [46]: x = np.arange(11) * 0.1

In [47]: plt.plot(x,x*2)

to see

We can add a second plot to the same axes by calling plot again without the call to clf().

In [47]: plt.plot(x,x**2)
Out[47]: [<matplotlib.lines.Line2D object at 0x3608990>]

## Matrices in Python

Can I work with vectors and matrices in python?

Of course! No data analysis tool is worth the bytes it burns if it doesn't. The python numpy module provides the magic to work with matrices as ndarray's.

We have several ways to create an array. Make this array by doing

In [17]: import numpy as np

In [18]: m = np.array([[1,2], [3,4], [5,6]])

In [19]: m
Out[19]:
array([[1, 2],
[3, 4],
[5, 6]])

This array can be copied and reshaped by

In [22]: m.reshape(1,6)
Out[22]: array([[1, 2, 3, 4, 5, 6]])

In [23]: m
array([[1, 2],
[3, 4],
[5, 6]])

To change m, you must assign it or use resize.

In [24]: m = m.reshape(1,6)

In [7]: m
Out[7]: array([[1, 2, 3, 4, 5, 6]])

In [8]: m.resize((2,3))

In [9]: m
Out[9]:
array([[1, 2, 3],
[4, 5, 6]])

We could use the numpy function arange followed by reshape:

In [26]: np.arange(10)
Out[26]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [27]: np.arange(10).reshape(2,5)
Out[27]:
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])

Want to know more? Try

np.reshape?

Let's make the matrices and We can use resize for the first one, and resize or reshape for the second. (resize changes the array it is applied to; reshape makes a new version)

In [65]: a = np.resize(2,(3,3))

In [66]: a
Out[66]:
array([[2, 2, 2],
[2, 2, 2],
[2, 2, 2]])

In [67]: b = np.resize(np.arange(9)+1,(3,3))

In [68]: b
Out[68]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

What is the value of $a * b$?

In [69]: a * b
Out[69]:
array([[ 2,  4,  6],
[ 8, 10, 12],
[14, 16, 18]])

The * operator does a component-wise multiplication. Use the numpy function dot to do matrix multiplication.

In [70]: np.dot(a,b)
Out[70]:
array([[24, 30, 36],
[24, 30, 36],
[24, 30, 36]])

An array is transposed by

In [75]: b.transpose()
Out[75]:
array([[1, 4, 7],
[2, 5, 8],
[3, 6, 9]])

In [76]: b.T
Out[76]:
array([[1, 4, 7],
[2, 5, 8],
[3, 6, 9]])

In [77]: np.transpose(b)
Out[77]:
array([[1, 4, 7],
[2, 5, 8],
[3, 6, 9]])

What is $a b^T$?

In [78]: np.dot(a,b.T)
Out[78]:
array([[12, 30, 48],
[12, 30, 48],
[12, 30, 48]])

Elements and sub-matrices are easily extracted. Given the previous assignment of $b$,

In [79]: b
Out[79]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

In [80]: b[0,0]
Out[80]: 1

In [81]: b[0,1]
Out[81]: 2

In [82]: b[0:1,0:1]
Out[82]: array([[1]])

In [83]: b[0:2,0:1]
Out[83]:
array([[1],
[4]])

In [84]: b[0:2,1:2]
Out[84]:
array([[2],
[5]])

In [85]: b[0:2,1:3]
Out[85]:
array([[2, 3],
[5, 6]])

Let's multiply $a$ by the first column of $b$.

In [89]: a
Out[89]:
array([[2, 2, 2],
[2, 2, 2],
[2, 2, 2]])

In [91]: b
Out[91]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

In [92]: b[:,0]
Out[92]: array([1, 4, 7])

In [93]: np.dot(a,b[:,0])
Out[93]: array([24, 24, 24])

What happenend? This should be a 3×3 times 3×1 operation, resulting in a 3×1 matrix, but the answer shows a 1×3 matrix.

While b is two-dimensonal, b[:,0] is one-dimensional. It is the first column, but as a vector. To keep the two-dimensional nature, use b[:,0:1] or just b[:,:1].

In [94]: b[:,:1]
Out[94]:
array([[1],
[4],
[7]])

In [95]: np.dot(a,b[:,:1])
Out[95]:
array([[24],
[24],
[24]])

How do I find the inverse of a matrix? Search the net for numpy inverse.

In [2]: z = np.array([[2,1,1],[1,2,2],[2,3,4]])

In [3]: z
Out[3]:
array([[2, 1, 1],
[1, 2, 2],
[2, 3, 4]])

In [4]: np.linalg.inv(z)
Out[4]:
array([[ 0.66666667, -0.33333333,  0.        ],
[ 0.        ,  2.        , -1.        ],
[-0.33333333, -1.33333333,  1.        ]])

In [5]: np.dot(z, np.linalg.inv(z))
Out[5]:
array([[ 1.,  0.,  0.],
[ 0.,  1.,  0.],
[ 0.,  0.,  1.]])