Table of Contents

Charles Mingus: *Making the simple complicated is commonplace; making the complicated
simple, awesomely simple, that's creativity.*

- Develop intuition for finding patterns in data.
- Learn concepts and algorithms in machine learning research.
- Learn to program in python.
- Implement many common machine learning algorithms in python.
- Write python code to apply algorithms to various data sets.
- Use python to analyze and visualize results.
- Learn to write scientific papers in LaTeX with graphics from python.
- Learn to do machine learning research by asking and answering questions about your data and results.

- Website http://www.cs.colostate.edu/~cs545 for on-campus and on-line students
- Announcements
- Schedule
- Discussions: you post questions, answers, comments. Must register.

- Bishop's
*Pattern Recognition and Machine Learning* - On-line material for learning python and latex

- About 7 assignments. Same ones for on and off-campus students.
- Implement a machine learning algorithm in python
- Apply it to a given data set
- Write a report in Latex describing
- data and machine learning problem being addressed,
- method you followed, including summary of python code and resources you used,
- results described with figures and explanations in text,
- discussion of the results, including your observations and questions you have,
- attempts to answer some of the questions if you have time with further experiments

- The last assignment will be of your own design.

Simple analysis and dynamic visualization can lead to tremendous insight with little effort. For example, take the time to watch the video at ted.com by Hans Rosling:

http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen.html

The first two steps to a data analysis problem should always be:

- read data into python, summarize and visualize the data until you have a sense of
- number of samples and attributes
- types of attributes and ranges of values
- any missing attribute values

- list the initial set of questions to be answered about the data and guess at possible answers (hypotheses), based on your summaries and visualizations.

These steps are often ignored, or not given enough attention.

- Given a bunch of images of hand-drawn digits, learn a mapping from image to digit (0 - 9) and test it on new images. (supervised learning: classification)
- Given a set of attributes (measurements) of a lot of automobiles, including their miles-per-gallon (mpg), learn a function of all attributes except mpg that predicts the mpg. (supervised learning: prediction)
- Given the definition of a game, learn a function of the current game position that produces the best next move. (reinforcement learning)
- Given gene expression data for a bunch of samples, find groupings that are common among the different treatments. (unsupervised learning)

What do you (I) want in a programming and computing environment?

- Concise, intuitive programming language.
- Ability to
*play*with data and computation ideas.- Data persistence.
- Matrix, vector structures and operations are easy.
- Interpreter, using same langauge as programs.
- Functional and object-oriented programming style

- Rich, easy-to-use visualization
- Fast computation.
- Large community of users and developers.
- Free

Python satisfies all, except perhaps *fast computation*, but getting
faster all the time. (See Python Speed)

Python

- is an open-source language and environment for computing and graphics
- started in 1989 by Guido van Rossum at CWI in the Netherlands
- is a multi-paradigm programming language
- has dynamic typing
- has garbage collection
- easily extensible in C and C++
- available for Unix, Windows, and MacOS systems;
- is the language of choice for many researchers in machine learning and statistics, and
- is the required language in many job ads (see http://www.python.org/community/jobs/)

Python is installed on our department's systems.

You may download and install python on your own computer by following instructions at http://www.python.org

On our systems, enter the *ipython* interactive environment with

ipython

To quit, type control-d

To run python code in file *code.py*, either type

run code.py

in *ipython*, or type

python code.py

at the unix command prompt.

When in the *ipython*, you may type python statements or expressions
that are evaluated, or *ipython* commands. See the
Video
tutorial on using ipython, in five parts by Jeff Rush, for help
getting started with *ipython*.

Documentation is immediately available for many things. For example:

> ipython Python 2.7.3 (default, Jul 24 2012, 11:41:40) Type "copyright", "credits" or "license" for more information. IPython 0.12 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. In [1]: list? Type: type Base Class: <type 'type'> String Form: <type 'list'> Namespace: Python builtin Docstring: list() -> new list list(sequence) -> new list initialized from sequence's items In [2]: help(list) Help on class list in module __builtin__: class list(object) | list() -> new list | list(sequence) -> new list initialized from sequence's items | | Methods defined here: | | __add__(...) | x.__add__(y) <==> x+y | . . . | | sort(...) | L.sort(cmp=None, key=None, reverse=False) -- stable sort *IN PLACE*; | cmp(x, y) -> -1, 0, 1 | | ---------------------------------------------------------------------- | Data and other attributes defined here: | | __hash__ = None | | __new__ = <built-in method __new__ of type object at 0x7ffd65be0580> | T.__new__(S, ...) -> a new object with type S, a subtype of T

What is the value of ?

In [301]: (100*2 - 12**2) / 7*5 + 2 Out[301]: 42

What is the value of ?

In [302]: sin(pi/2) --------------------------------------------------------------------------- NameError Traceback (most recent call last) /s/parsons/e/fac/anderson/public_html/cs545/dokuwiki-2009-12-25/data/pages/notes/<ipython console> in <module>() NameError: name 'sin' is not defined In [303]: import math In [304]: math.sin(math.pi/2) 1.0

How do I find out what other trigonometric functions are available?

help("math")

Let's get on to that all important step of visualizing data. We will be using the *matplotlib* python module to do our visualizing. Let's start by plotting the values 0, 2, 4, 6, …, 20.

First, generate the numbers. Well, there are tons of ways to do so. First, using for loop.

In [16]: nums = [] In [17]: i = 0 In [18]: while i <= 20: ....: nums = nums + [i] ....: i = i + 2 ....: ....: In [19]: nums Out[19]: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20]

To plot them, first import the module we want to use.

In [20]: import matplotlib.pyplot as plt In [21]: plt.ion() In [22]: plt.plot(nums) Out[22]: [<matplotlib.lines.Line2D object at 0x335c190>]

Or, you can use a for loop to step through each element of a list.

In [23]: nums = [] In [24]: for i in [0,1,2,3,4,5,6,7,8,9,10]: ....: nums = nums + [i*2] ....: ....: In [25]: nums Out[25]: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20]

That's a lot of numbers to type. Whenever you find yourself typing a lot, there is very likely a quick solution.

In [28]: range(10) Out[28]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] In [29]: range(11) Out[29]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] In [30]: nums = [] In [31]: for i in range(11): ....: nums = nums + [i*2] ....: ....: In [32]: nums Out[32]: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20]

But, really, we hardly ever need for or while loops. Watch this! A list comprehension!!

In [33]: nums = [i*2 for i in range(11)] In [34]: nums Out[34]: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20]

Why don't we just multiply *range(11)* by 2?

In [35]: range(11)*2 Out[35]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Multiplication of lists is a duplication operator.

But, this does work as expected if we first convert the list to a *numpy* array. Don't forget to import the *numpy* module first.

In [36]: import numpy as np In [37]: np.array(range(11)) Out[37]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) In [38]: np.array(range(11)) * 2 Out[38]: array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20])

We do this so often there is a *numpy* function to do this.

In [39]: np.arange(11) Out[39]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) In [40]: np.arange(11)*2 Out[40]: array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20])

The smart python developers setup the plotting functions to accept either lists or *numpy* arrays, so a fast way of doing our plot is

In [41]: plt.plot(np.arange(11)*2)

We could also include values:

In [45]: plt.clf() In [46]: x = np.arange(11) * 0.1 In [47]: plt.plot(x,x*2)

to see

We can add a second plot to the same axes by calling *plot* again without the call to *clf()*.

In [47]: plt.plot(x,x**2) Out[47]: [<matplotlib.lines.Line2D object at 0x3608990>]

Can I work with vectors and matrices in python?

Of course! No data analysis tool is worth the bytes it burns if it
doesn't. The python *numpy* module provides the magic to work with matrices as
*ndarray*'s.

We have several ways to create an array. Make this array by doing

In [17]: import numpy as np In [18]: m = np.array([[1,2], [3,4], [5,6]]) In [19]: m Out[19]: array([[1, 2], [3, 4], [5, 6]])

This array can be copied and reshaped by

In [22]: m.reshape(1,6) Out[22]: array([[1, 2, 3, 4, 5, 6]]) In [23]: m array([[1, 2], [3, 4], [5, 6]])

To change m, you must assign it or use resize.

In [24]: m = m.reshape(1,6) In [7]: m Out[7]: array([[1, 2, 3, 4, 5, 6]]) In [8]: m.resize((2,3)) In [9]: m Out[9]: array([[1, 2, 3], [4, 5, 6]])

We could use the *numpy* function *arange* followed by *reshape*:

In [26]: np.arange(10) Out[26]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) In [27]: np.arange(10).reshape(2,5) Out[27]: array([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]])

Want to know more? Try

np.reshape?

Let's make the matrices
and
We can use *resize* for the first one, and *resize* or *reshape*
for the second. (*resize* changes the array it is applied to;
*reshape* makes a new version)

In [65]: a = np.resize(2,(3,3)) In [66]: a Out[66]: array([[2, 2, 2], [2, 2, 2], [2, 2, 2]]) In [67]: b = np.resize(np.arange(9)+1,(3,3)) In [68]: b Out[68]: array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

What is the value of ?

In [69]: a * b Out[69]: array([[ 2, 4, 6], [ 8, 10, 12], [14, 16, 18]])

The *** operator does a component-wise multiplication. Use the
*numpy* function *dot* to do matrix multiplication.

In [70]: np.dot(a,b) Out[70]: array([[24, 30, 36], [24, 30, 36], [24, 30, 36]])

An array is transposed by

In [75]: b.transpose() Out[75]: array([[1, 4, 7], [2, 5, 8], [3, 6, 9]]) In [76]: b.T Out[76]: array([[1, 4, 7], [2, 5, 8], [3, 6, 9]]) In [77]: np.transpose(b) Out[77]: array([[1, 4, 7], [2, 5, 8], [3, 6, 9]])

What is ?

In [78]: np.dot(a,b.T) Out[78]: array([[12, 30, 48], [12, 30, 48], [12, 30, 48]])

Elements and sub-matrices are easily extracted. Given the previous assignment of ,

In [79]: b Out[79]: array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) In [80]: b[0,0] Out[80]: 1 In [81]: b[0,1] Out[81]: 2 In [82]: b[0:1,0:1] Out[82]: array([[1]]) In [83]: b[0:2,0:1] Out[83]: array([[1], [4]]) In [84]: b[0:2,1:2] Out[84]: array([[2], [5]]) In [85]: b[0:2,1:3] Out[85]: array([[2, 3], [5, 6]])

Let's multiply by the first column of .

In [89]: a Out[89]: array([[2, 2, 2], [2, 2, 2], [2, 2, 2]]) In [91]: b Out[91]: array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) In [92]: b[:,0] Out[92]: array([1, 4, 7]) In [93]: np.dot(a,b[:,0]) Out[93]: array([24, 24, 24])

What happenend? This should be a 3×3 times 3×1 operation, resulting in a 3×1 matrix, but the answer shows a 1×3 matrix.

While `b`

is two-dimensonal, `b[:,0]`

is one-dimensional. It is the
first column, but as a vector. To keep the two-dimensional nature,
use `b[:,0:1]`

or just `b[:,:1]`

.

In [94]: b[:,:1] Out[94]: array([[1], [4], [7]]) In [95]: np.dot(a,b[:,:1]) Out[95]: array([[24], [24], [24]])

How do I find the inverse of a matrix? Search the net for *numpy
inverse*.

In [2]: z = np.array([[2,1,1],[1,2,2],[2,3,4]]) In [3]: z Out[3]: array([[2, 1, 1], [1, 2, 2], [2, 3, 4]]) In [4]: np.linalg.inv(z) Out[4]: array([[ 0.66666667, -0.33333333, 0. ], [ 0. , 2. , -1. ], [-0.33333333, -1.33333333, 1. ]]) In [5]: np.dot(z, np.linalg.inv(z)) Out[5]: array([[ 1., 0., 0.], [ 0., 1., 0.], [ 0., 0., 1.]])