$\newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\yv}{\mathbf{y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}$

Assignment 1: Linear Regression

Steve Kommrusch

Overview

In this assignment for CS480 (taught by professor Chuck Anderson) I implement 3 functions that allow me to create a linear model to utilize various OECD country statistics to predict the 'life satisfaction' reported by the country's citizens. Details are discussed further in the 'Data' section below, but the main idea is to find a weight vector that minimizes the sum of squared errors between a linear combination of the inputs (GDP per capita, unemployment, life expectancy, etc) and 'life satisfaction', which is reported on a 1-10 scale.

Method

To set up the OECD data for this linear regression problem, I gathered data for 37 countries and set $t_n$ (the target value to predict) to be the life satisfaction average for citizens in country $n$, and set the vector $\xv_n$ to be the vector of statistics on country $n$ that would be interesting to consider in the problem (see the Data section for more details on the data used). Then the goal is to find a linear model that minimizes the least squared error for all 37 predictions of the target variable. The linear model $g$ is defined relative a set of scalar weights provided in a vector $\wv$ and given by the formula: $$ g(\xv; \wv) = \xv^T \wv $$ Hence, the goal is to find $\wv_{\mbox{best}}$ to minimize the least squared error: $$ \wv_{\mbox{best}} = \argmin{\wv} \sum_{n=1}^{37} (t_n - \xv_n^T \wv)^2 $$

If we gather the $t_n$ values into a vector $T$ and gather the $\xv_n^T$ vectors to be rows of a matrix called $\Xv$, then the minimal weights can be computed with: $$ \begin{align*} \wv &= (\Xv^T \Xv)^{-1} \Xv^T \Tv \end{align*} $$

A general python command which is the same as the above equation but will work even when $(\Xv^T \Xv)$ is not invertable is np.linalg.lstsq, which is used in the train function:

w = np.linalg.lstsq(np.dot(X.T,X), np.dot(X.T, T))

The 3 functions and their return types used for this assignment are:

  • model = train(X,T)
  • predict = use(model,X)
  • error = rmse(predict,T)

Where X is a two-dimensional matrix (np.array) with each row containing one data sample, and T is a two-dimensional matrix of one column containing the target values for each sample in X. The function call train(X,T) returns a python dictionary with the keys means, stds, and w.

In [1]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
def train (X,T):
    means=np.mean(X,axis=0)
    stds=np.std(X,axis=0)
    Xplus1=np.ones((X.shape[0],X.shape[1]+1)) # Add unit column for bias offset in model
    Xplus1[:,1:]=(X-means)/stds # adjust all variable inputs to model to 0 mean and unit variance
    w = np.linalg.lstsq(np.dot(Xplus1.T,Xplus1), np.dot(Xplus1.T, T))[0]
    return {'means':means,'stds':stds,'w':w}
In [3]:
def use (model,X):
    Xplus1=np.ones((X.shape[0],X.shape[1]+1))
    Xplus1[:,1:]=(X-model['means'])/model['stds']
    return np.dot(Xplus1,model['w'])
In [4]:
def rmse (predict,T):
    # both args have shape equal to number of samples
    return np.sqrt(np.mean((predict-T)**2,axis=0))

Data

I chose to explore what factors are best used to predict life satisfaction for the countries in the OECD (Organization for Economic Cooperation and Development). The data I chose for this comes from 2 sources under the stats.oecd.org site.

The source pages provide an option to create a comma separated value file (csv), and then I ran the long perl line below to create my input data file.

In [5]:
# Perl line to process CSV into data: cat *.csv | perl -e '@cols=("Life satisfaction","GDP per capita","Gini","GDP growth","Public social expenditure","Life expectancy","unemployment","Years in education","leisure and personal care","Feeling safe walking alone at night","Air pollution","Homicide rate"); while (<>) { @line = split /,/; foreach $col (@cols) {if (@line[1] =~ /$col/) {$dat{@line[3]}{$col}=@line[12]}; if ((@line[3] =~ /$col/) && (@line[7] =~ /Total/)) {$dat{@line[1]}{$col}=@line[14]}}}; print join(",",(@cols,"Country Name")),"\n"; foreach $c (sort keys %dat) {foreach $col (@cols) {print "$dat{$c}{$col},"} print "$c\n"}' | grep -v ,, > OECD.txt

The 'grep' command below shows line 1 of the OECD.txt file and statistics for 10 of the 37 countries that I found data for.

In [6]:
! grep t OECD.txt
Life satisfaction,GDP per capita,Gini,GDP growth,Public social expenditure,Life expectancy,unemployment,Years in education,leisure and personal care,Feeling safe walking alone at night,Air pollution,Homicide rate,Voter turnout,developing regulations,Country Name
7.3,44971.47015,0.326,2.735528595,18.982,82.2,1.32,19.2,14.35,62.6,6,0.8,93,2.7,"Australia"
7.1,46171.4264,0.276,0.353471066,28.414,81.2,1.53,17.1,14.55,81.2,15,0.4,75,1.3,"Austria"
5.6,26901.6372,0.338,2.906537167,16.26,77.3,3.32,17.8,14.9,67.2,9,4.8,64,2.8,"Estonia"
5.8,35015.25637,0.327,-0.443878227,28.64,82.8,7.79,16.8,14.89,59.3,18,0.8,75,1.5,"Italy"
7.3,47634.7553,0.278,1.011128405,24.736,81.4,2.98,17.9,15.9,80.5,17,0.8,75,1.3,"Netherlands"
5.1,28382.46899,0.338,0.905797803,25.21,80.8,8.28,17.4,14.72,69.2,10,0.9,56,1.2,"Portugal"
4.9,13146.40287,0.67,1.524841997,8.740899231,56.8,14.37,15.4,14.73,39.8,14,9.6,73,1.6,"South Africa"
7.6,57246.29124,0.285,1.889476944,19.39,82.9,1.71,17.4,15.01,87.4,17,0.5,48,2.6,"Switzerland"
6.5,39709.45234,0.351,2.940253236,21.738,81.1,2.22,16.7,14.87,77.8,11,0.2,66,2.9,"United Kingdom"
6.9,54353.19247,0.401,2.427798266,19.225,78.8,1.42,17.1,14.47,73.9,11,5.2,67,3.2,"United States"

I load the numerical data and country names into the variables shown below after sorting them based on life satisfaction for more meaningful data plots later in this document.

In [7]:
data = np.loadtxt('OECD.txt', usecols=range(12), delimiter=',', skiprows=1)
countries = np.loadtxt('OECD.txt', usecols=range(12,13), delimiter=',', skiprows=1, dtype='S')
countries = countries[data[:,0].argsort()]   # Sort the countries by increasing life satisfaction
data = data[data[:,0].argsort()]  # Sort the data rows by increasing life satisfaction
data[:,1] = data[:,1]/1000 # Scale GDP/capita by 1000 to help with cleaner graph below.
T=np.array(data[:,0])    # The target for training is the 0th data collumn (life satisfaction)
X=np.array(data); X[:,0] = X[:,4]**2;  # The data to train with is columns 1-11 from file.
                             # replace column 0 with square of column 4 (public social spending)

Note above that the target $T$ is set to the original column 0 of the data and then I replace column 0 in the $X$ array with the square of column 4. Column 4 is the data for 'public social expenditure' as a percentage of GDP, and one might expect there to be an ideal maximum for this expenditure (neither 0 percent nor 100 percent), hence the inclusion of a quadratic term could allow for the best solution to involve a parabolic peak. One might argue the Gini coefficient would have a similar peak (neither fully equal incomes nor highly unequal incomes may be ideal), but that idea is not explored here.

In [8]:
X.shape
Out[8]:
(37, 12)

In the colnames list setup given below, comments explain each column of data in the $\Xv$ matrix. Note that when data was available for multiple years, the most recent year in the OECD data would be used. The most recent year for any data in the source download was 2015.

In [9]:
colnames=list(("Bias",                  # The is the life satisfaction from the model if all
                                        # of the below variables for a country are at the mean.
          "Life satisfaction",          # Average subjective well-being in country on 1-10 scale,
                                        # with 10 being the highest life satisfaction.
          "GDP per capita in $1000s",   # Gross domestic product per capita in $1000s.
          "Gini (income inequality)",   # A gini of 0 means all income is equally shared
                                        # in the population.  A gini of 1 means one individual
                                        # has all the wealth.
          "GDP growth",                 # Year-to-year GDP growth as a percentage of GDP.
          "Public social expenditure",  # Government spending on health, pensions, disability,
                                        # housing assistance, etc as percentage of GDP.
                                        # (Does not include education, military, etc)
          "Life expectancy",            # Average life expectancy in years
          "Unemployment percentage",    # Percent of population that is long-term unemployed
          "Years in education",         # Average years of education 
          "Time for liesure",           # Average hours per day for leisure and personal care
                                        # (includes sleep, eating, hobbies, with friends, etc)
          "Feel safe walk alone night", # Percent of population that feels safe walking alone
                                        # at night.
          "Air pollution",              # PM10 concentrations in micrograms/m^3 in urban areas
          "Homicide Rate"))             # Homicide rate per 100,000 persons per year

Below is the list of countries that had data for all the columns in the $\Xv$ matrix. The list is sorted from the country with the lowest subjective well-being rating to the highest. The printing below adds a '--' indicator to every 5th country which is helpful when viewing the graphs later since the grid adds a vertical line for every 5th country.

In [10]:
for i in range(countries.shape[0]): 
    print("%s %2d: %s" % ("  " if (i%5) else "--",i,"".join(map(chr,countries[i][1:-1]))))
--  0: 
    1: 
    2: 
    3: 
    4: 
--  5: 
    6: 
    7: 
    8: 
    9: 
-- 10: 
   11: 
   12: 
   13: 
   14: 
-- 15: 
   16: 
   17: 
   18: 
   19: 
-- 20: 
   21: 
   22: 
   23: 
   24: 
-- 25: 
   26: 
   27: 
   28: 
   29: 
-- 30: 
   31: 
   32: 
   33: 
   34: 
-- 35: 
   36: 
In [11]:
plt.figure(figsize=(10,20))
nrow,ncol = data.shape
for c in range(ncol):
    plt.subplot(6,2, c+1)
    plt.plot(data[:,c],'.-')
    plt.ylabel(colnames[c+1])
    plt.xlabel("Country number in increasing life satisfaction")
    plt.grid(True)

Note that the data points are plotted with blue dots, allowing one to find the data for a given country with some ease. For example, the United States is country #22 out of 37 in the sorted life satisfaction list, and in the "Time for liesure" graph above, country #22 citizens have an average of 14.5 hours per day for sleeping, eating, socializing, and general liesure.

Results

When using the linear regression functions on the input data, the root-mean-square-error (sample standard deviation) is 0.42 (on a scale of 1 to 10).

In [12]:
model=train(X,T)
predict=use(model,X)
error=rmse(predict,T)
error
Out[12]:
0.42349934254133248

The plot of the actual life satisfaction relative to the model prediction is shown below. One can see that there are, not surprisingly, factors related to life satisfaction not accounted for in our numerical model, but the corellation is fairly accurate.

In [13]:
plt.plot(T,'o-',label="Actual data")
plt.plot(predict,'o-',label="Model prediction")
plt.ylabel("Life satisfaction")
plt.xlabel("Country number in increasing life satisfaction")
plt.legend(loc='lower right')
plt.grid(True);

Below is the list of weights discovered by the linear regression model after the data was normalized in the 'train' function.

In [14]:
colnames[1]="Public social expenditure squared"
for i in range(len(colnames)): print("%33s: %.3f" % (colnames[i],model['w'][i]))
                             Bias: 6.522
Public social expenditure squared: -0.137
         GDP per capita in $1000s: 0.304
         Gini (income inequality): 0.109
                       GDP growth: -0.090
        Public social expenditure: 0.021
                  Life expectancy: 0.145
          Unemployment percentage: -0.251
               Years in education: 0.268
                 Time for liesure: 0.094
       Feel safe walk alone night: 0.221
                    Air pollution: -0.024
                    Homicide Rate: 0.261

Of all the factors, the one that contributes most to the prediction of life satisfaction is GDP per capita, followed by years in education and homicide rate. Note that in this multivariable model, homicide rate actually adds in positively to the life satisfaction evaluation. As a 2-variable correlation, visible in the 'homicide rate' graph above, homicide rate in a country is not well corellated with higher life satisfaction (the countries with the highest homicide rates are near the middle of the series sorted on life satisfaction), but the covariance of the variables results in a positive contribution in the 12-variable model. In other words, presumably GDP/capita, unemployment, years in education, and the other factors combined to predict too low a happiness level for countries with high homicide rates, so the homicide rate needed to be added in as a correction.

Below we change the model to use only the variables that had weights with absolute values above 0.2 in the original model to see how well the model predicts with fewer input data. (Note that public social expenditure did not meet this criterion, but the availability of the squared term was seen as more useful to the model than the linear term (weight of -0.137 vs 0.021).

In [15]:
selected=np.array((1,6,7,9,11))
Xselected=X[:,selected]
model=train(Xselected,T)
predict=use(model,Xselected)
error=rmse(predict,T)
error
Out[15]:
0.44359272096742863

The RSME rose from 0.42 to 0.44, still a rather accurate prediction, and the graph comparing the model to actual data is very similar to the more complex model.

In [16]:
plt.plot(T,'o-',label="Actual data")
plt.plot(predict,'o-',label="Model prediction")
plt.ylabel("Life satisfaction")
plt.xlabel("Country number in increasing life satisfaction")
plt.legend(loc='lower right')
plt.grid(True);
In [17]:
print("%33s: %.3f" % (colnames[0],model['w'][0]))
for i in range(selected.shape[0]): 
    print("%33s: %.3f" % (colnames[selected[i]+1],model['w'][i+1]))
                             Bias: 6.522
         GDP per capita in $1000s: 0.314
          Unemployment percentage: -0.239
               Years in education: 0.258
       Feel safe walk alone night: 0.245
                    Homicide Rate: 0.274

In summary, out of a dozen various metrics on quality of life in a country, the 5 most important variables to predicting life satisfaction in the OECD countries are, in order, GDP per capita, homicide rate, years in education, feeling safe walking alone at night, and the unemployment rate.

Grading

Your notebook will be run and graded automatically. Test this grading process by first downloading A1grader.tar and extract A1grader.py from it. After running all of the above cells in your notebook, run the code in the following cell to demonstrate an example grading session. You should see a perfect score of 100/100 if your functions are defined correctly.

In [18]:
%run -i "A1grader.py"
20/20 points. 'means' values are correct.
20/20 points. 'stds' values are correct.
20/20 points. 'w' values are correct.
20/20 points. Values returned by 'use' are correct.
20/20 points. rmse() is correct.
notebooks Grade is 100/100