{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Cross Validation\n",
    "\n",
    "As a first step we need a classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn import cross_validation\n",
    "from sklearn import metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Scikit-learn has datasets that are already ready for use:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_breast_cancer\n",
    "data = load_breast_cancer()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A scikit-learn data object is container object with whose interesting attributes are: \n",
    "  * ‘data’, the data to learn, \n",
    "  * ‘target’, the classification labels, \n",
    "  * ‘target_names’, the meaning of the labels,\n",
    "  * ‘feature_names’, the meaning of the features, and \n",
    "  * ‘DESCR’, the full description of the dataset.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['malignant', 'benign'], \n",
       "      dtype='<U9')"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',\n",
       "       'mean smoothness', 'mean compactness', 'mean concavity',\n",
       "       'mean concave points', 'mean symmetry', 'mean fractal dimension',\n",
       "       'radius error', 'texture error', 'perimeter error', 'area error',\n",
       "       'smoothness error', 'compactness error', 'concavity error',\n",
       "       'concave points error', 'symmetry error', 'fractal dimension error',\n",
       "       'worst radius', 'worst texture', 'worst perimeter', 'worst area',\n",
       "       'worst smoothness', 'worst compactness', 'worst concavity',\n",
       "       'worst concave points', 'worst symmetry', 'worst fractal dimension'], \n",
       "      dtype='<U23')"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X = data.data\n",
    "y = data.target\n",
    "data.target_names\n",
    "data.feature_names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
       "          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n",
       "          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n",
       "          verbose=0, warm_start=False)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4, random_state=0)\n",
    "classifier = LogisticRegression()\n",
    "classifier.fit(X_train, y_train)\n",
    "y_pred = classifier.predict(X_test)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's compute the accuracy of our predictions (in two different ways):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9692982456140351"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "0.9692982456140351"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(np.where(np.equal(y_pred, y_test))[0])/len(y_test)\n",
    "\n",
    "np.sum(y_pred==y_test)/len(y_test)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can do the same using scikit-learn:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9692982456140351"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "metrics.accuracy_score(y_test, y_pred)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's compute accuracy using cross-validation instead:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.93913043,  0.93913043,  0.97345133,  0.95575221,  0.96460177])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cross_validation.cross_val_score(classifier, X, y, cv=5, scoring='accuracy')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can obtain accuracy for other metrics, such as area under the ROC curve:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.99418605,  0.99192506,  0.99731724,  0.98222669,  0.99664655])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cross_validation.cross_val_score(classifier, X, y, cv=5, scoring='roc_auc')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's a good idea to first obtain the predictions, and then compute accuracy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.95430579964850615"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_predict = cross_validation.cross_val_predict(classifier, X, y, cv=5)\n",
    "metrics.accuracy_score(y, y_predict)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's an alternative way of doing cross-validation.  We first divide the data into folds:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "cv = cross_validation.StratifiedKFold(y, 5)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using this division of data into folds we can run cross-validation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.95430579964850615"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_predict = cross_validation.cross_val_predict(classifier, X, y, cv=cv)\n",
    "metrics.accuracy_score(y, y_predict)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see how examples were divided into folds by looking at the test_folds attribute:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
      " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 1\n",
      " 0 1 0 1 1 0 0 0 1 1 0 1 1 1 0 0 0 1 0 0 1 1 0 0 0 1 1 0 0 0 0 1 0 0 1 0 0\n",
      " 0 0 0 0 0 0 1 1 1 0 1 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 1 0 0 1 0 0 0 0 1 0\n",
      " 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 2 1 1 2 2 1 1 2 2 1 1 1 1 2 1 1 2 2 2 1 2\n",
      " 1 2 1 1 1 2 1 1 2 2 1 2 2 2 2 1 2 2 2 1 2 1 2 1 1 2 1 2 2 2 2 1 1 2 2 1 1\n",
      " 1 2 1 1 1 1 1 2 2 1 1 2 1 1 2 2 1 2 1 1 1 1 2 1 1 1 1 1 2 1 2 2 2 3 3 3 3\n",
      " 3 3 3 3 3 3 3 1 1 1 1 1 1 3 1 3 1 1 3 1 1 3 1 3 3 1 1 1 1 1 2 2 2 2 2 2 2\n",
      " 2 3 2 2 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 3 2 3 2 2 2 2 3 3 3 2 2\n",
      " 2 2 3 2 3 2 3 2 2 2 3 2 2 2 2 2 2 2 3 3 3 2 2 2 2 2 2 2 2 2 2 2 3 3 2 3 3\n",
      " 3 2 3 3 2 2 2 2 2 3 2 2 2 2 3 3 3 3 3 4 3 3 4 4 3 3 3 3 3 3 4 3 3 3 3 3 3\n",
      " 3 4 3 3 3 3 3 4 3 3 4 3 3 3 3 3 3 3 3 3 3 3 3 4 3 4 4 3 4 3 3 3 3 3 4 3 3\n",
      " 4 3 4 3 3 4 3 4 3 3 3 3 3 3 3 3 4 4 3 3 3 3 3 3 4 3 3 3 3 3 3 4 4 4 4 4 4\n",
      " 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4\n",
      " 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4\n",
      " 4 4 4 4 4 4 4 4 4 4 4 4 4 4]\n"
     ]
    }
   ],
   "source": [
    "print(cv.test_folds)\n",
    " "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hmm... this is not ideal, so let's shuffle things a bit..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[3 1 0 3 2 4 3 4 1 2 2 4 1 4 0 0 3 0 1 2 3 1 2 4 2 0 1 1 2 0 3 4 0 1 1 0 1\n",
      " 4 4 3 2 3 3 1 0 1 4 4 2 4 1 2 3 1 0 0 2 2 2 3 0 4 1 4 1 4 0 3 1 2 1 0 4 0\n",
      " 0 0 2 2 1 4 0 1 0 3 3 0 4 3 3 0 4 0 3 3 0 3 2 1 1 4 2 0 3 1 1 4 4 2 2 0 3\n",
      " 1 0 2 2 4 4 0 1 2 3 4 1 4 1 3 1 0 4 4 3 1 2 0 2 2 0 4 0 3 4 2 2 3 1 2 4 4\n",
      " 1 0 0 3 2 3 2 0 4 2 0 0 2 3 0 1 1 2 2 2 2 1 4 0 3 3 3 0 0 3 2 2 3 2 0 3 1\n",
      " 2 1 0 4 0 4 4 2 3 4 3 0 2 0 1 0 0 3 1 4 4 4 2 1 2 3 1 0 3 3 4 1 1 3 2 4 0\n",
      " 0 3 1 2 2 2 3 0 3 3 4 0 1 1 0 1 3 2 3 4 1 1 2 1 3 3 0 3 2 2 2 2 3 4 4 1 4\n",
      " 0 1 3 1 3 0 4 4 1 3 0 4 2 2 2 3 2 2 1 1 0 1 1 1 2 4 2 0 3 3 1 2 3 1 4 0 1\n",
      " 0 4 2 1 3 2 1 2 4 0 4 0 4 2 2 4 4 1 1 2 3 1 1 1 4 4 3 4 4 3 1 0 4 4 2 3 2\n",
      " 4 3 4 4 3 0 1 0 0 2 3 2 4 0 0 3 2 4 1 0 3 4 0 4 3 1 3 4 4 4 4 0 2 3 3 2 4\n",
      " 3 1 0 2 1 4 4 3 1 4 3 4 1 1 2 4 2 2 1 1 3 0 2 3 1 2 2 0 2 4 0 1 0 0 2 1 0\n",
      " 3 3 3 0 0 4 1 0 1 2 0 0 3 0 4 1 0 1 3 3 0 0 2 1 3 0 3 0 1 3 1 3 3 2 4 0 0\n",
      " 0 0 3 0 3 2 1 3 1 3 3 2 1 0 4 0 2 2 4 0 4 2 1 0 4 2 2 0 0 3 4 2 3 2 4 1 2\n",
      " 1 4 3 3 3 1 0 4 0 1 2 4 4 1 3 4 1 2 0 0 4 2 1 3 3 3 1 0 2 2 0 1 4 3 0 0 1\n",
      " 4 4 1 3 4 1 1 4 3 4 3 3 2 0 3 4 2 3 4 1 0 1 4 4 3 1 1 2 0 3 1 4 4 2 4 4 1\n",
      " 1 2 0 1 2 2 0 2 4 4 2 0 3 2]\n"
     ]
    }
   ],
   "source": [
    " \n",
    "cv = cross_validation.StratifiedKFold(y, 5, shuffle=True)\n",
    "print (cv.test_folds)\n",
    " "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you run division into folds multiple times you will get a different answer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[3 1 0 2 1 4 0 0 4 4 0 4 2 3 3 4 2 0 4 1 2 4 3 3 1 0 1 4 3 3 3 2 0 1 1 0 3\n",
      " 4 1 3 0 0 0 2 0 3 0 4 4 0 0 0 0 4 1 4 3 2 2 1 1 4 2 4 4 2 4 1 1 3 1 1 3 2\n",
      " 4 4 3 3 3 0 4 0 0 4 3 1 0 0 1 1 1 0 0 0 1 2 1 4 1 4 0 1 2 2 2 1 4 3 2 2 4\n",
      " 3 2 3 4 2 1 3 2 0 1 3 4 4 2 2 3 2 2 2 3 0 3 4 1 0 1 0 1 3 3 4 2 2 0 1 4 0\n",
      " 0 2 2 1 3 3 3 0 0 0 2 1 1 0 4 0 1 4 1 0 0 1 3 3 3 2 0 3 0 0 4 0 4 2 0 4 1\n",
      " 2 4 4 3 4 0 1 1 4 0 3 2 2 2 3 2 0 2 3 0 0 1 2 0 2 1 0 4 0 1 2 3 3 2 0 2 1\n",
      " 4 3 0 1 1 1 2 3 0 1 3 2 0 2 4 1 0 0 0 0 4 3 2 4 4 3 4 3 4 4 3 1 0 2 4 1 0\n",
      " 4 3 2 1 1 0 2 2 3 0 4 4 4 3 4 0 3 2 1 2 4 2 4 1 2 1 1 1 1 3 1 2 3 1 0 4 4\n",
      " 3 4 2 2 1 0 4 0 1 3 0 1 1 1 4 3 0 4 2 1 0 1 0 4 2 3 4 4 4 2 2 2 0 1 3 0 4\n",
      " 3 1 3 2 4 3 1 3 3 4 0 1 1 3 0 2 4 0 4 4 3 3 3 4 0 0 4 2 3 0 0 0 4 1 1 2 4\n",
      " 1 4 1 2 4 3 3 2 1 4 4 1 4 1 4 1 4 4 2 4 2 0 2 3 2 2 3 2 3 1 2 2 4 3 0 1 2\n",
      " 4 2 2 2 3 3 3 4 0 3 2 4 3 2 2 2 0 0 3 3 4 0 3 3 1 1 2 0 4 2 3 1 0 3 3 4 0\n",
      " 1 1 4 1 4 2 4 3 2 4 2 3 1 0 3 2 2 3 1 0 2 3 1 2 0 4 3 0 0 3 2 2 4 3 3 3 2\n",
      " 4 1 1 0 1 2 1 4 2 1 1 3 2 0 4 3 0 1 2 2 1 4 1 3 0 1 3 3 2 3 1 3 4 1 3 1 4\n",
      " 3 3 2 4 2 2 1 2 0 0 2 2 2 0 3 4 0 4 0 4 1 4 4 1 0 2 1 0 2 3 0 0 1 4 1 0 0\n",
      " 0 3 1 3 0 1 4 1 3 3 1 2 0 1]\n"
     ]
    }
   ],
   "source": [
    "cv = cross_validation.StratifiedKFold(y, 5, shuffle=True)\n",
    "print (cv.test_folds)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to consistently get the same division into folds:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "cv = cross_validation.StratifiedKFold(y, 5, shuffle=True, random_state=0)\n",
    "# random_state sets the seed for the random number generator."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}