{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Natural Language Processing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sources:\n",
    "\n",
    "  * [Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit](http://www.nltk.org/book/) by Steven Bird, Ewan Klein, and Edward Loper\n",
    "  * [NLP Tutorial Using Python NLTK (Simple Examples)](https://likegeeks.com/nlp-tutorial-using-python-nltk/)\n",
    "  * [NLTK: Natural Language Toolkit Documentation](http://www.nltk.org/)\n",
    "  * [Demos by the Cognitive Computation Group](http://cogcomp.cs.illinois.edu/page/demos) at the University of Illinois, Champaign-Urbana\n",
    "  \n",
    "The NLTK package can be installed by\n",
    "\n",
    "     conda install nltk"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First step in any natural language work...get some text.  The book provides several large text samples that you can download with the following commands. A window will pop up from which you can select various [corpora](https://en.wikipedia.org/wiki/Text_corpus).  Select \"book\" and download it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nltk\n",
    "nltk.download()  # download text examples from \"book\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's assume this as been done. Now the material for the book can be used by doing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nltk.book as bk"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Cool!  All of *Moby Dick*!  Let's take a look at the text.  The text objects can be treated as lists, so"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bk.text1[:8]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bk.text1[1000:1010]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A very common way to characterize some text is by a count of the times\n",
    "a word appears in the text.  The text objects have a number of useful\n",
    "methods, such as"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bk.text1.count('whale')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now write a python function to count the frequency of\n",
    "occurrence of a number of words.  NLTK provides a way to do this for\n",
    "all of the words in a text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wordCounts = bk.FreqDist(bk.text1)\n",
    "len(wordCounts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "type(wordCounts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wordCounts.most_common()[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we need a plot!  We can do it ourselves, or use the `plot` method already defined for the `FreqDist` object.|"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(10,10))\n",
    "wordCounts.plot(50);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But, where is ''Ishmael'' and ''Starbuck''?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wordCounts['Ishmael'], wordCounts.freq('Ishmael')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wordCounts['Starbuck'], wordCounts.freq('Starbuck')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Who is mentioned more in *Moby Dick*?\n",
    "\n",
    "Usually we will want to ignore short words, including punctuation, and\n",
    "words that occur a very small number of times.  We can use a list\n",
    "comprehension to build this list, but first need a way to create a\n",
    "list of words without repetitions.  Python `set` to the rescue."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(bk.text1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(set(bk.text1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So each word is used, on average, ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(bk.text1) / len(set(bk.text1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "... about 13 times.  Now, let's step through each word in the set and only\n",
    "keep ones that are longer than 8 characters and that appear more than\n",
    "20 times, and let's sort the result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sigWords = sorted([word for word in set(bk.text1) if len(word) > 8 and wordCounts[word] > 20])\n",
    "len(sigWords)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(*sigWords)  # prints all on one line.  Easier to see than just evaluating sigWords"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another common technique is to look for *collocations*, or pairs of\n",
    "words that appear together more often than expected by considering the\n",
    "number of times each word appears.  Yep, NLTK has a method for that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bk.text1.collocations()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using text from a URL"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import urllib.request"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = urllib.request.urlopen('https://www.gutenberg.org/files/46/46-h/46-h.htm')\n",
    "html = response.read()\n",
    "# print(html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To extract the text from the html, we will use [Beautiful Soup](https://medium.freecodecamp.org/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "soup = BeautifulSoup(html,'html.parser')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "soup(['script', 'style'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for script in soup([\"script\", \"style\"]):\n",
    "    script.extract()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = soup.get_text(strip=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text[:100]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tokens = [t for t in text.split()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tokens[:20]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "freq = nltk.FreqDist(tokens)\n",
    "freq.plot(20, cumulative=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stopwords = nltk.corpus.stopwords.words('english')\n",
    "stopwords[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tokens_no_stopwords = [token for token in tokens if token not in stopwords]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(tokens), len(tokens_no_stopwords)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "freq = nltk.FreqDist(tokens_no_stopwords)\n",
    "freq.plot(20, cumulative=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tokenize into sentences and words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentences = nltk.tokenize.sent_tokenize(text)\n",
    "print(len(sentences))\n",
    "sentences[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's get rid of those \\n's and \\r's."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentences = nltk.tokenize.sent_tokenize(text.replace('\\n', '').replace('\\r', ''))\n",
    "print(len(sentences))\n",
    "sentences[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also tokenize into words, in a better way than what we did above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "words = nltk.tokenize.word_tokenize(text.replace('\\n', ''))\n",
    "print(len(words))\n",
    "words[:20]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We often want the part-of-speech (POS) for each word to analyze the structure of sentences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nltk.pos_tag(words[1000:1010])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's get rid of stopwords again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "words_no_stopwords = [word for word in words if word not in stopwords]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(words_no_stopwords)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "freq = nltk.FreqDist(words_no_stopwords)\n",
    "freq.plot(20);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And let's remove words that are single characters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "words_no_stopwords = [word for word in words if word not in stopwords and len(word) > 1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "freq = nltk.FreqDist(words_no_stopwords)\n",
    "freq.plot(20);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "freq.most_common()[:20]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Compare texts\n",
    "\n",
    "Let's come up with a very simple way to compare texts and apply it to see how it rates the similarities among two Dicken's books and two Wilde books."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import urllib.request\n",
    "from bs4 import BeautifulSoup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_text(url):\n",
    "    response = urllib.request.urlopen(url)\n",
    "    html = response.read()\n",
    "    soup = BeautifulSoup(html,'html.parser')\n",
    "    for script in soup([\"script\", \"style\"]):\n",
    "        script.extract()\n",
    "    text = soup.get_text(strip=True)\n",
    "    words = nltk.tokenize.word_tokenize(text)\n",
    "    stopwords = nltk.corpus.stopwords.words('english')\n",
    "    words_no_stopwords = [word for word in words if word not in stopwords]\n",
    "    freq = nltk.FreqDist(words_no_stopwords)\n",
    "    commonWordsCounts = freq.most_common()[:500]\n",
    "    return [word for (word, count) in commonWordsCounts if len(word) > 1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dickens1 = load_text('http://www.gutenberg.org/cache/epub/730/pg730.txt')\n",
    "dickens2 = load_text('http://www.gutenberg.org/files/786/786-0.txt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wilde1 = load_text('http://www.gutenberg.org/files/790/790-0.txt')\n",
    "wilde2 = load_text('http://www.gutenberg.org/cache/epub/14522/pg14522.txt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many words in the 500 most common do two Dickens books have in common?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(set(dickens1) & set(dickens2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and two Wilde books?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(set(wilde1) & set(wilde2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How about the first Dickens book compared to the two Wilde books?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(set(dickens1) & set(wilde1)), len(set(dickens1) & set(wilde2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and the second Dickens book compared to the two Wilde books?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(set(dickens2) & set(wilde1)), len(set(dickens2) & set(wilde2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(*dickens1)\n",
    "print(*dickens2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(*wilde1)\n",
    "print(*wilde2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Better Pre-processing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "wpt = nltk.WordPunctTokenizer()\n",
    "stop_words = nltk.corpus.stopwords.words('english')\n",
    "\n",
    "def simplify(doc):\n",
    "    # lower case and remove special characters\\whitespaces\n",
    "    doc = re.sub(r'[^a-zA-Z\\s]', '', doc, re.I|re.A)\n",
    "    doc = doc.lower()\n",
    "    doc = doc.strip()\n",
    "    # tokenize document\n",
    "    tokens = wpt.tokenize(doc)\n",
    "    # filter stopwords out of document\n",
    "    filtered_tokens = [token for token in tokens if token not in stop_words]\n",
    "    # re-create document from filtered tokens\n",
    "    doc = ' '.join(filtered_tokens)\n",
    "    return doc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "simplify('Hello, Mary, wouldn''t you like to skip this class?')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_text(url):\n",
    "    response = urllib.request.urlopen(url)\n",
    "    html = response.read()\n",
    "    soup = BeautifulSoup(html,'html.parser')\n",
    "    for script in soup([\"script\", \"style\"]):\n",
    "        script.extract()\n",
    "    text = soup.get_text(strip=True)\n",
    "    text = simplify(text)\n",
    "    words = nltk.tokenize.word_tokenize(text)\n",
    "    stopwords = nltk.corpus.stopwords.words('english')\n",
    "    words_no_stopwords = [word for word in words if word not in stopwords]\n",
    "    freq = nltk.FreqDist(words_no_stopwords)\n",
    "    commonWordsCounts = freq.most_common()[:500]\n",
    "    return [word for (word, count) in commonWordsCounts if len(word) > 1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(*wilde1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wilde1 = load_text('http://www.gutenberg.org/cache/epub/301/pg301.txt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "print(*wilde1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dickens1 = load_text('http://www.gutenberg.org/cache/epub/730/pg730.txt')\n",
    "dickens2 = load_text('http://www.gutenberg.org/files/786/786-0.txt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wilde1 = load_text('http://www.gutenberg.org/files/790/790-0.txt')\n",
    "wilde2 = load_text('http://www.gutenberg.org/cache/epub/14522/pg14522.txt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_in_common = []\n",
    "for txt1 in [dickens1, dickens2, wilde1, wilde2]:\n",
    "    for txt2 in [dickens1, dickens2, wilde1, wilde2]:\n",
    "        n_in_common.append(len(set(txt1) & set(txt2)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_in_common"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.array(n_in_common).reshape(4,4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "pd.DataFrame(np.array(n_in_common).reshape(4,4), \n",
    "             index=['dickens1', 'dickens2', 'wilde1', 'wilde2'],\n",
    "             columns=['dickens1', 'dickens2', 'wilde1', 'wilde2'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}