{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Natural Language Processing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sources:\n", "\n", " * [Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit](http://www.nltk.org/book/) by Steven Bird, Ewan Klein, and Edward Loper\n", " * [NLP Tutorial Using Python NLTK (Simple Examples)](https://likegeeks.com/nlp-tutorial-using-python-nltk/)\n", " * [NLTK: Natural Language Toolkit Documentation](http://www.nltk.org/)\n", " * [Demos by the Cognitive Computation Group](http://cogcomp.cs.illinois.edu/page/demos) at the University of Illinois, Champaign-Urbana\n", " \n", "The NLTK package can be installed by\n", "\n", " conda install nltk" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First step in any natural language work...get some text. The book provides several large text samples that you can download with the following commands. A window will pop up from which you can select various [corpora](https://en.wikipedia.org/wiki/Text_corpus). Select \"book\" and download it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import nltk\n", "nltk.download() # download text examples from \"book\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's assume this as been done. Now the material for the book can be used by doing" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import nltk.book as bk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cool! All of *Moby Dick*! Let's take a look at the text. The text objects can be treated as lists, so" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bk.text1[:8]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bk.text1[1000:1010]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A very common way to characterize some text is by a count of the times\n", "a word appears in the text. The text objects have a number of useful\n", "methods, such as" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bk.text1.count('whale')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now write a python function to count the frequency of\n", "occurrence of a number of words. NLTK provides a way to do this for\n", "all of the words in a text." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wordCounts = bk.FreqDist(bk.text1)\n", "len(wordCounts)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(wordCounts)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wordCounts.most_common()[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we need a plot! We can do it ourselves, or use the `plot` method already defined for the `FreqDist` object.|" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(10,10))\n", "wordCounts.plot(50);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But, where is ''Ishmael'' and ''Starbuck''?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wordCounts['Ishmael'], wordCounts.freq('Ishmael')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wordCounts['Starbuck'], wordCounts.freq('Starbuck')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Who is mentioned more in *Moby Dick*?\n", "\n", "Usually we will want to ignore short words, including punctuation, and\n", "words that occur a very small number of times. We can use a list\n", "comprehension to build this list, but first need a way to create a\n", "list of words without repetitions. Python `set` to the rescue." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(bk.text1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(set(bk.text1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So each word is used, on average, ..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(bk.text1) / len(set(bk.text1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "... about 13 times. Now, let's step through each word in the set and only\n", "keep ones that are longer than 8 characters and that appear more than\n", "20 times, and let's sort the result." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sigWords = sorted([word for word in set(bk.text1) if len(word) > 8 and wordCounts[word] > 20])\n", "len(sigWords)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(*sigWords) # prints all on one line. Easier to see than just evaluating sigWords" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another common technique is to look for *collocations*, or pairs of\n", "words that appear together more often than expected by considering the\n", "number of times each word appears. Yep, NLTK has a method for that." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bk.text1.collocations()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Using text from a URL" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import urllib.request" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = urllib.request.urlopen('https://www.gutenberg.org/files/46/46-h/46-h.htm')\n", "html = response.read()\n", "# print(html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To extract the text from the html, we will use [Beautiful Soup](https://medium.freecodecamp.org/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "soup = BeautifulSoup(html,'html.parser')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "soup(['script', 'style'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for script in soup([\"script\", \"style\"]):\n", " script.extract()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "text = soup.get_text(strip=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "text[:100]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tokens = [t for t in text.split()]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(tokens)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tokens[:20]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "freq = nltk.FreqDist(tokens)\n", "freq.plot(20, cumulative=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stopwords = nltk.corpus.stopwords.words('english')\n", "stopwords[:10]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tokens_no_stopwords = [token for token in tokens if token not in stopwords]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(tokens), len(tokens_no_stopwords)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "freq = nltk.FreqDist(tokens_no_stopwords)\n", "freq.plot(20, cumulative=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tokenize into sentences and words" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sentences = nltk.tokenize.sent_tokenize(text)\n", "print(len(sentences))\n", "sentences[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's get rid of those \\n's and \\r's." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sentences = nltk.tokenize.sent_tokenize(text.replace('\\n', '').replace('\\r', ''))\n", "print(len(sentences))\n", "sentences[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also tokenize into words, in a better way than what we did above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "words = nltk.tokenize.word_tokenize(text.replace('\\n', ''))\n", "print(len(words))\n", "words[:20]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We often want the part-of-speech (POS) for each word to analyze the structure of sentences." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nltk.pos_tag(words[1000:1010])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's get rid of stopwords again." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "words_no_stopwords = [word for word in words if word not in stopwords]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(words_no_stopwords)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "freq = nltk.FreqDist(words_no_stopwords)\n", "freq.plot(20);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And let's remove words that are single characters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "words_no_stopwords = [word for word in words if word not in stopwords and len(word) > 1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "freq = nltk.FreqDist(words_no_stopwords)\n", "freq.plot(20);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "freq.most_common()[:20]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Compare texts\n", "\n", "Let's come up with a very simple way to compare texts and apply it to see how it rates the similarities among two Dicken's books and two Wilde books." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import urllib.request\n", "from bs4 import BeautifulSoup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def load_text(url):\n", " response = urllib.request.urlopen(url)\n", " html = response.read()\n", " soup = BeautifulSoup(html,'html.parser')\n", " for script in soup([\"script\", \"style\"]):\n", " script.extract()\n", " text = soup.get_text(strip=True)\n", " words = nltk.tokenize.word_tokenize(text)\n", " stopwords = nltk.corpus.stopwords.words('english')\n", " words_no_stopwords = [word for word in words if word not in stopwords]\n", " freq = nltk.FreqDist(words_no_stopwords)\n", " commonWordsCounts = freq.most_common()[:500]\n", " return [word for (word, count) in commonWordsCounts if len(word) > 1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dickens1 = load_text('http://www.gutenberg.org/cache/epub/730/pg730.txt')\n", "dickens2 = load_text('http://www.gutenberg.org/files/786/786-0.txt')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wilde1 = load_text('http://www.gutenberg.org/files/790/790-0.txt')\n", "wilde2 = load_text('http://www.gutenberg.org/cache/epub/14522/pg14522.txt')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many words in the 500 most common do two Dickens books have in common?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(set(dickens1) & set(dickens2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and two Wilde books?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(set(wilde1) & set(wilde2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How about the first Dickens book compared to the two Wilde books?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(set(dickens1) & set(wilde1)), len(set(dickens1) & set(wilde2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and the second Dickens book compared to the two Wilde books?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(set(dickens2) & set(wilde1)), len(set(dickens2) & set(wilde2))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(*dickens1)\n", "print(*dickens2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(*wilde1)\n", "print(*wilde2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Better Pre-processing" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "wpt = nltk.WordPunctTokenizer()\n", "stop_words = nltk.corpus.stopwords.words('english')\n", "\n", "def simplify(doc):\n", " # lower case and remove special characters\\whitespaces\n", " doc = re.sub(r'[^a-zA-Z\\s]', '', doc, re.I|re.A)\n", " doc = doc.lower()\n", " doc = doc.strip()\n", " # tokenize document\n", " tokens = wpt.tokenize(doc)\n", " # filter stopwords out of document\n", " filtered_tokens = [token for token in tokens if token not in stop_words]\n", " # re-create document from filtered tokens\n", " doc = ' '.join(filtered_tokens)\n", " return doc" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "simplify('Hello, Mary, wouldn''t you like to skip this class?')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def load_text(url):\n", " response = urllib.request.urlopen(url)\n", " html = response.read()\n", " soup = BeautifulSoup(html,'html.parser')\n", " for script in soup([\"script\", \"style\"]):\n", " script.extract()\n", " text = soup.get_text(strip=True)\n", " text = simplify(text)\n", " words = nltk.tokenize.word_tokenize(text)\n", " stopwords = nltk.corpus.stopwords.words('english')\n", " words_no_stopwords = [word for word in words if word not in stopwords]\n", " freq = nltk.FreqDist(words_no_stopwords)\n", " commonWordsCounts = freq.most_common()[:500]\n", " return [word for (word, count) in commonWordsCounts if len(word) > 1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(*wilde1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wilde1 = load_text('http://www.gutenberg.org/cache/epub/301/pg301.txt')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "print(*wilde1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dickens1 = load_text('http://www.gutenberg.org/cache/epub/730/pg730.txt')\n", "dickens2 = load_text('http://www.gutenberg.org/files/786/786-0.txt')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wilde1 = load_text('http://www.gutenberg.org/files/790/790-0.txt')\n", "wilde2 = load_text('http://www.gutenberg.org/cache/epub/14522/pg14522.txt')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n_in_common = []\n", "for txt1 in [dickens1, dickens2, wilde1, wilde2]:\n", " for txt2 in [dickens1, dickens2, wilde1, wilde2]:\n", " n_in_common.append(len(set(txt1) & set(txt2)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "n_in_common" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.array(n_in_common).reshape(4,4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "pd.DataFrame(np.array(n_in_common).reshape(4,4), \n", " index=['dickens1', 'dickens2', 'wilde1', 'wilde2'],\n", " columns=['dickens1', 'dickens2', 'wilde1', 'wilde2'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }