{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "NLP with Transformers with the T5 Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "\n", "In this assignment you will learn how to apply the T5 pre-trained model to three tasks\n", "1. summarization,\n", "2. translation,\n", "3. grammar checking." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This will require installation of `pytorch` and the `transformers` package. You should already have `pytorch` installed. To install `transformers`, you can use\n", "\n", " pip install transformers\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then run the following code cell. The first time it is run, the t5-base model will be downloaded." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import transformers as tr\n", "\n", "# initialize the model architecture and weights\n", "model = tr.T5ForConditionalGeneration.from_pretrained(\"t5-base\")\n", "# initialize the model tokenizer\n", "tokenizer = tr.T5Tokenizer.from_pretrained(\"t5-base\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you will use `model` and `tokenizer` to perform the above tasks. Here are some examples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summarize\n", "\n", "First, let's summarize this text with at most 100 words." ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "text = \"\"\"\n", "Julia was designed from the beginning for high performance. Julia programs compile to efficient \n", "native code for multiple platforms via LLVM.\n", "Julia is dynamically typed, feels like a scripting language, and has good support for interactive use.\n", "Reproducible environments make it possible to recreate the same Julia environment every time, \n", "across platforms, with pre-built binaries.\n", "Julia uses multiple dispatch as a paradigm, making it easy to express many object-oriented \n", "and functional programming patterns. The talk on the Unreasonable Effectiveness of Multiple \n", "Dispatch explains why it works so well.\n", "Julia provides asynchronous I/O, metaprogramming, debugging, logging, profiling, a package manager, \n", "and more. One can build entire Applications and Microservices in Julia.\n", "Julia is an open source project with over 1,000 contributors. It is made available under the \n", "MIT license. The source code is available on GitHub.\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "929" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(text)" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tensor([[21603, 10, 18618, 47, 876, 45, 8, 1849, 21, 306,\n", " 821, 5, 18618, 1356, 2890, 699, 12, 2918, 4262, 1081,\n", " 21, 1317, 5357, 1009, 3, 10376, 12623, 5, 18618, 19,\n", " 4896, 1427, 686, 26, 6, 4227, 114, 3, 9, 4943,\n", " 53, 1612, 6, 11, 65, 207, 380, 21, 6076, 169,\n", " 5, 419, 1409, 4817, 2317, 8258, 143, 34, 487, 12,\n", " 23952, 8, 337, 18618, 1164, 334, 97, 6, 640, 5357,\n", " 6, 28, 554, 18, 16152, 2701, 5414, 5, 18618, 2284,\n", " 1317, 17648, 38, 3, 9, 20491, 6, 492, 34, 514,\n", " 12, 3980, 186, 3735, 18, 9442, 11, 5014, 6020, 4264,\n", " 5, 37, 1350, 30, 8, 597, 864, 739, 179, 18652,\n", " 655, 13, 16821, 3, 23664, 14547, 3, 9453, 572, 34,\n", " 930, 78, 168, 5, 18618, 795, 3, 9, 30373, 27,\n", " 87, 667, 6, 10531, 7050, 53, 6, 20, 14588, 3896,\n", " 6, 3, 12578, 6, 9639, 53, 6, 3, 9, 2642,\n", " 2743, 6, 11, 72, 5, 555, 54, 918, 1297, 15148,\n", " 11, 5893, 5114, 7, 16, 18618, 5, 18618, 19, 46,\n", " 539, 1391, 516, 28, 147, 11668, 13932, 7, 5, 94,\n", " 19, 263, 347, 365, 8, 3, 12604, 3344, 5, 37,\n", " 1391, 1081, 19, 347, 30, 3, 30516, 5, 1]])" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inputs = tokenizer.encode(\"summarize: \" + text, return_tensors=\"pt\", max_length=512, truncation=True)\n", "inputs" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "outputs = model.generate(inputs, max_length=100, min_length=10, length_penalty=1.0, num_beams=4,\n", " num_return_sequences=3)" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tensor([[ 0, 18618, 19, 4896, 1427, 686, 26, 6, 4227, 114,\n", " 3, 9, 4943, 53, 1612, 6, 11, 65, 207, 380,\n", " 21, 6076, 169, 3, 5, 18618, 795, 3, 9, 30373,\n", " 27, 87, 667, 6, 10531, 7050, 53, 6, 20, 14588,\n", " 3896, 6, 3, 12578, 6, 9639, 53, 6, 3, 9,\n", " 2642, 2743, 6, 11, 72, 3, 5, 8, 1391, 1081,\n", " 19, 347, 30, 3, 30516, 365, 8, 3, 12604, 3344,\n", " 3, 5, 1],\n", " [ 0, 18618, 19, 4896, 1427, 686, 26, 6, 4227, 114,\n", " 3, 9, 4943, 53, 1612, 6, 11, 65, 207, 380,\n", " 21, 6076, 169, 3, 5, 18618, 795, 3, 9, 30373,\n", " 27, 87, 667, 6, 10531, 7050, 53, 6, 20, 14588,\n", " 3896, 6, 3, 12578, 6, 9639, 53, 6, 3, 9,\n", " 2642, 2743, 6, 11, 72, 3, 5, 34, 19, 46,\n", " 539, 1391, 516, 28, 147, 11668, 13932, 7, 3, 5,\n", " 1, 0, 0],\n", " [ 0, 18618, 19, 4896, 1427, 686, 26, 6, 4227, 114,\n", " 3, 9, 4943, 53, 1612, 6, 11, 65, 207, 380,\n", " 21, 6076, 169, 3, 5, 18618, 795, 3, 9, 30373,\n", " 27, 87, 667, 6, 10531, 7050, 53, 6, 20, 14588,\n", " 3896, 6, 3, 12578, 6, 9639, 53, 6, 3, 9,\n", " 2642, 2743, 6, 11, 72, 3, 5, 18618, 19, 46,\n", " 539, 1391, 516, 28, 147, 11668, 13932, 7, 3, 5,\n", " 1, 0, 0]])\n", "torch.Size([3, 73])\n" ] } ], "source": [ "print(outputs)\n", "print(outputs.shape)" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Result 1\n", "Julia is dynamically typed, feels like a scripting language, and has good support for interactive use. Julia provides asynchronous I/O, metaprogramming, debugging, logging, profiling, a package manager, and more. the source code is available on GitHub under the MIT license.\n", "\n", "Result 2\n", "Julia is dynamically typed, feels like a scripting language, and has good support for interactive use. Julia provides asynchronous I/O, metaprogramming, debugging, logging, profiling, a package manager, and more. it is an open source project with over 1,000 contributors.\n", "\n", "Result 3\n", "Julia is dynamically typed, feels like a scripting language, and has good support for interactive use. Julia provides asynchronous I/O, metaprogramming, debugging, logging, profiling, a package manager, and more. Julia is an open source project with over 1,000 contributors.\n" ] } ], "source": [ "for i in range(outputs.shape[0]):\n", " print('\\nResult', i + 1)\n", " print(tokenizer.decode(outputs[i]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Translation\n", "\n", "Now let's translate the text to German." ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Result 1\n", "Julia wurde von Anfang an für hohe Leistung entwickelt. Julia-Programme kompilieren zu effizientem nativen Code für mehrere Plattformen über LLVM. Julia ist dynamisch getippt, fühlt sich wie eine Skriptsprache und hat gute Unterstützung für interaktive Verwendung. Reproduzierbare Umgebungen ermöglichen es, die gleiche Julia-Umgebung jedes Mal, über Plattformen hinweg, mit vorgefertigten Binärdateien zu erschaffen. Julia\n", "\n", "Result 2\n", "Julia wurde von Anfang an für hohe Performance entwickelt. Julia-Programme kompilieren zu effizientem nativen Code für mehrere Plattformen über LLVM. Julia ist dynamisch getippt, fühlt sich wie eine Skriptsprache und hat gute Unterstützung für interaktive Verwendung. Reproduzierbare Umgebungen ermöglichen es, die gleiche Julia-Umgebung jedes Mal, über Plattformen hinweg, mit vorgefertigten Binärdateien zu erschaffen. Julia verwendet\n", "\n", "Result 3\n", "Julia wurde von Anfang an für hohe Leistung entwickelt. Julia-Programme kompilieren zu effizientem nativem Code für mehrere Plattformen über LLVM. Julia ist dynamisch getippt, fühlt sich wie eine Skriptsprache und hat gute Unterstützung für interaktive Verwendung. Reproduzierbare Umgebungen ermöglichen es, die gleiche Julia-Umgebung jedes Mal, über Plattformen hinweg, mit vorgefertigten Binärdateien zu erschaffen. Julia\n" ] } ], "source": [ "inputs = tokenizer.encode('translate English to German: ' + text, return_tensors='pt',\n", " max_length=512, truncation=True)\n", "outputs = model.generate(inputs, max_length=1500, min_length=20, length_penalty=1.0, num_beams=10,\n", " num_return_sequences=3)\n", "\n", "for i in range(outputs.shape[0]):\n", " print('\\nResult', i + 1)\n", " print(tokenizer.decode(outputs[i]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Grammar Checker\n", "\n", "Now to check some grammar." ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "unacceptable\n" ] } ], "source": [ "sentence = 'This sentence do not be grammatical.'\n", "inputs = tokenizer.encode('cola sentence: ' + sentence, return_tensors='pt')\n", "outputs = model.generate(inputs)\n", "print(tokenizer.decode(outputs[0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Requirements" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summarization\n", "\n", "Cut and paste a news story that has at least five paragraphs that describes the recent news about the COVID-19 vaccine developed by the University of Oxford University and AstraZenec. Try at least three values for each of the parameters:\n", "\n", "* `max_length`,\n", "* `min_length`,\n", "* `length_penalty`, and\n", "* `num_beams`.\n", "\n", "Copy and paste into a markdown cell what you consider to be the best summarization of the news article. Also, describe the effects of these four parameters on the results with at least four sentences.\n", "\n", "[This article](https://github.com/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb) will help you understand a bit more about these parameters." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Translation\n", "\n", "Try translating the first paragraph of your news story into German. Use `num_return_sequences=5` and translate the German back to English using [translate.google.com](https://translate.google.com/). Experiment with at least three values for the above four parameters. Using the google translations, describe which German translation is best, and which parameter values led to its generation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Grammar Checker\n", "\n", "Write a for loop that checks the grammatical correctness of each sentence in a list of sentences. Apply it to the first paragraph of your news article. Describe the results.\n", "\n", "Now modify at least three of the sentences in your paragraph to make the sentences grammatically incorrect and repeat the analysis of all sentences. Describe the results. Are your grammatically incorrect sentences correctly identified?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Extra Credit\n", "\n", "Read some of the on-line documentation and examples that describe how to fine-tune the T5 model to do better English to German and German to English translations. Try fine-tuning the T5 model we use here on example translations. Does it perform better?\n", "\n", "Warning: This will take a lot of time to figure out. First try to find examples on-line of training to fine-tune the model." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }