Text Processing: Analyzing The Works of Shakespeare

This is an HTML version of the Jupyter Notebook. Download the zip file with all of the necessary files here.

In this notebook, we will analyze word frequency of all of the works of Shakespeare!

We will:

- We will read all of Shakespeare's works from a text file

- Do a word frequency analysis: total word counts and total unique words

- Compute the total 30 most frequently used words in Shakespeare's works not including glue words or stop words such as "the" and "and", etc...

To read a file, we use a context manager. The most widely used context manager is the with statement. Use the following code to open the attached file "shakespeare.txt":

with open(filename, "r") as file:
    text = file.read()

The above code reads in the entire works of Shakespeare! This code can read in a file of any size! The only limitation is the RAM on your computer. For example, we can read in the entire Wikipedia this way!

In [1]:
with open("shakespeare.txt", "r") as file:
    text = file.read()

Use the type() function to determine the type of text.

In [ ]:

Use slicing to OUTPUT the first 500 characters. (From the play Corialanus)

Notice the newline \n character. Now print the string.

In [ ]:

Find the text of a famous Shakespeare's quote. "To be, or not to be, that is the question:". Do this by calling the find method to find the index of the quote.

In [ ]:

You may run into some difficulties with the previous problem. This is because the text you're looking for need to match exactly with Shakespeare's text, including punctuation and capitalization. Let's normalize the text by lower casing the entire text and removing all punctuation including newline characters.

Do this by defining a function remove_punctuation which accepts a string of text and returns the string with all punctuations and new line characters removed.

In [8]:
def remove_punctuation(txt):
    """Convert text to lower case. Use the function replace(replacedtext,newtext) to 
    remove \n and punctuations from text: punc = [".",":",",",";","'",'"', "!","?"]
    Then return text."""
    # lower text by calling the method lower() on the string
    # replace \n with a SPACE
    # loop through punc and replace each punctuation with empty string ""
    # remember to return the text!

In [ ]:

Test your remove_punctuation function on small text sample.

In [ ]:

Call remove_punctuation on the Shakespeare text. Now find your famous Shakespeare's quote. Print out approximately 300 characters after your quote.

In [ ]:
In [ ]:

Compute the total number of characters of all of the works of Shakespeare.

In [ ]:

Approximately how many words are there in the entire works of Shakespeare? Use split(). Recall that split() returns a list of all words from a string; split() by default splits on white spaces.

In [ ]:

Write function frequency below. The function takes an input string text and returns a dictionary whose keys are the words and values are the word counts.

For example frequency("baa baa black sheep") returns the dictionary {"baa":2, "black":1, sheep:1}.

In [24]:
def frequency(text):
    """Returns dictionary of word:counts key-value pairs."""
    # create empty dictionary
    # create list of individual words of all of Shakespeare's texts(Hint: Call split())
    # loop through list of words
    # add to dictionary. Remember to check to see if a word is in the dictionary first. 

    # return the dictionary

Test your frequency function on small text sample.

In [ ]:

Write the function text_stats which accepts a string input and returns a tuple of (total words, total unique words). Hint: Call the remove_punctuation and frequency functions above.

In [26]:
def text_stats(text):
    """Given a text, returns a TUPLE of total words and total unique words.
    Remember to call remove_punctuation and frequency above."""
    # call remove_punctuation function
    # call frequency function
    # compute the total words
    # compute total unique words
    # return the tuple

Test your text_stats function on small text sample.

In [ ]:

Now call text_stats on Shakespeare's text. How many unique words are there in all of his works?

In [ ]:

There is already an object called Counter from the collections module which returns dictionary object equivalent to the the function frequency implemented above.

Here's how to create a Counter object.

from collections import Counter

text = remove_punctuation(text)
counter = Counter(text.split())

Use == to check that the Counter object is equivalent to the dictionary object returned by the frequency function above.

In [1]:
from collections import Counter

# the function frequency we implement above have already been implemented
# in the Counter class. Counter has a nice function most_common
# that returns the most common occurring words in a text. 

Counter has a nice method called most_common(n) which accepts an integer n and returns the n most common occurring words. Call most_common to see the top 20 most occuring words in all of Shakespeare's works.

In [ ]:

The above words are high frequency words but many do not contribute to the content of the text. These are called glue words or stop words. In Google searches for example, Google matches articles or sites to searches by ignoring these stop words as they do not contribute to the content of those articles/sites.

Follow the comment below to complete to read in a tab separated file of stop words and add those words to a list.

In [2]:
with open("stopwords.txt", 'r') as file:
    # create an empty list stop_words
    # loop through each line of file(line is a str of words separated by tabs('\t'))
    #     call split() on line passing in "\t" as separator
    #     loop through each word of line
    #           apppend word to list, don't forget to call strip() to strip away any new line character 
    #           at the end of each line as well as any leading/trailing spaces
  File "<ipython-input-2-ee1e2fcf1efb>", line 11
    #           at the end of each line as well as any leading/trailing spaces
SyntaxError: unexpected EOF while parsing
In [ ]:

Finally, loop through each of the stop words in the list created above and remove them from the counter. The counter(subclass of a dictionary) object has a pop(key) function which removes an entry given its key.

The counter objects now contain the most frequently used words in Shakespeare's works NOT including the stop words.

In [ ]:

Now print out the list of the top 30 words in all of Shakespeare's works. The most_common method can

In [ ]:
In [ ]: