July 2016

A picture is worth a thousand words

Or how to make word clouds with Python

Full code on


This post shows a simple example of how to build word clouds from your favorite text. It uses the Python library word_cloud, which makes this really easy and even allows you to have the output adopt any shape you like (as long as you provide a suitable stencil). So, as you'll see below, in this post a picture is worth two thousand words by default, actually, with an option to change that to whatever number you want.

The source text

For the source text I picked "The Possessed", a novel by Russian writer Fyodor Dostoyevsky, one of my favorite auhors. I got the full text from Project Gutenberg, a great resource for free books. By the way, I think Project Gutenberg provides a great service to the community and I would very much encourage you to support them. I downloaded the plain text (UTF-8) version of the book using wget as shown below.

			
wget http://www.gutenberg.org/ebooks/8117.txt.utf-8 -O ThePossessed.txt
		

A wrapper function to word_cloud

word_cloud is really easy to use and doesn't leave much for the user to worry about. It even comes with its own set of stop words. If you are not familiar with the concept of "stop words", in simple terms it refers to the most common words in a language. These are typically uninformative words, such as "the" or "and", for example, that are thus removed during preprocessing in many Natural Language Processing (NLP) applications. Our word cloud is no exception. We don't want stop words to dominate the output and hide the less common but more informative words that define the input document.

I wrote a simple wrapper function called make_word_cloud around the main function WordCloud. make_word_cloud takes the input text, a mask file to define the shape, and a name for the output .png file. Additionally, it allows you to provide your own list of stop words instead of the one provided by word_cloud. You can also add some extra stop words manually, in case you spot some uninteresting words creeping into your cloud. Here's make_word_cloud, together with the required imports.

				  
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS


def make_word_cloud(input_text, mask, output_file_name, stopwords=None, extra_stopwords=None,
                    bckgrd_color = "white", max_words=2000):
	"""Generate a word cloud and write it to .png file.

    Keyword arguments:
		input_text -- path to plain text file
		mask -- path to mask image
		output_file_name -- string name to output .png file
		stopwords -- list of stop word strings (default None)
		extra_stopwords -- list of extra stop word strings (default None)
		bckgrd_color -- background color (default "white")
		max_words -- maximum number of words (default 2000)
    """

    # Read the whole text
    text = open(input_text).read()
    # Load the mask image
    mask = np.array(Image.open(mask))
	# Load stop word list
    stopwords = set(stopwords)
	# Add extra stop words if provided
    if extra_stopwords is not None:
        [stopwords.add(word) for word in extra_stopwords]
	# Call WordCloud
    wc = WordCloud(background_color=bckgrd_color, max_words=max_words,
                   mask=mask, stopwords=stopwords)

    # Generate word cloud
    wc.generate(text)
    # Write to file
    wc.to_file(output_file_name)
			

Generating a word cloud

This is all we need to make our word cloud. We can call the wrapper function and use the list of stop words provided by word_cloud. In the original Russian the title of the novel means "Demons", so I thought a picture of a devil would be appropriate to define the shape of the word cloud.

				
make_word_cloud(input_text="ThePossessed.txt", mask="devil_stencil.jpg",
				output_file_name="ThePossessed.png", stopwords=STOPWORDS)
			

word cloud

Looking at the output image, the result is a nicely balanced word cloud with cool colors and making up the shape we chose. The cloud is rather disapointing when it comes to the informative power, though. We can see some character names and a few interesting words, but the most important words are still quite generic and uninformative.

A better stop word collection

The obvious next step is to get a better stop word collection. One of the most popular NLP open source Python libraries is the Natural Language Toolkit (NLTK). So let's try their stop word collection.

			
from nltk.corpus import stopwords
nltk_sw = stopwords.words('english')

make_word_cloud(input_text="ThePossessed.txt", mask="devil_stencil.jpg",
                output_file_name="ThePossessed_nltk_sw.png", stopwords=nltk_sw)
		

word cloud

No improvement here. The result looks really similar and it is still dominated by common, uninformative words. As you can see here below, both the word_cloud and the NLTK stop word collections include less than 200 words. Time to look for longer lists.

			
print len(STOPWORDS)
# 183
print len(nltk_sw)
# 153

A little googling around reveals a more extensive stop word list compiled by the IR Multilingual Resources at UniNE (University of Neuchâtel, Switzerland). Their English stop word list includes 571 words, so it looks like a good bet.

				
wget http://members.unine.ch/jacques.savoy/clef/englishST.txt
			

Since we downloaded this list as a text file it will require a little more work to get into Python. This little function will do the trick.

					
def load_words_from_file(path_to_file):
	"""Read text file return list of words."""
    sw_list = []
    with open(path_to_file, 'r') as f:
        [sw_list.append(word) for line in f for word in line.split()]
    return sw_list
			

And now we are ready to give this longer list a try. Let's see how well it works.

			
# Load stop words
UniNE_sw = load_words_from_file('englishST.txt')

make_word_cloud(input_text="ThePossessed.txt", mask="devil_stencil.jpg",
          		output_file_name="ThePossessed_UniNE.png", stopwords=UniNE_sw)
			

word cloud

Much better. We got rid of most common and uninformative words and managed to get a pretty interesting word cloud. And thanks to the word_cloud Python library this was really easy. A nice go-to resource for your word cloud needs.

Happy "word clouding"!