This post shows a simple example of how to build word clouds from your
favorite text. It uses the Python library
For the source text I picked "The Possessed", a novel by Russian writer Fyodor
Dostoyevsky, one of my favorite auhors. I got the full text from
Project Gutenberg, a
great resource for free books. By the way, I think Project Gutenberg provides a
great service to the community and I would very much encourage you to support them.
I downloaded the plain text (UTF-8) version of the book using
wget http://www.gutenberg.org/ebooks/8117.txt.utf-8 -O ThePossessed.txt
I wrote a simple wrapper function called
from PIL import Image import numpy as np import matplotlib.pyplot as plt from wordcloud import WordCloud, STOPWORDS def make_word_cloud(input_text, mask, output_file_name, stopwords=None, extra_stopwords=None, bckgrd_color = "white", max_words=2000): """Generate a word cloud and write it to .png file. Keyword arguments: input_text -- path to plain text file mask -- path to mask image output_file_name -- string name to output .png file stopwords -- list of stop word strings (default None) extra_stopwords -- list of extra stop word strings (default None) bckgrd_color -- background color (default "white") max_words -- maximum number of words (default 2000) """ # Read the whole text text = open(input_text).read() # Load the mask image mask = np.array(Image.open(mask)) # Load stop word list stopwords = set(stopwords) # Add extra stop words if provided if extra_stopwords is not None: [stopwords.add(word) for word in extra_stopwords] # Call WordCloud wc = WordCloud(background_color=bckgrd_color, max_words=max_words, mask=mask, stopwords=stopwords) # Generate word cloud wc.generate(text) # Write to file wc.to_file(output_file_name)
This is all we need to make our word cloud. We can call the wrapper function
and use the list of stop words provided by
make_word_cloud(input_text="ThePossessed.txt", mask="devil_stencil.jpg", output_file_name="ThePossessed.png", stopwords=STOPWORDS)
Looking at the output image, the result is a nicely balanced word cloud with cool colors and making up the shape we chose. The cloud is rather disapointing when it comes to the informative power, though. We can see some character names and a few interesting words, but the most important words are still quite generic and uninformative.
The obvious next step is to get a better stop word collection. One of the most popular NLP open source Python libraries is the Natural Language Toolkit (NLTK). So let's try their stop word collection.
from nltk.corpus import stopwords nltk_sw = stopwords.words('english') make_word_cloud(input_text="ThePossessed.txt", mask="devil_stencil.jpg", output_file_name="ThePossessed_nltk_sw.png", stopwords=nltk_sw)
No improvement here. The result looks really similar and it is still dominated by
common, uninformative words. As you can see here below, both the
print len(STOPWORDS) # 183 print len(nltk_sw) # 153
A little googling around reveals a more extensive stop word list compiled by the IR Multilingual Resources at UniNE (University of Neuchâtel, Switzerland). Their English stop word list includes 571 words, so it looks like a good bet.
Since we downloaded this list as a text file it will require a little more work to get into Python. This little function will do the trick.
def load_words_from_file(path_to_file): """Read text file return list of words.""" sw_list =  with open(path_to_file, 'r') as f: [sw_list.append(word) for line in f for word in line.split()] return sw_list
And now we are ready to give this longer list a try. Let's see how well it works.
# Load stop words UniNE_sw = load_words_from_file('englishST.txt') make_word_cloud(input_text="ThePossessed.txt", mask="devil_stencil.jpg", output_file_name="ThePossessed_UniNE.png", stopwords=UniNE_sw)
Much better. We got rid of most common and uninformative words and managed
to get a pretty interesting word cloud. And thanks to the
Happy "word clouding"!