Text Visualization
Word cloud is one of the most popular visualization techniques to represent text data. Wordle is a website provides online service to generate word clouds. In Python, we can use the module created by Andreas Mueller to generate beautiful world clouds.
Here we are going to demonstrate how to make a world cloud of The Constitution of the United States.
To apply the wordcloud module we need to install it at first by running the following script from (mac) terminal:
sudo pip install git+git://github.com/amueller/word_cloud.git
Data Collection
import re, Image, urllib2, wordcloud, ImageFont, ImageDraw
import numpy as np
from bs4 import BeautifulSoup
url='http://www.concourt.am/armenian/legal_resources/world_constitutions/constit/usa/usa----e.htm'
t = re.sub('<[^<]+?>', '', urllib2.urlopen(url).read())
def words(text): return re.findall('[a-z]+', text.lower())
w = words(t)
text = ' '.join(w)
Create World Cloud
w = 800; h = 500
# Separate into a list of (word, frequency).
words = wordcloud.process_text(text)
# Compute the position of the words.
elements = wordcloud.fit_words(words, font_path='/Users/csid/Desktop/DroidSansMono.ttf', width = w, height = h,\
margin=5, ranks_only=False, prefer_horiz=0.90)
# Draw the positioned words to a PNG file.
wordcloud.draw(elements, file_name='/Users/csid/Desktop/constitution.png', \
font_path='/Users/csid/Desktop/DroidSansMono.ttf', width=w, height=h, scale=1,
bg_color=(0, 0, 0))
The above codes gives the following beautiful figure:
It turns out that the words ''state" and "shall" are used very frequently in the constitution, after the stopwords are removed. This is consistent with our expectation. After all, a constitution defines the rules that the country (states) should follow.