Clustering Countries by Constitutions

Countries are different in tons of ways, culture, geological location and climate, available natural resources, to name just a few. Among these various differences, social science scholars are generally more interested in the institutional difference.

We see more qualitative comparative studies on countries than quantitative studies, as collecting data across countries is challenging. But in this chapter I will demonstrate how we can quantitatively compare the political systems across a hundred of countries in a zero-cost way.

Collect constitution text across countries

Again, we import several modules to facilitate our analysis. A new module we want to explore in this chapter is AffinityPropagation, a clustering algorithm based on message passing proposed by Frey and Dueck in 2007. You can read their paper for technological details. Compare to other popular clustering algorithms like K-means, an outstanding advantage of Affinity Propagation is that you do not need to specify the number of clusters in advance, the algorithm will learn this parameter from data.

import re, sys, urllib2
import math
import numpy as np
import pylab as plt
from bs4 import BeautifulSoup
from collections import Counter
from sklearn.cluster import AffinityPropagation

You can download the English version of constitutions of 118 countries from this site. To do this, I firstly retrieve the URLs of these constitutions:

ad = s +'consts2l.htm'
soup = BeautifulSoup(urllib2.urlopen(ad).read())
for a in soup.body.find_all('a', href=True):
    u = a['href']
    if u[-6:]=='-e.htm':
        if words(a.string)[0] == 'english' or words(a.string)[0] == 'in':
            urls[str(u.split('/')[0].title())]= s+u
            urls[str(words(a.string)[0].title())]= s+u

The above codes gives you a dictionary of URLs for constitution text:

Using these Web address we can fetch the text of constitutions and package them into a dictionary with country names as keys and text data as values:

def words(text): return re.findall('[a-z]+', text.lower())

for country in urls:
    n = urls.keys().index(country)
    if n==92:
    url = urls[country]
    h = urllib2.urlopen(url).read()
    d = re.sub('<[^<]+?>', '', h)
    w = words(d)

See this chapter for the flushPrint function used to track the running progress of codes. The downloading speed depends on the Internet connection, in my case it takes about ten minutes to download all the text data.

Note that here we use the Counter function in the collections module to get word-frequency tables instead of raw text. This allows us to treat a constitution as a vector with weights (frequency) and paves way for the following clustering task.

Calculate text similarity

Now we have a dictionary W in which keys are the names of countries and values are word-frequency tables. We can calculate the cosine similarity between each pair of constitutions and construct a similarity matrix.

Before calculating the cosine similarity, we need to turn out word-frequency tables into vectors such that all constitutions are vectors of the same length.

all_items = set([])
for i in W:
    for j in W[i].keys():
        if j not in all_items:
for i in W:
    V[i]=[W[i][j] for j in all_items]

V is a dictionary with country names as keys and the vector as values. The length of vectors is always 19,985. In other words, each country (constitution) can be understand as a point in the same high-dimensional (19,985 dimensions) feature space. And our next task is to calculate the distance between these points. Among the various distance definitions we choose to use cosine similarity.

def cosine(v1, v2):
    dp = sum(n1 * n2 for n1, n2 in zip(v1, v2) )
    m1 = math.sqrt(sum(n ** 2 for n in v1))
    m2 = math.sqrt(sum(n ** 2 for n in v2))
    return dp / (m1 * m2)

countries = [i for i in W.keys() if i!='Litva' and i!='Uk'] # remove outlier data
l = len(countries)
s = np.zeros(shape=(l,l))
for i in range(l):
    for j in range(l):
        s[i,j] = cosim(V[countries[i]],V[countries[j]])

The above codes give a similarity matrix s (dimension = 115*115) like the left panel of the following figure:

The values of elements in the matrix vary from 0.7 to 1. The higher calculated value is, the more similar the corresponding pair of countries are. Diagonal elements are always 1 as they represent the cosine similarity between countries and themselves.

The most similar pair of countries are Serbia and Yugoslavia and the most different pair of countries are Azer and Oceania. These pairs can be identified by the following codes:

from numpy import unravel_index
x,y = unravel_index(s.argmin(), s.shape)

np.fill_diagonal(s, 0)
x,y = unravel_index(s.argmax(), s.shape)

Remember recovering (recalculating) matrix s after running the above codes, as the values of diagonal elements are changed into 0 before the maximum and minimum values are found.

Affinity propagation algorithm

Given a N x N dimensional similarity matrix, there are many different ways to cluster the N entities together. Here we use the Affinity Propagation algorithm.

af = AffinityPropagation(preference=0.89, affinity="precomputed")
labels =
dd = sorted(zip(countries,labels),key=lambda x:x[1])
datatable=[['Country','Constitution Category']]
for i in dd:


The above codes create a list called datatable, in which countries and their cluster labels are matched together. There are 8 clusters in total.

You can also use the following codes to explore how the clustering algorithm detect order from the random-like similarity matrix:

SortedCountries=zip(*sorted(datatable[1:],key=lambda x:x[1]))[0]
permutation1=[countries.index(i) for i in SortedCountries]
rank1 = np.argsort(permutation1)
rank2 = np.argsort(permutation2)
tt1=np.where(t1>similarityThreshold, 1, 0)
tt2=np.where(t2>similarityThreshold, 1, 0)
fig = plt.figure(figsize=(10, 5),facecolor='white')

ax = fig.add_subplot(121)
plt.title('before clustering')
ax = fig.add_subplot(122)
plt.title('after clustering')

The above codes generate the two similarity matrix. In the right panel, the rows and columns are sorted according to their group labels. To make this figure more informative, we used a threshold to convert the similarity matrix into an adjacency matrix with only 0 and 1 values.

Visualize the results

GeoChart is a API provided by Google to display maps of countries and regions. We can write Python scripts to create HMTL files that connect to GeoChart as follows.

    <script type='text/javascript' src=''></script>
    <script type='text/javascript'>
     google.load('visualization', '1', {'packages': ['geochart']});

      function drawRegionsMap() {
        var data = google.visualization.arrayToDataTable(

        var options = {};
        options['dataMode'] = 'regions';
        options['colorAxis'] = { minValue : 0, maxValue : 8, colors :
        options['backgroundColor'] = '#FFFFFF';
        options['datalessRegionColor'] = '#E5E5E5';
        var chart = new google.visualization.GeoChart(document.getElementById('chart_div'));
        chart.draw(data, options);
    <div id="chart_div" style="width: 900px; height: 500px;"></div>
open( '/Users/csid/Desktop/countries.html', "w" ).write(htmltext)

The following screen shot shows the rendered interactive (when you move your mouse over a counrty, the names of the country name and its cluster label will be shown) geological graphs. The countries of similar constitution text are displayed in the same color.

Some questions worth exploring immediately arise from a closer look at this figure:

  • Countries closed to each other tend to share similar constitutions ? We can pick up several examples from Eastern Asia (China, Laos, and Vietnam),South Asia (India and Pakistan), and East Europe;

  • Afghanistan and Iraq are more similar whereas Iran and Turkey are more similar;

  • Ukraine and Chechnya are clustered into the same group. But you can not find Chechnya in the map because it is already become a part of Russia. This is a surprising fact that is related to the current (2014) news in Crimea, Ukraine. Can text analysis on constitutions help us understand Geopolitics ?

results matching ""

    No results matching ""