Processing the Log Files of Stack Exchange
Please cite the following paper if you feel that your research benefits from the codes and ideas in this chapter:
The Stack Exchange database dump
Stack Exchange is a network of question and answer (Q&A) sites. Until 2016, there are more than 160 sites covering diverse topics. The database dump of Stack Exchange push database is publicly accessible and is updated every couple of months. You can download the latest data dump from here, which contains the questions and answers posted by millions of users.
In this chapter I analyze the data dump in Jan., 2014, which contains the log files of 110 sites. The smallest site, italian.stackexchange.com, was launched at Nov., 2013 and has 374 users, 194 questions, and 387 answers. The largest site, stackoverflow.com, was launched at Jul., 2008 and has 2,728,224 users, 6,474,687 questions and 11,540,788 answers to the date of data collection.
Cleaning data
The total size of the downloaded and unzipped data set is 46 GB. The files look like this in my MacBook:
There are 110 folders, each corresponds to a Q&A site. In each folder there are several files in either XML or text format. We are more interested in the Posts.xml file, which contains all the asking and answering records from a site.
So now we want to iterate over the 110 folders and pick up the Posts.xml file from each folder. This can be done by calling the listdir function from the os module and using it to get all the folder names. The use of this function is very simple as shown below:
import pylab as plt
import numpy as np
from os import listdir
import statsmodels.api as sm
from datetime import datetime
import sys, collections
path='.../stackexchange/unzip/'
sites = [ f for f in listdir(path) if f[-1]=='m']
Here we observe that the names of all folders are ended with "m" (com), so we use the if condition to filter out unwanted hidden files.
If you are also using the Jupyter notebook, you may want to have a function to trace the progress when you are analyzing big data sets, on which the running time of codes may be longer than brewing (or even finishing) a cup of tea. The following codes print out the values of a preset counter and keep overwriting the printed characters, i.e., you will see rolling numbers instead an increasing list of numbers that unfold your Jupyter notebook interface.
def flushPrint(variable):
sys.stdout.write('\r')
sys.stdout.write('%s' % variable)
sys.stdout.flush()
To clean data, we define a function to parse each record in a given Posts.xml file, determine whether it is a question (label = 1) or an answer, and get the time stamp of this record. We create a defaultdict object, which is a dictionary that carries given values. We put the dates as keys and update the two elements of the list which are placed as values of the dictionary. The first element is the number of questions in that date and the second one is the number of answers.
We repeat this action in each of the 110 sites (folders) and then stack the resulting dictionaries into a bigger dictionary with site names as keys.
Now let's roll.
def dailyQA(site):
F = collections.defaultdict(lambda:[0,0])
filename = path + site + '/Posts.xml'
with open(filename,'r') as f:
for line in f:
try:
label = line.split('PostTypeId=')[1][1:2]
day = line.split('CreationDate=')[1][1:11]
if label == '1':
F[day][0]+=1
if label == '2':
F[day][1]+=1
except:
pass
return F
F={}
for i in sites:
flushPrint(sites.index(i))
F[i] = dailyQA(i)
My laptop (purchased in 2014, dual-core Mac Pro with a 16GB memory) takes two minutes to run the above codes. Enough time to brew a cup of tea.
We can take a look at the nested dictionary (named "F") we've got. As mentioned, each key is a site names and the value is a dictionary that contains dates as keys and the number of newly added answers and questions as values.
F['photo.stackexchange.com'].items()[:10]
[('2011-07-12', [9, 26]),
('2013-07-27', [6, 9]),
('2011-07-13', [5, 17]),
('2013-10-08', [7, 14]),
('2011-04-30', [5, 17]),
('2011-07-10', [7, 16]),
('2012-12-25', [7, 17]),
('2013-04-19', [6, 18]),
('2013-02-20', [13, 28]),
('2011-07-11', [13, 49])]
Visualize the temporal evolution of sites
Visualization not only adds a lot of fun into the data analysis but also brings inspiration for research. By analyzing the increase of the number of questions and answers, we observe that
in which is the cumulative number of questions, is the daily increase of questions, and indicates the growing speed. If is positive, questions are going through an accelerating increase, otherwise, the growth trend is slowing down (the number of answers is always proportional to the number of questions, so we can only estimate the values of in questions).
It would be interesting to estimate the value of for each site at first and then plot their growth trends in red and green to indicate accelerating and slowing-down growth, respectively.
The following codes use OLS regression (see this chapter for codes) in the log-log plot to estimate the value of .
E = {}
for i in sites:
f = F[i]
days = sorted(f.keys())
q,a = np.array([f[day] for day in days]).T
x = np.log(np.cumsum(q) + 1)
y = np.log(q + 1)
constant, eta, r2 = OLSRegressFit(x,y)
E[i] = eta
And now we can plot the evolution of daily added questions (the lower bound of bands) and answers (the upper bound of bands). We create two figures, in one of them the we show the data of sites with a positive , and in the other we show the data of sites with a negative .
def plotMonth(site,ax,col):
M=collections.defaultdict(lambda:np.array([0,0]))
f=F[site]
for i in f:
M[i[:7]]+=np.array(f[i])
ms=sorted(M.keys())[1:-1]
if len(ms)>3:
x,y = np.array([M[i] for i in ms]).T
mm=[datetime.strptime(j,'%Y-%m') for j in ms]
ax.vlines(mm[0], x[0], y[0],color=col,linestyle='-')
ax.fill_between(mm, x, y,color=col, alpha=0.1)
ax.plot(mm,x,color="white",linestyle='-',marker='',alpha=0.1)
ax.plot(mm,y,color="white",linestyle='-',marker='',alpha=0.1)
from matplotlib.dates import YearLocator
years = YearLocator()
fig = plt.figure(figsize=(12, 5))
ax1 = fig.add_subplot(1,2,1)
plt.text(datetime.strptime('2013-03','%Y-%m'),10**5,'A',size=14)
ax2 = fig.add_subplot(1,2,2)
plt.text(datetime.strptime('2013-03','%Y-%m'),10**5,'B',size=14)
for i in sites:
if E[i] > 0:
plotMonth(i,ax1,'green')
else:
plotMonth(i,ax2,'#E0542E')
ax1.set_title(r'$\eta>0$')
ax1.set_yscale('log')
ax1.set_ylim(1,10**6)
ax1.set_xlabel(r'$Time$', size=14)
ax1.set_ylabel(r'$Monthly\,Q\,&\,A$', size=14)
ax1.xaxis.set_major_locator(years)
ax2.set_title(r'$\eta<0$')
ax2.set_yscale('log')
ax2.set_ylim(1,10**6)
ax2.set_xlabel(r'$Time$', size=14)
ax2.set_ylabel(r'$Monthly\,Q\,&\,A$', size=14)
ax2.xaxis.set_major_locator(years)
plt.tight_layout()
plt.show()
#plt.savefig('/Users/csid/Desktop/growth.png')
#for plotdevice
f = open('/Users/csid/Desktop/SEsites.txt', "wb")
for i in F:
age = len(F[i])
nq,na=sum(np.array(F[i].values()))
f.write(i+'\t'+str(age)+'\t'+str(nq)+'\t' + str(na)+'\n')
f.close()
The above codes give the following figure, which shows clearly the difference between sustainable and unsustainable growth of Q&A sites. This visualization paves way for the analysis on the underlying mechanisms leading to the observed differences.