Detecting Influential Papers in Citation Networks

PageRank algorithm

PageRank is algorithm used by Google to rank Web pages according to their "authority". It simulates an invisible "flow" of "authority" in the dircted hyperlink network and calculate the converged authority score on nodes, i.e., the PageRank values.

Considering the similarity between citation networks and hyperlink networks, it is natural to apply PageRank to evaluate the influence of papers. Actually, as suggested in the original paper, PageRank itself was inspired by citation analysis.

This algorithm can be applied to any non-weighted networks to rank the influence of nodes. In this chapter I will introduce how to calculate the PageRank values of papers from a citation network, but the same logic can also be used to mine opinion leaders in social networks or elsewhere.

The APS citation data

Several academic institutions provide public access to citation data of their journals. Such as the APS datasets and arXiv datasets in physics, and the AMiner datasets in computer science. Here we use the APS citation network as an example. You can get the same copy by shooting [email protected] an email to request the data and explain what you are using them for.

The APS data set contains two subsets, the first one is a JSON file containing the meta data of papers, including DOI, author names, published year, the number of pages, etc. The other is CSV file containing all the citation links from paper A to paper B (B is cited by A). The citation data set under investigation includes 530 K paper and 540 K citations.

Descriptive analysis

Again, we need to import a collection of modules to facilitate our analysis.

import re
import sys
from collections import defaultdict
import numpy as np
import pylab as plt
import matplotlib.cm as cm
import statsmodels.api as sm
from os import listdir
import json
import networkx as nx

To obtain descriptive results, we want to calculate the reference length and impact (number of citations) of papers, and integrate this information into the meta data of papers.

The following codes will calculate the reference length and impact (number of citations) of papers:

#citation count
C=defaultdict(lambda:[0,0]) #paper doi : (cite n papers, been cited by n papers)
citationCounter=0
with open('/Users/lingfeiw/Documents/bigdata/aps/aps-dataset-citations-2013/aps-dataset-citations-2013.csv','r') as f:
    for line in f:
        citationCounter+=1
        if citationCounter%100000==0:
            flushPrint(citationCounter/100000)
        x,y=line.strip().split(',')#x cited y
        C[x][0]+=1
        C[y][1]+=1

Now we can integrate these information into the meta data of papers:

P=defaultdict(lambda:[])
paperCounter=0
path='/Users/lingfeiw/Documents/bigdata/aps/aps-dataset-metadata-2013/'
for journal in listdir(path):
    if journal=='.DS_Store':
        continue
    ad2=path+journal+'/'
    J=defaultdict(lambda:[])
    for folder in listdir(ad2): # all subfolders
        if folder=='.DS_Store':
            continue
        ad1=ad2+folder+'/'
        flushPrint(journal+'_'+folder)
        for paper in listdir(ad1): # all json files in subfolders
            if paper=='.DS_Store':
                continue
            ad0=ad1+paper
            try:
                with open(ad0,'rb') as e:
                    paperCounter+=1
                    j=json.loads(e.read())
                    nAuthor=len(j['authors'])
                    date=str(j['date'])
                    year=int(date[:4])
                    length=int(j['numPages'])
                    doi=str(j['id'])
                    ncite,ncited = C[doi]
                    J[doi]=[year,nAuthor,length,ncite,ncited]
            except:
                pass
    P[journal]=J

Using this information, we can plot the yearly increase of the number of papers in the past century:

Y={}
for j in P:
    J = defaultdict(lambda:0)
    for x,y in P[j].items():
        year,nAuthor,length,ncite,ncited  = y
        J[year]+=1
    Y[j]=J

journals = sorted([(sorted(Y[j].keys())[0],j) for j in Y])
fig = plt.figure(figsize=(12, 5),facecolor='white')
cmap = cm.get_cmap('Accent_r', len(journals))
for n, val in enumerate(journals):
    firstyear,name=val
    year,stat = np.array(sorted(Y[name].items())).T
    plt.plot(year,stat,marker='',linestyle='-',color=cmap(n),label=name+'_'+str(firstyear))
plt.legend(loc=2,fontsize=10)
plt.xlabel('Year')
plt.ylabel('N of papers')

The above codes give the following figure:

We can also systematically investigate the temporal evolution of paper properties:


def yearlystat(variable):
    v = ['year','nAuthor','length','ncite','ncited']
    n = v.index(variable)
    Y={}
    for j in P:
        M = {}
        J = defaultdict(lambda:[])
        for x,y in P[j].items():
            if y[n]>0:
                J[y[0]].append(y[n])
        for year in J:
            M[year]=(np.mean(J[year]),np.std(J[year]))
        Y[j]=M
    return Y

D = yearlystat('ncite')
E = yearlystat('nAuthor')
F = yearlystat('length')
G = yearlystat('ncited')

def plotV(dic,ylab,label,X_label):
    cmap = cm.get_cmap('Accent_r', len(journals))
    for n, val in enumerate(journals):
        firstyear,name=val
        if name == 'RMP':
            continue
        d = dic[name]
        year,mean,std = np.array(sorted([(year, d[year][0], d[year][1]) for year in d])).T
        plt.plot(year,mean,marker='',linestyle='-',color=cmap(n),label=name+'_'+str(firstyear))
        plt.fill_between(year, mean-std/2.0,mean+std/2.0,color=cmap(n),alpha=0.05)
    if label==True:
        lg=plt.legend(loc=2,fontsize=10,ncol=3)
        lg.draw_frame(False)
    if X_label==True:
        plt.xlabel('Year')
    plt.ylabel(ylab)
    plt.ylim(0,40)
    plt.xlim(1890,2013)
    #if X_axis==False:
    #    plt.tick_params(axis='x',which='both',bottom='off',top='off',labelbottom='off')

fig = plt.figure(figsize=(10, 10),facecolor='white')
#
ax = fig.add_subplot(411)
plotV(D,'Reference length',True,False)
ax = fig.add_subplot(412)
plotV(E,'N of authors',False,False)
ax = fig.add_subplot(413)
plotV(F,'Paper length',False,False)
ax = fig.add_subplot(414)
plotV(G,'Impact',False,True)
#plt.savefig('/Users/lingfeiw/Desktop/APS_stat.png')

APS_stat.png

We observe that the length of papers did not increase significantly in the past century, but papers are citing more references, been contributed by more authors, and on average have smaller impact.

Construct citation network

We now construct a network using the module networkx as follows:

G = nx.DiGraph()
citationCounter=0
with open('/Users/lingfeiw/Documents/bigdata/aps/aps-dataset-citations-2013/aps-dataset-citations-2013.csv','r') as f:
    for line in f:
        citationCounter+=1
        if citationCounter%100000==0:
            flushPrint(citationCounter/100000)
        x,y=line.strip().split(',')#x cites y
        G.add_edge(x,y)

len(G.nodes()),len(G.edges())

This network has 531,480 nodes and 6,039,994 edges.

Calculate PageRank values

The Networkx module provides the PageRank algorithm so there is no need to reinvent the wheel. The implement of the algorithm is as simple as one line code:

pg = nx.pagerank(G)

After obtaining the PageRank scores of the papers (nodes), we sort the result and list the five most influential papers:

top10 = zip(*sorted(pg.items(),key=lambda x:-x[1])[:10])[0]
Author Year Title Journal
W. Kohn & L. J. Sham 1965 Self-Consistent Equations Including Exchange and Correlation Effects Physical Review
P. Hohenberg & W. Kohn 1964 Inhomogeneous Electron Gas Physical Review
J. Bardeen, L. N. Cooper, and J. R. Schrieffer 1957 Theory of Superconductivity Physical Review
A. Einstein, B. Podolsky, and N. Rosen 1935 Can Quantum-Mechanical Description of Physical Reality Be Considered Complete? Physical Review
J. P. Perdew and Alex Zunger 1981 Self-interaction correction to density-functional approximations for many-electron systems Physical Review B
S. Chandrasekhar 1943 Stochastic Problems in Physics and Astronomy Reviews of Modern Physics
John P. Perdew, Kieron Burke, and Matthias Ernzerhof 1996 Generalized Gradient Approximation Made Simple Physical Review Letters
P. W. Anderson 1958 Absence of Diffusion in Certain Random Lattices Physical Review
Steven Weinberg 1967 A Model of Leptons Physical Review Letters
U. Fano 1961 Effects of Configuration Interaction on Intensities and Phase Shifts Physical Review

Finally, here are codes to generate data for Processing visualization of the cover for this chapter - the connection between top 100 most influential papers.

top = set([i for i,j in sorted(pg.items(),key=lambda x:-x[1])[:100]])
with open('/Users/csid/Desktop/citations.txt','wb') as f:
    for i,j in G.edges():
        if i in top and j in top:
            f.write(str(i)+'\t'+ str(j)+'\n')

results matching ""

    No results matching ""