# Detecting Influential Papers in Citation Networks

## PageRank algorithm

PageRank is algorithm used by Google to rank Web pages according to their "authority". It simulates an invisible "flow" of "authority" in the dircted hyperlink network and calculate the converged authority score on nodes, i.e., the PageRank values.

Considering the similarity between citation networks and hyperlink networks, it is natural to apply PageRank to evaluate the influence of papers. Actually, as suggested in the original paper, PageRank itself was inspired by citation analysis.

This algorithm can be applied to any non-weighted networks to rank the influence of nodes. In this chapter I will introduce how to calculate the PageRank values of papers from a citation network, but the same logic can also be used to mine opinion leaders in social networks or elsewhere.

## The APS citation data

Several academic institutions provide public access to citation data of their journals. Such as the APS datasets and arXiv datasets in physics, and the AMiner datasets in computer science. Here we use the APS citation network as an example. You can get the same copy by shooting [email protected] an email to request the data and explain what you are using them for.

The APS data set contains two subsets, the first one is a JSON file containing the meta data of papers, including DOI, author names, published year, the number of pages, etc. The other is CSV file containing all the citation links from paper A to paper B (B is cited by A). The citation data set under investigation includes 530 K paper and 540 K citations.

## Descriptive analysis

Again, we need to import a collection of modules to facilitate our analysis.

```
import re
import sys
from collections import defaultdict
import numpy as np
import pylab as plt
import matplotlib.cm as cm
import statsmodels.api as sm
from os import listdir
import json
import networkx as nx
```

To obtain descriptive results, we want to calculate the reference length and impact (number of citations) of papers, and integrate this information into the meta data of papers.

The following codes will calculate the reference length and impact (number of citations) of papers:

```
#citation count
C=defaultdict(lambda:[0,0]) #paper doi : (cite n papers, been cited by n papers)
citationCounter=0
with open('/Users/lingfeiw/Documents/bigdata/aps/aps-dataset-citations-2013/aps-dataset-citations-2013.csv','r') as f:
for line in f:
citationCounter+=1
if citationCounter%100000==0:
flushPrint(citationCounter/100000)
x,y=line.strip().split(',')#x cited y
C[x][0]+=1
C[y][1]+=1
```

Now we can integrate these information into the meta data of papers:

```
P=defaultdict(lambda:[])
paperCounter=0
path='/Users/lingfeiw/Documents/bigdata/aps/aps-dataset-metadata-2013/'
for journal in listdir(path):
if journal=='.DS_Store':
continue
ad2=path+journal+'/'
J=defaultdict(lambda:[])
for folder in listdir(ad2): # all subfolders
if folder=='.DS_Store':
continue
ad1=ad2+folder+'/'
flushPrint(journal+'_'+folder)
for paper in listdir(ad1): # all json files in subfolders
if paper=='.DS_Store':
continue
ad0=ad1+paper
try:
with open(ad0,'rb') as e:
paperCounter+=1
j=json.loads(e.read())
nAuthor=len(j['authors'])
date=str(j['date'])
year=int(date[:4])
length=int(j['numPages'])
doi=str(j['id'])
ncite,ncited = C[doi]
J[doi]=[year,nAuthor,length,ncite,ncited]
except:
pass
P[journal]=J
```

Using this information, we can plot the yearly increase of the number of papers in the past century:

```
Y={}
for j in P:
J = defaultdict(lambda:0)
for x,y in P[j].items():
year,nAuthor,length,ncite,ncited = y
J[year]+=1
Y[j]=J
journals = sorted([(sorted(Y[j].keys())[0],j) for j in Y])
fig = plt.figure(figsize=(12, 5),facecolor='white')
cmap = cm.get_cmap('Accent_r', len(journals))
for n, val in enumerate(journals):
firstyear,name=val
year,stat = np.array(sorted(Y[name].items())).T
plt.plot(year,stat,marker='',linestyle='-',color=cmap(n),label=name+'_'+str(firstyear))
plt.legend(loc=2,fontsize=10)
plt.xlabel('Year')
plt.ylabel('N of papers')
```

The above codes give the following figure:

We can also systematically investigate the temporal evolution of paper properties:

```
def yearlystat(variable):
v = ['year','nAuthor','length','ncite','ncited']
n = v.index(variable)
Y={}
for j in P:
M = {}
J = defaultdict(lambda:[])
for x,y in P[j].items():
if y[n]>0:
J[y[0]].append(y[n])
for year in J:
M[year]=(np.mean(J[year]),np.std(J[year]))
Y[j]=M
return Y
D = yearlystat('ncite')
E = yearlystat('nAuthor')
F = yearlystat('length')
G = yearlystat('ncited')
def plotV(dic,ylab,label,X_label):
cmap = cm.get_cmap('Accent_r', len(journals))
for n, val in enumerate(journals):
firstyear,name=val
if name == 'RMP':
continue
d = dic[name]
year,mean,std = np.array(sorted([(year, d[year][0], d[year][1]) for year in d])).T
plt.plot(year,mean,marker='',linestyle='-',color=cmap(n),label=name+'_'+str(firstyear))
plt.fill_between(year, mean-std/2.0,mean+std/2.0,color=cmap(n),alpha=0.05)
if label==True:
lg=plt.legend(loc=2,fontsize=10,ncol=3)
lg.draw_frame(False)
if X_label==True:
plt.xlabel('Year')
plt.ylabel(ylab)
plt.ylim(0,40)
plt.xlim(1890,2013)
#if X_axis==False:
# plt.tick_params(axis='x',which='both',bottom='off',top='off',labelbottom='off')
fig = plt.figure(figsize=(10, 10),facecolor='white')
#
ax = fig.add_subplot(411)
plotV(D,'Reference length',True,False)
ax = fig.add_subplot(412)
plotV(E,'N of authors',False,False)
ax = fig.add_subplot(413)
plotV(F,'Paper length',False,False)
ax = fig.add_subplot(414)
plotV(G,'Impact',False,True)
#plt.savefig('/Users/lingfeiw/Desktop/APS_stat.png')
```

We observe that the length of papers did not increase significantly in the past century, but papers are citing more references, been contributed by more authors, and on average have smaller impact.

## Construct citation network

We now construct a network using the module ** networkx** as follows:

```
G = nx.DiGraph()
citationCounter=0
with open('/Users/lingfeiw/Documents/bigdata/aps/aps-dataset-citations-2013/aps-dataset-citations-2013.csv','r') as f:
for line in f:
citationCounter+=1
if citationCounter%100000==0:
flushPrint(citationCounter/100000)
x,y=line.strip().split(',')#x cites y
G.add_edge(x,y)
len(G.nodes()),len(G.edges())
```

This network has 531,480 nodes and 6,039,994 edges.

## Calculate PageRank values

The *Networkx* module provides the PageRank algorithm so there is no need to reinvent the wheel. The implement of the algorithm is as simple as one line code:

```
pg = nx.pagerank(G)
```

After obtaining the PageRank scores of the papers (nodes), we sort the result and list the five most influential papers:

```
top10 = zip(*sorted(pg.items(),key=lambda x:-x[1])[:10])[0]
```

Author | Year | Title | Journal |
---|---|---|---|

W. Kohn & L. J. Sham | 1965 | Self-Consistent Equations Including Exchange and Correlation Effects | Physical Review |

P. Hohenberg & W. Kohn | 1964 | Inhomogeneous Electron Gas | Physical Review |

J. Bardeen, L. N. Cooper, and J. R. Schrieffer | 1957 | Theory of Superconductivity | Physical Review |

A. Einstein, B. Podolsky, and N. Rosen | 1935 | Can Quantum-Mechanical Description of Physical Reality Be Considered Complete? | Physical Review |

J. P. Perdew and Alex Zunger | 1981 | Self-interaction correction to density-functional approximations for many-electron systems | Physical Review B |

S. Chandrasekhar | 1943 | Stochastic Problems in Physics and Astronomy | Reviews of Modern Physics |

John P. Perdew, Kieron Burke, and Matthias Ernzerhof | 1996 | Generalized Gradient Approximation Made Simple | Physical Review Letters |

P. W. Anderson | 1958 | Absence of Diffusion in Certain Random Lattices | Physical Review |

Steven Weinberg | 1967 | A Model of Leptons | Physical Review Letters |

U. Fano | 1961 | Effects of Configuration Interaction on Intensities and Phase Shifts | Physical Review |

Finally, here are codes to generate data for Processing visualization of the cover for this chapter - the connection between top 100 most influential papers.

```
top = set([i for i,j in sorted(pg.items(),key=lambda x:-x[1])[:100]])
with open('/Users/csid/Desktop/citations.txt','wb') as f:
for i,j in G.edges():
if i in top and j in top:
f.write(str(i)+'\t'+ str(j)+'\n')
```