Retrieving Raw Data from Figures
Let's take a picture
It more often than not that we are interested in data sets analyzed in published papers, but do not have access to it - either because data are poorly archived, or your email to politely request data never gets a reply. In this case there is still something we can do to retrieve raw data from a picture.
In the hallway of CBIE (located at Matthews Hall in the Arizona State University), the research center I worked as a post-doc researcher for two years from 2014-2015, there are many posters on the studies of social norms and institutions.
The above poster seems interesting. And I am particularly interested in their experiment results. I want to see how these data points look like in a log-log plot.
So I took a low-quality photo using my $90 Windows phone (Lumia) (note that this was in 2014):
Great, we've got it. Now it only takes a few more steps to see the real data.
We will use functions in the ndimage module in scipy to do some image analysis.
import numpy as np import pylab as plt from scipy import ndimage
The first step is to import the photo into Python and turn it into a mathematical object, i.e., a matrix. A colorful image is simply a matrix in which a element (pixel) is a three-digit array. The three digits in the array correspond to the values in red, green, and blue colors. Putting together these values we can retrieve the color of pixels in human eyes.
j = ndimage.imread("/Users/csid/Desktop/figure.jpg") j = j[100:730,300:1240] plt.imshow(j) plt.show()
We trim the photo along the frame. The size of the original matrix (photo) was 916 (height) x 1632 (width). After trimming its size is 630 (height) x 940 (width).
I use a Gaussian filter to smooth the edges data points and turn them into disconnected "clusters". Note that we will not be able to separate and identify data pints within these clusters with codes as simple as shown here. After this, I convert the colorful image into a black and white version. Basically, this step replaces the 3-digit elements of the matrix with 0/1 values.
b = ndimage.gaussian_filter(j,4) c,d = ndimage.label(b<150) im = plt.imshow(c) plt.show()
Using the handy functions provided by the ndimage module we are able to detect all disconnected clusters in a given matrix and find the locations of their centers.
sliced = ndimage.measurements.find_objects(c) data= for i in sliced: sliceX = i sliceY = i x = np.abs(sliceX.stop - sliceX.start)/2 + min(sliceX.stop,sliceX.start) y = np.abs(sliceY.stop - sliceY.start)/2 + min(sliceY.stop,sliceY.start) xlim=im.axes.get_xlim() ylim=im.axes.get_ylim() data.append([x,ylim-y]) plt.plot([sliceX.start, sliceX.start, sliceX.stop, sliceX.stop, sliceX.start], \ [sliceY.start, sliceY.stop, sliceY.stop, sliceY.start, sliceY.start], \ color="c") plt.xlim(xlim) plt.ylim(ylim) plt.imshow(c) plt.show()
The following figure shows that we have successfully detected almost all clusters.
The only thing remained is to get the coordinates of the centers of the green rectangles (contained in the "data" list) shown in the above figure and rescale the axes using the real values shown in the original photo.
fig = plt.figure(figsize=(9.4, 6.3),facecolor='white') x,y=np.array(data).T x=x*0.3/940.0 y=y*3000.0/630 plt.plot(x,y,'bD',markersize=6) plt.xlim(0,0.3) plt.ylim(0,3000) plt.xlabel('inequality',fontsize=14) plt.ylabel('length of game',fontsize=14) plt.savefig('/Users/lingfeiw/Desktop/raw.png')
The above figure demonstrates the data we retrieved, which is far from perfect due to the loss in information in photo distortion and image analysis, but probably good enough for meta-analysis. For example, we can test whether the log-log plot of this data give a stronger Pearson correlation (see this chapter for codes of OLS regression):
fig = plt.figure(figsize=(9.4*1.5, 1.5*6.3/2.0),facecolor='white') ax = fig.add_subplot(121) OLSRegressPlot(x,y,'b','log inequality','log length of game') ax = fig.add_subplot(122) OLSRegressPlot(np.log(x),np.log(y),'b','log inequality','log length of game')
See this chapter for the codes of the regression plot. We known that the square of Pearson's correlation coefficient is the same as the R-squred in simple linear regression. There fore the above figure shows that the logarithmic transformation of data does not gives a better result. Hmm...not really a good news after all this long journey, but at least our concern was addressed and we are happy.