Python for Basic Data Analysis

The most frequently used Python modules from a data scientists include 1) numpy, a module for scientific computing; 2) matplotlib, a plotting library that produce high quality figures (it can also be imported as pylab); 3) scipy, a collection of science and engineering modules. Sometimes you may also need random to generate random data or do random sampling, statsmodels for statistical models, and scikit-learn to do machine learning.

import numpy as np
import pylab as plt
import random, datetime
import statsmodels.api as sm
from scipy.stats import norm

If you are using jupyter notebook, you can also add the following syntax

%matplotlib inline

to get figures to show up inline, which is way more convenient than opening new windows for figures, as matplotlib normally would do.

Variable types

There are tree basic types of data. str for strings, int for integers, and float for floats. Using the following scripts you can convert one format to another.


Data structures

There are four frequently used data structures to be familiar with: 1) list is very handy for small tasks but is of less flexibility in computation - you always need to apply the function to each of its elements using for loop. And a nightmare about list is that to retrieve any element you have to go over the entire list. To greatly speed up computation you can convert a list into an array, on which you can directly apply various matrix manipulations.

One of my favorite structure is dictionary, it allows to handle very big and complicate data sets - as long as they do not hit the limit of your memory. Each element in a dictionary is a pair of "key" and "value", and you can put any type of data there, including strings, integers, floats, lists, and dictionaries themselves. Finally, a set is basically a dictionary with only keys. The common limitation of dictionaries and sets is that you can not ask their elements to line up in an order you like - that is also why you can so quickly pull any value out by indexing its key, it is a Hash table.

l = [1,2,3,3]
a = np.array(List)
d = {'a':1,'b':2,'c':3}
s = set([1,2,3,3])

Applying functions

To define a function, you should specify the name, the operation, and the return value. There are two ways to apply the function, you can either uses it as an operator and apply it directly to all the elements in a list, or use the map function to map it to list elements. The while syntax provides a handy way to do conditionally apply functions, i.e., the application of function would be terminated if a predetermined condition is met.

def squarePlus(n):
    return n**2 + 1


r = [squarePlus(i) for i in range(10)]

r = map(squarePlus, range(10))

r = []
while i<10:

Read and write

Do not try to load the entire data file into memory, no matter it's 1 MB or 1GB, it always slow down the computation. Using the with syntax you can open a file, get the parts you want and/or update a counter (sometimes you may just want to count the number of lines in a file), and then close the file. In this way, you can manage TB level data using your laptop.

When exporting data, make sure the elements in each line are strings.

data = []
with open('.../xxx.txt','r') as f:
    for line in f:
        line = line.strip().split(',')

f = open(".../xxx.txt", "wb")
for i in data:
    f.write('\t'.join(map(str,i)) + '\n')

Statistical analysis examples

Here I give three naive examples on popular statistical tools in social science, including ordinary least square regression, distribution fitting, and time series visualization.

To conduct a regression, we firstly generate a simulated data set:

x = np.random.randn(50)

This gives 50 numbers following a standardized normal distribution with mean 0 and variance 1.

y = np.random.randn(50) + 3*x

This gives 50 new numbers that are linearly depended on the previous 50 numbers but also contains a fluctuation.
The following codes fit the regression line and plot it out on top of data.

def OLSRegressPlot(x,y,col,xlab,ylab):
    xx = sm.add_constant(x, prepend=True)
    res = sm.OLS(y,xx).fit()
    constant, beta = res.params
    r2 = res.rsquared
    lab = r'$slope = %.2f, \,R^2 = %.2f$' %(beta,r2)
    plt.scatter(x,y,s=60,facecolors='none', edgecolors=col)
    plt.plot(x,constant + x*beta,"red",label=lab)
    plt.legend(loc = 'upper left',fontsize=16)

fig = plt.figure(figsize=(7, 7),facecolor='white')

There are more than just one way to generated simulated data that satisfy a given distribution. You can also use the norm function in the scipy module to get a collection of normally distributed data points.

data = norm.rvs(10.0, 2.5, size=5000)

You can fit this data set, or any other real data set at hand, to a normal distribution using

mu, std =

The following codes plots the histogram of data and also shows the fitted normal distribution curve.

fig = plt.figure(figsize=(7, 7),facecolor='white')
plt.hist(data, bins=25, normed=True, alpha=0.6, color='g')
x = np.linspace(min(data), max(data), 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'r', linewidth=2)
title = r"$\mu = %.2f, \,  \sigma = %.2f$" % (mu, std)

The modules not only make life easier by providing powerful functions for analyzing and visualizing data, but also giving you access to some interesting data sets. For example, the finance function in matplotlib gives you access to the daily opening and closing stock prices of all stocks on the U.S. market. The following codes show you how to pull them out and generate a candlestick chart.

from matplotlib.dates import WeekdayLocator, DayLocator, MONDAY
from import quotes_historical_yahoo, candlestick

date1 = (2014, 2, 1)
date2 = (2014, 5, 1)
quotes = quotes_historical_yahoo('INTC', date1, date2)
fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(1,1,1)
candlestick(ax, quotes, width=0.8, colorup='green', colordown='r', alpha=0.8)
mondays = WeekdayLocator(MONDAY)    # major ticks on the mondays
alldays = DayLocator()              # minor ticks on the days
weekFormatter = DateFormatter('%b %d')  # e.g., Jan 12
plt.setp( plt.gca().get_xticklabels(), rotation=45, horizontalalignment='right')
plt.title(r'$Intel \,Corporation \,Stock \,Price$',size=16)

results matching ""

    No results matching ""