# Python for Basic Data Analysis

## Popular modules

The most frequently used Python modules from a data scientists include 1) ** numpy**, a module for scientific computing; 2)

**, a plotting library that produce high quality figures (it can also be imported as pylab); 3)**

*matplotlib***, a collection of science and engineering modules. Sometimes you may also need**

*scipy***to generate random data or do random sampling,**

*random***for statistical models, and**

*statsmodels***to do machine learning.**

*scikit-learn*```
import numpy as np
import pylab as plt
import random, datetime
import statsmodels.api as sm
from scipy.stats import norm
```

If you are using jupyter notebook, you can also add the following syntax

```
%matplotlib inline
```

to get figures to show up inline, which is way more convenient than opening new windows for figures, as *matplotlib* normally would do.

## Variable types

There are tree basic types of data. **str** for strings, **int** for integers, and **float** for floats. Using the following scripts you can convert one format to another.

```
str(3)
int('5')
float('7.1')
```

## Data structures

There are four frequently used data structures to be familiar with: 1) ** list** is very handy for small tasks but is of less flexibility in computation - you always need to apply the function to each of its elements using for loop. And a nightmare about list is that to retrieve any element you have to go over the entire list. To greatly speed up computation you can convert a list into an

**, on which you can directly apply various matrix manipulations.**

*array*One of my favorite structure is ** dictionary**, it allows to handle very big and complicate data sets - as long as they do not hit the limit of your memory. Each element in a dictionary is a pair of "key" and "value", and you can put any type of data there, including strings, integers, floats, lists, and dictionaries themselves. Finally, a

**is basically a dictionary with only keys. The common limitation of dictionaries and sets is that you can not ask their elements to line up in an order you like - that is also why you can so quickly pull any value out by indexing its key, it is a Hash table.**

*set*```
l = [1,2,3,3]
a = np.array(List)
d = {'a':1,'b':2,'c':3}
s = set([1,2,3,3])
```

## Applying functions

To define a function, you should specify the name, the operation, and the return value. There are two ways to apply the function, you can either uses it as an operator and apply it directly to all the elements in a list, or use the *map* function to map it to list elements. The *while* syntax provides a handy way to do conditionally apply functions, i.e., the application of function would be terminated if a predetermined condition is met.

```
def squarePlus(n):
return n**2 + 1
squarePlus(2)
r = [squarePlus(i) for i in range(10)]
r = map(squarePlus, range(10))
r = []
i=0
while i<10:
r.append(squarePlus(i))
i+=1
```

## Read and write

Do not try to load the entire data file into memory, no matter it's 1 MB or 1GB, it always slow down the computation. Using the *with* syntax you can open a file, get the parts you want and/or update a counter (sometimes you may just want to count the number of lines in a file), and then close the file. In this way, you can manage TB level data using your laptop.

When exporting data, make sure the elements in each line are strings.

```
data = []
with open('.../xxx.txt','r') as f:
for line in f:
line = line.strip().split(',')
data.append(line)
f.close()
f = open(".../xxx.txt", "wb")
for i in data:
f.write('\t'.join(map(str,i)) + '\n')
f.close()
```

## Statistical analysis examples

Here I give three naive examples on popular statistical tools in social science, including ordinary least square regression, distribution fitting, and time series visualization.

To conduct a regression, we firstly generate a simulated data set:

```
x = np.random.randn(50)
```

This gives 50 numbers following a standardized normal distribution with mean 0 and variance 1.

```
y = np.random.randn(50) + 3*x
```

This gives 50 new numbers that are linearly depended on the previous 50 numbers but also contains a fluctuation.

The following codes fit the regression line and plot it out on top of data.

```
def OLSRegressPlot(x,y,col,xlab,ylab):
xx = sm.add_constant(x, prepend=True)
res = sm.OLS(y,xx).fit()
constant, beta = res.params
r2 = res.rsquared
lab = r'$slope = %.2f, \,R^2 = %.2f$' %(beta,r2)
plt.scatter(x,y,s=60,facecolors='none', edgecolors=col)
plt.plot(x,constant + x*beta,"red",label=lab)
plt.legend(loc = 'upper left',fontsize=16)
plt.xlabel(xlab,size=16)
plt.ylabel(ylab,size=16)
fig = plt.figure(figsize=(7, 7),facecolor='white')
OLSRegressPlot(x,y,'RoyalBlue',r'$x$',r'$y$')
plt.show()
```

There are more than just one way to generated simulated data that satisfy a given distribution. You can also use the *norm* function in the *scipy* module to get a collection of normally distributed data points.

```
data = norm.rvs(10.0, 2.5, size=5000)
```

You can fit this data set, or any other real data set at hand, to a normal distribution using

```
mu, std = norm.fit(data)
```

The following codes plots the histogram of data and also shows the fitted normal distribution curve.

```
fig = plt.figure(figsize=(7, 7),facecolor='white')
plt.hist(data, bins=25, normed=True, alpha=0.6, color='g')
x = np.linspace(min(data), max(data), 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'r', linewidth=2)
title = r"$\mu = %.2f, \, \sigma = %.2f$" % (mu, std)
plt.title(title,size=16)
plt.show()
```

The modules not only make life easier by providing powerful functions for analyzing and visualizing data, but also giving you access to some interesting data sets. For example, the *finance* function in *matplotlib* gives you access to the daily opening and closing stock prices of all stocks on the U.S. market. The following codes show you how to pull them out and generate a candlestick chart.

```
from matplotlib.dates import WeekdayLocator, DayLocator, MONDAY
from matplotlib.finance import quotes_historical_yahoo, candlestick
date1 = (2014, 2, 1)
date2 = (2014, 5, 1)
quotes = quotes_historical_yahoo('INTC', date1, date2)
fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(1,1,1)
candlestick(ax, quotes, width=0.8, colorup='green', colordown='r', alpha=0.8)
mondays = WeekdayLocator(MONDAY) # major ticks on the mondays
alldays = DayLocator() # minor ticks on the days
weekFormatter = DateFormatter('%b %d') # e.g., Jan 12
ax.xaxis.set_major_locator(mondays)
ax.xaxis.set_minor_locator(alldays)
ax.xaxis.set_major_formatter(weekFormatter)
ax.autoscale_view()
plt.setp( plt.gca().get_xticklabels(), rotation=45, horizontalalignment='right')
plt.title(r'$Intel \,Corporation \,Stock \,Price$',size=16)
fig.subplots_adjust(bottom=0.2)
plt.show()
```