Pandas data anaysis for the SQL Guy

Here’s some beginning examples for using the excellent python data analysis libary pandas.

Coming from database background, I found I spent a fair amount of time looking up how to do things that would be second nature in SQL.  The intent of this blog is to shortcut this learning curve.

Setup

Easiest way to get running with pandas (as of 08/13) is from the Anaconda download at https://store.continuum.io/.  This installs python and all the requisite libraries.

I have been using the ipython command shell to use pandas interactively, and it works well.   Start from command line ipython –pylab.

Save work using %save command (this means you don’t have to start from scratch each session):

    %save <filename> <startline>-<endline>

Run saved file as

    %run <filename>

Beyond the online documentation, Python for Data Analysis (http://www.amazon.com/books/dp/1449319793) is very good.

Data for examples:

Sample data:  EOD stock data from EODData.com.

Data staged in nasdqa_daily.csv file that looks like this:

AACC,2012-01-02,3.91,3.91,3.91,3.91,0
AAME,2012-01-02,1.97,1.97,1.97,1.97,0
AAON,2012-01-02,20.49,20.49,20.49,20.49,0
AAPL,2012-01-02,405,405,405,405,0
AAWW,2012-01-02,38.43,38.43,38.43,38.43,0

Imported into a DB (Sqlite), the stocks table looks like

CREATE TABLE nasdaq_daily (
  ticker text,
  date text,
  open real,
  high real,
  low real,
  close real,
  volume real
)

The second table contains twitter data  that looks like this:

1,goog,2013-06-20,488
2,vod,2013-06-30,1

Imported into a DB, the table looks like this:

CREATE TABLE `twitter_data` (
  id INTEGER primary key,
  ticker CHAR(5) NOT NULL,
  date DATETIME NOT NULL,
  count INTEGER
)

——————————————————————————————–

# set up the environment and import data

from pandas import DataFrame, Series
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import csv data
stocks = pd.read_csv(“nasdaq_daily.csv”, names=[‘ticker’, ‘date’, ‘open’, ‘high’, ‘low’, ‘close’, ‘volume’])

# Import database data
import sqlite3
import pandas.io.sql as sql
con = sqlite3.connect(‘stockdata.db’)
tweets = sql.read_frame(‘select * from twitter_data’, con)

#start of data manipulation

# select * from stocks where ticker = ‘GOOG’
goog = stocks[stocks.ticker == ‘goog’]

# select * from stocks where ticker in (‘GOOG’, ‘VOD’, ‘AAPL’, ‘CSCO’, ‘AMGN’, ‘QCOM’, ‘MSFT’, ‘AMZN’, ‘CMCSA’, ‘INTC’ )
mystocks = stocks[stocks.ticker.isin( [‘GOOG’, ‘VOD’, ‘AAPL’, ‘CSCO’, ‘AMGN’, ‘QCOM’, ‘MSFT’, ‘AMZN’, ‘CMCSA’, ‘INTC’ ] ) ];

# update stocks set ticker = lower(ticker)
mystocks[‘ticker’] = mystocks[‘ticker’].map(lambda x: x.lower())

# select * from stocks s  join twitter_data t on  s.ticker = t.ticker and s.date = t.date
merged = pd.merge(tweets, mystocks)  # assumes same named columns in two tables are joined — can also specify

#select date, count, close from <previous join>
merged = merged[[‘date’, ‘count’, ‘close’]]

#select ticker, avg(close), min(close), max(close)
#from stocks
#group by ticker
stocks[[‘ticker’, ‘close’]].groupby(‘ticker’).agg([np.mean, np.min, np.max])