Here’s some beginning examples for using the excellent python data analysis libary pandas.
Coming from database background, I found I spent a fair amount of time looking up how to do things that would be second nature in SQL. The intent of this blog is to shortcut this learning curve.
Easiest way to get running with pandas (as of 08/13) is from the Anaconda download at https://store.continuum.io/. This installs python and all the requisite libraries.
I have been using the ipython command shell to use pandas interactively, and it works well. Start from command line ipython –pylab.
Save work using %save command (this means you don’t have to start from scratch each session):
%save <filename> <startline>-<endline>
Run saved file as
Beyond the online documentation, Python for Data Analysis (http://www.amazon.com/books/dp/1449319793) is very good.
Data for examples:
Sample data: EOD stock data from EODData.com.
Data staged in nasdqa_daily.csv file that looks like this:
Imported into a DB (Sqlite), the stocks table looks like
CREATE TABLE nasdaq_daily (
The second table contains twitter data that looks like this:
Imported into a DB, the table looks like this:
CREATE TABLE `twitter_data` (
id INTEGER primary key,
ticker CHAR(5) NOT NULL,
date DATETIME NOT NULL,
# set up the environment and import data
from pandas import DataFrame, Series
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Import csv data
stocks = pd.read_csv(“nasdaq_daily.csv”, names=[‘ticker’, ‘date’, ‘open’, ‘high’, ‘low’, ‘close’, ‘volume’])
# Import database data
import pandas.io.sql as sql
con = sqlite3.connect(‘stockdata.db’)
tweets = sql.read_frame(‘select * from twitter_data’, con)
#start of data manipulation
# select * from stocks where ticker = ‘GOOG’
goog = stocks[stocks.ticker == ‘goog’]
# select * from stocks where ticker in (‘GOOG’, ‘VOD’, ‘AAPL’, ‘CSCO’, ‘AMGN’, ‘QCOM’, ‘MSFT’, ‘AMZN’, ‘CMCSA’, ‘INTC’ )
mystocks = stocks[stocks.ticker.isin( [‘GOOG’, ‘VOD’, ‘AAPL’, ‘CSCO’, ‘AMGN’, ‘QCOM’, ‘MSFT’, ‘AMZN’, ‘CMCSA’, ‘INTC’ ] ) ];
# update stocks set ticker = lower(ticker)
mystocks[‘ticker’] = mystocks[‘ticker’].map(lambda x: x.lower())
# select * from stocks s join twitter_data t on s.ticker = t.ticker and s.date = t.date
merged = pd.merge(tweets, mystocks) # assumes same named columns in two tables are joined — can also specify
#select date, count, close from <previous join>
merged = merged[[‘date’, ‘count’, ‘close’]]
#select ticker, avg(close), min(close), max(close)
#group by ticker
stocks[[‘ticker’, ‘close’]].groupby(‘ticker’).agg([np.mean, np.min, np.max])