Intro to Pandas DataFrames


DataFrames are a lovely way to store data. They’re essentially matrices that can store almost any type of data and are a great option for handling data where you want to keep track of rows and columns with labels. The Pandas library also comes with a handy host of functions that allow you to work with your DataFrames in very smart ways (eg. means, split-apply-combine).

The Pandas website has a lot of great documentation to help you get started. However, even after reading the official tutorials, I was stumped on some problems using DataFrames. This lesson is based directly off of how I solved those problems, so hopefully some will find it helpful.

Creating DataFrames

You can instantiate DataFrames without any data or with data from any number of sources

import pandas as pd
import numpy as np

blank = pd.DataFrame()

movies = pd.DataFrame(np.zeros((4,4)), index=['Forrest Gump', 'Scanners', '2010: Odyssey Two', 'Fern Gully'], columns = ['Date Released', 'Box Office Gross', 'IMDB Score', 'Tomatometer'])

zeros = pd.DataFrame(np.zeros((3,5)))


Got a csv file you want to read? No problem!

probedata = pd.read_csv('12probe20cm.csv')


Slicing DataFrames

You can slice DataFrames just like an array, and can also include some more advanced criteria

movies['Date Released'] = ['1994','1981','1984','1992']
movies['Box Office Gross'] = [50,12,8,16]
movies['IMDB Score'] = [8.8,6.8,6.8,6.4]
movies['Tomatometer'] = [.72,.80,.66,.71]



movies[['IMDB Score','Tomatometer']]
movies[movies['IMDB Score']<(movies['Tomatometer']*10)]

Adding new elements

Columns and rows can be added easily (relatively)

Columns can be added by setting a new column to another DataFrame. Just make sure that the indices are compatible!

favlist = pd.DataFrame([True,True,True,True],index=movies.index)

movies['Childhood Top 10?'] = favlist


Rows can be added by using the append function on a DataFrame, taking an appropriately index and columned DataFrame that will be added on the end of the first DataFrame.

wildwildwest = pd.DataFrame(index=['Wild Wild West'], columns=movies.columns)

wildwildwest.iloc[0] = ['1999', 2, 4.8,.17,False]


The concat function is also very useful. It can handle DataFrames with different indices and/or columns. There are multiple ways to joining the indices and columns however you would like


The Amazing GroupBy

Pandas.GroupBy is a great function that allows you to process your data in many different ways without having to get fancy or write any loops.

Essentially, it carries out three different steps:

In my work, this function was extremely useful when I wanted to obtain the mean time spent in target zone for each of my treatment groups

grouped = probedata.groupby(['Group']).mean()

import matplotlib.pylab as plt
import seaborn as sns

grouped[['Zone 1 %','Target Zone %','Zone 3 %','Zone 4 %']].plot(kind='bar')
plt.title('Time spent in Zone')
plt.ylabel('Time (sec)')
locs, labels = plt.xticks()