loc[]
, head()
, info()
, describe()
, shape
, columns
, index
.groupby()
, mean()
, agg()
and size()
to apply this technique.pandas
mean()
to summarize categorical data (20 min)size()
to summarize categorical data (15 min)Today, we will be working with real data about the world combined from multiple sources by the Gapminder foundation. Gapminder is an independent Swedish organization that fights devastating misconceptions about global development. They also promote a fact-based world view through the production of free teaching and data exploration resources. Insights from the Gapminder data sources have been popularized through the efforts of public health professor Hans Rosling. It is highly recommended to check out his entertaining videos, most famously The best stats you have ever seen. Before we start exploring the data, we recommend taking this 5-10 min quiz, to see how knowledgeable (or ignorant) you are about the world. Then we will learn how to dive deeper into this data using Python!
We are studying the species and weight of animals caught in plots in our study area. The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent:
Column | Description |
---|---|
country | Country name |
year | Year of observation |
population | Population in the country at each year |
region | Continent the country belongs to |
sub_region | Sub regions as defined by |
income_group | Income group as specified by the world bank |
life_expectancy | The average number of years a newborn child would live if mortality patterns were to stay the same |
income | GDP per capita (in USD) adjusted for differences in purchasing power |
children_per_woman | Number of children born to each woman |
child_mortality | Deaths of children under 5 years |
pop_density | Average number of people per km2 |
co2_per_capita | CO2 emissions from fossil fuels (tonnes per capita) |
years_in_school_men | Average number of years attending primary, secondary, and tertiary school for 25-36 years old men |
years_in_school_women | Average number of years attending primary, secondary, and tertiary school for 25-36 years old women |
To read the data into Python,
we are going to use a function called read_csv
from the Python-package pandas.
As mentioned previously,
Python-packages are a bit like phone apps,
they are not essential to the core Python library,
but provides domain specific functionality.
To use a package,
it first needs to be imported.
# pandas is given the nickname `pd`
import pandas as pd
pandas can read CSV-files saved on the computer or directly from an URL. Here, we read data that we have compiled from Gapminder and uploaded to our GitHub repository.
url = 'https://raw.githubusercontent.com/UofTCoders/2018-09-10-utoronto/go-pages/data/world-data-Gapminder.csv'
world_data = pd.read_csv(url)
To view the dataframe that pandas created,
type world_data
in a cell and run it,
just as when viewing the content of any variable in Python.
world_data
This is how a dataframe is displayed in the Jupyter notebook.
The Jupyter notebook displays pandas dataframes in a tabular format,
and adds cosmetic conveniences such as the bold font type for the column and row names,
the alternating grey and white zebra stripes for the rows,
and highlighting of the row the mouse pointer hovers over.
The increasing numbers on the far left is the dataframe's index or row names.
These are not present in CSV-file,
but were added by pandas
to easily distinguish between the rows.
A dataframe is the representation of data in a tabular format,
similar to how data is often arranged in spreadsheets.
The data is rectangular,
meaning that all rows have the same amount of columns
and all columns have the same amount of rows.
As mentioned in the previous lectures,
when our data is arranged in a tidy format,
the columns can be referred to as the "features" or "variables" of the data,
while each row represents an individual "observation".
Dataframes are the standard data structure for most tabular data,
and what we will use for data wrangling, statistics and plotting.
A dataframe can be created by hand,
but most commonly they are generated by an input function,
such as read_csv()
,
when importing spreadsheet data from your hard drive (or the web).
As can be seen above,
the default is to display the first and last five rows
and truncate everything in between,
as indicated by the ellipsis (...
).
If we wanted to display only the first 5 lines,
we could use the head()
method.
world_data.head()
Methods are very similar to functions,
the main difference is that they belong to an object
(the method head()
belongs to the dataframe world_data
).
Methods operate on the object they belong to,
that's why we can call the method with an empty parenthesis without any arguments.
Compare this with the function type()
that was introduced previously.
type(world_data)
Here,
the world_data
variable is explicitly passed as an argument to type()
.
An immediately tangible advantage with methods is that they simplify tab completion.
Just type the name of the dataframe,
a period,
and then hit tab to see all the relevant methods for that dataframe
instead of fumbling around with all the available functions in Python
(there's quite a few!)
and figuring out which ones operate on dataframes and which do not.
Methods also facilitates readability when chaining many operations together,
which will be shown in detail later.
The columns in a dataframe can contain data of different types,
e.g. integers, floats, and objects (which includes strings, lists, dictionaries, and more)).
General information about the dataframe
(including the column data types)
can be obtained with the info()
method.
world_data.info()
The information includes the total number of rows and columns, the number of non-null observations, the column data types, and the memory (RAM) usage. The number of non-null observation is not the same for all columns, which means that some columns contain null (or NA) values indicating that there is missing data for some observations. The column data type indicates which type of data is stored in that column, and approximately corresponds to the following
object
, category
object
, category
, int
bool
int
float
Note that an object
could contain different types,
e.g. str
or list
.
Also note that there can be exceptions to the schema above,
but it is a useful general guide.
After reading in the data into a dataframe,
head()
and info()
are two of the most useful methods
to get an idea of the structure of this dataframe.
There are a few additional methods
that can facilitate the understanding of what a dataframe contains:
Content:
world_data.head(n)
- shows the first n
rowsworld_data.tail(n)
- shows the last n
rowsSummary:
world_data.info()
- column names and data types, number of observations, memory consumptions
length, and content of each columnworld_data.describe()
- summary statistics for each columnThe suffixed parentheses indicate that the method is being called,
which means that there is a computation carried out
when we execute the code.
Parameters can be put inside this parentheses
to change the behavior of the method.
For example,
head(10)
tells the head()
method to show the first ten rows of the dataframe,
instead of the default first five.
In addition to methods that compute values on demand,
dataframes can also have pre-calculated values stored with the same dot-syntax.
Values stored like this are often frequently accessed
and it saves time store the value directly instead of recomputing it every time it is needed.
For example,
every time pandas
creates a dataframe,
the number of rows and columns is computed and stored in the shape
attribute.
Some useful pre-computed values are shown below.
Names:
world_data.columns
- the names of the columns
objects)world_data.index
- the names of the rows (referred to as the index in pandas)Size:
world_data.shape
- the number of rows and columns stored as a tupleworld_data.shape[0]
- the number of rowsworld_data.shape[1]
- the number of columnsIn shape[0]
,
the [0]
part accesses the first element of the tuple via indexing
and it is not the same as passing a number to head()
,
which changes how a calculation happens.
Generally,
anything accessible via the dot-syntax,
is an attribute of the dataframe (including methods).
Challenge¶
Based on the output of
world_data.info()
, can you answer the following questions?
- What is the class of the object
world_data
?- How many rows and how many columns are in this object?
- Why is there not the same number of rows (observations) for each column?
When using data from an online source,
it is good practice to keep a copy stored locally on your computer
in case you want to do offline analyses,
the online version of the file changes,
or the file is taken down.
To save a local copy,
the data could be downloaded manually
or the current world_data
dataframe could be saved to disk as a CSV-file with to_csv()
.
world_data.to_csv('world-data.csv', index=False)
# `index=False` because the index (the numbered row names)
# was generated automatically when pandas loaded the file
# and this information is not needed to be saved
Since the data is now saved locally, the next time this notebook is opened, it could be loaded from the local path instead of downloading it from the URL.
world_data = pd.read_csv('world-data.csv')
world_data.head()
The world data dataframe has rows and columns, which means it has two dimensions. We can "subset" the dataframe and extract data only from a single column by using its name inside brackets. pandas recognizes the column names in the dataframe, so tab autocompletion can be used when typing out the column name.
world_data['year'].head()
The name of the column and its data type is shown at the bottom.
Remember that the numbers on the left is the index of the dataframe,
which was added by pandas
upon importing the data.
You could also select a column with the dot-syntax world_data.year
,
but using brackets is clearer so this tutorial will stick to that.
To selected multiple columns,
the columns names can be passed as a list inside the brackets
(so there will be double brackets,
one for the dataframe indexing and one for the list).
world_data[['country', 'year']].head()
The output is displayed a bit differently this time.
The reason is that when there was only one column pandas
technically returned a Series
,
not a Dataframe
.
This can be confirmed by using type
as previously.
type(world_data['year'])
type(world_data[['country', 'year']])
Every column in a dataframe is a Series
and pandas glues them together to form a Dataframe
.
There can be performance benefits to work with Series
,
but pandas often takes care of conversions between these two object types under the hood,
so this introductory tutorial will not make any further distinction between a Series
and a Dataframe
.
Many of the analysis techniques used here will apply to both series and dataframes.
Selecting with single brackets ([]
) is a shortcut for common operations,
such as selecting columns by labels as above.
For more flexible and robust row and column selection,
the more verbose loc[<rows>, <columns>]
syntax can be used
(.loc
stand for "location").
world_data.loc[[0, 2, 4], ['country', 'year']]
# Although methods usually have trailing parenthesis,
# square brackets are used with `loc[]` to stay
# consistent with the indexing with square brackets in general in Python
# (e.g. lists and Numpy arrays)
A single number can be selected,
which returns that value (an integer in this case),
rather than a Dataframe
or Series
with one value.
world_data.loc[4, 'year']
type(world_data.loc[4, 'year'])
To select all rows,
but only a subset of columns,
the colon character (:
) can be used.
world_data.loc[:, ['country', 'year']].head() # head() is used to limit the length of the output
The same syntax can be used to select all columns, but only a subset of rows.
world_data.loc[[3, 4], :]
When selecting all columns,
the :
could be left out as a convenience.
world_data.loc[[3, 4]]
It is also possible to select slices of rows and column labels.
world_data.loc[2:4, 'country':'region']
It is important to realize that loc[]
selects rows and columns by their labels.
To instead select by row or column position,
use iloc[]
(integer location).
world_data.iloc[[2, 3, 4], [0, 1, 2]]
The index of world_data
consists of consecutive integers,
so in this case selecting from the index by labels or position will return the same rows.
As will be shown later,
an index could also consist of text names,
just like the columns.
While selecting slices by label is inclusive of both the start and end, selecting slices by position is inclusive of the start but exclusive of the end position, just like when slicing in lists.
world_data.iloc[2:5, :4] # `iloc[2:5]` gives the same result as `loc[2:4]` above
Selecting slices of row positions is a common operation, and has thus been given a shortcut syntax with single brackets.
world_data[2:5]
Challenge¶
Extract the 200th and 201st row of the
world_data
dataset and assign the resulting dataframe to a new variable name (world_data_200_201
). Remember that Python indexing starts at 0!How can you get the same result as from
world_data.head()
by using row slices instead of thehead()
method?There are at least three distinct ways to extract the last row of the dataframe. Which can you find?
The describe()
method was mentioned above
as a way of retrieving summary statistics of a dataframe.
Together with info()
and head()
,
this is often a good place to start exploratory data analysis
as it gives a helpful overview of the numeric valuables the data set.
world_data.describe()
A common next step would be to plot the data to explore relationships between different variables, but before getting into plotting in the next lecture, we will elaborate on the dataframe object and several of its common operations.
An often desired operation is to select a subset of rows matching a criteria, e.g. which observations have a life expectancy above 83 years. To do this, the "less than" comparison operator that was introduced previously can be used to filter the relevant rows.
world_data['life_expectancy'] > 83
The result is a boolean array with one value for every row in the dataframe
indicating whether it is True
or False
that this row has a value above 83 in the column life_expectancy
.
To find out how many observations there are matching this condition,
the sum()
method can used
since each True
will be 1
and each False
will be 0
.
above_83_bool = world_data['life_expectancy'] > 83
above_83_bool.sum()
Instead of assigning to the intermediate variable above_83_bool
,
we can use methods directly on the resulting boolean series
by surrounding it with parentheses.
(world_data['life_expectancy'] > 83).sum()
The boolean array can be used to select only those rows from the dataframe that meet the specified condition.
world_data[world_data['life_expectancy'] > 83]
As before, this can be combined with selection of a particular set of columns.
world_data.loc[world_data['life_expectancy'] > 83, ['country', 'year', 'life_expectancy']]
A single expression can be used to filter for several criteria,
either matching all criteria with the &
operator,
or any criteria with the |
.
These special operators are used instead of and
and or
to make sure that the comparison occurs for each row in the dataframe.
Parentheses are added to indicate the priority of the comparisons.
world_data.loc[(world_data['sub_region'] == 'Northern Europe') & (world_data['year'] == 1879), ['sub_region', 'country', 'year']]
To increase readability, long statements can be put on multiple rows. Anything that is within parentheses or brackets can be continued on the next row. When inside a bracket or parenthesis, the indentation is not significant to the Python interpreter, but it is recommended to align code in meaningful ways, to make it more readable.
world_data.loc[(world_data['sub_region'] == 'Northern Europe') &
(world_data['year'] == 1879),
['sub_region', 'country', 'year']]
Above,
we assumed that 'Northern Europe'
was a value within the sub_region
column.
When we don't know which values exist in a column,
the unique()
method can reveal them.
world_data['sub_region'].unique()
With the |
operator, rows matching either of the supplied criteria are returned.
world_data.loc[(world_data['year'] == 1800) |
(world_data['year'] == 1801) ,
['country', 'year']].head()
Additional useful ways of subsetting the data includes between()
,
which checks if a numerical value is within a given range,
and isin()
,
which checks if a value is contained in a given list.
# `unique` is used to show that only the relevant items are returned
world_data.loc[world_data['year'].between(2000, 2015), 'year'].unique()
world_data.loc[world_data['region'].isin(['Africa', 'Asia', 'Americas']), 'region'].unique()
A frequent operation when working with data, is to create new columns based on the values in existing columns. For example, to find the total income in a country, we could multiple the income per person with the population:
world_data['population_income'] = world_data['income'] * world_data['population']
world_data[['population', 'income', 'population_income']].head()
Challenge¶
Subset
world_data
to include observations from 1995 to 2001. Check that the dimensions of the resulting dataframe is 1253 x 15.Subset the data to include only observation from year 2000 and onwards, from all regions except 'Asia', and retain only the columns
country
,year
, andsub_region
. The dimensions of the resulting dataframe should be 2508 x 3.
# Challenge solutions
# 1.
world_data.loc[world_data['year'].between(1995, 2001)].shape
# 2.
world_data.loc[(world_data['year'] >= 2000) &
(world_data['region'] != 'Asia'),
['country', 'year', 'sub_region']].shape
Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some operation on each group, and combine the results into a single table.
Image credit Jake VanderPlas
pandas facilitates this workflow through the use of groupby()
to split data,
and summary/aggregation functions such as mean()
,
which collapses each group into a single-row summary of that group.
When the mean is computed,
the default behavior is to ignore NA values.
The arguments to groupby()
are column names that reference categorical variables
by which the summary statistics should be calculated.
world_data.groupby('region')['population'].sum()
The output is a series that is indexed with the grouped variable (the region) and the result of the aggregation (the total population) as the values.
These population numbers are abnormally high because the summary was made for all the years in the dataframe, instead of for a single year. To view only the data from this year, we can use what we learnt previously to filter the dataframe for observations in 2018 only. Compare these results to the picture in the world ignorance survey that placed 4 million people in Asia and 1 million in each of the other regions.
world_data_2018 = world_data.loc[world_data['year'] == 2018]
world_data_2018.groupby('region')['population'].sum()
Individual countries can be selected from the resulting series using loc[]
.
avg_density = world_data_2018.groupby('region')['population'].sum()
avg_density.loc[['Asia', 'Europe']]
As a shortcut,
loc[]
can be omitted when indexing a series.
This is similar to selecting columns from a dataframe with just []
.
avg_density[['Asia', 'Europe']]
This indexing can be used to normalize the population numbers to the region of interest.
region_pop_2018 = world_data_2018.groupby('region')['population'].sum()
region_pop_2018 / region_pop_2018['Europe']
There are six times as many people living in Asia than in Europe.
Groups can also be created from multiple columns, e.g. it could be interesting to compare how densely populated countries are on average in different income brackets around the world.
world_data_2018.groupby(['region', 'income_group'])['pop_density'].mean()
Note that income_group
is an ordinal variable,
i.e. a categorical variable with an inherent order to it.
pandas has not listed the values of that variable in the order we would expect
(low, lower-middle, upper-middle, high).
The order of a categorical variable can be specified in the dataframe,
using the top level pandas function Categorical()
.
# Reassign in the main dataframe since we will use more than just the 2018 data later
world_data['income_group'] = (
pd.Categorical(world_data['income_group'], ordered=True,
categories=['Low', 'Lower middle', 'Upper middle', 'High'])
)
# Need to recreate the 2018 dataframe since the categorical was changed in the main frame
world_data_2018 = world_data.loc[world_data['year'] == 2018]
world_data_2018['income_group'].dtype
world_data_2018.groupby(['region', 'income_group'])['pop_density'].mean()
Now the values appear in the order we would expect. The value for Asia in the high income bracket looks suspiciously high. It would be interesting to see which countries were averaged to that value.
world_data_2018.loc[(world_data['region'] == 'Asia') &
(world_data['income_group'] == 'High'),
['country', 'pop_density']]
Extreme values, such as the city-state Singapore, can heavily skew averages and it could be a good idea to use a more robust statistics such as the median instead.
world_data_2018.groupby(['region', 'income_group'])['pop_density'].median()
The returned series has an index that is a combination of the columns region
and sub_region
,
and referred to as a MultiIndex
.
The same syntax as previously can be used to select rows on the species-level.
med_density_2018 = world_data_2018.groupby(['region', 'income_group'])['pop_density'].median()
med_density_2018[['Africa', 'Americas']]
To select specific values from both levels of the MultiIndex
,
a list of tuples can be passed to loc[]
.
med_density_2018.loc[[('Africa', 'High'), ('Americas', 'High')]]
To select only the low income values from all region,
the xs()
(cross section) method can be used.
med_density_2018.xs('Low', level='income_group')
The names and values of the index levels can be seen by inspecting the index object.
med_density_2018.index
Although MultiIndexes
offer succinct and fast ways to access data,
they also requires memorization of additional syntax
and are strictly speaking not essential unless speed is of particular concern.
It can therefore be easier to reset the index,
so that all values are stored in columns.
med_density_2018_res = med_density_2018.reset_index()
med_density_2018_res
After resetting the index,
the same comparison syntax introduced earlier can be used instead of xs()
or passing lists of tuples to loc[]
.
med_density_2018_asia = med_density_2018_res.loc[med_density_2018_res['income_group'] == 'Low']
med_density_2018_asia
reset_index()
grants the freedom of not having to work with indexes,
but it is still worth keeping in mind that selecting on an index level with xs()
can be orders of magnitude faster than using boolean comparisons on large dataframes.
The opposite operation of creating an index from an existing columns
can be performed with set_index()
on any column (or combination of columns) that creates an index with unique values.
med_density_2018_asia.set_index(['region', 'income_group'])
Challenge
Which is the highest population density in each region?
The low income group for the Americas had the same population density for both the mean and the median. This could mean that there are few observations in this group. List all the low income countries in the Americas.
# Challenge solutions
# 1.
world_data_2018.groupby('region')['pop_density'].max()
# This will be a challenge
# 2.
world_data_2018.loc[(world_data['region'] == 'Americas') & (world_data['income_group'] == 'Low'), ['country', 'pop_density']]
Since the same grouped dataframe will be used in multiple code chunks below, we can assigned it to a new variable instead of typing out the grouping expression each time.
grouped_world_data = world_data_2018.groupby(['region', 'sub_region'])
grouped_world_data['life_expectancy'].mean()
Instead of using the mean()
or sum()
methods directly,
the more general agg()
method could be called
to aggregate by any existing aggregation functions.
The equivalent to the mean()
method would be to call agg()
and specify 'mean'
.
grouped_world_data['life_expectancy'].agg('mean')
This general approach is more flexible and powerful,
since multiple aggregation functions can be applied in the same line of code
by passing them as a list to agg()
.
For instance,
the standard deviation and mean could be computed in the same call:
grouped_world_data['life_expectancy'].agg(['mean', 'std'])
The returned output is in this case a dataframe
and the column MultiIndex
is indicated in bold font.
By passing a dictionary to .agg()
it is possible to apply different aggregations to the different columns.
Long code statements can be broken down into multiple lines
if they are enclosed by parentheses, brackets, or braces,
something that will be described in detail later.
grouped_world_data[['population', 'income']].agg(
{'population': 'sum',
'income': ['min', 'median', 'max']
}
)
There are plenty of aggregation methods available in pandas
(e.g. sem
, mad
, sum
),
most of which can be seen at the end of this section in the pandas
documentation,
or explored using tab-complete on the grouped dataframe.
# This is a side note if there are issues with tab completion
# Tab completion might only work like this:
# find_agg_methods = grouped_world_data['weight']
# find_agg_methods.<tab>
Even if a function is not part of the pandas
library,
it can be passed to agg()
.
import numpy as np
grouped_world_data['pop_density'].agg(np.mean)
Any function can be passed like this, including functions you create yourself.
Challenge¶
What's the mean life expectancy for each income group in 2018?
What's the min, median, and max life expectancies for each income group within each region?
# Challenge solutions
# 1.
world_data_2018.groupby('income_group')['life_expectancy'].mean()
# 2.
world_data_2018.groupby(['region', 'income_group'])['life_expectancy'].agg(['min', 'median', 'max'])
size()
to summarize categorical data¶When working with data,
we commonly want to know the number of observations present for each categorical variable.
For this,
pandas provides the size()
method.
For example,
to find the number of observations per region
(in this case unique countries during year 2018):
world_data_2018.groupby('region').size()
size()
can also be used when grouping on multiple variables.
world_data_2018.groupby(['region', 'income_group']).size()
If there are many groups,
size()
is not that useful on its own.
For example,
it is difficult to quickly find the five most abundant species among the observations.
world_data_2018.groupby('sub_region').size()
Since there are many rows in this output,
it would be beneficial to sort the table values and display the most abundant species first.
This is easy to do with the sort_values()
method.
world_data_2018.groupby('sub_region').size().sort_values()
That's better, but it could be helpful to display the most abundant species on top. In other words, the output should be arranged in descending order.
world_data_2018.groupby('sub_region').size().sort_values(ascending=False).head(5)
Looks good!
By now, the code statement has grown quite long because many methods have been chained together. It can be tricky to keep track of what is going on in long method chains. To make the code more readable, it can be broken up multiple lines by adding a surrounding parenthesis.
(world_data_2018
.groupby('sub_region')
.size()
.sort_values(ascending=False)
.head(5)
)
This looks neater and makes long method chains easier to reads. There is no absolute rule for when to break code into multiple line, but always try to write code that is easy for collaborators to understand. Remember that your most common collaborator is a future version of yourself!
pandas has a convenience function for returning the top five results, so the values don't need to be sorted explicitly.
(world_data_2018
.groupby(['sub_region'])
.size()
.nlargest() # the default is 5
)
To include more attributes about these countries,
add those columns to groupby()
.
(world_data_2018
.groupby(['region', 'sub_region'])
.size()
.nlargest() # the default is 5
)
world_data.head()
Challenge¶
- How many countries are there in each income group worldwide?
- Assign the variable name
world_data_2015
to a dataframe containing only the values from year 2015 (e.g. the same way asworld_data_2018
was created) 3. a. For those countries where women went to school longer than men, how many are there in each income group. b. Do the same as above but for countries where men went to school longer than women. What does this distribution tell you?
# Challenge solutions
# 1.
world_data_2018.groupby('income_group').size()
# 2
world_data_2015 = world_data.loc[world_data['year'] == 2015]
# 3a
world_data_2015.loc[world_data_2015['years_in_school_men'] < world_data_2015['years_in_school_women']].groupby('income_group').size()
# 3b
world_data_2015.loc[world_data_2015['years_in_school_men'] > world_data_2015['years_in_school_women']].groupby('income_group').size()
dropna()
removes both explicit NaN
values
and value that pandas assumed to be NaN
,
such as the non-numeric values in the life_expectancy column.
world_data_2018.dropna()
Instead of dropping observations that has NaN
values in a any column,
a subset of columns can be considered.
world_data_2018.dropna(subset='life_expectancy'])
Non-numeric values can also be coerced into explicit NaN
values
via the to_numeric()
top level function.
world_data_2018['life_expectancy'], errors='coerce')