Data analyses with pandas dataframes


Learning objectives

  • Describe what a dataframe is.
  • Load external data from a .csv file into a dataframe with pandas.
  • Summarize the contents of a dataframe with pandas.
  • Learn to use dataframe attributes loc[], head(), info(), describe(), shape, columns, index.
  • Learn to clean dirty data.
  • Understand the split-apply-combine concept for data analysis.
    • Use groupby(), mean(), agg() and size() to apply this technique.

Lesson outline

  • Manipulating and analyzing data with pandas
    • Data set background (10 min)
    • What are dataframes (15 min)
    • Data wrangling with pandas (40 min)
  • Cleaning data (20 min)
  • Split-apply-combine techniques in pandas
    • Using mean() to summarize categorical data (20 min)
    • Using size() to summarize categorical data (15 min)

Dataset background

Today, we will be working with real data about the world combined from multiple sources by the Gapminder foundation. Gapminder is an independent Swedish organization that fights devastating misconceptions about global development. They also promote a fact-based world view through the production of free teaching and data exploration resources. Insights from the Gapminder data sources have been popularized through the efforts of public health professor Hans Rosling. It is highly recommended to check out his entertaining videos, most famously The best stats you have ever seen. Before we start exploring the data, we recommend taking this 5-10 min quiz, to see how knowledgeable (or ignorant) you are about the world. Then we will learn how to dive deeper into this data using Python!

We are studying the species and weight of animals caught in plots in our study area. The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent:

Column Description
country Country name
year Year of observation
population Population in the country at each year
region Continent the country belongs to
sub_region Sub regions as defined by
income_group Income group as specified by the world bank
life_expectancy The average number of years a newborn child would
live if mortality patterns were to stay the same
income GDP per capita (in USD) adjusted
for differences in purchasing power
children_per_woman Number of children born to each woman
child_mortality Deaths of children under 5 years of age per 1000 live births
pop_density Average number of people per km2
co2_per_capita CO2 emissions from fossil fuels (tonnes per capita)
years_in_school_men Average number of years attending primary, secondary,
and tertiary school for 25-36 years old men
years_in_school_women Average number of years attending primary, secondary,
and tertiary school for 25-36 years old women

To read the data into Python, we are going to use a function called read_csv from the Python-package pandas. As mentioned previously, Python-packages are a bit like phone apps, they are not essential to the core Python library, but provides domain specific functionality. To use a package, it first needs to be imported.

In [1]:
# pandas is given the nickname `pd`
import pandas as pd

pandas can read CSV-files saved on the computer or directly from an URL. Here, we read data that we have compiled from Gapminder and uploaded to our GitHub repository.

In [3]:
url = 'https://raw.githubusercontent.com/UofTCoders/2018-09-10-utoronto/go-pages/data/world-data-Gapminder.csv'
world_data = pd.read_csv(url)
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-3-e29764507813> in <module>()
      1 url = 'https://raw.githubusercontent.com/UofTCoders/2018-09-10-utoronto/go-pages/data/world-data-gapminder.csv'
----> 2 world_data = pd.read_csv(url)

~/miniconda3/envs/ds/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    683         )
    684 
--> 685         return _read(filepath_or_buffer, kwds)
    686 
    687     parser_f.__name__ = name

~/miniconda3/envs/ds/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    438     # See https://github.com/python/mypy/issues/1297
    439     fp_or_buf, _, compression, should_close = get_filepath_or_buffer(
--> 440         filepath_or_buffer, encoding, compression
    441     )
    442     kwds["compression"] = compression

~/miniconda3/envs/ds/lib/python3.7/site-packages/pandas/io/common.py in get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode)
    194 
    195     if _is_url(filepath_or_buffer):
--> 196         req = urlopen(filepath_or_buffer)
    197         content_encoding = req.headers.get("Content-Encoding", None)
    198         if content_encoding == "gzip":

~/miniconda3/envs/ds/lib/python3.7/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    220     else:
    221         opener = _opener
--> 222     return opener.open(url, data, timeout)
    223 
    224 def install_opener(opener):

~/miniconda3/envs/ds/lib/python3.7/urllib/request.py in open(self, fullurl, data, timeout)
    529         for processor in self.process_response.get(protocol, []):
    530             meth = getattr(processor, meth_name)
--> 531             response = meth(req, response)
    532 
    533         return response

~/miniconda3/envs/ds/lib/python3.7/urllib/request.py in http_response(self, request, response)
    639         if not (200 <= code < 300):
    640             response = self.parent.error(
--> 641                 'http', request, response, code, msg, hdrs)
    642 
    643         return response

~/miniconda3/envs/ds/lib/python3.7/urllib/request.py in error(self, proto, *args)
    567         if http_err:
    568             args = (dict, 'default', 'http_error_default') + orig_args
--> 569             return self._call_chain(*args)
    570 
    571 # XXX probably also want an abstract factory that knows when it makes

~/miniconda3/envs/ds/lib/python3.7/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    501         for handler in handlers:
    502             func = getattr(handler, meth_name)
--> 503             result = func(*args)
    504             if result is not None:
    505                 return result

~/miniconda3/envs/ds/lib/python3.7/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
    647 class HTTPDefaultErrorHandler(BaseHandler):
    648     def http_error_default(self, req, fp, code, msg, hdrs):
--> 649         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    650 
    651 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 404: Not Found

To view the dataframe that pandas created, type world_data in a cell and run it, just as when viewing the content of any variable in Python.

In [5]:
world_data
Out[5]:
country year population region sub_region income_group life_expectancy income children_per_woman child_mortality pop_density co2_per_capita years_in_school_men years_in_school_women
0 Afghanistan 1800 3280000 Asia Southern Asia Low 28.2 603 7.00 469.0 NaN NaN NaN NaN
1 Afghanistan 1801 3280000 Asia Southern Asia Low 28.2 603 7.00 469.0 NaN NaN NaN NaN
2 Afghanistan 1802 3280000 Asia Southern Asia Low 28.2 603 7.00 469.0 NaN NaN NaN NaN
3 Afghanistan 1803 3280000 Asia Southern Asia Low 28.2 603 7.00 469.0 NaN NaN NaN NaN
4 Afghanistan 1804 3280000 Asia Southern Asia Low 28.2 603 7.00 469.0 NaN NaN NaN NaN
5 Afghanistan 1805 3280000 Asia Southern Asia Low 28.2 603 7.00 469.0 NaN NaN NaN NaN
6 Afghanistan 1806 3280000 Asia Southern Asia Low 28.1 603 7.00 470.0 NaN NaN NaN NaN
7 Afghanistan 1807 3280000 Asia Southern Asia Low 28.1 603 7.00 470.0 NaN NaN NaN NaN
8 Afghanistan 1808 3280000 Asia Southern Asia Low 28.1 603 7.00 470.0 NaN NaN NaN NaN
9 Afghanistan 1809 3280000 Asia Southern Asia Low 28.1 603 7.00 470.0 NaN NaN NaN NaN
10 Afghanistan 1810 3280000 Asia Southern Asia Low 28.1 604 7.00 470.0 NaN NaN NaN NaN
11 Afghanistan 1811 3280000 Asia Southern Asia Low 28.1 604 7.00 470.0 NaN NaN NaN NaN
12 Afghanistan 1812 3280000 Asia Southern Asia Low 28.1 604 7.00 470.0 NaN NaN NaN NaN
13 Afghanistan 1813 3280000 Asia Southern Asia Low 28.1 604 7.00 470.0 NaN NaN NaN NaN
14 Afghanistan 1814 3290000 Asia Southern Asia Low 28.1 604 7.00 470.0 NaN NaN NaN NaN
15 Afghanistan 1815 3290000 Asia Southern Asia Low 28.1 604 7.00 470.0 NaN NaN NaN NaN
16 Afghanistan 1816 3300000 Asia Southern Asia Low 28.1 604 7.00 471.0 NaN NaN NaN NaN
17 Afghanistan 1817 3300000 Asia Southern Asia Low 28.0 604 7.00 471.0 NaN NaN NaN NaN
18 Afghanistan 1818 3310000 Asia Southern Asia Low 28.0 604 7.00 471.0 NaN NaN NaN NaN
19 Afghanistan 1819 3320000 Asia Southern Asia Low 28.0 604 7.00 471.0 NaN NaN NaN NaN
20 Afghanistan 1820 3320000 Asia Southern Asia Low 28.0 604 7.00 471.0 NaN NaN NaN NaN
21 Afghanistan 1821 3330000 Asia Southern Asia Low 28.0 607 7.00 471.0 NaN NaN NaN NaN
22 Afghanistan 1822 3340000 Asia Southern Asia Low 28.0 609 7.00 471.0 NaN NaN NaN NaN
23 Afghanistan 1823 3350000 Asia Southern Asia Low 28.0 611 7.00 471.0 NaN NaN NaN NaN
24 Afghanistan 1824 3360000 Asia Southern Asia Low 28.0 613 7.00 471.0 NaN NaN NaN NaN
25 Afghanistan 1825 3380000 Asia Southern Asia Low 27.9 615 7.00 471.0 NaN NaN NaN NaN
26 Afghanistan 1826 3390000 Asia Southern Asia Low 27.9 617 7.00 473.0 NaN NaN NaN NaN
27 Afghanistan 1827 3400000 Asia Southern Asia Low 27.9 619 7.00 473.0 NaN NaN NaN NaN
28 Afghanistan 1828 3420000 Asia Southern Asia Low 27.9 621 7.00 473.0 NaN NaN NaN NaN
29 Afghanistan 1829 3430000 Asia Southern Asia Low 27.9 623 7.00 473.0 NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
38952 Zimbabwe 1989 9900000 Africa Sub-Saharan Africa Low 62.7 2490 5.37 73.9 25.6 1.630 7.61 6.01
38953 Zimbabwe 1990 10200000 Africa Sub-Saharan Africa Low 61.7 2590 5.18 75.2 26.3 1.540 7.74 6.16
38954 Zimbabwe 1991 10400000 Africa Sub-Saharan Africa Low 61.0 2670 5.00 77.4 27.0 1.530 7.88 6.31
38955 Zimbabwe 1992 10700000 Africa Sub-Saharan Africa Low 59.4 2370 4.84 80.2 27.6 1.590 8.01 6.46
38956 Zimbabwe 1993 10900000 Africa Sub-Saharan Africa Low 57.6 2350 4.69 83.4 28.2 1.500 8.14 6.61
38957 Zimbabwe 1994 11100000 Africa Sub-Saharan Africa Low 55.8 2520 4.56 86.8 28.7 1.600 8.28 6.76
38958 Zimbabwe 1995 11300000 Africa Sub-Saharan Africa Low 53.7 2480 4.43 90.1 29.3 1.340 8.41 6.92
38959 Zimbabwe 1996 11500000 Africa Sub-Saharan Africa Low 52.2 2690 4.33 92.8 29.8 1.300 8.54 7.07
38960 Zimbabwe 1997 11700000 Africa Sub-Saharan Africa Low 50.8 2710 4.24 94.7 30.3 1.230 8.67 7.23
38961 Zimbabwe 1998 11900000 Africa Sub-Saharan Africa Low 49.1 2750 4.16 95.9 30.7 1.200 8.80 7.39
38962 Zimbabwe 1999 12100000 Africa Sub-Saharan Africa Low 47.8 2690 4.10 96.4 31.2 1.310 8.93 7.55
38963 Zimbabwe 2000 12200000 Africa Sub-Saharan Africa Low 46.7 2570 4.06 96.8 31.6 1.140 9.07 7.71
38964 Zimbabwe 2001 12400000 Africa Sub-Saharan Africa Low 46.2 2580 4.02 97.1 32.0 1.020 9.20 7.87
38965 Zimbabwe 2002 12500000 Africa Sub-Saharan Africa Low 45.6 2320 4.00 97.7 32.3 0.957 9.33 8.03
38966 Zimbabwe 2003 12600000 Africa Sub-Saharan Africa Low 45.3 1910 3.99 98.2 32.7 0.843 9.47 8.20
38967 Zimbabwe 2004 12800000 Africa Sub-Saharan Africa Low 45.1 1780 3.98 99.0 33.0 0.742 9.60 8.36
38968 Zimbabwe 2005 12900000 Africa Sub-Saharan Africa Low 45.3 1650 3.99 99.7 33.4 0.832 9.73 8.53
38969 Zimbabwe 2006 13100000 Africa Sub-Saharan Africa Low 45.7 1580 3.99 100.0 33.9 0.796 9.87 8.69
38970 Zimbabwe 2007 13300000 Africa Sub-Saharan Africa Low 46.4 1490 4.00 100.0 34.5 0.742 10.00 8.86
38971 Zimbabwe 2008 13600000 Africa Sub-Saharan Africa Low 46.7 1210 4.01 98.0 35.0 0.573 10.10 9.03
38972 Zimbabwe 2009 13800000 Africa Sub-Saharan Africa Low 47.5 1290 4.02 94.9 35.7 0.406 10.30 9.19
38973 Zimbabwe 2010 14100000 Africa Sub-Saharan Africa Low 49.6 1460 4.03 89.9 36.4 0.552 10.40 9.36
38974 Zimbabwe 2011 14400000 Africa Sub-Saharan Africa Low 51.9 1660 4.02 83.8 37.2 0.665 10.50 9.53
38975 Zimbabwe 2012 14700000 Africa Sub-Saharan Africa Low 54.1 1850 4.00 76.0 38.0 0.530 10.70 9.70
38976 Zimbabwe 2013 15100000 Africa Sub-Saharan Africa Low 55.6 1900 3.96 70.0 38.9 0.776 10.80 9.86
38977 Zimbabwe 2014 15400000 Africa Sub-Saharan Africa Low 57.0 1910 3.90 64.3 39.8 0.780 10.90 10.00
38978 Zimbabwe 2015 15800000 Africa Sub-Saharan Africa Low 58.3 1890 3.84 59.9 40.8 NaN 11.10 10.20
38979 Zimbabwe 2016 16200000 Africa Sub-Saharan Africa Low 59.3 1860 3.76 56.4 41.7 NaN NaN NaN
38980 Zimbabwe 2017 16500000 Africa Sub-Saharan Africa Low 59.8 1910 3.68 56.8 42.7 NaN NaN NaN
38981 Zimbabwe 2018 16900000 Africa Sub-Saharan Africa Low 60.2 1950 3.61 55.5 43.7 NaN NaN NaN

38982 rows × 14 columns

This is how a dataframe is displayed in the Jupyter notebook. The Jupyter notebook displays pandas dataframes in a tabular format, and adds cosmetic conveniences such as the bold font type for the column and row names, the alternating grey and white zebra stripes for the rows, and highlighting of the row the mouse pointer hovers over. The increasing numbers on the far left is the dataframe's index or row names. These are not present in CSV-file, but were added by pandas to easily distinguish between the rows.

What are dataframes?

A dataframe is the representation of data in a tabular format, similar to how data is often arranged in spreadsheets. The data is rectangular, meaning that all rows have the same amount of columns and all columns have the same amount of rows. As mentioned in the previous lectures, when our data is arranged in a tidy format, the columns can be referred to as the "features" or "variables" of the data, while each row represents an individual "observation". Dataframes are the standard data structure for most tabular data, and what we will use for data wrangling, statistics and plotting. A dataframe can be created by hand, but most commonly they are generated by an input function, such as read_csv(), when importing spreadsheet data from your hard drive (or the web).

As can be seen above, the default is to display the first and last five rows and truncate everything in between, as indicated by the ellipsis (...). If we wanted to display only the first 5 lines, we could use the head() method.

In [6]:
world_data.head()
Out[6]:
country year population region sub_region income_group life_expectancy income children_per_woman child_mortality pop_density co2_per_capita years_in_school_men years_in_school_women
0 Afghanistan 1800 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN
1 Afghanistan 1801 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN
2 Afghanistan 1802 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN
3 Afghanistan 1803 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN
4 Afghanistan 1804 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN

Methods are very similar to functions, the main difference is that they belong to an object (the method head() belongs to the dataframe world_data). Methods operate on the object they belong to, that's why we can call the method with an empty parenthesis without any arguments. Compare this with the function type() that was introduced previously.

In [7]:
type(world_data)
Out[7]:
pandas.core.frame.DataFrame

Here, the world_data variable is explicitly passed as an argument to type(). An immediately tangible advantage with methods is that they simplify tab completion. Just type the name of the dataframe, a period, and then hit tab to see all the relevant methods for that dataframe instead of fumbling around with all the available functions in Python (there's quite a few!) and figuring out which ones operate on dataframes and which do not. Methods also facilitates readability when chaining many operations together, which will be shown in detail later.

The columns in a dataframe can contain data of different types, e.g. integers, floats, and objects (which includes strings, lists, dictionaries, and more)). General information about the dataframe (including the column data types) can be obtained with the info() method.

In [8]:
world_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38982 entries, 0 to 38981
Data columns (total 14 columns):
country                  38982 non-null object
year                     38982 non-null int64
population               38982 non-null int64
region                   38982 non-null object
sub_region               38982 non-null object
income_group             38982 non-null object
life_expectancy          38982 non-null float64
income                   38982 non-null int64
children_per_woman       38982 non-null float64
child_mortality          38980 non-null float64
pop_density              12282 non-null float64
co2_per_capita           16285 non-null float64
years_in_school_men      8188 non-null float64
years_in_school_women    8188 non-null float64
dtypes: float64(7), int64(3), object(4)
memory usage: 4.2+ MB

The information includes the total number of rows and columns, the number of non-null observations, the column data types, and the memory (RAM) usage. The number of non-null observation is not the same for all columns, which means that some columns contain null (or NA) values indicating that there is missing data for some observations. The column data type indicates which type of data is stored in that column, and approximately corresponds to the following

  • Categorical/Qualitative
    • Nominal (labels, e.g. 'red', 'green', 'blue')
      • object, category
    • Ordinal (labels with order, e.g. 'Jan', 'Feb', 'Mar')
      • object, category, int
    • Binary (only two outcomes, e.g. True or False)
      • bool
  • Quantitative/Numerical
    • Discrete (whole numbers, often counting, e.g. number of children)
      • int
    • Continuous (measured values with decimals, e.g. weight)
      • float

Note that an object could contain different types, e.g. str or list. Also note that there can be exceptions to the schema above, but it is a useful general guide.

After reading in the data into a dataframe, head() and info() are two of the most useful methods to get an idea of the structure of this dataframe. There are a few additional methods that can facilitate the understanding of what a dataframe contains:

  • Content:

    • world_data.head(n) - shows the first n rows
    • world_data.tail(n) - shows the last n rows
  • Summary:

    • world_data.info() - column names and data types, number of observations, memory consumptions length, and content of each column
    • world_data.describe() - summary statistics for each column

The suffixed parentheses indicate that the method is being called, which means that there is a computation carried out when we execute the code. Parameters can be put inside this parentheses to change the behavior of the method. For example, head(10) tells the head() method to show the first ten rows of the dataframe, instead of the default first five.

In addition to methods that compute values on demand, dataframes can also have pre-calculated values stored with the same dot-syntax. Values stored like this are often frequently accessed and it saves time store the value directly instead of recomputing it every time it is needed. For example, every time pandas creates a dataframe, the number of rows and columns is computed and stored in the shape attribute. Some useful pre-computed values are shown below.

  • Names:

    • world_data.columns - the names of the columns objects)
    • world_data.index - the names of the rows (referred to as the index in pandas)
  • Size:

    • world_data.shape - the number of rows and columns stored as a tuple
    • world_data.shape[0] - the number of rows
    • world_data.shape[1]- the number of columns

In shape[0], the [0] part accesses the first element of the tuple via indexing and it is not the same as passing a number to head(), which changes how a calculation happens. Generally, anything accessible via the dot-syntax, is an attribute of the dataframe (including methods).

Challenge

Based on the output of world_data.info(), can you answer the following questions?

  • What is the class of the object world_data?
  • How many rows and how many columns are in this object?
  • Why is there not the same number of rows (observations) for each column?

Saving dataframes locally

When using data from an online source, it is good practice to keep a copy stored locally on your computer in case you want to do offline analyses, the online version of the file changes, or the file is taken down. To save a local copy, the data could be downloaded manually or the current world_data dataframe could be saved to disk as a CSV-file with to_csv().

In [ ]:
world_data.to_csv('world-data.csv', index=False)
# `index=False` because the index (the numbered row names)
# was generated automatically when pandas loaded the file
# and this information is not needed to be saved

Since the data is now saved locally, the next time this notebook is opened, it could be loaded from the local path instead of downloading it from the URL.

In [10]:
world_data = pd.read_csv('world-data.csv')
world_data.head()
Out[10]:
country year population region sub_region income_group life_expectancy income children_per_woman child_mortality pop_density co2_per_capita years_in_school_men years_in_school_women
0 Afghanistan 1800 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN
1 Afghanistan 1801 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN
2 Afghanistan 1802 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN
3 Afghanistan 1803 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN
4 Afghanistan 1804 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN

Indexing and subsetting dataframes

The world data dataframe has rows and columns, which means it has two dimensions. We can "subset" the dataframe and extract data only from a single column by using its name inside brackets. pandas recognizes the column names in the dataframe, so tab autocompletion can be used when typing out the column name.

In [11]:
world_data['year'].head()
Out[11]:
0    1800
1    1801
2    1802
3    1803
4    1804
Name: year, dtype: int64

The name of the column and its data type is shown at the bottom. Remember that the numbers on the left is the index of the dataframe, which was added by pandas upon importing the data. You could also select a column with the dot-syntax world_data.year, but using brackets is clearer so this tutorial will stick to that. To selected multiple columns, the columns names can be passed as a list inside the brackets (so there will be double brackets, one for the dataframe indexing and one for the list).

In [13]:
world_data[['country', 'year']].head()
Out[13]:
country year
0 Afghanistan 1800
1 Afghanistan 1801
2 Afghanistan 1802
3 Afghanistan 1803
4 Afghanistan 1804

The output is displayed a bit differently this time. The reason is that when there was only one column pandas technically returned a Series, not a Dataframe. This can be confirmed by using type as previously.

In [14]:
type(world_data['year'])
Out[14]:
pandas.core.series.Series
In [15]:
type(world_data[['country', 'year']])
Out[15]:
pandas.core.frame.DataFrame

Every column in a dataframe is a Series and pandas glues them together to form a Dataframe. There can be performance benefits to work with Series, but pandas often takes care of conversions between these two object types under the hood, so this introductory tutorial will not make any further distinction between a Series and a Dataframe. Many of the analysis techniques used here will apply to both series and dataframes.

Selecting with single brackets ([]) is a shortcut for common operations, such as selecting columns by labels as above. For more flexible and robust row and column selection, the more verbose loc[<rows>, <columns>] syntax can be used (.loc stand for "location").

In [ ]:
world_data.loc[[0, 2, 4], ['country', 'year']]
# Although methods usually have trailing parenthesis,
# square brackets are used with `loc[]` to stay
# consistent with the indexing with square brackets in general in Python
# (e.g. lists and Numpy arrays)

A single number can be selected, which returns that value (an integer in this case), rather than a Dataframe or Series with one value.

In [17]:
world_data.loc[4, 'year']
Out[17]:
1804
In [18]:
type(world_data.loc[4, 'year'])
Out[18]:
numpy.int64

To select all rows, but only a subset of columns, the colon character (:) can be used.

In [19]:
world_data.loc[:, ['country', 'year']].head() # head() is used to limit the length of the output
Out[19]:
country year
0 Afghanistan 1800
1 Afghanistan 1801
2 Afghanistan 1802
3 Afghanistan 1803
4 Afghanistan 1804

The same syntax can be used to select all columns, but only a subset of rows.

In [20]:
world_data.loc[[3, 4], :]
Out[20]:
country year population region sub_region income_group life_expectancy income children_per_woman child_mortality pop_density co2_per_capita years_in_school_men years_in_school_women
3 Afghanistan 1803 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN
4 Afghanistan 1804 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN

When selecting all columns, the : could be left out as a convenience.

In [21]:
world_data.loc[[3, 4]]
Out[21]:
country year population region sub_region income_group life_expectancy income children_per_woman child_mortality pop_density co2_per_capita years_in_school_men years_in_school_women
3 Afghanistan 1803 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN
4 Afghanistan 1804 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN

It is also possible to select slices of rows and column labels.

In [22]:
world_data.loc[2:4, 'country':'region']
Out[22]:
country year population region
2 Afghanistan 1802 3280000 Asia
3 Afghanistan 1803 3280000 Asia
4 Afghanistan 1804 3280000 Asia

It is important to realize that loc[] selects rows and columns by their labels. To instead select by row or column position, use iloc[] (integer location).

In [23]:
world_data.iloc[[2, 3, 4], [0, 1, 2]]
Out[23]:
country year population
2 Afghanistan 1802 3280000
3 Afghanistan 1803 3280000
4 Afghanistan 1804 3280000

The index of world_data consists of consecutive integers, so in this case selecting from the index by labels or position will return the same rows. As will be shown later, an index could also consist of text names, just like the columns.

While selecting slices by label is inclusive of both the start and end, selecting slices by position is inclusive of the start but exclusive of the end position, just like when slicing in lists.

In [24]:
world_data.iloc[2:5, :4] # `iloc[2:5]` gives the same result as `loc[2:4]` above
Out[24]:
country year population region
2 Afghanistan 1802 3280000 Asia
3 Afghanistan 1803 3280000 Asia
4 Afghanistan 1804 3280000 Asia

Selecting slices of row positions is a common operation, and has thus been given a shortcut syntax with single brackets.

In [25]:
world_data[2:5]
Out[25]:
country year population region sub_region income_group life_expectancy income children_per_woman child_mortality pop_density co2_per_capita years_in_school_men years_in_school_women
2 Afghanistan 1802 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN
3 Afghanistan 1803 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN
4 Afghanistan 1804 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN

Challenge

  1. Extract the 200th and 201st row of the world_data dataset and assign the resulting dataframe to a new variable name (world_data_200_201). Remember that Python indexing starts at 0!

  2. How can you get the same result as from world_data.head() by using row slices instead of the head() method?

  3. There are at least three distinct ways to extract the last row of the dataframe. Which can you find?

Filtering observations

The describe() method was mentioned above as a way of retrieving summary statistics of a dataframe. Together with info() and head(), this is often a good place to start exploratory data analysis as it gives a helpful overview of the numeric valuables the data set.

In [26]:
world_data.describe()
Out[26]:
year population life_expectancy income children_per_woman child_mortality pop_density co2_per_capita years_in_school_men years_in_school_women
count 38982.000000 3.898200e+04 38982.000000 38982.000000 38982.000000 38980.000000 12282.000000 16285.000000 8188.000000 8188.000000
mean 1909.000000 1.422075e+07 43.073468 4527.128033 5.384391 292.050891 120.900572 3.236894 7.681019 6.948334
std 63.220006 6.722423e+07 16.219216 9753.116041 1.642597 161.562290 382.454242 6.079257 3.185983 3.876399
min 1800.000000 1.250000e+04 1.000000 247.000000 1.120000 1.950000 0.502000 0.000000 0.900000 0.210000
25% 1854.000000 5.060000e+05 31.200000 876.000000 4.550000 141.000000 14.800000 0.188000 5.160000 3.620000
50% 1909.000000 2.140000e+06 35.500000 1450.000000 5.910000 361.000000 46.000000 0.944000 7.650000 6.980000
75% 1964.000000 6.870000e+06 55.600000 3520.000000 6.630000 420.000000 110.000000 4.020000 10.100000 9.980000
max 2018.000000 1.420000e+09 84.200000 178000.000000 8.870000 756.000000 8270.000000 101.000000 15.300000 15.700000

A common next step would be to plot the data to explore relationships between different variables, but before getting into plotting in the next lecture, we will elaborate on the dataframe object and several of its common operations.

An often desired operation is to select a subset of rows matching a criteria, e.g. which observations have a life expectancy above 83 years. To do this, the "less than" comparison operator that was introduced previously can be used to filter the relevant rows.

In [27]:
world_data['life_expectancy'] > 83
Out[27]:
0        False
1        False
2        False
3        False
4        False
5        False
6        False
7        False
8        False
9        False
10       False
11       False
12       False
13       False
14       False
15       False
16       False
17       False
18       False
19       False
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27       False
28       False
29       False
         ...  
38952    False
38953    False
38954    False
38955    False
38956    False
38957    False
38958    False
38959    False
38960    False
38961    False
38962    False
38963    False
38964    False
38965    False
38966    False
38967    False
38968    False
38969    False
38970    False
38971    False
38972    False
38973    False
38974    False
38975    False
38976    False
38977    False
38978    False
38979    False
38980    False
38981    False
Name: life_expectancy, Length: 38982, dtype: bool

The result is a boolean array with one value for every row in the dataframe indicating whether it is True or False that this row has a value above 83 in the column life_expectancy. To find out how many observations there are matching this condition, the sum() method can used since each True will be 1 and each False will be 0.

In [28]:
above_83_bool = world_data['life_expectancy'] > 83
above_83_bool.sum()
Out[28]:
20

Instead of assigning to the intermediate variable above_83_bool, we can use methods directly on the resulting boolean series by surrounding it with parentheses.

In [29]:
(world_data['life_expectancy'] > 83).sum()
Out[29]:
20

The boolean array can be used to select only those rows from the dataframe that meet the specified condition.

In [30]:
world_data[world_data['life_expectancy'] > 83]
Out[30]:
country year population region sub_region income_group life_expectancy income children_per_woman child_mortality pop_density co2_per_capita years_in_school_men years_in_school_women
17513 Japan 2012 128000000 Asia Eastern Asia High 83.2 36400 1.40 3.00 352.0 9.58 14.8 15.2
17514 Japan 2013 128000000 Asia Eastern Asia High 83.4 37100 1.42 2.90 352.0 9.71 14.9 15.3
17515 Japan 2014 128000000 Asia Eastern Asia High 83.6 37300 1.43 2.80 352.0 9.47 15.0 15.4
17516 Japan 2015 128000000 Asia Eastern Asia High 83.8 37800 1.44 3.00 351.0 NaN 15.1 15.5
17517 Japan 2016 128000000 Asia Eastern Asia High 83.9 38200 1.46 2.70 350.0 NaN NaN NaN
17518 Japan 2017 127000000 Asia Eastern Asia High 84.0 38600 1.47 2.83 350.0 NaN NaN NaN
17519 Japan 2018 127000000 Asia Eastern Asia High 84.2 39100 1.48 2.76 349.0 NaN NaN NaN
30653 Singapore 2012 5270000 Asia South-eastern Asia High 83.2 76000 1.26 2.80 7530.0 6.90 13.6 13.3
30654 Singapore 2013 5360000 Asia South-eastern Asia High 83.2 78500 1.25 2.70 7660.0 10.40 13.7 13.5
30655 Singapore 2014 5450000 Asia South-eastern Asia High 83.4 80300 1.25 2.70 7780.0 10.30 13.8 13.7
30656 Singapore 2015 5540000 Asia South-eastern Asia High 83.6 80900 1.24 2.70 7910.0 NaN 14.0 13.8
30657 Singapore 2016 5620000 Asia South-eastern Asia High 83.7 81400 1.25 2.80 8030.0 NaN NaN NaN
30658 Singapore 2017 5710000 Asia South-eastern Asia High 83.8 82600 1.25 2.58 8160.0 NaN NaN NaN
30659 Singapore 2018 5790000 Asia South-eastern Asia High 84.0 83900 1.26 2.52 8270.0 NaN NaN NaN
32410 Spain 2017 46400000 Europe Southern Europe High 83.1 34000 1.38 3.14 92.9 NaN NaN NaN
32411 Spain 2018 46400000 Europe Southern Europe High 83.2 34700 1.39 3.02 93.0 NaN NaN NaN
33722 Switzerland 2015 8320000 Europe Western Europe High 83.1 56500 1.54 4.10 211.0 NaN 14.6 14.4
33723 Switzerland 2016 8400000 Europe Western Europe High 83.1 56600 1.55 4.10 213.0 NaN NaN NaN
33724 Switzerland 2017 8480000 Europe Western Europe High 83.3 56900 1.55 3.86 214.0 NaN NaN NaN
33725 Switzerland 2018 8540000 Europe Western Europe High 83.5 57100 1.55 3.75 216.0 NaN NaN NaN

As before, this can be combined with selection of a particular set of columns.

In [31]:
world_data.loc[world_data['life_expectancy'] > 83, ['country', 'year', 'life_expectancy']]
Out[31]:
country year life_expectancy
17513 Japan 2012 83.2
17514 Japan 2013 83.4
17515 Japan 2014 83.6
17516 Japan 2015 83.8
17517 Japan 2016 83.9
17518 Japan 2017 84.0
17519 Japan 2018 84.2
30653 Singapore 2012 83.2
30654 Singapore 2013 83.2
30655 Singapore 2014 83.4
30656 Singapore 2015 83.6
30657 Singapore 2016 83.7
30658 Singapore 2017 83.8
30659 Singapore 2018 84.0
32410 Spain 2017 83.1
32411 Spain 2018 83.2
33722 Switzerland 2015 83.1
33723 Switzerland 2016 83.1
33724 Switzerland 2017 83.3
33725 Switzerland 2018 83.5

A single expression can be used to filter for several criteria, either matching all criteria with the & operator, or any criteria with the |. These special operators are used instead of and and or to make sure that the comparison occurs for each row in the dataframe. Parentheses are added to indicate the priority of the comparisons.

In [33]:
world_data.loc[(world_data['sub_region'] == 'Northern Europe') & (world_data['year'] == 1879), ['sub_region', 'country', 'year']]
Out[33]:
sub_region country year
9496 Northern Europe Denmark 1879
11248 Northern Europe Estonia 1879
11905 Northern Europe Finland 1879
15409 Northern Europe Iceland 1879
16504 Northern Europe Ireland 1879
19132 Northern Europe Latvia 1879
20227 Northern Europe Lithuania 1879
25921 Northern Europe Norway 1879
33367 Northern Europe Sweden 1879
36871 Northern Europe United Kingdom 1879

To increase readability, long statements can be put on multiple rows. Anything that is within parentheses or brackets can be continued on the next row. When inside a bracket or parenthesis, the indentation is not significant to the Python interpreter, but it is recommended to align code in meaningful ways, to make it more readable.

In [ ]:
world_data.loc[(world_data['sub_region'] == 'Northern Europe') &
               (world_data['year'] == 1879),
               ['sub_region', 'country', 'year']]

Above, we assumed that 'Northern Europe' was a value within the sub_region column. When we don't know which values exist in a column, the unique() method can reveal them.

In [34]:
world_data['sub_region'].unique()
Out[34]:
array(['Southern Asia', 'Southern Europe', 'Northern Africa',
       'Sub-Saharan Africa', 'Latin America and the Caribbean',
       'Western Asia', 'Australia and New Zealand', 'Western Europe',
       'Eastern Europe', 'South-eastern Asia', 'Northern America',
       'Eastern Asia', 'Northern Europe', 'Melanesia', 'Central Asia',
       'Micronesia', 'Polynesia'], dtype=object)

With the | operator, rows matching either of the supplied criteria are returned.

In [ ]:
world_data.loc[(world_data['year'] == 1800) |
               (world_data['year'] == 1801) ,
               ['country', 'year']].head()

Additional useful ways of subsetting the data includes between(), which checks if a numerical value is within a given range, and isin(), which checks if a value is contained in a given list.

In [ ]:
# `unique` is used to show that only the relevant items are returned
world_data.loc[world_data['year'].between(2000, 2015), 'year'].unique()
In [37]:
world_data.loc[world_data['region'].isin(['Africa', 'Asia', 'Americas']), 'region'].unique()
Out[37]:
array(['Asia', 'Africa', 'Americas'], dtype=object)

Creating new columns

A frequent operation when working with data, is to create new columns based on the values in existing columns. For example, to find the total income in a country, we could multiple the income per person with the population:

In [38]:
world_data['population_income'] = world_data['income'] * world_data['population']
world_data[['population', 'income', 'population_income']].head()
Out[38]:
population income population_income
0 3280000 603 1977840000
1 3280000 603 1977840000
2 3280000 603 1977840000
3 3280000 603 1977840000
4 3280000 603 1977840000

Challenge

  1. Subset world_data to include observations from 1995 to 2001. Check that the dimensions of the resulting dataframe is 1253 x 15.

  2. Subset the data to include only observation from year 2000 and onwards, from all regions except 'Asia', and retain only the columns country, year, and sub_region. The dimensions of the resulting dataframe should be 2508 x 3.

In [39]:
# Challenge solutions

# 1.
world_data.loc[world_data['year'].between(1995, 2001)].shape

# 2.
world_data.loc[(world_data['year'] >= 2000) &
               (world_data['region'] != 'Asia'),
               ['country', 'year', 'sub_region']].shape
Out[39]:
(2489, 3)

Split-apply-combine techniques in pandas

Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some operation on each group, and combine the results into a single table.

Image credit Jake VanderPlas

Image credit Jake VanderPlas

pandas facilitates this workflow through the use of groupby() to split data, and summary/aggregation functions such as mean(), which collapses each group into a single-row summary of that group. When the mean is computed, the default behavior is to ignore NA values. The arguments to groupby() are column names that reference categorical variables by which the summary statistics should be calculated.

In [55]:
world_data.groupby('region')['population'].sum()
Out[55]:
region
Africa       59192998600
Americas     63837885500
Asia        330133218800
Europe       98766930400
Oceania       2422277600
Name: population, dtype: int64

The output is a series that is indexed with the grouped variable (the region) and the result of the aggregation (the total population) as the values.

These population numbers are abnormally high because the summary was made for all the years in the dataframe, instead of for a single year. To view only the data from this year, we can use what we learnt previously to filter the dataframe for observations in 2018 only. Compare these results to the picture in the world ignorance survey that placed 4 million people in Asia and 1 million in each of the other regions.

In [56]:
world_data_2018 = world_data.loc[world_data['year'] == 2018]
world_data_2018.groupby('region')['population'].sum()
Out[56]:
region
Africa      1286388200
Americas    1010688000
Asia        4514211000
Europe       742109000
Oceania       40212000
Name: population, dtype: int64

Individual countries can be selected from the resulting series using loc[].

In [57]:
avg_density = world_data_2018.groupby('region')['population'].sum()
avg_density.loc[['Asia', 'Europe']]
Out[57]:
region
Asia      4514211000
Europe     742109000
Name: population, dtype: int64

As a shortcut, loc[] can be omitted when indexing a series. This is similar to selecting columns from a dataframe with just [].

In [58]:
avg_density[['Asia', 'Europe']]
Out[58]:
region
Asia      4514211000
Europe     742109000
Name: population, dtype: int64

This indexing can be used to normalize the population numbers to the region of interest.

In [59]:
region_pop_2018 = world_data_2018.groupby('region')['population'].sum()
region_pop_2018 / region_pop_2018['Europe']
Out[59]:
region
Africa      1.733422
Americas    1.361913
Asia        6.082949
Europe      1.000000
Oceania     0.054186
Name: population, dtype: float64

There are six times as many people living in Asia than in Europe.

Groups can also be created from multiple columns, e.g. it could be interesting to compare how densely populated countries are on average in different income brackets around the world.

In [60]:
world_data_2018.groupby(['region', 'income_group'])['pop_density'].mean()
Out[60]:
region    income_group
Africa    High             207.000000
          Low              118.640741
          Lower middle      69.331250
          Upper middle      94.457500
Americas  High             136.426000
          Low              403.000000
          Lower middle     113.950000
          Upper middle      92.931875
Asia      High            1121.654545
          Low              115.866667
          Lower middle     262.606471
          Upper middle     235.447692
Europe    High             176.563214
          Lower middle      99.500000
          Upper middle      67.832222
Oceania   High              10.610000
          Lower middle      52.500000
          Upper middle      90.266667
Name: pop_density, dtype: float64

Note that income_group is an ordinal variable, i.e. a categorical variable with an inherent order to it. pandas has not listed the values of that variable in the order we would expect (low, lower-middle, upper-middle, high). The order of a categorical variable can be specified in the dataframe, using the top level pandas function Categorical().

In [61]:
# Reassign in the main dataframe since we will use more than just the 2018 data later
world_data['income_group'] = (
    pd.Categorical(world_data['income_group'], ordered=True,
                   categories=['Low', 'Lower middle', 'Upper middle', 'High'])
)

# Need to recreate the 2018 dataframe since the categorical was changed in the main frame
world_data_2018 = world_data.loc[world_data['year'] == 2018]
world_data_2018['income_group'].dtype
Out[61]:
CategoricalDtype(categories=['Low', 'Lower middle', 'Upper middle', 'High'], ordered=True)
In [62]:
world_data_2018.groupby(['region', 'income_group'])['pop_density'].mean()
Out[62]:
region    income_group
Africa    Low              118.640741
          Lower middle      69.331250
          Upper middle      94.457500
          High             207.000000
Americas  Low              403.000000
          Lower middle     113.950000
          Upper middle      92.931875
          High             136.426000
Asia      Low              115.866667
          Lower middle     262.606471
          Upper middle     235.447692
          High            1121.654545
Europe    Lower middle      99.500000
          Upper middle      67.832222
          High             176.563214
Oceania   Lower middle      52.500000
          Upper middle      90.266667
          High              10.610000
Name: pop_density, dtype: float64

Now the values appear in the order we would expect. The value for Asia in the high income bracket looks suspiciously high. It would be interesting to see which countries were averaged to that value.

In [63]:
world_data_2018.loc[(world_data['region'] == 'Asia') &
                    (world_data['income_group'] == 'High'),
                    ['country', 'pop_density']]
Out[63]:
country pop_density
2627 Bahrain 2060.0
9197 Cyprus 129.0
16862 Israel 391.0
17519 Japan 349.0
18614 Kuwait 236.0
26279 Oman 15.6
28469 Qatar 232.0
29564 Saudi Arabia 15.6
30659 Singapore 8270.0
31973 South Korea 526.0
36791 United Arab Emirates 114.0

Extreme values, such as the city-state Singapore, can heavily skew averages and it could be a good idea to use a more robust statistics such as the median instead.

In [64]:
world_data_2018.groupby(['region', 'income_group'])['pop_density'].median()
Out[64]:
region    income_group
Africa    Low              66.70
          Lower middle     74.75
          Upper middle     12.81
          High            207.00
Americas  Low             403.00
          Lower middle     68.20
          Upper middle     55.95
          High             37.80
Asia      Low              82.35
          Lower middle     92.00
          Upper middle    106.00
          High            236.00
Europe    Lower middle     99.50
          Upper middle     68.70
          High            109.50
Oceania   Lower middle     22.70
          Upper middle     69.90
          High             10.61
Name: pop_density, dtype: float64

The returned series has an index that is a combination of the columns region and sub_region, and referred to as a MultiIndex. The same syntax as previously can be used to select rows on the species-level.

In [65]:
med_density_2018 = world_data_2018.groupby(['region', 'income_group'])['pop_density'].median()
med_density_2018[['Africa', 'Americas']]
Out[65]:
region    income_group
Africa    Low              66.70
          Lower middle     74.75
          Upper middle     12.81
          High            207.00
Americas  Low             403.00
          Lower middle     68.20
          Upper middle     55.95
          High             37.80
Name: pop_density, dtype: float64

To select specific values from both levels of the MultiIndex, a list of tuples can be passed to loc[].

In [66]:
med_density_2018.loc[[('Africa', 'High'), ('Americas', 'High')]]
Out[66]:
region    income_group
Africa    High            207.0
Americas  High             37.8
Name: pop_density, dtype: float64

To select only the low income values from all region, the xs() (cross section) method can be used.

In [67]:
med_density_2018.xs('Low', level='income_group')
Out[67]:
region
Africa       66.70
Americas    403.00
Asia         82.35
Name: pop_density, dtype: float64

The names and values of the index levels can be seen by inspecting the index object.

In [68]:
med_density_2018.index
Out[68]:
MultiIndex(levels=[['Africa', 'Americas', 'Asia', 'Europe', 'Oceania'], ['Low', 'Lower middle', 'Upper middle', 'High']],
           labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4], [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 1, 2, 3, 1, 2, 3]],
           names=['region', 'income_group'])

Although MultiIndexes offer succinct and fast ways to access data, they also requires memorization of additional syntax and are strictly speaking not essential unless speed is of particular concern. It can therefore be easier to reset the index, so that all values are stored in columns.

In [69]:
med_density_2018_res = med_density_2018.reset_index()
med_density_2018_res
Out[69]:
region income_group pop_density
0 Africa Low 66.70
1 Africa Lower middle 74.75
2 Africa Upper middle 12.81
3 Africa High 207.00
4 Americas Low 403.00
5 Americas Lower middle 68.20
6 Americas Upper middle 55.95
7 Americas High 37.80
8 Asia Low 82.35
9 Asia Lower middle 92.00
10 Asia Upper middle 106.00
11 Asia High 236.00
12 Europe Lower middle 99.50
13 Europe Upper middle 68.70
14 Europe High 109.50
15 Oceania Lower middle 22.70
16 Oceania Upper middle 69.90
17 Oceania High 10.61

After resetting the index, the same comparison syntax introduced earlier can be used instead of xs() or passing lists of tuples to loc[].

In [70]:
med_density_2018_asia = med_density_2018_res.loc[med_density_2018_res['income_group'] == 'Low']
med_density_2018_asia
Out[70]:
region income_group pop_density
0 Africa Low 66.70
4 Americas Low 403.00
8 Asia Low 82.35

reset_index() grants the freedom of not having to work with indexes, but it is still worth keeping in mind that selecting on an index level with xs() can be orders of magnitude faster than using boolean comparisons on large dataframes.

The opposite operation of creating an index from an existing columns can be performed with set_index() on any column (or combination of columns) that creates an index with unique values.

In [71]:
med_density_2018_asia.set_index(['region', 'income_group'])
Out[71]:
pop_density
region income_group
Africa Low 66.70
Americas Low 403.00
Asia Low 82.35

Challenge

  1. Which is the highest population density in each region?

  2. The low income group for the Americas had the same population density for both the mean and the median. This could mean that there are few observations in this group. List all the low income countries in the Americas.

In [72]:
# Challenge solutions

# 1.
world_data_2018.groupby('region')['pop_density'].max()
Out[72]:
region
Africa       625.0
Americas     666.0
Asia        8270.0
Europe      1350.0
Oceania      151.0
Name: pop_density, dtype: float64
In [73]:
# This will be a challenge

# 2.
world_data_2018.loc[(world_data['region'] == 'Americas') & (world_data['income_group'] == 'Low'), ['country', 'pop_density']]
Out[73]:
country pop_density
14891 Haiti 403.0

Multiple aggregations on grouped data

Since the same grouped dataframe will be used in multiple code chunks below, we can assigned it to a new variable instead of typing out the grouping expression each time.

In [74]:
grouped_world_data = world_data_2018.groupby(['region', 'sub_region'])
grouped_world_data['life_expectancy'].mean()
Out[74]:
region    sub_region                     
Africa    Northern Africa                    74.716667
          Sub-Saharan Africa                 63.682609
Americas  Latin America and the Caribbean    75.600000
          Northern America                   80.650000
Asia      Central Asia                       71.340000
          Eastern Asia                       76.440000
          South-eastern Asia                 73.630000
          Southern Asia                      72.211111
          Western Asia                       76.122222
Europe    Eastern Europe                     75.110000
          Northern Europe                    80.140000
          Southern Europe                    79.466667
          Western Europe                     82.100000
Oceania   Australia and New Zealand          82.350000
          Melanesia                          63.700000
          Micronesia                         62.200000
          Polynesia                          71.550000
Name: life_expectancy, dtype: float64

Instead of using the mean() or sum() methods directly, the more general agg() method could be called to aggregate by any existing aggregation functions. The equivalent to the mean() method would be to call agg() and specify 'mean'.

In [75]:
grouped_world_data['life_expectancy'].agg('mean')
Out[75]:
region    sub_region                     
Africa    Northern Africa                    74.716667
          Sub-Saharan Africa                 63.682609
Americas  Latin America and the Caribbean    75.600000
          Northern America                   80.650000
Asia      Central Asia                       71.340000
          Eastern Asia                       76.440000
          South-eastern Asia                 73.630000
          Southern Asia                      72.211111
          Western Asia                       76.122222
Europe    Eastern Europe                     75.110000
          Northern Europe                    80.140000
          Southern Europe                    79.466667
          Western Europe                     82.100000
Oceania   Australia and New Zealand          82.350000
          Melanesia                          63.700000
          Micronesia                         62.200000
          Polynesia                          71.550000
Name: life_expectancy, dtype: float64

This general approach is more flexible and powerful, since multiple aggregation functions can be applied in the same line of code by passing them as a list to agg(). For instance, the standard deviation and mean could be computed in the same call:

In [76]:
grouped_world_data['life_expectancy'].agg(['mean', 'std'])
Out[76]:
mean std
region sub_region
Africa Northern Africa 74.716667 3.510793
Sub-Saharan Africa 63.682609 4.540108
Americas Latin America and the Caribbean 75.600000 3.721559
Northern America 80.650000 2.192031
Asia Central Asia 71.340000 0.808084
Eastern Asia 76.440000 6.566430
South-eastern Asia 73.630000 4.835298
Southern Asia 72.211111 6.426983
Western Asia 76.122222 4.585214
Europe Eastern Europe 75.110000 2.711478
Northern Europe 80.140000 2.958678
Southern Europe 79.466667 2.694889
Western Europe 82.100000 0.804156
Oceania Australia and New Zealand 82.350000 0.777817
Melanesia 63.700000 1.961292
Micronesia 62.200000 NaN
Polynesia 71.550000 1.202082

The returned output is in this case a dataframe and the column MultiIndex is indicated in bold font.

By passing a dictionary to .agg() it is possible to apply different aggregations to the different columns. Long code statements can be broken down into multiple lines if they are enclosed by parentheses, brackets, or braces, something that will be described in detail later.

In [77]:
grouped_world_data[['population', 'income']].agg(
    {'population': 'sum',
     'income': ['min', 'median', 'max']
    }
)
Out[77]:
population income
sum min median max
region sub_region
Africa Northern Africa 237270000 4440 11200 18300
Sub-Saharan Africa 1049118200 629 1985 27500
Americas Latin America and the Caribbean 646688000 1710 13700 30300
Northern America 364000000 43800 49350 54900
Asia Central Asia 71890000 2920 6690 24200
Eastern Asia 1626920000 1390 16000 39100
South-eastern Asia 655870000 1490 7255 83900
Southern Asia 1887261000 1870 6890 17400
Western Asia 272270000 2430 20750 121000
Europe Eastern Europe 291970000 5330 24100 32300
Northern Europe 104478000 25500 43450 65600
Southern Europe 151681000 12100 24050 37900
Western Europe 193980000 39000 45200 99000
Oceania Australia and New Zealand 29550000 36400 41100 45800
Melanesia 10237000 2110 2850 9420
Micronesia 118000 1890 1890 1890
Polynesia 307000 5500 5725 5950

There are plenty of aggregation methods available in pandas (e.g. sem, mad, sum), most of which can be seen at the end of this section in the pandas documentation, or explored using tab-complete on the grouped dataframe.

In [ ]:
# This is a side note if there are issues with tab completion
# Tab completion might only work like this:
# find_agg_methods = grouped_world_data['weight']
# find_agg_methods.<tab>

Even if a function is not part of the pandas library, it can be passed to agg().

In [79]:
import numpy as np

grouped_world_data['pop_density'].agg(np.mean)
Out[79]:
region    sub_region                     
Africa    Northern Africa                     50.113333
          Sub-Saharan Africa                 108.143043
Americas  Latin America and the Caribbean    126.558966
          Northern America                    19.880000
Asia      Central Asia                        38.504000
          Eastern Asia                       248.202000
          South-eastern Asia                 961.110000
          Southern Asia                      460.388889
          Western Asia                       298.355556
Europe    Eastern Europe                      88.629000
          Northern Europe                     64.897000
          Southern Europe                    202.166667
          Western Europe                     256.000000
Oceania   Australia and New Zealand           10.610000
          Melanesia                           28.475000
          Micronesia                         146.000000
          Polynesia                          110.450000
Name: pop_density, dtype: float64

Any function can be passed like this, including functions you create yourself.

Challenge

  1. What's the mean life expectancy for each income group in 2018?

  2. What's the min, median, and max life expectancies for each income group within each region?

In [80]:
# Challenge solutions

# 1.
world_data_2018.groupby('income_group')['life_expectancy'].mean()
Out[80]:
income_group
Low             63.744118
Lower middle    69.053488
Upper middle    74.283673
High            79.919231
Name: life_expectancy, dtype: float64
In [81]:
# 2.
world_data_2018.groupby(['region', 'income_group'])['life_expectancy'].agg(['min', 'median', 'max'])
Out[81]:
min median max
region income_group
Africa Low 51.6 62.50 68.3
Lower middle 51.1 66.35 78.0
Upper middle 63.5 67.10 77.9
High 74.2 74.20 74.2
Americas Low 64.5 64.50 64.5
Lower middle 73.1 74.90 78.7
Upper middle 68.2 75.80 81.4
High 73.4 77.60 82.2
Asia Low 58.7 70.45 72.2
Lower middle 67.9 71.50 77.8
Upper middle 68.0 76.50 80.5
High 76.9 80.70 84.2
Europe Lower middle 72.3 72.35 72.4
Upper middle 71.1 75.50 78.0
High 75.1 81.30 83.5
Oceania Lower middle 61.1 62.90 64.3
Upper middle 65.8 70.70 72.4
High 81.8 82.35 82.9

Additional sections (time permitting)

Using size() to summarize categorical data

When working with data, we commonly want to know the number of observations present for each categorical variable. For this, pandas provides the size() method. For example, to find the number of observations per region (in this case unique countries during year 2018):

In [82]:
world_data_2018.groupby('region').size()
Out[82]:
region
Africa      52
Americas    31
Asia        47
Europe      39
Oceania      9
dtype: int64

size() can also be used when grouping on multiple variables.

In [83]:
world_data_2018.groupby(['region', 'income_group']).size()
Out[83]:
region    income_group
Africa    Low             27
          Lower middle    16
          Upper middle     8
          High             1
Americas  Low              1
          Lower middle     4
          Upper middle    16
          High            10
Asia      Low              6
          Lower middle    17
          Upper middle    13
          High            11
Europe    Lower middle     2
          Upper middle     9
          High            28
Oceania   Lower middle     4
          Upper middle     3
          High             2
dtype: int64

If there are many groups, size() is not that useful on its own. For example, it is difficult to quickly find the five most abundant species among the observations.

In [84]:
world_data_2018.groupby('sub_region').size()
Out[84]:
sub_region
Australia and New Zealand           2
Central Asia                        5
Eastern Asia                        5
Eastern Europe                     10
Latin America and the Caribbean    29
Melanesia                           4
Micronesia                          1
Northern Africa                     6
Northern America                    2
Northern Europe                    10
Polynesia                           2
South-eastern Asia                 10
Southern Asia                       9
Southern Europe                    12
Sub-Saharan Africa                 46
Western Asia                       18
Western Europe                      7
dtype: int64

Since there are many rows in this output, it would be beneficial to sort the table values and display the most abundant species first. This is easy to do with the sort_values() method.

In [85]:
world_data_2018.groupby('sub_region').size().sort_values()
Out[85]:
sub_region
Micronesia                          1
Australia and New Zealand           2
Polynesia                           2
Northern America                    2
Melanesia                           4
Eastern Asia                        5
Central Asia                        5
Northern Africa                     6
Western Europe                      7
Southern Asia                       9
Northern Europe                    10
South-eastern Asia                 10
Eastern Europe                     10
Southern Europe                    12
Western Asia                       18
Latin America and the Caribbean    29
Sub-Saharan Africa                 46
dtype: int64

That's better, but it could be helpful to display the most abundant species on top. In other words, the output should be arranged in descending order.

In [86]:
world_data_2018.groupby('sub_region').size().sort_values(ascending=False).head(5)
Out[86]:
sub_region
Sub-Saharan Africa                 46
Latin America and the Caribbean    29
Western Asia                       18
Southern Europe                    12
Eastern Europe                     10
dtype: int64

Looks good!

Method chaining

By now, the code statement has grown quite long because many methods have been chained together. It can be tricky to keep track of what is going on in long method chains. To make the code more readable, it can be broken up multiple lines by adding a surrounding parenthesis.

In [87]:
(world_data_2018
     .groupby('sub_region')
     .size()
     .sort_values(ascending=False)
     .head(5)
)
Out[87]:
sub_region
Sub-Saharan Africa                 46
Latin America and the Caribbean    29
Western Asia                       18
Southern Europe                    12
Eastern Europe                     10
dtype: int64

This looks neater and makes long method chains easier to reads. There is no absolute rule for when to break code into multiple line, but always try to write code that is easy for collaborators to understand. Remember that your most common collaborator is a future version of yourself!

pandas has a convenience function for returning the top five results, so the values don't need to be sorted explicitly.

In [88]:
(world_data_2018
     .groupby(['sub_region'])
     .size()
     .nlargest()  # the default is 5
)
Out[88]:
sub_region
Sub-Saharan Africa                 46
Latin America and the Caribbean    29
Western Asia                       18
Southern Europe                    12
Eastern Europe                     10
dtype: int64

To include more attributes about these countries, add those columns to groupby().

In [89]:
(world_data_2018
     .groupby(['region', 'sub_region'])
     .size()
     .nlargest()  # the default is 5
)
Out[89]:
region    sub_region                     
Africa    Sub-Saharan Africa                 46
Americas  Latin America and the Caribbean    29
Asia      Western Asia                       18
Europe    Southern Europe                    12
Asia      South-eastern Asia                 10
dtype: int64
In [90]:
world_data.head()
Out[90]:
country year population region sub_region income_group life_expectancy income children_per_woman child_mortality pop_density co2_per_capita years_in_school_men years_in_school_women population_income
0 Afghanistan 1800 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN 1977840000
1 Afghanistan 1801 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN 1977840000
2 Afghanistan 1802 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN 1977840000
3 Afghanistan 1803 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN 1977840000
4 Afghanistan 1804 3280000 Asia Southern Asia Low 28.2 603 7.0 469.0 NaN NaN NaN NaN 1977840000

Challenge

  1. How many countries are there in each income group worldwide?
  2. Assign the variable name world_data_2015 to a dataframe containing only the values from year 2015 (e.g. the same way as world_data_2018 was created) 3. a. For those countries where women went to school longer than men, how many are there in each income group. b. Do the same as above but for countries where men went to school longer than women. What does this distribution tell you?
In [91]:
# Challenge solutions
# 1.
world_data_2018.groupby('income_group').size()
Out[91]:
income_group
Low             34
Lower middle    43
Upper middle    49
High            52
dtype: int64
In [92]:
# 2
world_data_2015 = world_data.loc[world_data['year'] == 2015]
In [93]:
# 3a
world_data_2015.loc[world_data_2015['years_in_school_men'] < world_data_2015['years_in_school_women']].groupby('income_group').size()
Out[93]:
income_group
Low              0
Lower middle    14
Upper middle    33
High            47
dtype: int64
In [94]:
# 3b
world_data_2015.loc[world_data_2015['years_in_school_men'] > world_data_2015['years_in_school_women']].groupby('income_group').size()
Out[94]:
income_group
Low             34
Lower middle    29
Upper middle    11
High             5
dtype: int64

Data cleaning tips

dropna() removes both explicit NaN values and value that pandas assumed to be NaN, such as the non-numeric values in the life_expectancy column.

In [ ]:
world_data_2018.dropna()

Instead of dropping observations that has NaN values in a any column, a subset of columns can be considered.

In [ ]:
world_data_2018.dropna(subset='life_expectancy'])

Non-numeric values can also be coerced into explicit NaN values via the to_numeric() top level function.

In [ ]:
world_data_2018['life_expectancy'], errors='coerce')