Exploring data graphically


Learning Objectives

  • Learn how to plot with matplotlib
  • Set universal plot settings.
  • Produce scatter plots, line plots, and histograms using seaborn and matplotlib.
  • Understand how to graphically explore relationships between variables.
  • Apply grids for faceting in seaborn.
  • Use seaborn grids with matplotlib functions

Lesson outline

  • Data visualization with matplotlib and seaborn (10 min)
    • Intro to plotting with matplotlib
    • Visualizing one quantitative variable with multiple categorical variables (50 min)
    • Visualizing the relationship of two quantitative variable with multiple categorical variables (40min)

Short review from yesterday

Do you remember how to:

1 - Read in the data from the csv file from yesterday?

In [10]:
import pandas as pd

world_data = pd.read_csv('https://raw.githubusercontent.com/UofTCoders/2018-09-10-utoronto/gh-pages/data/world-data-gapminder.csv')

# If saved locally yesterday:
# surveys = pd.read_csv("world_data.csv")

2 - How to select only the columns 'country' and 'year' from the dataframe?

In [12]:
world_data[['country', 'year']].head() #head just to limit output
Out[12]:
country year
0 Afghanistan 1800
1 Afghanistan 1801
2 Afghanistan 1802
3 Afghanistan 1803
4 Afghanistan 1804

3 - How to select a few rows together with the columns above?

In [13]:
world_data.loc[[1, 13, 24], ['country', 'year']]
Out[13]:
country year
1 Afghanistan 1801
13 Afghanistan 1813
24 Afghanistan 1824

4 - How to select only data from year 1995?

In [14]:
world_data.loc[world_data['year'] == 1995]
Out[14]:
country year population region sub_region income_group life_expectancy income children_per_woman child_mortality pop_density co2_per_capita years_in_school_men years_in_school_women
195 Afghanistan 1995 17100000 Asia Southern Asia Low 51.1 881 7.61 150.0 26.20 0.0727 2.56 0.49
414 Albania 1995 3110000 Europe Southern Europe Upper middle 74.1 4130 2.59 32.9 113.00 0.6720 9.31 9.07
633 Algeria 1995 28900000 Africa Northern Africa Upper middle 72.3 9300 3.45 43.3 12.10 3.3000 5.67 4.84
852 Angola 1995 14300000 Africa Sub-Saharan Africa Lower middle 52.0 2970 6.92 223.0 11.40 0.7690 4.89 3.05
1071 Antigua and Barbuda 1995 73600 Americas Latin America and the Caribbean High 74.4 16500 2.21 19.5 167.00 3.7400 10.50 11.40
1290 Argentina 1995 35000000 Americas Latin America and the Caribbean High 73.1 13900 2.76 24.3 12.80 3.6600 9.53 10.00
1509 Armenia 1995 3220000 Asia Western Asia Upper middle 69.3 2170 1.80 38.7 113.00 1.0600 10.10 10.20
1728 Australia 1995 18100000 Oceania Australia and New Zealand High 78.3 30400 1.82 7.0 2.35 15.6000 11.80 11.80
1947 Austria 1995 7990000 Europe Western Europe High 76.7 33700 1.42 6.8 97.00 7.4800 10.90 10.50
2166 Azerbaijan 1995 7780000 Asia Western Asia Upper middle 64.9 3320 2.58 94.1 94.10 4.2900 10.70 10.30
2385 Bahamas 1995 280000 Americas Latin America and the Caribbean High 70.6 22100 2.51 18.7 28.00 6.0100 10.00 10.40
2604 Bahrain 1995 564000 Asia Western Asia High 70.3 43500 3.10 18.1 742.00 26.3000 7.41 7.54
2823 Bangladesh 1995 119000000 Asia Southern Asia Lower middle 61.7 1440 3.73 114.0 912.00 0.1920 4.34 2.75
3042 Barbados 1995 265000 Americas Latin America and the Caribbean High 73.7 12400 1.73 14.7 616.00 3.1300 7.53 7.82
3261 Belarus 1995 10100000 Europe Eastern Europe Upper middle 68.3 5450 1.47 15.7 50.00 5.9900 11.10 11.60
3480 Belgium 1995 10200000 Europe Western Europe High 76.9 32700 1.61 7.6 336.00 11.0000 11.40 11.60
3699 Belize 1995 207000 Americas Latin America and the Caribbean Upper middle 70.7 6210 4.11 29.5 9.07 1.8200 7.17 6.63
3918 Benin 1995 5910000 Africa Sub-Saharan Africa Low 56.5 1520 6.36 158.0 52.40 0.2250 3.62 1.48
4137 Bhutan 1995 515000 Asia Southern Asia Lower middle 62.9 2900 4.60 101.0 13.50 0.4840 4.41 1.78
4356 Bolivia 1995 7570000 Americas Latin America and the Caribbean Lower middle 64.3 4110 4.58 101.0 6.98 1.3000 8.11 6.64
4575 Bosnia and Herzegovina 1995 3840000 Europe Southern Europe Upper middle 68.9 1830 1.71 14.2 75.40 0.8920 8.46 7.82
4794 Botswana 1995 1570000 Africa Sub-Saharan Africa Upper middle 56.4 8900 3.95 71.6 2.77 1.9400 5.11 5.59
5013 Brazil 1995 162000000 Americas Latin America and the Caribbean Upper middle 69.7 11100 2.50 49.1 19.40 1.5900 5.99 6.46
5232 Bulgaria 1995 8380000 Europe Eastern Europe Upper middle 71.0 8450 1.34 19.2 77.20 6.9200 10.70 11.20
5451 Burkina Faso 1995 10100000 Africa Sub-Saharan Africa Low 50.7 869 6.84 195.0 36.90 0.0621 1.86 0.91
5670 Burundi 1995 5960000 Africa Sub-Saharan Africa Low 47.0 870 7.29 169.0 232.00 0.0400 3.45 2.31
5889 Cambodia 1995 10700000 Asia South-eastern Asia Lower middle 58.2 1100 4.69 120.0 60.40 0.1460 4.97 3.36
6108 Cameroon 1995 13500000 Africa Sub-Saharan Africa Lower middle 56.5 2260 5.98 166.0 28.50 0.3140 6.45 4.36
6327 Canada 1995 29300000 Americas Northern America High 78.0 32200 1.64 6.9 3.23 15.9000 13.60 13.60
6546 Central African Republic 1995 3350000 Africa Sub-Saharan Africa Low 46.2 858 5.62 175.0 5.38 0.0700 4.92 2.43
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32607 Sri Lanka 1995 18200000 Asia Southern Asia Lower middle 72.2 4510 2.29 20.3 291.00 0.3240 8.57 8.44
32826 Sudan 1995 24100000 Africa Northern Africa Lower middle 60.4 1960 5.83 120.0 13.70 0.1780 5.70 3.37
33045 Suriname 1995 444000 Americas Latin America and the Caribbean Upper middle 70.3 9620 3.01 39.7 2.84 4.6500 7.25 6.89
33264 Swaziland 1995 961000 Africa Sub-Saharan Africa Lower middle 59.4 5610 4.80 83.7 55.90 0.4730 6.94 6.80
33483 Sweden 1995 8840000 Europe Northern Europe High 78.8 31100 1.73 4.8 21.50 6.2400 12.20 12.40
33702 Switzerland 1995 7020000 Europe Western Europe High 78.5 45900 1.52 6.4 178.00 5.5900 12.20 11.40
33921 Syria 1995 14300000 Asia Western Asia Low 72.1 4890 4.51 29.6 78.10 2.9000 7.18 5.05
34140 Tajikistan 1995 5760000 Asia Central Asia Low 63.9 1270 4.59 119.0 41.20 0.4250 10.50 9.52
34359 Tanzania 1995 30000000 Africa Sub-Saharan Africa Low 52.9 1370 5.88 164.0 33.80 0.1190 5.77 4.57
34578 Thailand 1995 59500000 Asia South-eastern Asia Upper middle 70.9 9380 1.87 29.2 116.00 2.7100 7.30 6.95
34797 Timor-Leste 1995 871000 Asia South-eastern Asia Lower middle 60.9 1560 6.38 139.0 58.60 NaN 5.49 4.09
35016 Togo 1995 4270000 Africa Sub-Saharan Africa Low 57.2 1200 5.76 134.0 78.60 0.2230 5.03 2.32
35235 Tonga 1995 96100 Oceania Polynesia Upper middle 68.4 4260 4.45 18.8 133.00 0.9920 9.61 9.49
35454 Trinidad and Tobago 1995 1260000 Americas Latin America and the Caribbean High 69.3 12900 1.96 28.0 245.00 13.6000 9.81 10.10
35673 Tunisia 1995 9110000 Africa Northern Africa Lower middle 73.0 6130 2.61 44.7 58.70 1.7300 8.20 5.28
35892 Turkey 1995 58500000 Asia Western Asia Upper middle 70.9 12300 2.76 54.8 76.00 2.9400 7.62 5.59
36111 Turkmenistan 1995 4210000 Asia Central Asia Upper middle 63.1 4600 3.51 87.5 8.95 8.0800 11.20 10.90
36330 Uganda 1995 20600000 Africa Sub-Saharan Africa Low 47.0 931 7.02 171.0 103.00 0.0457 5.72 3.69
36549 Ukraine 1995 50900000 Europe Eastern Europe Lower middle 66.6 5060 1.41 20.3 87.90 8.7600 11.20 11.50
36768 United Arab Emirates 1995 2450000 Asia Western Asia High 73.5 102000 3.42 13.1 29.30 28.8000 9.00 9.20
36987 United Kingdom 1995 58000000 Europe Northern Europe High 76.6 28600 1.76 7.2 240.00 9.2800 12.10 12.00
37206 United States 1995 266000000 Americas Northern America High 75.9 39500 1.98 9.5 29.00 19.3000 13.40 13.40
37425 Uruguay 1995 3220000 Americas Latin America and the Caribbean High 73.5 11500 2.40 20.8 18.40 1.4200 8.70 9.31
37644 Uzbekistan 1995 22900000 Asia Central Asia Lower middle 66.2 2240 3.53 70.5 53.70 4.5200 10.50 10.20
37863 Vanuatu 1995 168000 Oceania Melanesia Lower middle 62.3 2610 4.73 30.4 13.80 0.3920 6.70 5.86
38082 Venezuela 1995 22200000 Americas Latin America and the Caribbean Upper middle 73.0 15300 3.08 26.2 25.20 6.0100 7.87 8.22
38301 Vietnam 1995 75200000 Asia South-eastern Asia Lower middle 69.5 2040 2.71 39.0 243.00 0.3870 7.23 6.63
38520 Yemen 1995 15300000 Asia Western Asia Low 60.5 3530 7.53 112.0 29.00 0.6830 4.71 0.95
38739 Zambia 1995 9140000 Africa Sub-Saharan Africa Lower middle 46.5 2030 6.19 177.0 12.30 0.2380 6.73 5.13
38958 Zimbabwe 1995 11300000 Africa Sub-Saharan Africa Low 53.7 2480 4.43 90.1 29.30 1.3400 8.41 6.92

178 rows × 14 columns

5 - Select only the rows where the region is Asia or Africa.

In [15]:
world_data.loc[world_data['region'].isin(['Asia', 'Africa'])]
Out[15]:
country year population region sub_region income_group life_expectancy income children_per_woman child_mortality pop_density co2_per_capita years_in_school_men years_in_school_women
0 Afghanistan 1800 3280000 Asia Southern Asia Low 28.2 603 7.00 469.0 NaN NaN NaN NaN
1 Afghanistan 1801 3280000 Asia Southern Asia Low 28.2 603 7.00 469.0 NaN NaN NaN NaN
2 Afghanistan 1802 3280000 Asia Southern Asia Low 28.2 603 7.00 469.0 NaN NaN NaN NaN
3 Afghanistan 1803 3280000 Asia Southern Asia Low 28.2 603 7.00 469.0 NaN NaN NaN NaN
4 Afghanistan 1804 3280000 Asia Southern Asia Low 28.2 603 7.00 469.0 NaN NaN NaN NaN
5 Afghanistan 1805 3280000 Asia Southern Asia Low 28.2 603 7.00 469.0 NaN NaN NaN NaN
6 Afghanistan 1806 3280000 Asia Southern Asia Low 28.1 603 7.00 470.0 NaN NaN NaN NaN
7 Afghanistan 1807 3280000 Asia Southern Asia Low 28.1 603 7.00 470.0 NaN NaN NaN NaN
8 Afghanistan 1808 3280000 Asia Southern Asia Low 28.1 603 7.00 470.0 NaN NaN NaN NaN
9 Afghanistan 1809 3280000 Asia Southern Asia Low 28.1 603 7.00 470.0 NaN NaN NaN NaN
10 Afghanistan 1810 3280000 Asia Southern Asia Low 28.1 604 7.00 470.0 NaN NaN NaN NaN
11 Afghanistan 1811 3280000 Asia Southern Asia Low 28.1 604 7.00 470.0 NaN NaN NaN NaN
12 Afghanistan 1812 3280000 Asia Southern Asia Low 28.1 604 7.00 470.0 NaN NaN NaN NaN
13 Afghanistan 1813 3280000 Asia Southern Asia Low 28.1 604 7.00 470.0 NaN NaN NaN NaN
14 Afghanistan 1814 3290000 Asia Southern Asia Low 28.1 604 7.00 470.0 NaN NaN NaN NaN
15 Afghanistan 1815 3290000 Asia Southern Asia Low 28.1 604 7.00 470.0 NaN NaN NaN NaN
16 Afghanistan 1816 3300000 Asia Southern Asia Low 28.1 604 7.00 471.0 NaN NaN NaN NaN
17 Afghanistan 1817 3300000 Asia Southern Asia Low 28.0 604 7.00 471.0 NaN NaN NaN NaN
18 Afghanistan 1818 3310000 Asia Southern Asia Low 28.0 604 7.00 471.0 NaN NaN NaN NaN
19 Afghanistan 1819 3320000 Asia Southern Asia Low 28.0 604 7.00 471.0 NaN NaN NaN NaN
20 Afghanistan 1820 3320000 Asia Southern Asia Low 28.0 604 7.00 471.0 NaN NaN NaN NaN
21 Afghanistan 1821 3330000 Asia Southern Asia Low 28.0 607 7.00 471.0 NaN NaN NaN NaN
22 Afghanistan 1822 3340000 Asia Southern Asia Low 28.0 609 7.00 471.0 NaN NaN NaN NaN
23 Afghanistan 1823 3350000 Asia Southern Asia Low 28.0 611 7.00 471.0 NaN NaN NaN NaN
24 Afghanistan 1824 3360000 Asia Southern Asia Low 28.0 613 7.00 471.0 NaN NaN NaN NaN
25 Afghanistan 1825 3380000 Asia Southern Asia Low 27.9 615 7.00 471.0 NaN NaN NaN NaN
26 Afghanistan 1826 3390000 Asia Southern Asia Low 27.9 617 7.00 473.0 NaN NaN NaN NaN
27 Afghanistan 1827 3400000 Asia Southern Asia Low 27.9 619 7.00 473.0 NaN NaN NaN NaN
28 Afghanistan 1828 3420000 Asia Southern Asia Low 27.9 621 7.00 473.0 NaN NaN NaN NaN
29 Afghanistan 1829 3430000 Asia Southern Asia Low 27.9 623 7.00 473.0 NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
38952 Zimbabwe 1989 9900000 Africa Sub-Saharan Africa Low 62.7 2490 5.37 73.9 25.6 1.630 7.61 6.01
38953 Zimbabwe 1990 10200000 Africa Sub-Saharan Africa Low 61.7 2590 5.18 75.2 26.3 1.540 7.74 6.16
38954 Zimbabwe 1991 10400000 Africa Sub-Saharan Africa Low 61.0 2670 5.00 77.4 27.0 1.530 7.88 6.31
38955 Zimbabwe 1992 10700000 Africa Sub-Saharan Africa Low 59.4 2370 4.84 80.2 27.6 1.590 8.01 6.46
38956 Zimbabwe 1993 10900000 Africa Sub-Saharan Africa Low 57.6 2350 4.69 83.4 28.2 1.500 8.14 6.61
38957 Zimbabwe 1994 11100000 Africa Sub-Saharan Africa Low 55.8 2520 4.56 86.8 28.7 1.600 8.28 6.76
38958 Zimbabwe 1995 11300000 Africa Sub-Saharan Africa Low 53.7 2480 4.43 90.1 29.3 1.340 8.41 6.92
38959 Zimbabwe 1996 11500000 Africa Sub-Saharan Africa Low 52.2 2690 4.33 92.8 29.8 1.300 8.54 7.07
38960 Zimbabwe 1997 11700000 Africa Sub-Saharan Africa Low 50.8 2710 4.24 94.7 30.3 1.230 8.67 7.23
38961 Zimbabwe 1998 11900000 Africa Sub-Saharan Africa Low 49.1 2750 4.16 95.9 30.7 1.200 8.80 7.39
38962 Zimbabwe 1999 12100000 Africa Sub-Saharan Africa Low 47.8 2690 4.10 96.4 31.2 1.310 8.93 7.55
38963 Zimbabwe 2000 12200000 Africa Sub-Saharan Africa Low 46.7 2570 4.06 96.8 31.6 1.140 9.07 7.71
38964 Zimbabwe 2001 12400000 Africa Sub-Saharan Africa Low 46.2 2580 4.02 97.1 32.0 1.020 9.20 7.87
38965 Zimbabwe 2002 12500000 Africa Sub-Saharan Africa Low 45.6 2320 4.00 97.7 32.3 0.957 9.33 8.03
38966 Zimbabwe 2003 12600000 Africa Sub-Saharan Africa Low 45.3 1910 3.99 98.2 32.7 0.843 9.47 8.20
38967 Zimbabwe 2004 12800000 Africa Sub-Saharan Africa Low 45.1 1780 3.98 99.0 33.0 0.742 9.60 8.36
38968 Zimbabwe 2005 12900000 Africa Sub-Saharan Africa Low 45.3 1650 3.99 99.7 33.4 0.832 9.73 8.53
38969 Zimbabwe 2006 13100000 Africa Sub-Saharan Africa Low 45.7 1580 3.99 100.0 33.9 0.796 9.87 8.69
38970 Zimbabwe 2007 13300000 Africa Sub-Saharan Africa Low 46.4 1490 4.00 100.0 34.5 0.742 10.00 8.86
38971 Zimbabwe 2008 13600000 Africa Sub-Saharan Africa Low 46.7 1210 4.01 98.0 35.0 0.573 10.10 9.03
38972 Zimbabwe 2009 13800000 Africa Sub-Saharan Africa Low 47.5 1290 4.02 94.9 35.7 0.406 10.30 9.19
38973 Zimbabwe 2010 14100000 Africa Sub-Saharan Africa Low 49.6 1460 4.03 89.9 36.4 0.552 10.40 9.36
38974 Zimbabwe 2011 14400000 Africa Sub-Saharan Africa Low 51.9 1660 4.02 83.8 37.2 0.665 10.50 9.53
38975 Zimbabwe 2012 14700000 Africa Sub-Saharan Africa Low 54.1 1850 4.00 76.0 38.0 0.530 10.70 9.70
38976 Zimbabwe 2013 15100000 Africa Sub-Saharan Africa Low 55.6 1900 3.96 70.0 38.9 0.776 10.80 9.86
38977 Zimbabwe 2014 15400000 Africa Sub-Saharan Africa Low 57.0 1910 3.90 64.3 39.8 0.780 10.90 10.00
38978 Zimbabwe 2015 15800000 Africa Sub-Saharan Africa Low 58.3 1890 3.84 59.9 40.8 NaN 11.10 10.20
38979 Zimbabwe 2016 16200000 Africa Sub-Saharan Africa Low 59.3 1860 3.76 56.4 41.7 NaN NaN NaN
38980 Zimbabwe 2017 16500000 Africa Sub-Saharan Africa Low 59.8 1910 3.68 56.8 42.7 NaN NaN NaN
38981 Zimbabwe 2018 16900000 Africa Sub-Saharan Africa Low 60.2 1950 3.61 55.5 43.7 NaN NaN NaN

21681 rows × 14 columns

6 - Calculate the total population in each region

In [16]:
world_data.groupby('region')['population'].sum()
Out[16]:
region
Africa       59192998600
Americas     63837885500
Asia        330133218800
Europe       98766930400
Oceania       2422277600
Name: population, dtype: int64

7 - Get the number of countries in each region for the year 2018.

In [17]:
world_data.loc[world_data['year'] == 2018].groupby('region').size()
Out[17]:
region
Africa      52
Americas    31
Asia        47
Europe      39
Oceania      9
dtype: int64

Introduction to plotting

The human visual system is one of the most advanced apparatuses for detecting patterns and it allows for quick exploration of complex visual relationships. Data visualization is therefore a quick, efficient way of unearthing clues to interesting features in the data that can later be investigated in a robust, quantitative manner. Visualizations are also unparalleled in communicating insights drawn from data. For these reasons, it is important to possess the skills to graphically represent the data in a way that is efficient for humans to process.

There are many plotting packages in Python, making it possible to create diverse visualizations such as interactive web graphics, 3D animations, statistical visualizations, and map-based plots. When starting out, it can be helpful to find an example of how a plot looks that you want to create and then copy and modify that code. Examples of plots can be found in many excellent online Python plotting galleries, such as those from matplotlib, seaborn, and the Python graph gallery.

Our focus will be on two of the most useful packages for creating publication quality visualizations: matplotlib, which is a robust, detail-oriented, low level plotting interface, and seaborn, which provides high level functions on top of matplotlib and allows the plotting calls to be expressed in terms what is being explored in the underlying data rather than what graphical elements to add to the plot. The high-level figures created by seaborn can be configured via the matplotlib parameters, so learning these packages in tandem is useful.

In [1]:
%matplotlib inline
# Note that this will only need to be done the first time you create a plot in a notebook
# all subsequent plots will show up as expected.

To facilitate our understanding of plotting concepts, the initial examples here will not include dataframes, but instead have simple lists holding just a few data points. To create a line plot, the plot() function from matplotlib.pyplot can be used.

In [2]:
import matplotlib.pyplot as plt

x = [1, 2, 3, 4]
y = [1, 2, 4, 3]
plt.plot(x ,y)
Out[2]:
[<matplotlib.lines.Line2D at 0x7f4dc6a43dd8>]

Using plot() like this is not very explicit and a few things happens "under the hood", e.g. a figure is automatically created and it is assumed that the plot should go into the currently active region of this figure. This gives little control over exactly where to place the plots within a figure and how to make modifications the plot after creating it, e.g. adding a title or labeling the axes.

To facilitate modifications to the plot, it is recommended to use the object oriented plotting interface in matplotlib, where an empty figure and at least one axes object is explicitly created before a plot is added to it. This figure and its axes objects are assigned to variable names which are then used for plotting. In matplotlib, an axes object refers to what you would often call a subplot colloquially and it is named "axes" because it consists of an x-axis and a y-axis by default.

In [3]:
fig, ax = plt.subplots()

Calling subplots() returns two objects, the figure and its axes object. Plots can be added to the axes object of the figure by using the name we assigned to the returned axes object (ax by convention).

In [4]:
fig, ax = plt.subplots()
ax.plot(x, y)
Out[4]:
[<matplotlib.lines.Line2D at 0x7f4dc69e1668>]

To create a scatter plot, use scatter() instead of plot().

In [5]:
fig, ax = plt.subplots()
ax.scatter(x, y)
Out[5]:
<matplotlib.collections.PathCollection at 0x7f4dc6912f98>

Plots can also be combined together in the same axes. The line style and marker color can be changed to facilitate viewing the elements in th combined plot.

In [6]:
fig, ax = plt.subplots()
ax.scatter(x, y, color='red')
ax.plot(x, y, linestyle='dashed')
Out[6]:
[<matplotlib.lines.Line2D at 0x7f4dc68f9d30>]

And plot elements can be resized.

In [7]:
fig, ax = plt.subplots()
ax.scatter(x, y, color='red', s=100)
ax.plot(x, y, linestyle='dashed', linewidth=3)
Out[7]:
[<matplotlib.lines.Line2D at 0x7f4dc685afd0>]

It is common to modify the plot after creating it, e.g. adding a title or label the axis.

In [8]:
fig, ax = plt.subplots()
ax.scatter(x, y, color='red')
ax.plot(x, y, linestyle='dashed')

ax.set_title('Line and scatter plot')
ax.set_xlabel('Measurement X')
Out[8]:
Text(0.5,0,'Measurement X')

The scatter and line plots can easily be separated into two subplots within the same figure, by telling plt.subplots to create a figure with one row and two columns (so two subplots side by side). This returns two axes objects, one for each subplot, which we assign to the variable names ax1 and ax2.

In [9]:
fig, (ax1, ax2) = plt.subplots(1, 2)
# The default is (1, 1), that's why it does not need
# to be specified with only one subplot

To prevent plot elements, such as the axis tick labels from overlapping, tight_layout() method can be used.

In [10]:
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.tight_layout()

The figure size can easily be controlled when it is created.

In [ ]:
# `figsize` refers to the size of the figure in inches when printed
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4))
fig.tight_layout()

Bringing it all together to separate the line and scatter plot.

In [12]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4))
ax1.scatter(x, y, color='red')
ax2.plot(x, y, linestyle='dashed')

ax1.set_title('Scatter plot')
ax2.set_title('Line plot')
fig.tight_layout()

Challenge 1

  1. There is a plethora of colors available to use in matplotlib. Change the color of the line and the dots in the figure using your favorite color from this list.
  2. Use the documentation to change the styling of the line in the line plot and the type of marker used in the scatter plot (you might need to search online to figure this out).

Saving plots

Figures can be saved by calling the savefig() method and specifying the name of file to create. The resolution of the figure can be controlled by the dpi parameter.

In [ ]:
fig.savefig('scatter-and-line.png', dpi=300)

In the JupyterLab file browser, you can see that a new image file has been created. A PDF-file can be saved by changing the extension in the specified file name. Since PDF is a vector file format, it is not possible to specify a resolution.

In [14]:
fig.savefig('scatter-and-line.pdf')

This concludes the customization section. The concepts taught here will be applied in the next section on how to choose a suitable plot type for data sets with many observations.

Plotting dataframes

If the dataframe from the previous lecture is not loaded, read it in first.

In [ ]:
import pandas as pd

# world_data = pd.read_csv('../world-data-gapminder.csv')
# If not saved to disk yesterday
url = 'https://raw.githubusercontent.com/UofTCoders/2018-09-10-utoronto/gh-pages/data/world-data-gapminder.csv'
world_data = pd.read_csv(url)

We can use scatter() with the data parameter to plot columns from the dataframe.

In [16]:
fig, ax = plt.subplots()
ax.scatter(x='year', y='population', data=world_data)
Out[16]:
<matplotlib.collections.PathCollection at 0x7f4dbe565208>

The reason for the not immediately intuitive appearance of this graph, is that one scatter dot has been added for each year for every country. To instead see how the world's total population has changes over the years, the population from each country for each year needs to be summed together. This can be done using the dataframe techniques from the previous lecture.

In [17]:
# One could also do `as_index=False` with `groupby()`
world_pop = world_data.groupby('year')['population'].sum().reset_index()

fig, ax = plt.subplots()
ax.scatter(x='year', y='population', data=world_pop)
Out[17]:
<matplotlib.collections.PathCollection at 0x7f4dbde57a90>

This plot shows that the world population has been steadily increasing since the 1800s and dramatically picked up pace in the 1950s.

It is possible to use matplotlib in this way to explore visual relationships in dataframe. However, it is a little cumbersome already with these simple examples and it will get more complicated once we want to include more variables, e.g. stratifying the data in subplots based on region and income level would include writing double loops and keeping track of plot layout and grouping variables manually. The Python package seaborn is designed for effectively exploring data visually without getting bogged down in technical plotting details.

Visual data exploration with seaborn

When visually exploring data with lots of variables, it is in many cases easier to think in terms of what is to be explored in the data, rather than what graphical elements are to be added to the plot. For example, instead of instructing the computer to "go through a dataframe and plot any observations of country X in blue, any observations of country Y in red, etc", it is easier to just type "color the data by country". There are many benefits to using a so called descriptive syntax, instead of an imperative one.

Facilitating semantic mappings of data variable to graphical elements is one of the goals of the seaborn plotting package. Thanks to its functional way of interfacing with data, only minimal changes are required if the underlying data change or to switch the type of plot used for the visualization. seaborn provides a language that facilitates thinking about data in ways that are conducive for exploratory data analysis and allows for the creation of publication quality plots with minimal adjustments and tweaking.

The seaborn syntax was introduced briefly already in the introductory lecture and it is similar to how matplotlib plots dataframes. For example, to make the same scatter plot as above:

In [18]:
import seaborn as sns

sns.scatterplot(x='year', y='population', data=world_pop)
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4db4dfee80>

In addition to providing a data-centric syntax, seaborn also facilitates visualization of common statistical aggregations. For example, the when creating a line plot in seaborn, the default is to aggregate and average all observations with the same value on the x-axis, and to create a shaded region representing the 95% confidence interval for these observations.

In [19]:
sns.lineplot(x='year', y='population', data=world_data)
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4db46d5e48>

In this case, it would be more appropriate to have the shaded area describe the variation in the data, such as the standard deviation, rather than an inference about the reproducibility, such as the default 95% CI.

In [20]:
sns.lineplot(x='year', y='population', data=world_data, ci='sd')
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4dbde0a668>

To change from showing the average world population per country and year to showing the total population for all countries per year, the estimator parameter can be used. Here, the shaded are is also removed with ci=None.

In [21]:
# The `estimator` parameter is currently non-functional for sns.scatterplot, but will be added soon
sns.lineplot(x='year', y='population', data=world_data, estimator='sum', ci=None)
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4db4341eb8>

Changing graph aesthetics

Before continuing the exploration of the world population data, let's discuss how to customize the appearance of our plots. The returned object is an matplotlib axes, so all configuration available through matplotlib can be applied to the returned object by first assigning it to a variable name (ax by convention).

In [22]:
ax = sns.lineplot(x='year', y='population', data=world_data, estimator='sum', ci=None)
ax.set_title('World population since the 1800s', fontsize=16)
ax.set_xlabel('Year', fontsize=12)
Out[22]:
Text(0.5,0,'Year')

In addition to all the customization available through the standard matplotlib syntax, seaborn also offers its own functions for changing the appearance of the plots.

In essence, these functions are shortcuts to change several matplotlib parameters simultaneously For example, a more effective approach than setting individual font sizes or colors of graphical elements is to set the overall size and style for all graphs.

In [ ]:
sns.set(context='talk', style='darkgrid', palette='pastel')
sns.lineplot(x='year', y='population', data=world_data, estimator='sum', ci=None)

These functions are analogues to making changes in the settings menu of a graphical program and they will apply to all following plots.

Challenge 2

Find out which styles and contexts are available in seaborn. Try some of them out and choose your favorite style and context. Hint This information is available both through the built-in and the online documentation.

For the rest of this tutorial, the ticks style will be used.

In [24]:
sns.set(context='notebook', style='ticks', font_scale=1.4)
sns.lineplot(x='year', y='population', data=world_data, estimator='sum', ci=None)
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4db224e630>

For styles that include the frame around the plot, there is a special seaborn function to remove the top- and rightmost borders (again by modifying matplotlib parameters under the hood).

In [26]:
sns.lineplot(x='year', y='population', data=world_data, estimator='sum', ci=None)
sns.despine()

If the style options exposed through seaborn are not sufficient, it is possible to change all plot parameters directly through the matplotlib rc and style interfaces.

Exploring relationships between two quantitative variables

As mentioned above, the main strength of a descriptive plotting syntax lies in describing the plot appearance in human-friendly vocabulary and have the computer assign variables to graphical objects accordingly. For example, to plot subsets of the data in different colors, the hue parameter can be used.

In [27]:
sns.lineplot(x='year', y='population', hue='income_group',
            data=world_data, ci=None, estimator='sum')
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4db21721d0>

This stratification of the income groups reveals that the population growth has been the fastest in middle income countries.

The plot can be made more accessible (especially to those with color vision deficiency) by changing the style of each line instead of only relying on color to separate them.

In [28]:
sns.lineplot(x='year', y='population', hue='income_group', style='income_group',
            data=world_data, ci=None, estimator='sum')
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4db21bed30>

Just like in the previous lecture, the values of the ordinal variable income_group are not listed in an intuitive order. A custom order can easily be specified by passing a list to the hue_order parameter, but this would have to be done for every plot. A more effective approach is to encode the order in the dataframe itself, using the top level pandas function Categorical().

In [29]:
world_data['income_group'] = (
    pd.Categorical(world_data['income_group'], ordered=True,
                   categories=['Low', 'Lower middle', 'Upper middle', 'High'])
)
world_data['income_group'].dtype
Out[29]:
CategoricalDtype(categories=['Low', 'Lower middle', 'Upper middle', 'High'], ordered=True)
In [30]:
sns.lineplot(x='year', y='population', hue='income_group', style='income_group',
             data=world_data, ci=None, estimator='sum')
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4db20abe48>

The legend now lists the colors in the expected order. This modification also ensures that when making plots with income groups on the x- or y-axis, they will be plotted in the expected order.

Conditioning quantitative relationships on qualitative variables

It is difficult to explore multiple categorical relationships within one single plot. For example, to see how the income groups compare within each region, the hue and style variables could be used for different variables, but this makes the plot dense and difficult to interpret.

In [31]:
sns.lineplot(x='year', y='population', hue='income_group', style='region',
            data=world_data, ci=None, estimator='sum')
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4db2148978>

An effective approach for exploring multiple categorical variables in a data set is to plot so-called "small multiples" of the data, where the same type of plot is used for different subsets of the data. These subplots are drawn in rows and columns forming a grid pattern, and can be referred to as a "facet", "lattice" or "trellis" plot.

Visualizing categorical variables in this manner is a key step in exploratory data analysis, and thus seaborn has a dedicated plot function for this, called relplot() (for "relational plot" since it visualizes the relationships between numerical variables). The syntax to relplot() is very similar to lineplot(), but we need to specify that the kind of plot we want is a line plot.

In [32]:
# Create the same plot as above
sns.relplot(x='year', y='population', hue='income_group', style='income_group', kind='line',
            data=world_data, ci=None, estimator='sum')
Out[32]:
<seaborn.axisgrid.FacetGrid at 0x7f4db1fe77f0>

The region variable can now be mapped to different facets/subplots in a grid pattern.

In [33]:
# TODO switch this to some more interesting column if I have time
sns.relplot(x='year', y='population', data=world_data, estimator='sum',
            kind='line', hue='income_group', col='region', ci=None)
Out[33]:
<seaborn.axisgrid.FacetGrid at 0x7f4db1f4fd30>

It's a little hard to see because the figure is very wide and has been shrunk to fit in the notebook. To avoid this, relplot() can use the col_wrap parameter to distribute the plots over several rows. The height and aspect parameters can be used to set the height and width of each facet.

In [34]:
sns.relplot(x='year', y='population', data=world_data, estimator='sum',
            kind='line', hue='income_group', col='region', ci=None,
           col_wrap=3, height=2.5, aspect=1.3)
Out[34]:
<seaborn.axisgrid.FacetGrid at 0x7f4db1d1c470>

Facetting the plot by region reveals that the largest absolute population increase occurred among middle income countries in Asia. We will soon look closer on which countries are.

The returned object from relplot() is a grid (a special kind of figure) with many axes, and can therefore not be placed within a preexisting figure. It is saved just as any matplotlib figure with savefig(), but has some special methods for easily changing the aesthetics of each axes.

In [35]:
g = sns.relplot(x='year', y='population', data=world_data,
            kind='line', hue='income_group', col='region', ci=None,
           col_wrap=3, height=2.5, aspect=1.3)

g.set_titles('{col_name}', y=0.95)
g.set_axis_labels(y_var='Population', x_var='Year')
g.savefig('grid-figure.png')

Remember that names such as fig, ax, and here g, are only by convention, and any variable name could have been used.

We might want the color to indicate income group, but draw separate lines for each country. For this we can set units='country' and estimator=None (so don't aggregate, just draw one line per country with the raw values).

In [36]:
sns.relplot(x='year', y='population', data=world_data, estimator=None, units='country',
            kind='line', hue='income_group', col='region', ci=None,
           col_wrap=3, height=2.5, aspect=1.3)
Out[36]:
<seaborn.axisgrid.FacetGrid at 0x7f4db19822b0>

Two countries in Asia stand out in terms of total population. To find out which these are, we can filter the data.

In [37]:
world_data.loc[world_data['year'] == 2018].nlargest(8, 'population')
Out[37]:
country year population region sub_region income_group life_expectancy income children_per_woman child_mortality pop_density co2_per_capita years_in_school_men years_in_school_women
7226 China 2018 1420000000 Asia Eastern Asia Upper middle 76.9 16000 1.64 9.95 151.0 NaN NaN NaN
15767 India 2018 1350000000 Asia Southern Asia Lower middle 69.1 6890 2.28 41.10 455.0 NaN NaN NaN
37229 United States 2018 327000000 Americas Northern America High 79.1 54900 1.90 6.06 35.7 NaN NaN NaN
15986 Indonesia 2018 267000000 Asia South-eastern Asia Lower middle 72.0 11700 2.31 25.00 147.0 NaN NaN NaN
5036 Brazil 2018 211000000 Americas Latin America and the Caribbean Upper middle 75.7 14300 1.70 14.20 25.2 NaN NaN NaN
26498 Pakistan 2018 201000000 Asia Southern Asia Lower middle 68.0 5220 3.35 76.80 260.0 NaN NaN NaN
25622 Nigeria 2018 196000000 Africa Sub-Saharan Africa Lower middle 66.1 5570 5.39 97.90 215.0 NaN NaN NaN
2846 Bangladesh 2018 166000000 Asia Southern Asia Lower middle 73.4 3720 2.05 32.00 1280.0 NaN NaN NaN

Challenge 3

  1. To find out the total amount of CO2 released into the atmosphere, used the co2_per_capita and population columns to create a new column: co2_total.
  2. Plot the total CO2 per year for the world.
  3. Plot the total CO2 per year for the world and for each region.
  4. Create a faceted plot comparing total CO2 levels across income groups and regions.
In [ ]:
# Challenge 3 solutions

# 1.
world_data['co2_total'] = world_data['co2_per_capita'] * world_data['population']

# 2.
sns.relplot(x='year', y='co2_total', data=world_data, kind='line', ci=None, estimator='sum')

# 3.
sns.relplot(x='year', y='co2_total', data=world_data, kind='line', ci=None, estimator='sum', hue='region')

# 4.
sns.relplot(x='year', y='co2_total', data=world_data, kind='line', ci=None, estimator='sum',
            hue='income_group', col='region', col_wrap=3, height=4)

# Discuss what these plots tell us:
# The world's total co2 emissions are rapidly increasing. Europe and the Americas have been the highest emitters for
# many years, but have recently been overtaken by Asia, which is now producing around twice the amount of co2 compare
# to Europe and America. But don't forget that we saw in the last lecture that the population in Asia is 5-6 times bigger
# than in Europe and America!

# It's important to look at both total production from a country because change within that single country has big
# potential of reaching many people. Not plotted here, but also also important is to explore which countries are high in CO2 per capita
# since these might have more room to reduce the production. Of course, reality is more complicated. Some countries
# might import goods that demand high CO2 production in their manufacturing country instead of producing themselves
# so they might "sponsor" the production in another country, but would not show up high in this list.

To continue exploring the CO2 emissions we started to look at in the last challenge, let's use the other type of plot for comparing quantitative variables: scatterplot(). This is the default in the relplot() function, so we don't need to specify kind='scatter')

As mentioned in the discussion above, in addition to considering the total amount of CO2 produced per country, it can be insightful to explore the CO2 produced per citizen.

In [40]:
sns.relplot(x='co2_total', y='co2_per_capita', data=world_data)
Out[40]:
<seaborn.axisgrid.FacetGrid at 0x7f4db14f2dd8>

This looks funky, and not quite as expected... The reason is that we have plotted multiple data points per country, one for each year! This can be confusing since we don't know which dot is for which year and this plot is probably not what we wanted to create. Instead, we can filter the data to focus on a specific year. Unfortunately, there is not CO2 measurements available for the last few years. To find out in which years there are countries with CO2 measurements, we can drop the NAs in co2_per_capita and look at the min and max value.

In [41]:
world_data.dropna(subset=['co2_per_capita'])['year'].agg(['min', 'max'])
Out[41]:
min    1800
max    2014
Name: year, dtype: int64

Now we can subset the data for the latest available year with CO2 measurements, which is 2014.

In [ ]:
world_data_2014 = world_data.loc[world_data['year'] == 2014]
sns.relplot(x='co2_total', y='income', data=world_data_2014)
# TODO add to the line below significanlty what?

This reveals that there are a few countries in the world that have significantly and one country that is rather high in both measurements.

Just as before, it is possible to map plot semantics and facet the plot according to variables in the data set. scatterplot() can also scale the dot size according to a variable in the data set.

In [43]:
# `sizes` controls the dots min and max size
sns.relplot(x='co2_total', y='co2_per_capita', hue='income_group', size='population',
            data=world_data_2014, sizes=(40, 400))
Out[43]:
<seaborn.axisgrid.FacetGrid at 0x7f4db0409f98>

Unsurprisingly, some of the countries that are high in the total co2_emissions are also the most populous countries. The trends between different regions can now be easily compared by facetting the data by region.

In [44]:
sns.relplot(x='co2_total', y='co2_per_capita', hue='income_group', size='population',
            data=world_data_2014, sizes=(40, 400), col='region', col_wrap=3, height=4)
Out[44]:
<seaborn.axisgrid.FacetGrid at 0x7f4db03f4f98>

Already here we can get a pretty good idea of which some of these countries are. The high emission middle income countries in Asia are likely China and India, while the American country high in both total emissions and emissions per capita must be the USA. However, some observations are harder to resolve, like which the high co2_capita regions are in Asia and the Americas.

Challenge 4

Let's use some of the aggregation methods from yesterday to complement the plots we have just made.

  1. Find out which are the 10 countries with the highest co2 emissions per capita.
  2. Find out which are the 10 countries with the highest total co2 emissions.
  3. Which 10 countries have produce the most CO2 in total since the 1800s?
In [47]:
# Challenge 4 solutions

# 1.
world_data_2014.nlargest(10, 'co2_per_capita')

# 2.
world_data_2014.nlargest(10, 'co2_total')

# 3.
world_data.groupby('country')['co2_total'].sum().nlargest(10)

In addition to what we observed above, an interesting aspect to explore is how the relationship between per capita and total CO2 emissions has changed over time for different income groups. As we have seen before, this can be explored in a line graph, but if we instead subset certain years from the data and create a facet for each year, we can see the spread at each point in time

In [48]:
world_data_1920_2018 = world_data.loc[world_data['year'].isin([1920, 1940, 1960, 1980, 2000, 2014])]

sns.relplot(x='co2_total', y='co2_per_capita', col='year', hue='income_group',
            data=world_data_1920_2018, col_wrap=3, height=3.5)
Out[48]:
<seaborn.axisgrid.FacetGrid at 0x7f4db01d3978>

How to know which relationships to start exploring?

In the exercises above, we chose suitable variables to illustrate the plotting concepts. Often when doing EDA, it will not be as easy to know what comparison to start with. Unless you have good reason to look at a particular relationship, starting by plotting the pairwise relationships of all quantitative variables can be helpful.

In [ ]:
# Use 2014 data since we know that there are CO2 measurements in that year
# This might take some time
sns.pairplot(world_data_2014)

The year column is not that insightful since there is only one year in the data. Removing that column gives more space for the rest of the plots.

In [ ]:
sns.pairplot(world_data_2014.drop(columns='year'))

Each histogram on the diagonal shows the distribution of a single variable in a histogram. The scatter plots below the diagonal show the relationship between two numerical variables in a scatter plot. The scatter plots above the diagonal are mirror images of those below the diagonal.

Plotting all pairwise relationships can provide clues for what to explore next. For example, the relationships we explored above between child mortality and children per women or those between total CO2 and CO2 per capita can also be seen here. It is possible to quantify the strength of these relationships, by computing the Pearson correlation coefficients between columns.

In [51]:
world_data_2014.drop(columns='year').corr()
Out[51]:
population life_expectancy income children_per_woman child_mortality pop_density co2_per_capita years_in_school_men years_in_school_women co2_total
population 1.000000 0.020899 -0.039127 -0.075136 -0.012679 0.010329 0.009876 -0.012609 -0.055508 0.810722
life_expectancy 0.020899 1.000000 0.656187 -0.799298 -0.874404 0.177470 0.466554 0.726919 0.732383 0.117341
income -0.039127 0.656187 1.000000 -0.530189 -0.550647 0.277383 0.807494 0.581746 0.582572 0.097359
children_per_woman -0.075136 -0.799298 -0.530189 1.000000 0.876623 -0.144019 -0.430218 -0.751975 -0.784130 -0.148606
child_mortality -0.012679 -0.874404 -0.550647 0.876623 1.000000 -0.126336 -0.442394 -0.789018 -0.818036 -0.122293
pop_density 0.010329 0.177470 0.277383 -0.144019 -0.126336 1.000000 0.120080 0.084184 0.080018 -0.010954
co2_per_capita 0.009876 0.466554 0.807494 -0.430218 -0.442394 0.120080 1.000000 0.441900 0.454274 0.159584
years_in_school_men -0.012609 0.726919 0.581746 -0.751975 -0.789018 0.084184 0.441900 1.000000 0.964648 0.122927
years_in_school_women -0.055508 0.732383 0.582572 -0.784130 -0.818036 0.080018 0.454274 0.964648 1.000000 0.088188
co2_total 0.810722 0.117341 0.097359 -0.148606 -0.122293 -0.010954 0.159584 0.122927 0.088188 1.000000

With so much data, it is slow for us to process all the information as numbers in a table A higher bandwidth operation is to let our brain interpret colors for the strength of the relationships through a heatmap.

In [ ]:
sns.heatmap(world_data_2014.drop(columns='year').corr())

The heatmap can be made more informative by changing to a diverging colormap, which is generally recommended when there is a natural central value (such as 0 in our case). Optionally, the heatmap can be annotated with the correlation coefficients.

In [ ]:
fig, ax = plt.subplots(figsize=(10, 6))
sns.heatmap(world_data_2014.drop(columns='year').corr(), annot=True, ax=ax, cmap='coolwarm')

There are more formal ways of interrogating variable interactions and their potential causality (such as regressions), but these are outside the scope of this lecture. However, the pairwise scatter plot and correlation coefficient matrix are quick means to get an informative overview of how the dataframe columns relate to each other.

Let's zoom in on the relationship between income and life expectancy, which appears to be quite strong.

In [ ]:
# TODO Make this a challenge where they learn how to find things on stackoverflow
ax = sns.scatterplot(x='income', y='life_expectancy', data=world_data_2014)

This relationship appears to be log linear and can be visualized with the x-axis set to log-scale.

Challenge

  1. Find out how to change the x-axis to be log-scaled. Search online for how to change the scale of a matplotlib axes object. Remember that seaborn plots return matplotlib axes objects, so all matplotlib function to modify the axes will work on this plot. Good sites to use are the documentation pages for the respective package, and stackoverflow. However, it is often the fastest to type in a well chosen query in your favorite search engine.
  2. In the logged plot, color the dots according to the region of the observation.
In [55]:
# Challenge solutions
# 1.
ax = sns.scatterplot(x='income', y='life_expectancy', data=world_data_2014)
ax.set_xscale('log')
In [56]:
# Challenge solutions
# 2.
ax = sns.scatterplot(x='income', y='life_expectancy', data=world_data_2014, hue='region')
ax.set_xscale('log')

Another interesting relationship we could see from the pairplot is how child mortality relates to how many children are born per woman. We can filter out years of the data and look at how the relationship has changed over time using the same approach as for the CO2 data.

In [57]:
world_data_1920_2018 = world_data.loc[world_data['year'].isin([1920, 1940, 1960, 1980, 2000, 2018])]

sns.relplot(x='children_per_woman', y='child_mortality', col='year', hue='income_group',
            data=world_data_1920_2018, col_wrap=3, height=3.5)
Out[57]:
<seaborn.axisgrid.FacetGrid at 0x7f4d980e26a0>

A common misconception is that saving poor children will lead to overpopulation. However, we can see that lower child mortality is correlated with smaller family sizes. As more children survive, parents feel more secure with a smaller family size. Reducing poverty is also related to these variables, since most high income countries are found in the lower left corner of the plots (remember that the income group is classified based on 2018 year's income and not for each year that is being plotted above).

It is important to note that from a plot like this, it is not possible to tell causation, just correlation. However, in the gapminder video library there are a few videos on this topic (including this and this one), discussing how reducing poverty can help slow down population growth through decreased family sizes. Current estimates suggest that the word population will stabilize around 11 billion people and the average number of children per woman will be close to two worldwide in year 2100.

Exploring a single quantitative variable across multiple levels of a categorical variable

When exploring a single quantitative variable, we can choose between plotting every data point (e.g. categorical scatterplots such as swarm plots and strip plots), an approximation of the distribution (e.g. histograms and violinplots), or distribution statistics such as measures of central tendency (e.g. boxplots and barplots).

A good place to start is to visualize the variable's distribution with distplot(). Let's look at life expectancy during 2018 using this technique.

In [ ]:
world_data_2018 = world_data.loc[world_data['year'] == 2018]
sns.distplot(world_data_2018['life_expectancy'])

The line represents a KDE (kernel density estimate), as seen previously in the grouped pairplot. Conceptually, this is similar to a smoothened histogram.

distplot() can be customized to increase the number of bins and the bandwidth of the kernel. These are both calculated according to heuristics for what should be good numbers for the underlying data, but it is good to know how to change them.

In [59]:
sns.distplot(world_data_2018['life_expectancy'], bins=30, rug=True,
             kde_kws={'bw':1, 'color':'black'})
Out[59]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4d98067e10>

The rug plot along the x-axis shows exactly where each data point resides. To compare distributions between values of a categorical variables, violinplots are often used. These consist of two KDEs mirrored across a midline.

In [60]:
sns.violinplot(x='life_expectancy', y='income_group', data=world_data_2018)
Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4d98f25828>

Since income_group was defined as an ordered categorical variable previously, this order is preserved when distributing the income groups along the y-axis.

There is notable variation in life expectancy between income groups, people in wealthier countries live longer. This variation contributes to the multimodality seen in the first distribution plot of the life expectancy for all countries in the world. However, there is also large overlap between income groups and variation within the groups, so there are more variables affecting the life expectancy than just the income.

Dissecting multimodal distributions in this manner to find underlying explaining variables to why a distribution appears to consist of many small distributions is common practice during EDA. It looks like some income groups, e.g. "high", still consist of multimodal distributions. To explore these further, facetting can be used just as previously. The categorical equivalent of relplot is catplot (categorical plot).

In [61]:
sns.catplot(x='life_expectancy', y='income_group', data=world_data_2018)
Out[61]:
<seaborn.axisgrid.FacetGrid at 0x7f4d93fcb4a8>

The default in catplot is to create a stripplot, a categorical scatterplot where the dots are randomly jittered to not overlap. This is fast to create, but it is sometimes hard to see how many dots are in a group due to overlap. A more ordered approach is to create another type of categorical scatterplot, called swarmplot, where the dots are positioned to avoid overlap.

In [62]:
sns.catplot(x='life_expectancy', y='income_group', data=world_data_2018, kind='swarm')
Out[62]:
<seaborn.axisgrid.FacetGrid at 0x7f4d980ba668>

The swarm plot communicates the shape of the distribution more clearly than the stripplot, Here, we can see the same bimodality in the high income group as seen in the violinplot, which was hard to see in the stripplot.

A drawback is that swarmplots can be slow to create for large datasets. For really large datasets, even stripplot is slow and it is necessary to approximate the distributions (e.g. with a violinplot) or show distribution statistics (e.g. with a boxplot), instead of showing each observation .

We can use color to find out if regional differences are related to income level.

In [ ]:
# TODO Will update this to look prettier
sns.catplot(x='life_expectancy', y='region', data=world_data_2014, kind='box',
            col='income_group', col_wrap=2)

The variable levels are automatically ordered and it is easy to see how life expectancy generally grow with higher average income. In contrast to a line plot with the average change over time, we can here see how the distribution itself changes, not just the average. While countries in general have increased their life expectancy, differences can be seen in how they have done it: Europe and the Americas have gone from a mix of high and low life_expectancy levels to tighter distributions where all countries have high life expectancy, Africa has transitioned from most countries having low life_exp to diverse life lengths depending on country.

In [ ]:
# If both columns can be interpreted as numerical,
# the `orient` keyword can be added to be explicit
sns.catplot(x='life_expectancy', y='year', orient='horizontal', data=world_data_1920_2018, kind='violin',
            col='region', col_wrap=3, color='lightgrey')

Let's explore how much of the variation during the transition in African life expectancy can be explained by geographically close regions performing differently. First how many sub_regions are there in each Africa.

In [65]:
world_data_1920_2018.groupby('region')['sub_region'].nunique()
Out[65]:
region
Africa      2
Americas    2
Asia        5
Europe      4
Oceania     4
Name: sub_region, dtype: int64

There are two subregions, let's find out which ones.

In [66]:
world_data_1920_2018.groupby('region')['sub_region'].unique()
Out[66]:
region
Africa                  [Northern Africa, Sub-Saharan Africa]
Americas    [Latin America and the Caribbean, Northern Ame...
Asia        [Southern Asia, Western Asia, South-eastern As...
Europe      [Southern Europe, Western Europe, Eastern Euro...
Oceania     [Australia and New Zealand, Melanesia, Microne...
Name: sub_region, dtype: object

Let's see if sub-saharan and northern Africa have had different development when it comes to life expectancy.

In [67]:
# The split parameter saves some space and looks slick
africa = world_data_1920_2018.loc[world_data_1920_2018['region'] == 'Africa']
sns.catplot(x='life_expectancy', y='year', orient='horizontal', data=africa, kind='violin',
            hue='sub_region', palette='pastel', split=True)
Out[67]:
<seaborn.axisgrid.FacetGrid at 0x7f4d93955f28>

For the last challenge, we will explore how an education indicator between and men and women varies.

In [68]:
world_data.dropna(subset=['years_in_school_women'])['year'].agg(['min', 'max'])
Out[68]:
min    1970
max    2015
Name: year, dtype: int64

Challenge

  1. Subset dataframe for the years 1975, 1995, and 2015
  2. Make a new column of ratio women men in education
  3. plot for regions and income groups and times (reword)
In [69]:
# Challenge solutions
# 1.
world_data_1970_2015 = world_data.loc[world_data['year'].isin([1975, 1995, 2015])].copy()
In [70]:
# 2.
world_data_1970_2015['women_men_school_ratio'] = world_data_1970_2015['years_in_school_women'] / world_data_1970_2015['years_in_school_men']
# world_data_1970_2015['women_men_school_ratio']
In [71]:
# 3a.
sns.catplot(y='women_men_school_ratio', x='year', data=world_data_1970_2015, hue='region', dodge=True, kind='point')
Out[71]:
<seaborn.axisgrid.FacetGrid at 0x7f4d980d9f60>
In [72]:
# 3b.
sns.catplot(y='women_men_school_ratio', x='year', data=world_data_1970_2015, hue='income_group', dodge=True, kind='point')
Out[72]:
<seaborn.axisgrid.FacetGrid at 0x7f4d93878710>