Exploring data graphically¶

Learning Objectives¶

Learn how to plot with matplotlib
Set universal plot settings.
Produce scatter plots, line plots, and histograms using seaborn and matplotlib.
Understand how to graphically explore relationships between variables.
Apply grids for faceting in seaborn.
Use seaborn grids with matplotlib functions

Lesson outline¶

Data visualization with matplotlib and seaborn (10 min)
- Intro to plotting with matplotlib
- Visualizing one quantitative variable with multiple categorical variables (50 min)
- Visualizing the relationship of two quantitative variable with multiple categorical variables (40min)

Short review from yesterday¶

Do you remember how to:

1 - Read in the data from the csv file from yesterday?

import pandas as pd

world_data = pd.read_csv('https://raw.githubusercontent.com/UofTCoders/2018-09-10-utoronto/gh-pages/data/world-data-gapminder.csv')

# If saved locally yesterday:
# surveys = pd.read_csv("world_data.csv")

2 - How to select only the columns 'country' and 'year' from the dataframe?

world_data[['country', 'year']].head() #head just to limit output

3 - How to select a few rows together with the columns above?

world_data.loc[[1, 13, 24], ['country', 'year']]

4 - How to select only data from year 1995?

world_data.loc[world_data['year'] == 1995]

5 - Select only the rows where the region is Asia or Africa.

world_data.loc[world_data['region'].isin(['Asia', 'Africa'])]

6 - Calculate the total population in each region

world_data.groupby('region')['population'].sum()

region
Africa       59192998600
Americas     63837885500
Asia        330133218800
Europe       98766930400
Oceania       2422277600
Name: population, dtype: int64

7 - Get the number of countries in each region for the year 2018.

world_data.loc[world_data['year'] == 2018].groupby('region').size()

region
Africa      52
Americas    31
Asia        47
Europe      39
Oceania      9
dtype: int64

Introduction to plotting¶

The human visual system is one of the most advanced apparatuses for detecting patterns and it allows for quick exploration of complex visual relationships. Data visualization is therefore a quick, efficient way of unearthing clues to interesting features in the data that can later be investigated in a robust, quantitative manner. Visualizations are also unparalleled in communicating insights drawn from data. For these reasons, it is important to possess the skills to graphically represent the data in a way that is efficient for humans to process.

There are many plotting packages in Python, making it possible to create diverse visualizations such as interactive web graphics, 3D animations, statistical visualizations, and map-based plots. When starting out, it can be helpful to find an example of how a plot looks that you want to create and then copy and modify that code. Examples of plots can be found in many excellent online Python plotting galleries, such as those from matplotlib, seaborn, and the Python graph gallery.

Our focus will be on two of the most useful packages for creating publication quality visualizations: matplotlib, which is a robust, detail-oriented, low level plotting interface, and seaborn, which provides high level functions on top of matplotlib and allows the plotting calls to be expressed in terms what is being explored in the underlying data rather than what graphical elements to add to the plot. The high-level figures created by seaborn can be configured via the matplotlib parameters, so learning these packages in tandem is useful.

%matplotlib inline
# Note that this will only need to be done the first time you create a plot in a notebook
# all subsequent plots will show up as expected.

To facilitate our understanding of plotting concepts, the initial examples here will not include dataframes, but instead have simple lists holding just a few data points. To create a line plot, the plot() function from matplotlib.pyplot can be used.

import matplotlib.pyplot as plt

x = [1, 2, 3, 4]
y = [1, 2, 4, 3]
plt.plot(x ,y)

[<matplotlib.lines.Line2D at 0x7f4dc6a43dd8>]

Using plot() like this is not very explicit and a few things happens "under the hood", e.g. a figure is automatically created and it is assumed that the plot should go into the currently active region of this figure. This gives little control over exactly where to place the plots within a figure and how to make modifications the plot after creating it, e.g. adding a title or labeling the axes.

To facilitate modifications to the plot, it is recommended to use the object oriented plotting interface in matplotlib, where an empty figure and at least one axes object is explicitly created before a plot is added to it. This figure and its axes objects are assigned to variable names which are then used for plotting. In matplotlib, an axes object refers to what you would often call a subplot colloquially and it is named "axes" because it consists of an x-axis and a y-axis by default.

fig, ax = plt.subplots()

Calling subplots() returns two objects, the figure and its axes object. Plots can be added to the axes object of the figure by using the name we assigned to the returned axes object (ax by convention).

fig, ax = plt.subplots()
ax.plot(x, y)

[<matplotlib.lines.Line2D at 0x7f4dc69e1668>]

To create a scatter plot, use scatter() instead of plot().

fig, ax = plt.subplots()
ax.scatter(x, y)

<matplotlib.collections.PathCollection at 0x7f4dc6912f98>

Plots can also be combined together in the same axes. The line style and marker color can be changed to facilitate viewing the elements in th combined plot.

fig, ax = plt.subplots()
ax.scatter(x, y, color='red')
ax.plot(x, y, linestyle='dashed')

[<matplotlib.lines.Line2D at 0x7f4dc68f9d30>]

And plot elements can be resized.

fig, ax = plt.subplots()
ax.scatter(x, y, color='red', s=100)
ax.plot(x, y, linestyle='dashed', linewidth=3)

[<matplotlib.lines.Line2D at 0x7f4dc685afd0>]

It is common to modify the plot after creating it, e.g. adding a title or label the axis.

fig, ax = plt.subplots()
ax.scatter(x, y, color='red')
ax.plot(x, y, linestyle='dashed')

ax.set_title('Line and scatter plot')
ax.set_xlabel('Measurement X')

Text(0.5,0,'Measurement X')

The scatter and line plots can easily be separated into two subplots within the same figure, by telling plt.subplots to create a figure with one row and two columns (so two subplots side by side). This returns two axes objects, one for each subplot, which we assign to the variable names ax1 and ax2.

fig, (ax1, ax2) = plt.subplots(1, 2)
# The default is (1, 1), that's why it does not need
# to be specified with only one subplot

To prevent plot elements, such as the axis tick labels from overlapping, tight_layout() method can be used.

fig, (ax1, ax2) = plt.subplots(1, 2)
fig.tight_layout()

The figure size can easily be controlled when it is created.

# `figsize` refers to the size of the figure in inches when printed
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4))
fig.tight_layout()

Bringing it all together to separate the line and scatter plot.

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4))
ax1.scatter(x, y, color='red')
ax2.plot(x, y, linestyle='dashed')

ax1.set_title('Scatter plot')
ax2.set_title('Line plot')
fig.tight_layout()

Challenge 1¶

There is a plethora of colors available to use in matplotlib. Change the color of the line and the dots in the figure using your favorite color from this list.

Use the documentation to change the styling of the line in the line plot and the type of marker used in the scatter plot (you might need to search online to figure this out).

Saving plots¶

Figures can be saved by calling the savefig() method and specifying the name of file to create. The resolution of the figure can be controlled by the dpi parameter.

fig.savefig('scatter-and-line.png', dpi=300)

In the JupyterLab file browser, you can see that a new image file has been created. A PDF-file can be saved by changing the extension in the specified file name. Since PDF is a vector file format, it is not possible to specify a resolution.

fig.savefig('scatter-and-line.pdf')

This concludes the customization section. The concepts taught here will be applied in the next section on how to choose a suitable plot type for data sets with many observations.

Plotting dataframes¶

If the dataframe from the previous lecture is not loaded, read it in first.

import pandas as pd

# world_data = pd.read_csv('../world-data-gapminder.csv')
# If not saved to disk yesterday
url = 'https://raw.githubusercontent.com/UofTCoders/2018-09-10-utoronto/gh-pages/data/world-data-gapminder.csv'
world_data = pd.read_csv(url)

We can use scatter() with the data parameter to plot columns from the dataframe.

fig, ax = plt.subplots()
ax.scatter(x='year', y='population', data=world_data)

<matplotlib.collections.PathCollection at 0x7f4dbe565208>

The reason for the not immediately intuitive appearance of this graph, is that one scatter dot has been added for each year for every country. To instead see how the world's total population has changes over the years, the population from each country for each year needs to be summed together. This can be done using the dataframe techniques from the previous lecture.

# One could also do `as_index=False` with `groupby()`
world_pop = world_data.groupby('year')['population'].sum().reset_index()

fig, ax = plt.subplots()
ax.scatter(x='year', y='population', data=world_pop)

<matplotlib.collections.PathCollection at 0x7f4dbde57a90>

This plot shows that the world population has been steadily increasing since the 1800s and dramatically picked up pace in the 1950s.

It is possible to use matplotlib in this way to explore visual relationships in dataframe. However, it is a little cumbersome already with these simple examples and it will get more complicated once we want to include more variables, e.g. stratifying the data in subplots based on region and income level would include writing double loops and keeping track of plot layout and grouping variables manually. The Python package seaborn is designed for effectively exploring data visually without getting bogged down in technical plotting details.

Visual data exploration with `seaborn`¶

When visually exploring data with lots of variables, it is in many cases easier to think in terms of what is to be explored in the data, rather than what graphical elements are to be added to the plot. For example, instead of instructing the computer to "go through a dataframe and plot any observations of country X in blue, any observations of country Y in red, etc", it is easier to just type "color the data by country". There are many benefits to using a so called descriptive syntax, instead of an imperative one.

Facilitating semantic mappings of data variable to graphical elements is one of the goals of the seaborn plotting package. Thanks to its functional way of interfacing with data, only minimal changes are required if the underlying data change or to switch the type of plot used for the visualization. seaborn provides a language that facilitates thinking about data in ways that are conducive for exploratory data analysis and allows for the creation of publication quality plots with minimal adjustments and tweaking.

The seaborn syntax was introduced briefly already in the introductory lecture and it is similar to how matplotlib plots dataframes. For example, to make the same scatter plot as above:

import seaborn as sns

sns.scatterplot(x='year', y='population', data=world_pop)

<matplotlib.axes._subplots.AxesSubplot at 0x7f4db4dfee80>

In addition to providing a data-centric syntax, seaborn also facilitates visualization of common statistical aggregations. For example, the when creating a line plot in seaborn, the default is to aggregate and average all observations with the same value on the x-axis, and to create a shaded region representing the 95% confidence interval for these observations.

sns.lineplot(x='year', y='population', data=world_data)

<matplotlib.axes._subplots.AxesSubplot at 0x7f4db46d5e48>

In this case, it would be more appropriate to have the shaded area describe the variation in the data, such as the standard deviation, rather than an inference about the reproducibility, such as the default 95% CI.

sns.lineplot(x='year', y='population', data=world_data, ci='sd')

<matplotlib.axes._subplots.AxesSubplot at 0x7f4dbde0a668>

To change from showing the average world population per country and year to showing the total population for all countries per year, the estimator parameter can be used. Here, the shaded are is also removed with ci=None.

# The `estimator` parameter is currently non-functional for sns.scatterplot, but will be added soon
sns.lineplot(x='year', y='population', data=world_data, estimator='sum', ci=None)

<matplotlib.axes._subplots.AxesSubplot at 0x7f4db4341eb8>

Changing graph aesthetics¶

Before continuing the exploration of the world population data, let's discuss how to customize the appearance of our plots. The returned object is an matplotlib axes, so all configuration available through matplotlib can be applied to the returned object by first assigning it to a variable name (ax by convention).

ax = sns.lineplot(x='year', y='population', data=world_data, estimator='sum', ci=None)
ax.set_title('World population since the 1800s', fontsize=16)
ax.set_xlabel('Year', fontsize=12)

Text(0.5,0,'Year')

In addition to all the customization available through the standard matplotlib syntax, seaborn also offers its own functions for changing the appearance of the plots.

In essence, these functions are shortcuts to change several matplotlib parameters simultaneously For example, a more effective approach than setting individual font sizes or colors of graphical elements is to set the overall size and style for all graphs.

sns.set(context='talk', style='darkgrid', palette='pastel')
sns.lineplot(x='year', y='population', data=world_data, estimator='sum', ci=None)

These functions are analogues to making changes in the settings menu of a graphical program and they will apply to all following plots.

Challenge 2¶
Find out which styles and contexts are available in seaborn. Try some of them out and choose your favorite style and context. Hint This information is available both through the built-in and the online documentation.

For the rest of this tutorial, the ticks style will be used.

sns.set(context='notebook', style='ticks', font_scale=1.4)
sns.lineplot(x='year', y='population', data=world_data, estimator='sum', ci=None)

<matplotlib.axes._subplots.AxesSubplot at 0x7f4db224e630>

For styles that include the frame around the plot, there is a special seaborn function to remove the top- and rightmost borders (again by modifying matplotlib parameters under the hood).

sns.lineplot(x='year', y='population', data=world_data, estimator='sum', ci=None)
sns.despine()

If the style options exposed through seaborn are not sufficient, it is possible to change all plot parameters directly through the matplotlib rc and style interfaces.

Exploring relationships between two quantitative variables¶

As mentioned above, the main strength of a descriptive plotting syntax lies in describing the plot appearance in human-friendly vocabulary and have the computer assign variables to graphical objects accordingly. For example, to plot subsets of the data in different colors, the hue parameter can be used.

sns.lineplot(x='year', y='population', hue='income_group',
            data=world_data, ci=None, estimator='sum')

<matplotlib.axes._subplots.AxesSubplot at 0x7f4db21721d0>

This stratification of the income groups reveals that the population growth has been the fastest in middle income countries.

The plot can be made more accessible (especially to those with color vision deficiency) by changing the style of each line instead of only relying on color to separate them.

sns.lineplot(x='year', y='population', hue='income_group', style='income_group',
            data=world_data, ci=None, estimator='sum')

<matplotlib.axes._subplots.AxesSubplot at 0x7f4db21bed30>

Just like in the previous lecture, the values of the ordinal variable income_group are not listed in an intuitive order. A custom order can easily be specified by passing a list to the hue_order parameter, but this would have to be done for every plot. A more effective approach is to encode the order in the dataframe itself, using the top level pandas function Categorical().

world_data['income_group'] = (
    pd.Categorical(world_data['income_group'], ordered=True,
                   categories=['Low', 'Lower middle', 'Upper middle', 'High'])
)
world_data['income_group'].dtype

CategoricalDtype(categories=['Low', 'Lower middle', 'Upper middle', 'High'], ordered=True)

sns.lineplot(x='year', y='population', hue='income_group', style='income_group',
             data=world_data, ci=None, estimator='sum')

<matplotlib.axes._subplots.AxesSubplot at 0x7f4db20abe48>

The legend now lists the colors in the expected order. This modification also ensures that when making plots with income groups on the x- or y-axis, they will be plotted in the expected order.

Conditioning quantitative relationships on qualitative variables¶

It is difficult to explore multiple categorical relationships within one single plot. For example, to see how the income groups compare within each region, the hue and style variables could be used for different variables, but this makes the plot dense and difficult to interpret.

sns.lineplot(x='year', y='population', hue='income_group', style='region',
            data=world_data, ci=None, estimator='sum')

<matplotlib.axes._subplots.AxesSubplot at 0x7f4db2148978>

An effective approach for exploring multiple categorical variables in a data set is to plot so-called "small multiples" of the data, where the same type of plot is used for different subsets of the data. These subplots are drawn in rows and columns forming a grid pattern, and can be referred to as a "facet", "lattice" or "trellis" plot.

Visualizing categorical variables in this manner is a key step in exploratory data analysis, and thus seaborn has a dedicated plot function for this, called relplot() (for "relational plot" since it visualizes the relationships between numerical variables). The syntax to relplot() is very similar to lineplot(), but we need to specify that the kind of plot we want is a line plot.

# Create the same plot as above
sns.relplot(x='year', y='population', hue='income_group', style='income_group', kind='line',
            data=world_data, ci=None, estimator='sum')

<seaborn.axisgrid.FacetGrid at 0x7f4db1fe77f0>

The region variable can now be mapped to different facets/subplots in a grid pattern.

# TODO switch this to some more interesting column if I have time
sns.relplot(x='year', y='population', data=world_data, estimator='sum',
            kind='line', hue='income_group', col='region', ci=None)

<seaborn.axisgrid.FacetGrid at 0x7f4db1f4fd30>

It's a little hard to see because the figure is very wide and has been shrunk to fit in the notebook. To avoid this, relplot() can use the col_wrap parameter to distribute the plots over several rows. The height and aspect parameters can be used to set the height and width of each facet.

sns.relplot(x='year', y='population', data=world_data, estimator='sum',
            kind='line', hue='income_group', col='region', ci=None,
           col_wrap=3, height=2.5, aspect=1.3)

<seaborn.axisgrid.FacetGrid at 0x7f4db1d1c470>

Facetting the plot by region reveals that the largest absolute population increase occurred among middle income countries in Asia. We will soon look closer on which countries are.

The returned object from relplot() is a grid (a special kind of figure) with many axes, and can therefore not be placed within a preexisting figure. It is saved just as any matplotlib figure with savefig(), but has some special methods for easily changing the aesthetics of each axes.

g = sns.relplot(x='year', y='population', data=world_data,
            kind='line', hue='income_group', col='region', ci=None,
           col_wrap=3, height=2.5, aspect=1.3)

g.set_titles('{col_name}', y=0.95)
g.set_axis_labels(y_var='Population', x_var='Year')
g.savefig('grid-figure.png')

Remember that names such as fig, ax, and here g, are only by convention, and any variable name could have been used.

We might want the color to indicate income group, but draw separate lines for each country. For this we can set units='country' and estimator=None (so don't aggregate, just draw one line per country with the raw values).

sns.relplot(x='year', y='population', data=world_data, estimator=None, units='country',
            kind='line', hue='income_group', col='region', ci=None,
           col_wrap=3, height=2.5, aspect=1.3)

<seaborn.axisgrid.FacetGrid at 0x7f4db19822b0>

Two countries in Asia stand out in terms of total population. To find out which these are, we can filter the data.

world_data.loc[world_data['year'] == 2018].nlargest(8, 'population')

Challenge 3

To find out the total amount of CO2 released into the atmosphere, used the co2_per_capita and population columns to create a new column: co2_total.

Plot the total CO2 per year for the world.

Plot the total CO2 per year for the world and for each region.

Create a faceted plot comparing total CO2 levels across income groups and regions.

# Challenge 3 solutions

# 1.
world_data['co2_total'] = world_data['co2_per_capita'] * world_data['population']

# 2.
sns.relplot(x='year', y='co2_total', data=world_data, kind='line', ci=None, estimator='sum')

# 3.
sns.relplot(x='year', y='co2_total', data=world_data, kind='line', ci=None, estimator='sum', hue='region')

# 4.
sns.relplot(x='year', y='co2_total', data=world_data, kind='line', ci=None, estimator='sum',
            hue='income_group', col='region', col_wrap=3, height=4)

# Discuss what these plots tell us:
# The world's total co2 emissions are rapidly increasing. Europe and the Americas have been the highest emitters for
# many years, but have recently been overtaken by Asia, which is now producing around twice the amount of co2 compare
# to Europe and America. But don't forget that we saw in the last lecture that the population in Asia is 5-6 times bigger
# than in Europe and America!

# It's important to look at both total production from a country because change within that single country has big
# potential of reaching many people. Not plotted here, but also also important is to explore which countries are high in CO2 per capita
# since these might have more room to reduce the production. Of course, reality is more complicated. Some countries
# might import goods that demand high CO2 production in their manufacturing country instead of producing themselves
# so they might "sponsor" the production in another country, but would not show up high in this list.

To continue exploring the CO2 emissions we started to look at in the last challenge, let's use the other type of plot for comparing quantitative variables: scatterplot(). This is the default in the relplot() function, so we don't need to specify kind='scatter')

As mentioned in the discussion above, in addition to considering the total amount of CO2 produced per country, it can be insightful to explore the CO2 produced per citizen.

sns.relplot(x='co2_total', y='co2_per_capita', data=world_data)

<seaborn.axisgrid.FacetGrid at 0x7f4db14f2dd8>

This looks funky, and not quite as expected... The reason is that we have plotted multiple data points per country, one for each year! This can be confusing since we don't know which dot is for which year and this plot is probably not what we wanted to create. Instead, we can filter the data to focus on a specific year. Unfortunately, there is not CO2 measurements available for the last few years. To find out in which years there are countries with CO2 measurements, we can drop the NAs in co2_per_capita and look at the min and max value.

world_data.dropna(subset=['co2_per_capita'])['year'].agg(['min', 'max'])

min    1800
max    2014
Name: year, dtype: int64

Now we can subset the data for the latest available year with CO2 measurements, which is 2014.

world_data_2014 = world_data.loc[world_data['year'] == 2014]
sns.relplot(x='co2_total', y='income', data=world_data_2014)
# TODO add to the line below significanlty what?

This reveals that there are a few countries in the world that have significantly and one country that is rather high in both measurements.

Just as before, it is possible to map plot semantics and facet the plot according to variables in the data set. scatterplot() can also scale the dot size according to a variable in the data set.

# `sizes` controls the dots min and max size
sns.relplot(x='co2_total', y='co2_per_capita', hue='income_group', size='population',
            data=world_data_2014, sizes=(40, 400))

<seaborn.axisgrid.FacetGrid at 0x7f4db0409f98>

Unsurprisingly, some of the countries that are high in the total co2_emissions are also the most populous countries. The trends between different regions can now be easily compared by facetting the data by region.

sns.relplot(x='co2_total', y='co2_per_capita', hue='income_group', size='population',
            data=world_data_2014, sizes=(40, 400), col='region', col_wrap=3, height=4)

<seaborn.axisgrid.FacetGrid at 0x7f4db03f4f98>

Already here we can get a pretty good idea of which some of these countries are. The high emission middle income countries in Asia are likely China and India, while the American country high in both total emissions and emissions per capita must be the USA. However, some observations are harder to resolve, like which the high co2_capita regions are in Asia and the Americas.

Challenge 4¶
Let's use some of the aggregation methods from yesterday to complement the plots we have just made.

Find out which are the 10 countries with the highest co2 emissions per capita.

Find out which are the 10 countries with the highest total co2 emissions.

Which 10 countries have produce the most CO2 in total since the 1800s?

# Challenge 4 solutions

# 1.
world_data_2014.nlargest(10, 'co2_per_capita')

# 2.
world_data_2014.nlargest(10, 'co2_total')

# 3.
world_data.groupby('country')['co2_total'].sum().nlargest(10)

In addition to what we observed above, an interesting aspect to explore is how the relationship between per capita and total CO2 emissions has changed over time for different income groups. As we have seen before, this can be explored in a line graph, but if we instead subset certain years from the data and create a facet for each year, we can see the spread at each point in time

world_data_1920_2018 = world_data.loc[world_data['year'].isin([1920, 1940, 1960, 1980, 2000, 2014])]

sns.relplot(x='co2_total', y='co2_per_capita', col='year', hue='income_group',
            data=world_data_1920_2018, col_wrap=3, height=3.5)

<seaborn.axisgrid.FacetGrid at 0x7f4db01d3978>

How to know which relationships to start exploring?¶

In the exercises above, we chose suitable variables to illustrate the plotting concepts. Often when doing EDA, it will not be as easy to know what comparison to start with. Unless you have good reason to look at a particular relationship, starting by plotting the pairwise relationships of all quantitative variables can be helpful.

# Use 2014 data since we know that there are CO2 measurements in that year
# This might take some time
sns.pairplot(world_data_2014)

The year column is not that insightful since there is only one year in the data. Removing that column gives more space for the rest of the plots.

sns.pairplot(world_data_2014.drop(columns='year'))

Each histogram on the diagonal shows the distribution of a single variable in a histogram. The scatter plots below the diagonal show the relationship between two numerical variables in a scatter plot. The scatter plots above the diagonal are mirror images of those below the diagonal.

Plotting all pairwise relationships can provide clues for what to explore next. For example, the relationships we explored above between child mortality and children per women or those between total CO2 and CO2 per capita can also be seen here. It is possible to quantify the strength of these relationships, by computing the Pearson correlation coefficients between columns.

world_data_2014.drop(columns='year').corr()

With so much data, it is slow for us to process all the information as numbers in a table A higher bandwidth operation is to let our brain interpret colors for the strength of the relationships through a heatmap.

sns.heatmap(world_data_2014.drop(columns='year').corr())

The heatmap can be made more informative by changing to a diverging colormap, which is generally recommended when there is a natural central value (such as 0 in our case). Optionally, the heatmap can be annotated with the correlation coefficients.

fig, ax = plt.subplots(figsize=(10, 6))
sns.heatmap(world_data_2014.drop(columns='year').corr(), annot=True, ax=ax, cmap='coolwarm')

There are more formal ways of interrogating variable interactions and their potential causality (such as regressions), but these are outside the scope of this lecture. However, the pairwise scatter plot and correlation coefficient matrix are quick means to get an informative overview of how the dataframe columns relate to each other.

Let's zoom in on the relationship between income and life expectancy, which appears to be quite strong.

# TODO Make this a challenge where they learn how to find things on stackoverflow
ax = sns.scatterplot(x='income', y='life_expectancy', data=world_data_2014)

This relationship appears to be log linear and can be visualized with the x-axis set to log-scale.

Challenge¶

Find out how to change the x-axis to be log-scaled. Search online for how to change the scale of a matplotlib axes object. Remember that seaborn plots return matplotlib axes objects, so all matplotlib function to modify the axes will work on this plot. Good sites to use are the documentation pages for the respective package, and stackoverflow. However, it is often the fastest to type in a well chosen query in your favorite search engine.

In the logged plot, color the dots according to the region of the observation.

# Challenge solutions
# 1.
ax = sns.scatterplot(x='income', y='life_expectancy', data=world_data_2014)
ax.set_xscale('log')

# Challenge solutions
# 2.
ax = sns.scatterplot(x='income', y='life_expectancy', data=world_data_2014, hue='region')
ax.set_xscale('log')

Another interesting relationship we could see from the pairplot is how child mortality relates to how many children are born per woman. We can filter out years of the data and look at how the relationship has changed over time using the same approach as for the CO2 data.

world_data_1920_2018 = world_data.loc[world_data['year'].isin([1920, 1940, 1960, 1980, 2000, 2018])]

sns.relplot(x='children_per_woman', y='child_mortality', col='year', hue='income_group',
            data=world_data_1920_2018, col_wrap=3, height=3.5)

<seaborn.axisgrid.FacetGrid at 0x7f4d980e26a0>

A common misconception is that saving poor children will lead to overpopulation. However, we can see that lower child mortality is correlated with smaller family sizes. As more children survive, parents feel more secure with a smaller family size. Reducing poverty is also related to these variables, since most high income countries are found in the lower left corner of the plots (remember that the income group is classified based on 2018 year's income and not for each year that is being plotted above).

It is important to note that from a plot like this, it is not possible to tell causation, just correlation. However, in the gapminder video library there are a few videos on this topic (including this and this one), discussing how reducing poverty can help slow down population growth through decreased family sizes. Current estimates suggest that the word population will stabilize around 11 billion people and the average number of children per woman will be close to two worldwide in year 2100.

Exploring a single quantitative variable across multiple levels of a categorical variable¶

When exploring a single quantitative variable, we can choose between plotting every data point (e.g. categorical scatterplots such as swarm plots and strip plots), an approximation of the distribution (e.g. histograms and violinplots), or distribution statistics such as measures of central tendency (e.g. boxplots and barplots).

A good place to start is to visualize the variable's distribution with distplot(). Let's look at life expectancy during 2018 using this technique.

world_data_2018 = world_data.loc[world_data['year'] == 2018]
sns.distplot(world_data_2018['life_expectancy'])

The line represents a KDE (kernel density estimate), as seen previously in the grouped pairplot. Conceptually, this is similar to a smoothened histogram.

distplot() can be customized to increase the number of bins and the bandwidth of the kernel. These are both calculated according to heuristics for what should be good numbers for the underlying data, but it is good to know how to change them.

sns.distplot(world_data_2018['life_expectancy'], bins=30, rug=True,
             kde_kws={'bw':1, 'color':'black'})

<matplotlib.axes._subplots.AxesSubplot at 0x7f4d98067e10>

The rug plot along the x-axis shows exactly where each data point resides. To compare distributions between values of a categorical variables, violinplots are often used. These consist of two KDEs mirrored across a midline.

sns.violinplot(x='life_expectancy', y='income_group', data=world_data_2018)

<matplotlib.axes._subplots.AxesSubplot at 0x7f4d98f25828>

Since income_group was defined as an ordered categorical variable previously, this order is preserved when distributing the income groups along the y-axis.

There is notable variation in life expectancy between income groups, people in wealthier countries live longer. This variation contributes to the multimodality seen in the first distribution plot of the life expectancy for all countries in the world. However, there is also large overlap between income groups and variation within the groups, so there are more variables affecting the life expectancy than just the income.

Dissecting multimodal distributions in this manner to find underlying explaining variables to why a distribution appears to consist of many small distributions is common practice during EDA. It looks like some income groups, e.g. "high", still consist of multimodal distributions. To explore these further, facetting can be used just as previously. The categorical equivalent of relplot is catplot (categorical plot).

sns.catplot(x='life_expectancy', y='income_group', data=world_data_2018)

<seaborn.axisgrid.FacetGrid at 0x7f4d93fcb4a8>

The default in catplot is to create a stripplot, a categorical scatterplot where the dots are randomly jittered to not overlap. This is fast to create, but it is sometimes hard to see how many dots are in a group due to overlap. A more ordered approach is to create another type of categorical scatterplot, called swarmplot, where the dots are positioned to avoid overlap.

sns.catplot(x='life_expectancy', y='income_group', data=world_data_2018, kind='swarm')

<seaborn.axisgrid.FacetGrid at 0x7f4d980ba668>

The swarm plot communicates the shape of the distribution more clearly than the stripplot, Here, we can see the same bimodality in the high income group as seen in the violinplot, which was hard to see in the stripplot.

A drawback is that swarmplots can be slow to create for large datasets. For really large datasets, even stripplot is slow and it is necessary to approximate the distributions (e.g. with a violinplot) or show distribution statistics (e.g. with a boxplot), instead of showing each observation .

We can use color to find out if regional differences are related to income level.

# TODO Will update this to look prettier
sns.catplot(x='life_expectancy', y='region', data=world_data_2014, kind='box',
            col='income_group', col_wrap=2)

The variable levels are automatically ordered and it is easy to see how life expectancy generally grow with higher average income. In contrast to a line plot with the average change over time, we can here see how the distribution itself changes, not just the average. While countries in general have increased their life expectancy, differences can be seen in how they have done it: Europe and the Americas have gone from a mix of high and low life_expectancy levels to tighter distributions where all countries have high life expectancy, Africa has transitioned from most countries having low life_exp to diverse life lengths depending on country.

# If both columns can be interpreted as numerical,
# the `orient` keyword can be added to be explicit
sns.catplot(x='life_expectancy', y='year', orient='horizontal', data=world_data_1920_2018, kind='violin',
            col='region', col_wrap=3, color='lightgrey')

Let's explore how much of the variation during the transition in African life expectancy can be explained by geographically close regions performing differently. First how many sub_regions are there in each Africa.

world_data_1920_2018.groupby('region')['sub_region'].nunique()

region
Africa      2
Americas    2
Asia        5
Europe      4
Oceania     4
Name: sub_region, dtype: int64

There are two subregions, let's find out which ones.

world_data_1920_2018.groupby('region')['sub_region'].unique()

region
Africa                  [Northern Africa, Sub-Saharan Africa]
Americas    [Latin America and the Caribbean, Northern Ame...
Asia        [Southern Asia, Western Asia, South-eastern As...
Europe      [Southern Europe, Western Europe, Eastern Euro...
Oceania     [Australia and New Zealand, Melanesia, Microne...
Name: sub_region, dtype: object

Let's see if sub-saharan and northern Africa have had different development when it comes to life expectancy.

# The split parameter saves some space and looks slick
africa = world_data_1920_2018.loc[world_data_1920_2018['region'] == 'Africa']
sns.catplot(x='life_expectancy', y='year', orient='horizontal', data=africa, kind='violin',
            hue='sub_region', palette='pastel', split=True)

<seaborn.axisgrid.FacetGrid at 0x7f4d93955f28>

For the last challenge, we will explore how an education indicator between and men and women varies.

world_data.dropna(subset=['years_in_school_women'])['year'].agg(['min', 'max'])

min    1970
max    2015
Name: year, dtype: int64

Challenge¶

Subset dataframe for the years 1975, 1995, and 2015

Make a new column of ratio women men in education

plot for regions and income groups and times (reword)

# Challenge solutions
# 1.
world_data_1970_2015 = world_data.loc[world_data['year'].isin([1975, 1995, 2015])].copy()

# 2.
world_data_1970_2015['women_men_school_ratio'] = world_data_1970_2015['years_in_school_women'] / world_data_1970_2015['years_in_school_men']
# world_data_1970_2015['women_men_school_ratio']

# 3a.
sns.catplot(y='women_men_school_ratio', x='year', data=world_data_1970_2015, hue='region', dodge=True, kind='point')

<seaborn.axisgrid.FacetGrid at 0x7f4d980d9f60>

# 3b.
sns.catplot(y='women_men_school_ratio', x='year', data=world_data_1970_2015, hue='income_group', dodge=True, kind='point')

<seaborn.axisgrid.FacetGrid at 0x7f4d93878710>

	country	year
0	Afghanistan	1800
1	Afghanistan	1801
2	Afghanistan	1802
3	Afghanistan	1803
4	Afghanistan	1804

	country	year
1	Afghanistan	1801
13	Afghanistan	1813
24	Afghanistan	1824

	country	year	population	region	sub_region	income_group	life_expectancy	income	children_per_woman	child_mortality	pop_density	co2_per_capita	years_in_school_men	years_in_school_women
195	Afghanistan	1995	17100000	Asia	Southern Asia	Low	51.1	881	7.61	150.0	26.20	0.0727	2.56	0.49
414	Albania	1995	3110000	Europe	Southern Europe	Upper middle	74.1	4130	2.59	32.9	113.00	0.6720	9.31	9.07
633	Algeria	1995	28900000	Africa	Northern Africa	Upper middle	72.3	9300	3.45	43.3	12.10	3.3000	5.67	4.84
852	Angola	1995	14300000	Africa	Sub-Saharan Africa	Lower middle	52.0	2970	6.92	223.0	11.40	0.7690	4.89	3.05
1071	Antigua and Barbuda	1995	73600	Americas	Latin America and the Caribbean	High	74.4	16500	2.21	19.5	167.00	3.7400	10.50	11.40
1290	Argentina	1995	35000000	Americas	Latin America and the Caribbean	High	73.1	13900	2.76	24.3	12.80	3.6600	9.53	10.00
1509	Armenia	1995	3220000	Asia	Western Asia	Upper middle	69.3	2170	1.80	38.7	113.00	1.0600	10.10	10.20
1728	Australia	1995	18100000	Oceania	Australia and New Zealand	High	78.3	30400	1.82	7.0	2.35	15.6000	11.80	11.80
1947	Austria	1995	7990000	Europe	Western Europe	High	76.7	33700	1.42	6.8	97.00	7.4800	10.90	10.50
2166	Azerbaijan	1995	7780000	Asia	Western Asia	Upper middle	64.9	3320	2.58	94.1	94.10	4.2900	10.70	10.30
2385	Bahamas	1995	280000	Americas	Latin America and the Caribbean	High	70.6	22100	2.51	18.7	28.00	6.0100	10.00	10.40
2604	Bahrain	1995	564000	Asia	Western Asia	High	70.3	43500	3.10	18.1	742.00	26.3000	7.41	7.54
2823	Bangladesh	1995	119000000	Asia	Southern Asia	Lower middle	61.7	1440	3.73	114.0	912.00	0.1920	4.34	2.75
3042	Barbados	1995	265000	Americas	Latin America and the Caribbean	High	73.7	12400	1.73	14.7	616.00	3.1300	7.53	7.82
3261	Belarus	1995	10100000	Europe	Eastern Europe	Upper middle	68.3	5450	1.47	15.7	50.00	5.9900	11.10	11.60
3480	Belgium	1995	10200000	Europe	Western Europe	High	76.9	32700	1.61	7.6	336.00	11.0000	11.40	11.60
3699	Belize	1995	207000	Americas	Latin America and the Caribbean	Upper middle	70.7	6210	4.11	29.5	9.07	1.8200	7.17	6.63
3918	Benin	1995	5910000	Africa	Sub-Saharan Africa	Low	56.5	1520	6.36	158.0	52.40	0.2250	3.62	1.48
4137	Bhutan	1995	515000	Asia	Southern Asia	Lower middle	62.9	2900	4.60	101.0	13.50	0.4840	4.41	1.78
4356	Bolivia	1995	7570000	Americas	Latin America and the Caribbean	Lower middle	64.3	4110	4.58	101.0	6.98	1.3000	8.11	6.64
4575	Bosnia and Herzegovina	1995	3840000	Europe	Southern Europe	Upper middle	68.9	1830	1.71	14.2	75.40	0.8920	8.46	7.82
4794	Botswana	1995	1570000	Africa	Sub-Saharan Africa	Upper middle	56.4	8900	3.95	71.6	2.77	1.9400	5.11	5.59
5013	Brazil	1995	162000000	Americas	Latin America and the Caribbean	Upper middle	69.7	11100	2.50	49.1	19.40	1.5900	5.99	6.46
5232	Bulgaria	1995	8380000	Europe	Eastern Europe	Upper middle	71.0	8450	1.34	19.2	77.20	6.9200	10.70	11.20
5451	Burkina Faso	1995	10100000	Africa	Sub-Saharan Africa	Low	50.7	869	6.84	195.0	36.90	0.0621	1.86	0.91
5670	Burundi	1995	5960000	Africa	Sub-Saharan Africa	Low	47.0	870	7.29	169.0	232.00	0.0400	3.45	2.31
5889	Cambodia	1995	10700000	Asia	South-eastern Asia	Lower middle	58.2	1100	4.69	120.0	60.40	0.1460	4.97	3.36
6108	Cameroon	1995	13500000	Africa	Sub-Saharan Africa	Lower middle	56.5	2260	5.98	166.0	28.50	0.3140	6.45	4.36
6327	Canada	1995	29300000	Americas	Northern America	High	78.0	32200	1.64	6.9	3.23	15.9000	13.60	13.60
6546	Central African Republic	1995	3350000	Africa	Sub-Saharan Africa	Low	46.2	858	5.62	175.0	5.38	0.0700	4.92	2.43
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
32607	Sri Lanka	1995	18200000	Asia	Southern Asia	Lower middle	72.2	4510	2.29	20.3	291.00	0.3240	8.57	8.44
32826	Sudan	1995	24100000	Africa	Northern Africa	Lower middle	60.4	1960	5.83	120.0	13.70	0.1780	5.70	3.37
33045	Suriname	1995	444000	Americas	Latin America and the Caribbean	Upper middle	70.3	9620	3.01	39.7	2.84	4.6500	7.25	6.89
33264	Swaziland	1995	961000	Africa	Sub-Saharan Africa	Lower middle	59.4	5610	4.80	83.7	55.90	0.4730	6.94	6.80
33483	Sweden	1995	8840000	Europe	Northern Europe	High	78.8	31100	1.73	4.8	21.50	6.2400	12.20	12.40
33702	Switzerland	1995	7020000	Europe	Western Europe	High	78.5	45900	1.52	6.4	178.00	5.5900	12.20	11.40
33921	Syria	1995	14300000	Asia	Western Asia	Low	72.1	4890	4.51	29.6	78.10	2.9000	7.18	5.05
34140	Tajikistan	1995	5760000	Asia	Central Asia	Low	63.9	1270	4.59	119.0	41.20	0.4250	10.50	9.52
34359	Tanzania	1995	30000000	Africa	Sub-Saharan Africa	Low	52.9	1370	5.88	164.0	33.80	0.1190	5.77	4.57
34578	Thailand	1995	59500000	Asia	South-eastern Asia	Upper middle	70.9	9380	1.87	29.2	116.00	2.7100	7.30	6.95
34797	Timor-Leste	1995	871000	Asia	South-eastern Asia	Lower middle	60.9	1560	6.38	139.0	58.60	NaN	5.49	4.09
35016	Togo	1995	4270000	Africa	Sub-Saharan Africa	Low	57.2	1200	5.76	134.0	78.60	0.2230	5.03	2.32
35235	Tonga	1995	96100	Oceania	Polynesia	Upper middle	68.4	4260	4.45	18.8	133.00	0.9920	9.61	9.49
35454	Trinidad and Tobago	1995	1260000	Americas	Latin America and the Caribbean	High	69.3	12900	1.96	28.0	245.00	13.6000	9.81	10.10
35673	Tunisia	1995	9110000	Africa	Northern Africa	Lower middle	73.0	6130	2.61	44.7	58.70	1.7300	8.20	5.28
35892	Turkey	1995	58500000	Asia	Western Asia	Upper middle	70.9	12300	2.76	54.8	76.00	2.9400	7.62	5.59
36111	Turkmenistan	1995	4210000	Asia	Central Asia	Upper middle	63.1	4600	3.51	87.5	8.95	8.0800	11.20	10.90
36330	Uganda	1995	20600000	Africa	Sub-Saharan Africa	Low	47.0	931	7.02	171.0	103.00	0.0457	5.72	3.69
36549	Ukraine	1995	50900000	Europe	Eastern Europe	Lower middle	66.6	5060	1.41	20.3	87.90	8.7600	11.20	11.50
36768	United Arab Emirates	1995	2450000	Asia	Western Asia	High	73.5	102000	3.42	13.1	29.30	28.8000	9.00	9.20
36987	United Kingdom	1995	58000000	Europe	Northern Europe	High	76.6	28600	1.76	7.2	240.00	9.2800	12.10	12.00
37206	United States	1995	266000000	Americas	Northern America	High	75.9	39500	1.98	9.5	29.00	19.3000	13.40	13.40
37425	Uruguay	1995	3220000	Americas	Latin America and the Caribbean	High	73.5	11500	2.40	20.8	18.40	1.4200	8.70	9.31
37644	Uzbekistan	1995	22900000	Asia	Central Asia	Lower middle	66.2	2240	3.53	70.5	53.70	4.5200	10.50	10.20
37863	Vanuatu	1995	168000	Oceania	Melanesia	Lower middle	62.3	2610	4.73	30.4	13.80	0.3920	6.70	5.86
38082	Venezuela	1995	22200000	Americas	Latin America and the Caribbean	Upper middle	73.0	15300	3.08	26.2	25.20	6.0100	7.87	8.22
38301	Vietnam	1995	75200000	Asia	South-eastern Asia	Lower middle	69.5	2040	2.71	39.0	243.00	0.3870	7.23	6.63
38520	Yemen	1995	15300000	Asia	Western Asia	Low	60.5	3530	7.53	112.0	29.00	0.6830	4.71	0.95
38739	Zambia	1995	9140000	Africa	Sub-Saharan Africa	Lower middle	46.5	2030	6.19	177.0	12.30	0.2380	6.73	5.13
38958	Zimbabwe	1995	11300000	Africa	Sub-Saharan Africa	Low	53.7	2480	4.43	90.1	29.30	1.3400	8.41	6.92

	country	year	population	region	sub_region	income_group	life_expectancy	income	children_per_woman	child_mortality	pop_density	co2_per_capita	years_in_school_men	years_in_school_women
0	Afghanistan	1800	3280000	Asia	Southern Asia	Low	28.2	603	7.00	469.0	NaN	NaN	NaN	NaN
1	Afghanistan	1801	3280000	Asia	Southern Asia	Low	28.2	603	7.00	469.0	NaN	NaN	NaN	NaN
2	Afghanistan	1802	3280000	Asia	Southern Asia	Low	28.2	603	7.00	469.0	NaN	NaN	NaN	NaN
3	Afghanistan	1803	3280000	Asia	Southern Asia	Low	28.2	603	7.00	469.0	NaN	NaN	NaN	NaN
4	Afghanistan	1804	3280000	Asia	Southern Asia	Low	28.2	603	7.00	469.0	NaN	NaN	NaN	NaN
5	Afghanistan	1805	3280000	Asia	Southern Asia	Low	28.2	603	7.00	469.0	NaN	NaN	NaN	NaN
6	Afghanistan	1806	3280000	Asia	Southern Asia	Low	28.1	603	7.00	470.0	NaN	NaN	NaN	NaN
7	Afghanistan	1807	3280000	Asia	Southern Asia	Low	28.1	603	7.00	470.0	NaN	NaN	NaN	NaN
8	Afghanistan	1808	3280000	Asia	Southern Asia	Low	28.1	603	7.00	470.0	NaN	NaN	NaN	NaN
9	Afghanistan	1809	3280000	Asia	Southern Asia	Low	28.1	603	7.00	470.0	NaN	NaN	NaN	NaN
10	Afghanistan	1810	3280000	Asia	Southern Asia	Low	28.1	604	7.00	470.0	NaN	NaN	NaN	NaN
11	Afghanistan	1811	3280000	Asia	Southern Asia	Low	28.1	604	7.00	470.0	NaN	NaN	NaN	NaN
12	Afghanistan	1812	3280000	Asia	Southern Asia	Low	28.1	604	7.00	470.0	NaN	NaN	NaN	NaN
13	Afghanistan	1813	3280000	Asia	Southern Asia	Low	28.1	604	7.00	470.0	NaN	NaN	NaN	NaN
14	Afghanistan	1814	3290000	Asia	Southern Asia	Low	28.1	604	7.00	470.0	NaN	NaN	NaN	NaN
15	Afghanistan	1815	3290000	Asia	Southern Asia	Low	28.1	604	7.00	470.0	NaN	NaN	NaN	NaN
16	Afghanistan	1816	3300000	Asia	Southern Asia	Low	28.1	604	7.00	471.0	NaN	NaN	NaN	NaN
17	Afghanistan	1817	3300000	Asia	Southern Asia	Low	28.0	604	7.00	471.0	NaN	NaN	NaN	NaN
18	Afghanistan	1818	3310000	Asia	Southern Asia	Low	28.0	604	7.00	471.0	NaN	NaN	NaN	NaN
19	Afghanistan	1819	3320000	Asia	Southern Asia	Low	28.0	604	7.00	471.0	NaN	NaN	NaN	NaN
20	Afghanistan	1820	3320000	Asia	Southern Asia	Low	28.0	604	7.00	471.0	NaN	NaN	NaN	NaN
21	Afghanistan	1821	3330000	Asia	Southern Asia	Low	28.0	607	7.00	471.0	NaN	NaN	NaN	NaN
22	Afghanistan	1822	3340000	Asia	Southern Asia	Low	28.0	609	7.00	471.0	NaN	NaN	NaN	NaN
23	Afghanistan	1823	3350000	Asia	Southern Asia	Low	28.0	611	7.00	471.0	NaN	NaN	NaN	NaN
24	Afghanistan	1824	3360000	Asia	Southern Asia	Low	28.0	613	7.00	471.0	NaN	NaN	NaN	NaN
25	Afghanistan	1825	3380000	Asia	Southern Asia	Low	27.9	615	7.00	471.0	NaN	NaN	NaN	NaN
26	Afghanistan	1826	3390000	Asia	Southern Asia	Low	27.9	617	7.00	473.0	NaN	NaN	NaN	NaN
27	Afghanistan	1827	3400000	Asia	Southern Asia	Low	27.9	619	7.00	473.0	NaN	NaN	NaN	NaN
28	Afghanistan	1828	3420000	Asia	Southern Asia	Low	27.9	621	7.00	473.0	NaN	NaN	NaN	NaN
29	Afghanistan	1829	3430000	Asia	Southern Asia	Low	27.9	623	7.00	473.0	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
38952	Zimbabwe	1989	9900000	Africa	Sub-Saharan Africa	Low	62.7	2490	5.37	73.9	25.6	1.630	7.61	6.01
38953	Zimbabwe	1990	10200000	Africa	Sub-Saharan Africa	Low	61.7	2590	5.18	75.2	26.3	1.540	7.74	6.16
38954	Zimbabwe	1991	10400000	Africa	Sub-Saharan Africa	Low	61.0	2670	5.00	77.4	27.0	1.530	7.88	6.31
38955	Zimbabwe	1992	10700000	Africa	Sub-Saharan Africa	Low	59.4	2370	4.84	80.2	27.6	1.590	8.01	6.46
38956	Zimbabwe	1993	10900000	Africa	Sub-Saharan Africa	Low	57.6	2350	4.69	83.4	28.2	1.500	8.14	6.61
38957	Zimbabwe	1994	11100000	Africa	Sub-Saharan Africa	Low	55.8	2520	4.56	86.8	28.7	1.600	8.28	6.76
38958	Zimbabwe	1995	11300000	Africa	Sub-Saharan Africa	Low	53.7	2480	4.43	90.1	29.3	1.340	8.41	6.92
38959	Zimbabwe	1996	11500000	Africa	Sub-Saharan Africa	Low	52.2	2690	4.33	92.8	29.8	1.300	8.54	7.07
38960	Zimbabwe	1997	11700000	Africa	Sub-Saharan Africa	Low	50.8	2710	4.24	94.7	30.3	1.230	8.67	7.23
38961	Zimbabwe	1998	11900000	Africa	Sub-Saharan Africa	Low	49.1	2750	4.16	95.9	30.7	1.200	8.80	7.39
38962	Zimbabwe	1999	12100000	Africa	Sub-Saharan Africa	Low	47.8	2690	4.10	96.4	31.2	1.310	8.93	7.55
38963	Zimbabwe	2000	12200000	Africa	Sub-Saharan Africa	Low	46.7	2570	4.06	96.8	31.6	1.140	9.07	7.71
38964	Zimbabwe	2001	12400000	Africa	Sub-Saharan Africa	Low	46.2	2580	4.02	97.1	32.0	1.020	9.20	7.87
38965	Zimbabwe	2002	12500000	Africa	Sub-Saharan Africa	Low	45.6	2320	4.00	97.7	32.3	0.957	9.33	8.03
38966	Zimbabwe	2003	12600000	Africa	Sub-Saharan Africa	Low	45.3	1910	3.99	98.2	32.7	0.843	9.47	8.20
38967	Zimbabwe	2004	12800000	Africa	Sub-Saharan Africa	Low	45.1	1780	3.98	99.0	33.0	0.742	9.60	8.36
38968	Zimbabwe	2005	12900000	Africa	Sub-Saharan Africa	Low	45.3	1650	3.99	99.7	33.4	0.832	9.73	8.53
38969	Zimbabwe	2006	13100000	Africa	Sub-Saharan Africa	Low	45.7	1580	3.99	100.0	33.9	0.796	9.87	8.69
38970	Zimbabwe	2007	13300000	Africa	Sub-Saharan Africa	Low	46.4	1490	4.00	100.0	34.5	0.742	10.00	8.86
38971	Zimbabwe	2008	13600000	Africa	Sub-Saharan Africa	Low	46.7	1210	4.01	98.0	35.0	0.573	10.10	9.03
38972	Zimbabwe	2009	13800000	Africa	Sub-Saharan Africa	Low	47.5	1290	4.02	94.9	35.7	0.406	10.30	9.19
38973	Zimbabwe	2010	14100000	Africa	Sub-Saharan Africa	Low	49.6	1460	4.03	89.9	36.4	0.552	10.40	9.36
38974	Zimbabwe	2011	14400000	Africa	Sub-Saharan Africa	Low	51.9	1660	4.02	83.8	37.2	0.665	10.50	9.53
38975	Zimbabwe	2012	14700000	Africa	Sub-Saharan Africa	Low	54.1	1850	4.00	76.0	38.0	0.530	10.70	9.70
38976	Zimbabwe	2013	15100000	Africa	Sub-Saharan Africa	Low	55.6	1900	3.96	70.0	38.9	0.776	10.80	9.86
38977	Zimbabwe	2014	15400000	Africa	Sub-Saharan Africa	Low	57.0	1910	3.90	64.3	39.8	0.780	10.90	10.00
38978	Zimbabwe	2015	15800000	Africa	Sub-Saharan Africa	Low	58.3	1890	3.84	59.9	40.8	NaN	11.10	10.20
38979	Zimbabwe	2016	16200000	Africa	Sub-Saharan Africa	Low	59.3	1860	3.76	56.4	41.7	NaN	NaN	NaN
38980	Zimbabwe	2017	16500000	Africa	Sub-Saharan Africa	Low	59.8	1910	3.68	56.8	42.7	NaN	NaN	NaN
38981	Zimbabwe	2018	16900000	Africa	Sub-Saharan Africa	Low	60.2	1950	3.61	55.5	43.7	NaN	NaN	NaN

	country	year	population	region	sub_region	income_group	life_expectancy	income	children_per_woman	child_mortality	pop_density	co2_per_capita	years_in_school_men	years_in_school_women
7226	China	2018	1420000000	Asia	Eastern Asia	Upper middle	76.9	16000	1.64	9.95	151.0	NaN	NaN	NaN
15767	India	2018	1350000000	Asia	Southern Asia	Lower middle	69.1	6890	2.28	41.10	455.0	NaN	NaN	NaN
37229	United States	2018	327000000	Americas	Northern America	High	79.1	54900	1.90	6.06	35.7	NaN	NaN	NaN
15986	Indonesia	2018	267000000	Asia	South-eastern Asia	Lower middle	72.0	11700	2.31	25.00	147.0	NaN	NaN	NaN
5036	Brazil	2018	211000000	Americas	Latin America and the Caribbean	Upper middle	75.7	14300	1.70	14.20	25.2	NaN	NaN	NaN
26498	Pakistan	2018	201000000	Asia	Southern Asia	Lower middle	68.0	5220	3.35	76.80	260.0	NaN	NaN	NaN
25622	Nigeria	2018	196000000	Africa	Sub-Saharan Africa	Lower middle	66.1	5570	5.39	97.90	215.0	NaN	NaN	NaN
2846	Bangladesh	2018	166000000	Asia	Southern Asia	Lower middle	73.4	3720	2.05	32.00	1280.0	NaN	NaN	NaN

	population	life_expectancy	income	children_per_woman	child_mortality	pop_density	co2_per_capita	years_in_school_men	years_in_school_women	co2_total
population	1.000000	0.020899	-0.039127	-0.075136	-0.012679	0.010329	0.009876	-0.012609	-0.055508	0.810722
life_expectancy	0.020899	1.000000	0.656187	-0.799298	-0.874404	0.177470	0.466554	0.726919	0.732383	0.117341
income	-0.039127	0.656187	1.000000	-0.530189	-0.550647	0.277383	0.807494	0.581746	0.582572	0.097359
children_per_woman	-0.075136	-0.799298	-0.530189	1.000000	0.876623	-0.144019	-0.430218	-0.751975	-0.784130	-0.148606
child_mortality	-0.012679	-0.874404	-0.550647	0.876623	1.000000	-0.126336	-0.442394	-0.789018	-0.818036	-0.122293
pop_density	0.010329	0.177470	0.277383	-0.144019	-0.126336	1.000000	0.120080	0.084184	0.080018	-0.010954
co2_per_capita	0.009876	0.466554	0.807494	-0.430218	-0.442394	0.120080	1.000000	0.441900	0.454274	0.159584
years_in_school_men	-0.012609	0.726919	0.581746	-0.751975	-0.789018	0.084184	0.441900	1.000000	0.964648	0.122927
years_in_school_women	-0.055508	0.732383	0.582572	-0.784130	-0.818036	0.080018	0.454274	0.964648	1.000000	0.088188
co2_total	0.810722	0.117341	0.097359	-0.148606	-0.122293	-0.010954	0.159584	0.122927	0.088188	1.000000