Do you remember how to:
1 - Read in the data from the csv file from yesterday?
import pandas as pd
world_data = pd.read_csv('https://raw.githubusercontent.com/UofTCoders/2018-09-10-utoronto/gh-pages/data/world-data-gapminder.csv')
# If saved locally yesterday:
# surveys = pd.read_csv("world_data.csv")
2 - How to select only the columns 'country' and 'year' from the dataframe?
world_data[['country', 'year']].head() #head just to limit output
3 - How to select a few rows together with the columns above?
world_data.loc[[1, 13, 24], ['country', 'year']]
4 - How to select only data from year 1995?
world_data.loc[world_data['year'] == 1995]
5 - Select only the rows where the region is Asia or Africa.
world_data.loc[world_data['region'].isin(['Asia', 'Africa'])]
6 - Calculate the total population in each region
world_data.groupby('region')['population'].sum()
7 - Get the number of countries in each region for the year 2018.
world_data.loc[world_data['year'] == 2018].groupby('region').size()
The human visual system is one of the most advanced apparatuses for detecting patterns and it allows for quick exploration of complex visual relationships. Data visualization is therefore a quick, efficient way of unearthing clues to interesting features in the data that can later be investigated in a robust, quantitative manner. Visualizations are also unparalleled in communicating insights drawn from data. For these reasons, it is important to possess the skills to graphically represent the data in a way that is efficient for humans to process.
There are many plotting packages in Python, making it possible to create diverse visualizations such as interactive web graphics, 3D animations, statistical visualizations, and map-based plots. When starting out, it can be helpful to find an example of how a plot looks that you want to create and then copy and modify that code. Examples of plots can be found in many excellent online Python plotting galleries, such as those from matplotlib, seaborn, and the Python graph gallery.
Our focus will be on two of the most useful packages for creating publication quality visualizations:
matplotlib
,
which is a robust, detail-oriented, low level plotting interface,
and seaborn
,
which provides high level functions on top of matplotlib
and allows the plotting calls to be expressed in terms what is being explored in the underlying data
rather than what graphical elements to add to the plot.
The high-level figures created by seaborn
can be configured via the matplotlib
parameters,
so learning these packages in tandem is useful.
%matplotlib inline
# Note that this will only need to be done the first time you create a plot in a notebook
# all subsequent plots will show up as expected.
To facilitate our understanding of plotting concepts,
the initial examples here will not include dataframes,
but instead have simple lists holding just a few data points.
To create a line plot,
the plot()
function from matplotlib.pyplot
can be used.
import matplotlib.pyplot as plt
x = [1, 2, 3, 4]
y = [1, 2, 4, 3]
plt.plot(x ,y)
Using plot()
like this is not very explicit
and a few things happens "under the hood",
e.g. a figure is automatically created
and it is assumed that the plot should go into the currently active region of this figure.
This gives little control over exactly where to place the plots within a figure
and how to make modifications the plot after creating it,
e.g. adding a title or labeling the axes.
To facilitate modifications to the plot,
it is recommended to use the object oriented plotting interface in matplotlib
,
where an empty figure and at least one axes object is explicitly created
before a plot is added to it.
This figure and its axes objects are assigned to variable names
which are then used for plotting.
In matplotlib
,
an axes object refers to what you would often call a subplot colloquially
and it is named "axes" because it consists of an x-axis and a y-axis by default.
fig, ax = plt.subplots()
Calling subplots()
returns two objects,
the figure and its axes object.
Plots can be added to the axes object of the figure
by using the name we assigned to the returned axes object (ax
by convention).
fig, ax = plt.subplots()
ax.plot(x, y)
To create a scatter plot,
use scatter()
instead of plot()
.
fig, ax = plt.subplots()
ax.scatter(x, y)
Plots can also be combined together in the same axes. The line style and marker color can be changed to facilitate viewing the elements in th combined plot.
fig, ax = plt.subplots()
ax.scatter(x, y, color='red')
ax.plot(x, y, linestyle='dashed')
And plot elements can be resized.
fig, ax = plt.subplots()
ax.scatter(x, y, color='red', s=100)
ax.plot(x, y, linestyle='dashed', linewidth=3)
It is common to modify the plot after creating it, e.g. adding a title or label the axis.
fig, ax = plt.subplots()
ax.scatter(x, y, color='red')
ax.plot(x, y, linestyle='dashed')
ax.set_title('Line and scatter plot')
ax.set_xlabel('Measurement X')
The scatter and line plots can easily be separated into two subplots within the same figure,
by telling plt.subplots
to create a figure with one row and two columns
(so two subplots side by side).
This returns two axes objects,
one for each subplot,
which we assign to the variable names ax1
and ax2
.
fig, (ax1, ax2) = plt.subplots(1, 2)
# The default is (1, 1), that's why it does not need
# to be specified with only one subplot
To prevent plot elements,
such as the axis tick labels from overlapping,
tight_layout()
method can be used.
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.tight_layout()
The figure size can easily be controlled when it is created.
# `figsize` refers to the size of the figure in inches when printed
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4))
fig.tight_layout()
Bringing it all together to separate the line and scatter plot.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4))
ax1.scatter(x, y, color='red')
ax2.plot(x, y, linestyle='dashed')
ax1.set_title('Scatter plot')
ax2.set_title('Line plot')
fig.tight_layout()
Challenge 1¶
- There is a plethora of colors available to use in
matplotlib
. Change the color of the line and the dots in the figure using your favorite color from this list.- Use the documentation to change the styling of the line in the line plot and the type of marker used in the scatter plot (you might need to search online to figure this out).
Figures can be saved by calling the savefig()
method
and specifying the name of file to create.
The resolution of the figure can be controlled by the dpi
parameter.
fig.savefig('scatter-and-line.png', dpi=300)
In the JupyterLab file browser, you can see that a new image file has been created. A PDF-file can be saved by changing the extension in the specified file name. Since PDF is a vector file format, it is not possible to specify a resolution.
fig.savefig('scatter-and-line.pdf')
This concludes the customization section. The concepts taught here will be applied in the next section on how to choose a suitable plot type for data sets with many observations.
If the dataframe from the previous lecture is not loaded, read it in first.
import pandas as pd
# world_data = pd.read_csv('../world-data-gapminder.csv')
# If not saved to disk yesterday
url = 'https://raw.githubusercontent.com/UofTCoders/2018-09-10-utoronto/gh-pages/data/world-data-gapminder.csv'
world_data = pd.read_csv(url)
We can use scatter()
with the data
parameter
to plot columns from the dataframe.
fig, ax = plt.subplots()
ax.scatter(x='year', y='population', data=world_data)
The reason for the not immediately intuitive appearance of this graph, is that one scatter dot has been added for each year for every country. To instead see how the world's total population has changes over the years, the population from each country for each year needs to be summed together. This can be done using the dataframe techniques from the previous lecture.
# One could also do `as_index=False` with `groupby()`
world_pop = world_data.groupby('year')['population'].sum().reset_index()
fig, ax = plt.subplots()
ax.scatter(x='year', y='population', data=world_pop)
This plot shows that the world population has been steadily increasing since the 1800s and dramatically picked up pace in the 1950s.
It is possible to use matplotlib
in this way to explore visual relationships in dataframe.
However,
it is a little cumbersome already with these simple examples
and it will get more complicated once we want to include more variables,
e.g. stratifying the data in subplots based on region and income level
would include writing double loops and keeping track of plot
layout and grouping variables manually.
The Python package seaborn is designed for effectively exploring data visually
without getting bogged down in technical plotting details.
seaborn
¶When visually exploring data with lots of variables, it is in many cases easier to think in terms of what is to be explored in the data, rather than what graphical elements are to be added to the plot. For example, instead of instructing the computer to "go through a dataframe and plot any observations of country X in blue, any observations of country Y in red, etc", it is easier to just type "color the data by country". There are many benefits to using a so called descriptive syntax, instead of an imperative one.
Facilitating semantic mappings of data variable to graphical elements
is one of the goals of the seaborn plotting package.
Thanks to its functional way of interfacing with data,
only minimal changes are required if the underlying data change
or to switch the type of plot used for the visualization.
seaborn
provides a language that facilitates thinking about data
in ways that are conducive for exploratory data analysis
and allows for the creation of publication quality plots
with minimal adjustments and tweaking.
The seaborn syntax was introduced briefly already in the introductory lecture and it is similar to how matplotlib plots dataframes. For example, to make the same scatter plot as above:
import seaborn as sns
sns.scatterplot(x='year', y='population', data=world_pop)
In addition to providing a data-centric syntax,
seaborn
also facilitates visualization of common statistical aggregations.
For example,
the when creating a line plot in seaborn
,
the default is to aggregate and average all observations with the same value on the x-axis,
and to create a shaded region representing the 95% confidence interval for these observations.
sns.lineplot(x='year', y='population', data=world_data)
In this case, it would be more appropriate to have the shaded area describe the variation in the data, such as the standard deviation, rather than an inference about the reproducibility, such as the default 95% CI.
sns.lineplot(x='year', y='population', data=world_data, ci='sd')
To change from showing the average world population per country and year
to showing the total population for all countries per year,
the estimator
parameter can be used.
Here,
the shaded are is also removed with ci=None
.
# The `estimator` parameter is currently non-functional for sns.scatterplot, but will be added soon
sns.lineplot(x='year', y='population', data=world_data, estimator='sum', ci=None)
Before continuing the exploration of the world population data,
let's discuss how to customize the appearance of our plots.
The returned object is an matplotlib axes,
so all configuration available through matplotlib
can be applied to the returned object by first assigning it to a variable name (ax
by convention).
ax = sns.lineplot(x='year', y='population', data=world_data, estimator='sum', ci=None)
ax.set_title('World population since the 1800s', fontsize=16)
ax.set_xlabel('Year', fontsize=12)
In addition to all the customization available through the standard matplotlib
syntax,
seaborn
also offers its own functions for changing the appearance of the plots.
In essence,
these functions are shortcuts to change several matplotlib
parameters simultaneously
For example,
a more effective approach than setting individual font sizes or colors of graphical elements
is to set the overall size and style for all graphs.
sns.set(context='talk', style='darkgrid', palette='pastel')
sns.lineplot(x='year', y='population', data=world_data, estimator='sum', ci=None)
These functions are analogues to making changes in the settings menu of a graphical program and they will apply to all following plots.
Challenge 2¶
Find out which styles and contexts are available in seaborn. Try some of them out and choose your favorite style and context. Hint This information is available both through the built-in and the online documentation.
For the rest of this tutorial, the ticks
style will be used.
sns.set(context='notebook', style='ticks', font_scale=1.4)
sns.lineplot(x='year', y='population', data=world_data, estimator='sum', ci=None)
For styles that include the frame around the plot,
there is a special seaborn
function to remove the top- and rightmost borders
(again by modifying matplotlib
parameters under the hood).
sns.lineplot(x='year', y='population', data=world_data, estimator='sum', ci=None)
sns.despine()
If the style options exposed through seaborn
are not sufficient,
it is possible to change all plot parameters directly through the matplotlib
rc and style interfaces.
As mentioned above,
the main strength of a descriptive plotting syntax
lies in describing the plot appearance in human-friendly vocabulary
and have the computer assign variables to graphical objects accordingly.
For example,
to plot subsets of the data in different colors,
the hue
parameter can be used.
sns.lineplot(x='year', y='population', hue='income_group',
data=world_data, ci=None, estimator='sum')
This stratification of the income groups reveals that the population growth has been the fastest in middle income countries.
The plot can be made more accessible (especially to those with color vision deficiency) by changing the style of each line instead of only relying on color to separate them.
sns.lineplot(x='year', y='population', hue='income_group', style='income_group',
data=world_data, ci=None, estimator='sum')
Just like in the previous lecture,
the values of the ordinal variable income_group
are not listed in an intuitive order.
A custom order can easily be specified by passing a list to the hue_order
parameter,
but this would have to be done for every plot.
A more effective approach is to encode the order in the dataframe itself,
using the top level pandas
function Categorical()
.
world_data['income_group'] = (
pd.Categorical(world_data['income_group'], ordered=True,
categories=['Low', 'Lower middle', 'Upper middle', 'High'])
)
world_data['income_group'].dtype
sns.lineplot(x='year', y='population', hue='income_group', style='income_group',
data=world_data, ci=None, estimator='sum')
The legend now lists the colors in the expected order. This modification also ensures that when making plots with income groups on the x- or y-axis, they will be plotted in the expected order.
It is difficult to explore multiple categorical relationships within one single plot.
For example,
to see how the income groups compare within each region,
the hue
and style
variables could be used for different variables,
but this makes the plot dense and difficult to interpret.
sns.lineplot(x='year', y='population', hue='income_group', style='region',
data=world_data, ci=None, estimator='sum')
An effective approach for exploring multiple categorical variables in a data set is to plot so-called "small multiples" of the data, where the same type of plot is used for different subsets of the data. These subplots are drawn in rows and columns forming a grid pattern, and can be referred to as a "facet", "lattice" or "trellis" plot.
Visualizing categorical variables in this manner is a key step in exploratory data analysis,
and thus seaborn
has a dedicated plot function for this,
called relplot()
(for "relational plot" since it visualizes the relationships between numerical variables).
The syntax to relplot()
is very similar to lineplot()
,
but we need to specify that the kind of plot we want is a line plot.
# Create the same plot as above
sns.relplot(x='year', y='population', hue='income_group', style='income_group', kind='line',
data=world_data, ci=None, estimator='sum')
The region
variable can now be mapped to different facets/subplots in a grid pattern.
# TODO switch this to some more interesting column if I have time
sns.relplot(x='year', y='population', data=world_data, estimator='sum',
kind='line', hue='income_group', col='region', ci=None)
It's a little hard to see because the figure is very wide
and has been shrunk to fit in the notebook.
To avoid this,
relplot()
can use the col_wrap
parameter to distribute the plots over several rows.
The height
and aspect
parameters can be used to set the height and width of each facet.
sns.relplot(x='year', y='population', data=world_data, estimator='sum',
kind='line', hue='income_group', col='region', ci=None,
col_wrap=3, height=2.5, aspect=1.3)
Facetting the plot by region reveals that the largest absolute population increase occurred among middle income countries in Asia. We will soon look closer on which countries are.
The returned object from relplot()
is a grid
(a special kind of figure)
with many axes,
and can therefore not be placed within a preexisting figure.
It is saved just as any matplotlib
figure with savefig()
,
but has some special methods for easily changing the aesthetics of each axes.
g = sns.relplot(x='year', y='population', data=world_data,
kind='line', hue='income_group', col='region', ci=None,
col_wrap=3, height=2.5, aspect=1.3)
g.set_titles('{col_name}', y=0.95)
g.set_axis_labels(y_var='Population', x_var='Year')
g.savefig('grid-figure.png')
Remember that names such as fig
, ax
, and here g
,
are only by convention,
and any variable name could have been used.
We might want the color to indicate income group,
but draw separate lines for each country.
For this we can set units='country'
and estimator=None
(so don't aggregate,
just draw one line per country with the raw values).
sns.relplot(x='year', y='population', data=world_data, estimator=None, units='country',
kind='line', hue='income_group', col='region', ci=None,
col_wrap=3, height=2.5, aspect=1.3)
Two countries in Asia stand out in terms of total population. To find out which these are, we can filter the data.
world_data.loc[world_data['year'] == 2018].nlargest(8, 'population')
Challenge 3
- To find out the total amount of CO2 released into the atmosphere, used the
co2_per_capita
andpopulation
columns to create a new column:co2_total
.- Plot the total CO2 per year for the world.
- Plot the total CO2 per year for the world and for each region.
- Create a faceted plot comparing total CO2 levels across income groups and regions.
# Challenge 3 solutions
# 1.
world_data['co2_total'] = world_data['co2_per_capita'] * world_data['population']
# 2.
sns.relplot(x='year', y='co2_total', data=world_data, kind='line', ci=None, estimator='sum')
# 3.
sns.relplot(x='year', y='co2_total', data=world_data, kind='line', ci=None, estimator='sum', hue='region')
# 4.
sns.relplot(x='year', y='co2_total', data=world_data, kind='line', ci=None, estimator='sum',
hue='income_group', col='region', col_wrap=3, height=4)
# Discuss what these plots tell us:
# The world's total co2 emissions are rapidly increasing. Europe and the Americas have been the highest emitters for
# many years, but have recently been overtaken by Asia, which is now producing around twice the amount of co2 compare
# to Europe and America. But don't forget that we saw in the last lecture that the population in Asia is 5-6 times bigger
# than in Europe and America!
# It's important to look at both total production from a country because change within that single country has big
# potential of reaching many people. Not plotted here, but also also important is to explore which countries are high in CO2 per capita
# since these might have more room to reduce the production. Of course, reality is more complicated. Some countries
# might import goods that demand high CO2 production in their manufacturing country instead of producing themselves
# so they might "sponsor" the production in another country, but would not show up high in this list.
To continue exploring the CO2 emissions we started to look at in the last challenge,
let's use the other type of plot for comparing quantitative variables:
scatterplot()
.
This is the default in the relplot()
function,
so we don't need to specify kind='scatter'
)
As mentioned in the discussion above, in addition to considering the total amount of CO2 produced per country, it can be insightful to explore the CO2 produced per citizen.
sns.relplot(x='co2_total', y='co2_per_capita', data=world_data)
This looks funky,
and not quite as expected...
The reason is that we have plotted multiple data points per country,
one for each year!
This can be confusing
since we don't know which dot is for which year
and this plot is probably not what we wanted to create.
Instead,
we can filter the data to focus on a specific year.
Unfortunately,
there is not CO2 measurements available for the last few years.
To find out in which years there are countries with CO2 measurements,
we can drop the NAs in co2_per_capita
and look at the min and max value.
world_data.dropna(subset=['co2_per_capita'])['year'].agg(['min', 'max'])
Now we can subset the data for the latest available year with CO2 measurements, which is 2014.
world_data_2014 = world_data.loc[world_data['year'] == 2014]
sns.relplot(x='co2_total', y='income', data=world_data_2014)
# TODO add to the line below significanlty what?
This reveals that there are a few countries in the world that have significantly and one country that is rather high in both measurements.
Just as before,
it is possible to map plot semantics and facet the plot according to variables in the data set.
scatterplot()
can also scale the dot size according to a variable in the data set.
# `sizes` controls the dots min and max size
sns.relplot(x='co2_total', y='co2_per_capita', hue='income_group', size='population',
data=world_data_2014, sizes=(40, 400))
Unsurprisingly, some of the countries that are high in the total co2_emissions are also the most populous countries. The trends between different regions can now be easily compared by facetting the data by region.
sns.relplot(x='co2_total', y='co2_per_capita', hue='income_group', size='population',
data=world_data_2014, sizes=(40, 400), col='region', col_wrap=3, height=4)
Already here we can get a pretty good idea of which some of these countries are. The high emission middle income countries in Asia are likely China and India, while the American country high in both total emissions and emissions per capita must be the USA. However, some observations are harder to resolve, like which the high co2_capita regions are in Asia and the Americas.
Challenge 4¶
Let's use some of the aggregation methods from yesterday to complement the plots we have just made.
- Find out which are the 10 countries with the highest co2 emissions per capita.
- Find out which are the 10 countries with the highest total co2 emissions.
- Which 10 countries have produce the most CO2 in total since the 1800s?
# Challenge 4 solutions
# 1.
world_data_2014.nlargest(10, 'co2_per_capita')
# 2.
world_data_2014.nlargest(10, 'co2_total')
# 3.
world_data.groupby('country')['co2_total'].sum().nlargest(10)
In addition to what we observed above, an interesting aspect to explore is how the relationship between per capita and total CO2 emissions has changed over time for different income groups. As we have seen before, this can be explored in a line graph, but if we instead subset certain years from the data and create a facet for each year, we can see the spread at each point in time
world_data_1920_2018 = world_data.loc[world_data['year'].isin([1920, 1940, 1960, 1980, 2000, 2014])]
sns.relplot(x='co2_total', y='co2_per_capita', col='year', hue='income_group',
data=world_data_1920_2018, col_wrap=3, height=3.5)
In the exercises above, we chose suitable variables to illustrate the plotting concepts. Often when doing EDA, it will not be as easy to know what comparison to start with. Unless you have good reason to look at a particular relationship, starting by plotting the pairwise relationships of all quantitative variables can be helpful.
# Use 2014 data since we know that there are CO2 measurements in that year
# This might take some time
sns.pairplot(world_data_2014)
The year column is not that insightful since there is only one year in the data. Removing that column gives more space for the rest of the plots.
sns.pairplot(world_data_2014.drop(columns='year'))
Each histogram on the diagonal shows the distribution of a single variable in a histogram. The scatter plots below the diagonal show the relationship between two numerical variables in a scatter plot. The scatter plots above the diagonal are mirror images of those below the diagonal.
Plotting all pairwise relationships can provide clues for what to explore next. For example, the relationships we explored above between child mortality and children per women or those between total CO2 and CO2 per capita can also be seen here. It is possible to quantify the strength of these relationships, by computing the Pearson correlation coefficients between columns.
world_data_2014.drop(columns='year').corr()
With so much data, it is slow for us to process all the information as numbers in a table A higher bandwidth operation is to let our brain interpret colors for the strength of the relationships through a heatmap.
sns.heatmap(world_data_2014.drop(columns='year').corr())
The heatmap can be made more informative by changing to a diverging colormap, which is generally recommended when there is a natural central value (such as 0 in our case). Optionally, the heatmap can be annotated with the correlation coefficients.
fig, ax = plt.subplots(figsize=(10, 6))
sns.heatmap(world_data_2014.drop(columns='year').corr(), annot=True, ax=ax, cmap='coolwarm')
There are more formal ways of interrogating variable interactions and their potential causality (such as regressions), but these are outside the scope of this lecture. However, the pairwise scatter plot and correlation coefficient matrix are quick means to get an informative overview of how the dataframe columns relate to each other.
Let's zoom in on the relationship between income and life expectancy, which appears to be quite strong.
# TODO Make this a challenge where they learn how to find things on stackoverflow
ax = sns.scatterplot(x='income', y='life_expectancy', data=world_data_2014)
This relationship appears to be log linear and can be visualized with the x-axis set to log-scale.
Challenge¶
- Find out how to change the x-axis to be log-scaled. Search online for how to change the scale of a matplotlib axes object. Remember that seaborn plots return matplotlib axes objects, so all matplotlib function to modify the axes will work on this plot. Good sites to use are the documentation pages for the respective package, and stackoverflow. However, it is often the fastest to type in a well chosen query in your favorite search engine.
- In the logged plot, color the dots according to the region of the observation.
# Challenge solutions
# 1.
ax = sns.scatterplot(x='income', y='life_expectancy', data=world_data_2014)
ax.set_xscale('log')
# Challenge solutions
# 2.
ax = sns.scatterplot(x='income', y='life_expectancy', data=world_data_2014, hue='region')
ax.set_xscale('log')
Another interesting relationship we could see from the pairplot
is how child mortality relates to how many children are born per woman.
We can filter out years of the data
and look at how the relationship has changed over time
using the same approach as for the CO2 data.
world_data_1920_2018 = world_data.loc[world_data['year'].isin([1920, 1940, 1960, 1980, 2000, 2018])]
sns.relplot(x='children_per_woman', y='child_mortality', col='year', hue='income_group',
data=world_data_1920_2018, col_wrap=3, height=3.5)
A common misconception is that saving poor children will lead to overpopulation. However, we can see that lower child mortality is correlated with smaller family sizes. As more children survive, parents feel more secure with a smaller family size. Reducing poverty is also related to these variables, since most high income countries are found in the lower left corner of the plots (remember that the income group is classified based on 2018 year's income and not for each year that is being plotted above).
It is important to note that from a plot like this, it is not possible to tell causation, just correlation. However, in the gapminder video library there are a few videos on this topic (including this and this one), discussing how reducing poverty can help slow down population growth through decreased family sizes. Current estimates suggest that the word population will stabilize around 11 billion people and the average number of children per woman will be close to two worldwide in year 2100.
When exploring a single quantitative variable, we can choose between plotting every data point (e.g. categorical scatterplots such as swarm plots and strip plots), an approximation of the distribution (e.g. histograms and violinplots), or distribution statistics such as measures of central tendency (e.g. boxplots and barplots).
A good place to start is to visualize the variable's distribution with distplot()
.
Let's look at life expectancy during 2018 using this technique.
world_data_2018 = world_data.loc[world_data['year'] == 2018]
sns.distplot(world_data_2018['life_expectancy'])
The line represents a KDE (kernel density estimate), as seen previously in the grouped pairplot. Conceptually, this is similar to a smoothened histogram.
distplot()
can be customized to increase the number of bins
and the bandwidth of the kernel.
These are both calculated according to heuristics
for what should be good numbers for the underlying data,
but it is good to know how to change them.
sns.distplot(world_data_2018['life_expectancy'], bins=30, rug=True,
kde_kws={'bw':1, 'color':'black'})
The rug plot along the x-axis shows exactly where each data point resides. To compare distributions between values of a categorical variables, violinplots are often used. These consist of two KDEs mirrored across a midline.
sns.violinplot(x='life_expectancy', y='income_group', data=world_data_2018)
Since income_group
was defined as an ordered categorical variable previously,
this order is preserved when distributing the income groups along the y-axis.
There is notable variation in life expectancy between income groups, people in wealthier countries live longer. This variation contributes to the multimodality seen in the first distribution plot of the life expectancy for all countries in the world. However, there is also large overlap between income groups and variation within the groups, so there are more variables affecting the life expectancy than just the income.
Dissecting multimodal distributions in this manner
to find underlying explaining variables
to why a distribution appears to consist of many small distributions
is common practice during EDA.
It looks like some income groups,
e.g. "high",
still consist of multimodal distributions.
To explore these further,
facetting can be used just as previously.
The categorical equivalent of relplot
is catplot
(categorical plot).
sns.catplot(x='life_expectancy', y='income_group', data=world_data_2018)
The default in catplot
is to create a stripplot
,
a categorical scatterplot where the dots are randomly jittered to not overlap.
This is fast to create,
but it is sometimes hard to see how many dots are in a group due to overlap.
A more ordered approach is to create another type of categorical scatterplot,
called swarmplot,
where the dots are positioned to avoid overlap.
sns.catplot(x='life_expectancy', y='income_group', data=world_data_2018, kind='swarm')
The swarm plot communicates the shape of the distribution more clearly than the stripplot, Here, we can see the same bimodality in the high income group as seen in the violinplot, which was hard to see in the stripplot.
A drawback is that swarmplots can be slow to create for large datasets. For really large datasets, even stripplot is slow and it is necessary to approximate the distributions (e.g. with a violinplot) or show distribution statistics (e.g. with a boxplot), instead of showing each observation .
We can use color to find out if regional differences are related to income level.
# TODO Will update this to look prettier
sns.catplot(x='life_expectancy', y='region', data=world_data_2014, kind='box',
col='income_group', col_wrap=2)
The variable levels are automatically ordered and it is easy to see how life expectancy generally grow with higher average income. In contrast to a line plot with the average change over time, we can here see how the distribution itself changes, not just the average. While countries in general have increased their life expectancy, differences can be seen in how they have done it: Europe and the Americas have gone from a mix of high and low life_expectancy levels to tighter distributions where all countries have high life expectancy, Africa has transitioned from most countries having low life_exp to diverse life lengths depending on country.
# If both columns can be interpreted as numerical,
# the `orient` keyword can be added to be explicit
sns.catplot(x='life_expectancy', y='year', orient='horizontal', data=world_data_1920_2018, kind='violin',
col='region', col_wrap=3, color='lightgrey')
Let's explore how much of the variation during the transition in African life expectancy can be explained by geographically close regions performing differently. First how many sub_regions are there in each Africa.
world_data_1920_2018.groupby('region')['sub_region'].nunique()
There are two subregions, let's find out which ones.
world_data_1920_2018.groupby('region')['sub_region'].unique()
Let's see if sub-saharan and northern Africa have had different development when it comes to life expectancy.
# The split parameter saves some space and looks slick
africa = world_data_1920_2018.loc[world_data_1920_2018['region'] == 'Africa']
sns.catplot(x='life_expectancy', y='year', orient='horizontal', data=africa, kind='violin',
hue='sub_region', palette='pastel', split=True)
For the last challenge, we will explore how an education indicator between and men and women varies.
world_data.dropna(subset=['years_in_school_women'])['year'].agg(['min', 'max'])
Challenge¶
- Subset dataframe for the years 1975, 1995, and 2015
- Make a new column of ratio women men in education
- plot for regions and income groups and times (reword)
# Challenge solutions
# 1.
world_data_1970_2015 = world_data.loc[world_data['year'].isin([1975, 1995, 2015])].copy()
# 2.
world_data_1970_2015['women_men_school_ratio'] = world_data_1970_2015['years_in_school_women'] / world_data_1970_2015['years_in_school_men']
# world_data_1970_2015['women_men_school_ratio']
# 3a.
sns.catplot(y='women_men_school_ratio', x='year', data=world_data_1970_2015, hue='region', dodge=True, kind='point')
# 3b.
sns.catplot(y='women_men_school_ratio', x='year', data=world_data_1970_2015, hue='income_group', dodge=True, kind='point')