Introduction to programmatic data analyses


Lesson objectives

  • To give students an overview of the capabilities of Python and how to use the JupyterLab for exploratory data analyses.
  • Learn about some differences between Python and Excel.
  • Learn some basic Python commands.
  • Learn about the Markdown syntax and how to use it within the Jupyter notebook.

Lesson outline

  • Communicating with computers (5 min)
    • Advantages of text-based communication (10 min)
    • Speaking Python (5 min)
    • Natural and formal languages (10 min)
  • The Jupyter notebook (20 min)
  • Data analysis in Python (5 min)
    • Packages (5 min)
    • How to get help (5 min)
    • Exploring data with pandas (10 min)
    • Visualizing data with seaborn (10 min)

The aim of this workshop is to teach you basic concepts, skills, and tools for working with data, so that you can get more done in less time while having more fun. You will learn how to use the programming language Python to replace many of the tasks you would normally do in spreadsheet software such as Excel, and also do more advanced analysis. This first section will be a brief introduction to communicating with your computer via text rather than by pointing and clicking in a graphical user interface, which might be what you are used to.


Communicating with computers

Before we get into practically doing things, I want to give some background to the idea of computing. Essentially, computing is about humans communicating with the computer to modulate flows of current in the hardware, in order to get the computer to carry out advanced calculations that we are unable to efficiently compute ourselves. Early examples of human-computer communication was quite primitive and included actually disconnecting a wire and connecting it again in a different spot. Luckily, we are not doing this anymore. Instead, we have graphical user interfaces with menus and buttons, which is what you are commonly using on your laptop. These graphical interfaces can be thought of as a layer (or shell) around the internal components of your operating system. They exist as a middle man, making it easier for us to express our thoughts, and for computers to interpret them.

An example of programs with as graphical interface are spreadsheet software, such as Microsoft Excel and LibreOffice Calc. In these programs, all the functionality is accessible via hierarchical menus, and clicking buttons sends instructions to the computer, which then responds and sends the results back to your screen.

Spreadsheet software is great for viewing and entering small data sets, and for quickly creating simple visualizations. However, it can be tricky to design publication-ready figures, create automatic reproducible analysis workflows, perform advanced calculations, and reliably clean data sets. Even when using a spreadsheet program to record data, it is often beneficial to have some some basic programming skills to facilitate the analyses of those data.

Advantages of text-based communication

Today, we will learn about communicating to your computer via text, rather than graphical point and click. Typing instruction to the computer might at first seems counterintuitive, why do we need it when it is so easy to point and click with the mouse? Well, graphical user interfaces can be nice when you are new to something, but text based interfaces are more powerful, faster, and actually easier to use once you get comfortable with them.

We can compare it to learning a language: in the beginning it's convenient to look up words in a dictionary (the analog of a menu in a graphical interface), to slowly string together sentences one word at a time. But, once we become more proficient in the language and know what we want to say, it is easier to say or type it directly, instead of having to look up every word in the dictionary first. Likewise, experienced programmers can work faster and with more flow when they can type directly to the computer, instead of being constrained by graphical menus. It would be even faster to have the computer interpret speech or even thought, via speech- and brain-computer interfaces, but accuracy is often a limiting factor.

A critical advantage of text interfaces, is the ease of task automation and repetition. Once you have all the instructions written down, it is simple to apply them to a different dataset. This facilitates reproducibility of analysis, not only between teams from different organizations, but also between individuals on the same team. Compare being shown how to perform a certain analysis in spreadsheet software, where the instructions will be "first you click here, then here, then here...", with being handed the same workflow written down in several lines of codes, which you can analyze and understand at your own pace.

Another advantage of text interfaces, is that they are less resource intensive than their graphical counterparts and easier to develop programs with text interfaces since you don't have to code the graphical components. This frees up developers to spend their time on other valuable components of the software. The ease of implementation combined with the efficiency of use for experienced users, results in that that many powerful programs are written without a graphical user interface. To use these programs, you need to know how to interact via text. For example, many the best data analysis and machine learning packages are written in Python or R, and you need to know these languages to use the machine learning components. Even if the program or package you want to use is not written in one of these languages, much of the knowledge you gain from understanding one programming language can be transferred to others. In addition, most powerful computers that you can log into remotely only give you a text interface to work with and there is no way to launch a graphical user interface.

Speaking Python

To communicate with the computer via Python, we need to use the Python interpreter, which can interpret our typed commands into machine language so that the computer can understand it. On Windows open the Anaconda Prompt, on MacOS open terminal.app, and on Linux open whichever terminal you prefer (e.g. gnome-terminal or konsole). Then type in python and hit Enter. You should see something like this:

Python 3.6.5 | packaged by conda-forge | (default, Apr  6 2018, 13:39:56)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

The >>> is called the "prompt", because it is prompting you to enter a command (a suggestion enhanced by the blinking block cursor). Now let's speak Python!

Natural and formal languages

While English and other spoken language are referred to as "natural" languages, computer languages are said to be "formal" languages. You might believe that it is quite tricky to learn formal languages, but it is actually not! You already know one: mathematics, which in fact written largely the same way in Python as you would write it by hand.

In [1]:
4 + 5
Out[1]:
9

The Python interpreter returns the result directly under our input and prompts us to enter new instructions. This is another strength of using Python for data analysis. Some programming languages requires an additional step where the typed instructions are compiled into machine language and saved as a separate file that they computer can run. Although compiling code often results in faster execution time, Python allows us to very quickly experiment and test new code, which is where most of the time is spent when doing exploratory data analysis.

The sparseness of the input 4 + 5 is much more efficient than typing "Computer, could you please add 4 and 5 for me?". Formal computer languages also avoid the ambiguity present in natural languages. You can think of Python as a combination of math and a formal, succinct version of English. Since it is designed to reduce ambiguity, Python lacks the edge cases and special rules that can make English so difficult to learn, and there is almost always a logical reason for how the Python language is designed, not only a historical one.

The syntax for assigning a value to a variable is also similar to how this is written in math.

In [2]:
a = 4

The value 4 is now accessible via the variable name a. We can now perform operations with the variable name, just as we would with the value directly.

In [3]:
a * 2
Out[3]:
8

This becomes very helpful when working with more complex data that a single value, and descriptive variable names can make our code more readable, as we will see later on.

Learning programming is similar to learning a foreign natural language: you will often learn the most from trying to do something and receiving feedback (from the computer or another person)! When there is something you can't wrap you head around, or if you are actively trying to find a new way of expressing a thought, then look it up, just as you would with a natural language. You can use your favorite search engine to find answers to your questions, many of which might already have been asked by other learners. One reputable hub for programming related questions is the Q&A site StackOverflow. They have helpful directions if you can't find what you are looking for and want to ask a question that has not been asked previously.

JupyterLab and the Jupyter notebook

Although the Python interpreter is powerful, it is commonly bundled with other useful tools in user interfaces specifically designed for exploratory data analysis. One such interface is JupyterLab from Project Jupyter, which is what we will be using today. JupyterLab originates from a project called IPython, an effort to make Python development more interactive. Since its inception, the scope of the project expanded to include additional programming languages, such as Julia and R, so the name was changed to "Jupyter" (JU-PY-te-R) as a reference to its core languages. Here, we will be using the Jupyter notebook within JupyterLab, which allows us to easily take notes about our analysis and view plots within the same document where we code. The notebook format also facilitates sharing of analyses, since the notebook interface is easily accessible through any web browser as well as exportable as a PDF or HTML page.

JupyterLab is launched by running the command jupyter-lab from the terminal, or by finding it in the Anaconda navigator from your operating system menu. This should output some text in the terminal and open a new tab in your default browser. Although a web browser is used to display the JupyterLab interface, you don't need to be connected to the internet to use it. All the files necessary to run JupyterLab are stored locally and the browser is simply used to display the interface. In the new browser tab, click the tile that says "Python3" under the "Notebook" heading, or use the "File" menu (File --> New --> Notebook). The new notebook is named "Untitled". If you right click on the tab where it says "Untitled", you will be given the option of changing the name to whatever you want. The notebook is divided into cells. Initially there will be a single input cell. You can type Python code directly into the cell, just as we did before. To run the output, click the play button in the toolbar, or press Shift + Enter

In [4]:
4 + 5
Out[4]:
9

By default, the code in the current cell is executed and the next existing cell is selected (if there is no next cell, a new empty one is created) You can execute multiple lines of code in the same code cell, the lines will be executed one after the other.

In [5]:
a = 4
a * 2
Out[5]:
8

In notebooks, you can take notes in nicely formatted text notes via the Markdown text format. To use it, create a new cell by clicking the "+" sign in the toolbar. Then click the dropdown menu that says "code", and change it to say "markdown". In markdown, you can use symbols to indicate how certain text should be rendered. You might already be familiar with this syntax if you have commented in online forums or used chat applications. An example of the syntax can look like this:

# Heading level one

- A bullet point
- *Emphasis in italics*
- **Strong emphasis in bold**

This is a [link to learn more about markdown](https://guides.github.com/features/mastering-markdown/)

The combination of code, plots, notes, and easy sharing, makes for a powerful data analysis environment that facilitates creating automated reproducible documents. It is possible to write an entire academic paper in this environment, and it is very handy for reports such as progress updates, since you can share your notes together with the analysis itself.

A few more notebook tips

The little counter on the left of each cell keeps track of in which order the cells were executed. It changes to an * when the computer is processing the computation (mostly noticeable for computation that takes longer time). If the * is shown for a really long time and you want to interrupt it, you can click the stop button. If you ever need to restart the Python in the notebook, you can click the circular arrow button. Cells can be reordered by drag and drop with the mouse, and copy and paste is available via right mouse click. The shortcut keys in the right click menu are referring to the Jupyter Command mode, which is not that important to know about when just starting out, but can be interesting to look into if you like keyboard shortcuts. These are the key JupyterLab features for getting started, there are many more that you can learn on your own if you want to.

The notebook is saved automatically, but it can also be done manually from the toolbar or by hitting Ctrl + s. Both the input and the output cells are saved, so any plots that you make will be present in the notebook next time you open it up without the need to rerun any code. This allows you to create complete documents with both your code and the output of the code in a single place, instead of spread across text files for your code and separate image files for each of your plots.

The notebook is stored as a JSON file with an .ipynb extension. These are specially formatted text files, which allow you to share your code, results, and documentation as a single document. When you want to share your notebooks with collaborators that don't have JupyterLab installed, you can export the notebook to HTML or PDF, so that your colleagues can view them in a web browser or PDF-viewer without installing anything! Exporting is done via File --> Export Notebook As... (The first time trying to export to PDF, there might be an error message with instructions on how to install TeX. Follow those instructions and try exporting again. If it is still not working, click Help --> Launch Classic Notebook and try exporting the same way as before).

It is also possible to open up other document types in JupyterLab, e.g. text documents and terminals. These can be placed side by side with the notebook through drag and drop, and all running programs can be viewed in the "Running" tab to the left. To search among all available commands for the notebook, the "Commands" tab can be used. Existing documents can be opened from the "File Browser" tab.

Data analysis in Python

To access additional functionality in a spreadsheet program, you need to click the menu and select the tool you want to use. All charts are in one menu, text layout tools in another, data analyses tools in a third, and so on. Programming languages such as Python have so many tools and functions, that they would not fit in a menu. Instead of clicking File -> Open and chose the file, you would type something similar to file.open('<filename>') in a programming language. Don't worry if you forget the exact expression, it is often enough to just type the few first letters and then hit Tab, to show the available options, more on that later.

Exploring data with the pandas package

For this section of the tutorial, the goal is to understand the concepts of data analysis in Python and how they are different from analyzing data in graphical programs. Therefore, it is recommend to not code along for this part, but rather try to get a feel for the overall workflow. All these steps will be covered in detail during later sections of the tutorial.

Packages contain domain specific functionality, that is not included in the default Python installation, a bit like how an app can add functionality to your phone. pandas is a Python package commonly used to perform data analysis on spreadsheet-like data. The name is derived from "panel data", an econometrics term for multidimensional structured data sets. Data are easily loaded into pandas from CSV-files and other spreadsheet formats. The format pandas uses to represent this data is called a dataframe.

For the example data, we will load a public dataset from the web (you can view the data by pasting the URL into your browser). This sample dataset describes the length and width of petals (the colorful leaves of flowers) and sepals (the small leaves covering budding flowers) for three species of iris flowers. When you open a file in a graphical spreadsheet program, it will immediately display the content in the window. Likewise, pandas will display the information of the dataset when you read it in.

In [6]:
import pandas as pd

url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv'
pd.read_csv(url)
Out[6]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

To not flood the screen with data, pandas by default only displays the five first and last rows of datasets with more than 60 rows in total. The ellipsis in the middle indicates that the output has been truncated.

To do useful and interesting things with the data, we first need to assign it to a variable name so that it is easy to access later. Let's save our dataframe as a variable called iris.

In [7]:
iris = pd.read_csv(url)
iris
Out[7]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

And a single column can be selected with the following syntax.

In [8]:
iris['sepal_length']
Out[8]:
0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal_length, Length: 150, dtype: float64

We could calculate the mean of all columns easily.

In [9]:
iris.mean()
Out[9]:
sepal_length    5.843333
sepal_width     3.057333
petal_length    3.758000
petal_width     1.199333
dtype: float64

And even divide the observations into groups depending on which species or iris flower they belong to.

In [10]:
iris.groupby('species').mean()
Out[10]:
sepal_length sepal_width petal_length petal_width
species
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026

This technique is often referred to as "split-apply-combine". The groupby() method splits the observations into groups, the mean() is applied an operation to each group, and the results are combined into the table that we can see here. We will learn much more about this in a later lecture.

Visualizing data with seaborn

A crucial part of any exploratory data analysis is data visualization. Humans have great pattern recognition systems, which makes it much easier for us to understand data, when it is represented by graphical elements in plots rather than numbers in tables. To visualize the results in with plots, we will use Python package dedicated to statistical visualization, seaborn (the name is a reference to a TV-show character). To count the observations in each species, we could use countplot().

In [11]:
import seaborn as sns

sns.countplot(x='species', data=iris)
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fbf7b7f2490>

We can see that there are 50 observations recorded for each species of iris. More interesting could be to compare the sepal lengths to see if there are differences depending on species. We could use swarmplot() for this, which plots every single observation as a dot.

In [12]:
sns.swarmplot(x='species', y='sepal_length', data=iris)
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fbf7970a4c0>

Since the number of observations for each species was 50, there will be 50 dots for each species in the plot. This plot agrees with what we saw when looking at the mean values for each species earlier.

There is much more to learn about plotting, which will do in a later lecture. One last example to illustrate the power of programmatic data analysis and how straightforward it can be to create complex visualizations. A common exploratory visualization is to investigate the pairwise relationship between variables to find out if there are measurements that are correlated with each other. This can be done with the pairplot() function.

In [13]:
sns.pairplot(data=iris)
Out[13]:
<seaborn.axisgrid.PairGrid at 0x7fbf796f2fa0>

On the diagonal, we can see histograms of the distribution of each variable. The pairwise relationships between the columns in the data set are shown in scatter plots below the diagonal and the same information is mirrored above the diagonal. From visually inspecting the plot, some variables certainly seem to depend on each other. For example, the petal width and petal height increase simultaneously, which indicate that some flowers have overall bigger petals than others.

We can also make out some clusters or groups of points within each pairwise scatter plot. It would be interesting to find out if these corresponded to some inherent structure in the data. Maybe observations from the different species that cluster together? To investigate, we can make a minor modification in the code above, to instruct seaborn to adjust the color hue of the data points according to their species affiliation.

In [14]:
sns.pairplot(data=iris, hue='species')
Out[14]:
<seaborn.axisgrid.PairGrid at 0x7fbf736d5130>

It certainly looks like observations from the same species are close together and a lot of the variation in our data can be explained by which species the observation belongs to!

This has been an introduction to what data analysis looks like with Python in the Jupyter Notebook. We will get into details about all the steps of this workflow in the following lectures, and you can keep referring back to this lecture as a high level overview of the process.

Introduction to programming in Python


Learning objectives

  • Perform mathematical operations in Python using basic operators.
  • Define the following data types in Python: strings, integers, and floats.
  • Define the following as it relates to Python: lists, tuples, and dictionaries.

Lesson outline

  • Introduction to programming in Python (50 min)

Operators

Python can be used as a calculator, where mathematical calculations use familiar syntax for operators such as +, -, /, and *.

In [15]:
2 + 2
Out[15]:
4
In [16]:
6 * 7
Out[16]:
42
In [17]:
4 / 3
Out[17]:
1.3333333333333333

Text prefaced with a # is called a "comment". These are often technical notes, reminders, or clarification to readers, and they will be ignored by the Python interpreter.

In [18]:
# `**` means "to the power of"
2 ** 3
Out[18]:
8

Variables

Values can be stored as variables, which is a little like giving values nicknames so that they are easier to access. This is called assigning values to variables and is handy when the same value will be used multiple times. The assignment operator in Python is =.

In [19]:
a = 5
a * 2
Out[19]:
10

A variable can be named almost anything. It is recommended to separate multiple words with underscores and it is necessary to start the variable name with a letter, not a number or symbol.

In [20]:
new_variable = 4
a - new_variable
Out[20]:
1

Variables can hold different types of data, not just numbers. For example, a sequence of characters surrounded by single or double quotation marks is called a string and can be assigned to a variable. In Python, it is intuitive to append string by adding them together:

In [21]:
# Either single or double quotes can be used to define a string.
# It does not matter as long as you are consistent.
b = 'Hello'
c = 'universe'
b + c
Out[21]:
'Hellouniverse'

A space can be added to separate the words.

In [22]:
b + ' ' + c
Out[22]:
'Hello universe'

Functions

To find out what type a variable is, the built-in function type() can be used. In essence, a function can be passed input values, follows a set of instructions with how to operate on the input, and then outputs the result. This is analogous to following a recipe: the ingredients are the input, the recipe specifies the set of instructions, and the output is the finished dish.

In [23]:
type(a)
Out[23]:
int

int stands for "integer", which is the type of any number without a decimal component.

To be reminded of the value of a, the variable name can be typed into an empty code cell.

In [24]:
a
Out[24]:
5

A code cell will only output its last value. To see more than one value per code cell, the built-in function print() can be used. When using Python from an interface that is not interactive like the Jupyter Notebook, such as when executing a set of Python instructions together as a script, the print() function is suitable choice for displaying output.

In [25]:
print(a)
type(a)
5
Out[25]:
int

Numbers with a decimal component are referred to as floats.

In [26]:
type(3.14)
Out[26]:
float

Text is of the type str, which stands for "string". Strings hold sequences of characters, which can be letters, numbers, punctuation, or more exotic forms of text (even emojis!).

In [27]:
print(type(b))
b
<class 'str'>
Out[27]:
'Hello'

The output from type() is formatted slightly differently when it is printed.

Packages

Certain functions, like the ones we used above, are considered essential for the Python programming language, and are installed together with Python automatically. Other highly useful, but often more domain specific functionality, can be accessed separately in the form of Python packages. A package is essentially a set of related functions bundled together in a format that is easy to install. Since there are so many packages available, it is not feasible to include all of them with the default Python installation (it would be as if your new phone came with every single app from the app/playstore preinstalled). Instead, Python packages can be downloaded from central repositories online and installed as needed. The Anaconda Python distribution already bundles the core Python language with many of the most effective Python packages for data analysis, so for this tutorial we don't need to install anything.

Installing packages

This section is only included as a reference for learners to rely on from home. It can be skipped or covered conceptually during the workshop.

After Anaconda has been installed on your system, you can use the command line conda package manager or the GUI-driven anaconda-navigator to install Python packages. For comprehensive instructions on both of these, refer to the [official documentation]. Brief step-by-step instructions to get up and running with conda follow.

  1. To install a new Python package from the Anaconda repositories, run conda install <package name> in a terminal. You can also use the pip package manager, but it will be easier to keep track of packages by sticking to one installation method.
  2. Some packages are not available in the default Anaconda repositories. User contributed packaged are available in Anaconda "channels", use anaconda search -t conda <package name>, to find a channel with the desired package. To install this package, use conda install -c <channel name> <package name>. The conda forge channel channel has many of the packages not in the default repositories.

Using packages

As in spreadsheet software menus, there are lots of different tools within each Python package. For example, if you want to use numerical Python functions, you can import the numerical python module, numpy. You can then access any function by writing numpy.<function_name>.

In [28]:
import numpy

numpy.mean([1, 2, 3, 4, 5])
Out[28]:
3.0

It is common to give packages nicknames, to reduce typing. This is not necessary, but it makes code less verbose and easier to read.

In [29]:
import numpy as np

np.mean([1, 2, 3, 4, 5])
Out[29]:
3.0

How to get help with packages and functions

Once you start out using Python, you don't know what functions are available within each package. Luckily, in the Jupyter notebook, you can type numpy.Tab (that is numpy + period + tab-key) and a small menu will pop up that shows you all the available functions in that module. This is analogous to clicking a 'numpy-menu' and then going through its list of functions. As mentioned earlier, there are plenty of available functions and it can be helpful to filter tab autocompletion menu by typing some letters in the function name.

To get more info on the function you want to use, you can type out the full name and then press Shift + Tab to bring up a help dialogue and again to expand that dialogue. We can see that to use the mean() function, we need to supply it with an argument (a), which should be 'array-like'. An array is essentially just a sequence of items. We just saw that one way of doing this was to enclose numbers in brackets [], which in Python means that these numbers are in a list, something you will hear more about later. Instead of manually activating the help every time, JupyterLab offers a tool called "Contextual Help", which displays help information as you type. Open it via the menu (Help --> Show Contextual Help) and drag the tab side by side with the notebook. When you start getting familiar with typing function names, you will notice that this is often faster than looking for functions in menus. However, sometimes you forget and it is useful to get hints via the help system described above.

Comparisons

Python also allows to use comparison and logic operators (<, >, ==, !=, <=, >=, and, or, not), which will return either True or False.

In [30]:
3 > 4
Out[30]:
False

not reverses the outcome from a comparison.

In [31]:
not 3 > 4
Out[31]:
True

and checks if both comparisons are True.

In [32]:
3 > 4 and 5 > 1
Out[32]:
False

or checks if at least one of the comparisons are True.

In [33]:
3 > 4 or 5 > 1
Out[33]:
True

The type of the resulting True or False value is called "boolean".

In [34]:
type(True)
Out[34]:
bool

Boolean comparison like these are important when extracting specific values from a larger set of values. This use case will be explored in detail later in this material, when we need to filter observations that meet a specific criteria.

Another common use of boolean comparison is with conditional statement, where the code after the comparison only is executed if the comparison is True.

In [35]:
if a == 4:
    print('a is 4')
else:
    print('a is not 4')
a is not 4
In [36]:
a
Out[36]:
5

Note that the second line in the example above is indented. Indentation is important in Python, and the Python interpreter uses it to understand that the code in the indented block will only be executed if the conditional statement above is True.

Challenge 1

  1. Assign a*2 to the variable name two_a.
  2. Change the value of a to 3. What is the value of two_a now, 6 or 10?

Array-like Python types

Lists

Lists are a suitable data structure for holding a sequence of elements.

In [37]:
planets = ['Earth', 'Mars', 'Venus']
planets
Out[37]:
['Earth', 'Mars', 'Venus']

Each element can be accessed by an index. Note that Python indices start with 0 instead of 1.

In [38]:
planets[0]
Out[38]:
'Earth'

You can index from the end of the list by prefixing with a minus sign.

In [39]:
planets[-1]
Out[39]:
'Venus'

Multiple elements can be selected via slicing.

In [40]:
planets[0:2]
Out[40]:
['Earth', 'Mars']

Slicing is inclusive of the start of the range and exclusive of the end, so 0:2 returns list elements 0 and 1.

Either the start or the end number of the range can be excluded to include all items from the beginning or to the end of the list, respectively.

In [41]:
# Same as [0:2]
planets[:2]
Out[41]:
['Earth', 'Mars']

To add items to the list, the addition operator can be used together with a list of the items to be added.

In [42]:
planets = planets + ['Neptune']
planets
Out[42]:
['Earth', 'Mars', 'Venus', 'Neptune']

Loops

A loop can be used to access the elements in a list or other Python data structure one at a time.

In [43]:
for planet in planets:
    print(planet)
Earth
Mars
Venus
Neptune

The variable planet is recreated for every iteration in the loop until the list planets has been exhausted.

Operation can be performed with elements inside loops.

In [44]:
for planet in planets:
    print('I live on ' + planet)
I live on Earth
I live on Mars
I live on Venus
I live on Neptune

Tuples

A tuple is similar to a list in that it's a sequence of elements. However, tuples can not be changed once created (they are "immutable"). Tuples are created by separating values with a comma (and for clarity these are commonly surrounded by parentheses).

In [45]:
a_tuple = (1, 2, 3)
another_tuple = ('blue', 'green', 'red')

Challenge - Tuples

  1. Type type(a_tuple) into Python - what is the object type?
  2. What happens when you type a_tuple[2] = 5 vs planets[1] = 5 ?

Dictionaries

A dictionary is a container that holds pairs of objects: keys and values.

In [46]:
fruit_colors = {'banana': 'yellow', 'strawberry': 'red'}
fruit_colors
Out[46]:
{'banana': 'yellow', 'strawberry': 'red'}

Dictionaries are indexed with keys. Think about a key as a unique identifier for a set of values in the dictionary. Keys can only have particular types: they have to be "hashable". Strings and numeric types are acceptable, but lists aren't.

In [47]:
fruit_colors['banana']
Out[47]:
'yellow'

To add an item to the dictionary, a value is assigned to a new dictionary key.

In [48]:
fruit_colors['apple'] = 'green'
fruit_colors
Out[48]:
{'banana': 'yellow', 'strawberry': 'red', 'apple': 'green'}

Using loops with dictionaries iterates over the keys by default.

In [49]:
for fruit in fruit_colors:
    print(fruit, fruit_colors[fruit])
banana yellow
strawberry red
apple green

Trying to use a non-existing key, e.g. from making a typo, throws an error message.

In [50]:
fruit_colors['bannana']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-50-84b86acf3267> in <module>
----> 1 fruit_colors['bannana']

KeyError: 'bannana'

This error message is commonly referred to as a "traceback", since you can use it to trace back what has gone awry. The message pinpoints what line in the code cell resulted in an error, by pointing at it with an arrow (---->). This is helpful in figuring out what went wrong, especially when many lines of code are executed simultaneously.

Challenge - Can you do reassignment in a dictionary?

  1. In the fruit_colors dictionary, change the color of apple to 'red'.
  2. Loop through the fruit_colors dictionary and print the key only if the value of that key points to in the dictionary is 'red'.

Writing functions

We have already seen how to use preexisting functions, now let's see how we can create our own. Defining a section of code as a function in Python is done using the def keyword. For example, a function that takes two arguments and returns their sum can be defined as follows:

In [51]:
def subtract_function(a, b):
    result = a - b
    return result

There is not output until we call the function.

In [52]:
subtract_function(a=8, b=5)
Out[52]:
3

a and b are called parameters and the values passed to them are arguments. If the name of the parameters are not specified in the function call, the arguments are assigned in the same order as the parameters are listed in the function definition.

In [53]:
subtract_function(8, 5)
Out[53]:
3

If the parameter names are specified, they can be in any order.

In [54]:
subtract_function(b=8, a=5)
Out[54]:
-3

The result from a function can be assigned to a variable

In [55]:
z = subtract_function(8, 5)
z
Out[55]:
3

A function can return more than one value.

In [56]:
def subtract_function_2(a, b):
    result = a - b
    return result, 2 * result

subtract_function_2(4, 1)
Out[56]:
(3, 6)

Which can be assigned to a single value as a tuple.

In [57]:
z = subtract_function_2(4, 1)
z
Out[57]:
(3, 6)

Or to two variables.

In [58]:
z, x = subtract_function_2(4, 1)
z
Out[58]:
3
In [59]:
x
Out[59]:
6

It is helpful to include a description of the function, which in Python is referred to as a "docstring". There is a special syntax for this in Python, which ensures that the message shows up in the help messages we used previously.

In [60]:
def subtract_function(a, b):
    """This subtracts b from a"""
    result = a - b
    return result

Just as before, the ? can be used to get help for the function.

In [61]:
?subtract_function
Signature: subtract_function(a, b)
Docstring: This subtracts b from a
File:      ~/proj/uoftcoders/workshops/workshops-dc-py/notebooks/<ipython-input-60-48e2ec4422d0>
Type:      function

It is important to write a clear description of the function, but extensive docstring coverage is outside the scope of this material. Packages often have their own guidelines of how to write docstrings, the pandas docstring guide describes a good set of conventions to follow.

It is possible to inspect the source code of a function by using double ? (this can be quite complex for complicated functions).

In [62]:
??subtract_function
Signature: subtract_function(a, b)
Source:   
def subtract_function(a, b):
    """This subtracts b from a"""
    result = a - b
    return result
File:      ~/proj/uoftcoders/workshops/workshops-dc-py/notebooks/<ipython-input-60-48e2ec4422d0>
Type:      function

Much of the power from languages such as Python comes from community contributed functions and packages written by talented people and shared openly, so that anyone can use them instead of reinventing the wheel.