Machine Learning with scikit-learn

Machine learning in Python

Authors: Nil Sahin
Research field: Computational Biology and Molecular Genetics
Lesson topic: Machine Learning using Python’s Sklearn packages
Lesson content URL: https://github.com/UofTCoders/studyGroup/tree/gh-pages/lessons/python/scikit-learn

Introduction to Machine Learning

“Machine learning gives computers the ability to learn without explicitly programmed.” - Arthur Samuel, 1959

Machine learning originated from pattern recognition and computational learning theory in AI. It is the study and construction of algorithms to learn from and make predictions on data through building a model from sample input.

Some uses of learning algorithms

Classification: determine which discrete category the example is
Recognizing patterns: speech recognition, facial identity …
Recommender Systems: noisy data, commercial pay-off (e.g., Amazon, Netflix)
Information retrieval: find documents or images with similar content
Computer vision: detection, segmentation, depth estimation, optical flow …
Robotics: perception, planning …
Learning to play games: AlphaGO
Recognizing anomalies: Unusual sequences of credit card transactions, panic situation at an airport
Spam filtering, fraud detection: The enemy adapts so we must adapt too

Types of learning tasks

Supervised Learning: correct output known for each training example for predicting output when given an input vector
- Classification: 1-of-N output, e.g. object recognition, medical diagnosis
- Regression: real-valued-output, e.g.predicting market prices, customer ratings
Unsupervised Learning: for learning an internal representation of the input to capture regularities and structure in the data without any labels
- Clustering: dividing input into groups that are unknown beforehand
- Dimensionality reduction: extract informative features, e.g. PCA, t-SNE
Reinforcement Learning: perform an action with the goal to maximize payoff by the feedback of reward and punishments, e.g. playing a game against an opponent

Packages to be installed

numpy, pandas, matplotlib, sklearn, scipy, itertools
tpot: For installation, please refer to TPOT.

In this tutorial, we will cover:

Regression (e.g. Linear Regression)
Classification (e.g. SVM, Logistic regression)
Clustering (e.g. K-means)
Dimensionality Reduction (e.g. PCA)
Sklearn’s TPOT package for optimized machine learning pipelines
We won’t cover:
Other classification and clustering algorithms (e.g. Neural Networks, Hierarchical Clustering)
Model selection (such as Bayesian Information Criteria - BIC)
Hyper-parameter selection

Useful websites

Sklearn class and function reference page
Machine Learning course from Computer Science Department, UOFT
TPOT for optimized machine learning pipelines
KERAS package for Neural Networks
Tensorflow Playground for Deep Neural Networks