Machine Learning with scikit-learn
Machine learning in Python
Introduction to Machine Learning
“Machine learning gives computers the ability to learn without explicitly programmed.” - Arthur Samuel, 1959
Machine learning originated from pattern recognition and computational learning theory in AI. It is the study and construction of algorithms to learn from and make predictions on data through building a model from sample input.
Some uses of learning algorithms
- Classification: determine which discrete category the example is
- Recognizing patterns: speech recognition, facial identity …
- Recommender Systems: noisy data, commercial pay-off (e.g., Amazon, Netflix)
- Information retrieval: find documents or images with similar content
- Computer vision: detection, segmentation, depth estimation, optical flow …
- Robotics: perception, planning …
- Learning to play games: AlphaGO
- Recognizing anomalies: Unusual sequences of credit card transactions, panic situation at an airport
- Spam filtering, fraud detection: The enemy adapts so we must adapt too
Types of learning tasks
- Supervised Learning: correct output known for each training example for predicting output when given an input vector
- Classification: 1-of-N output, e.g. object recognition, medical diagnosis
- Regression: real-valued-output, e.g.predicting market prices, customer ratings
- Unsupervised Learning: for learning an internal representation of the input to capture regularities and structure in the data without any labels
- Clustering: dividing input into groups that are unknown beforehand
- Dimensionality reduction: extract informative features, e.g. PCA, t-SNE
- Reinforcement Learning: perform an action with the goal to maximize payoff by the feedback of reward and punishments, e.g. playing a game against an opponent
Packages to be installed
numpy, pandas, matplotlib, sklearn, scipy, itertools
tpot
: For installation, please refer to TPOT.
In this tutorial, we will cover:
- Regression (e.g. Linear Regression)
- Classification (e.g. SVM, Logistic regression)
- Clustering (e.g. K-means)
- Dimensionality Reduction (e.g. PCA)
- Sklearn’s TPOT package for optimized machine learning pipelines
We won’t cover:
- Other classification and clustering algorithms (e.g. Neural Networks, Hierarchical Clustering)
- Model selection (such as Bayesian Information Criteria - BIC)
- Hyper-parameter selection
Useful websites