We will use the following textbooks for this course:

[HTF] The elements of statistical learning: data mining, inference and prediction. Trevor Hastie, Robert Tibshirani and Jerome Friedman. Springer. 2001. Q325.75.F75 2001 c. 1. Available at http://statweb.stanford.edu/~tibs/ElemStatLearn/.

[BIS] Pattern recognition and machine learning. Christopher M. Bishop. 2009. Q327.B52 2009 c. 1

Other useful references are:

[MUR] Machine learning: a probabilistic perspective. Kevin P. Murphy. 2012. Q325.5 .M87 2012 c. 1

[TM] Machine learning. Tom M. Mitchell. 1997. Q325.5.M58 1997 c. 1

[SSBD] Understanding Machine Learning: From Theory to Algorithms. Shai Shalev-Shwartz and Shai Ben-David. 2014. Q325.5 .S475 2014 c. 1. Available at http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf.

[MRT] Foundations of machine learning. Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. 2012. Q325.5 .M64 2012 c. 1

Date

Topics

Readings

Introduction

M
1/30

Introduction to Machine Learning

Define learning, why/when do we need machine learning, discuss different types of machine learning, recent success, cool applications

MUR Chapter 1 PDF,
BIS Chapter 1, SSBD Chapter 1

W
2/1

Course overview, formal introduction

Supervised learning: task, performance, evaluation; classification, regression, loss function, risk

MRT Chapter 1
HTF Chapter 1

F
2/3

Recitation: Probability and Statistics

Events, random variables, probabilities, pdf, pmf, cdf, mean, mode, median, variance, multivariate distributions, marginals, conditionals, Bayes theorem, independence

Review slides
Aaditya Ramdas' Tutorials
Math review (BIS Appendix B, MRT Appendix A, C). BIS Chapter 2.

M
2/6

Foundations

Bayes optimal rule, Bayes risk, empirical risk minimization (ERM), generalization error

SSBD Chapter 2

W
2/8

Supervised Learning

Classification, regression, rote learning, lazy learning, model fitting

HTF Chapter 1, MUR Sections 1.1-1.2

F
2/10

Recitation: Linear Algebra

Vector spaces, norms, metric spaces, inner product spaces, Cauchy-Schwarz, Orthonormal bases

Video Lecture
Review Notes
Review Slides

M
2/13

Linear Regression

Linear functions, loss function, empirical risk minimization, least squares

HTF Sections 2.3, 3.2, BIS Section 3.1

W
2/15

Error analysis, statistical view

Bayes optimal predictor, Error decomposition, statistical view, Gaussian model, Maximum Likelihood Estimation (MLE)

HTF Sections 2.4, 2.6, 2.9, BIS Sections 1.2, 3.2, MUR Sections 7.1-7.3

F
2/17

Generalized Linear Regression, Intro Python

Polynomial regression, general additive regression, gradient descent, introduction to Python, Jupyter notebook, numpy

SSBD Section 9.2, BIS Sections 1.1, 3.1, Python Tutorial, Learn Python

M
2/20

Regularization

Model complexity and overfitting, penalizing model complexity, description length, shrinkage methods, ridge regression, Lasso

HTF Section 3.3, BIS Sections 1.1, 1.3, 3.1.4, MUR Section 7.5

W
2/22

Classification

Introduction, classification as regression, linear classifiers, risk, conditional risk, logistic regression, MLE, surrogate loss, generalized additive models

HTF Sections 4.1, 4.4, BIS Sections 1.5, 4.3.2, MUR Sections 8.1-8.3

F
2/24

Recitation: Convex Optimization

Convex sets, convex function, standard form, Lagrange multipliers, equivalence of constrained and unconstrained versions of ridge regression and Lasso regression

MRT Appendix B
BIS Appendix D, E

M
2/27

Logistic Regression

Log odds ratio, logistic function, gradient descent, Newton-Raphson

BIS Section 4.3.4, SSBD Sections 9.3, 14.1, MUR Section 8.5

W
3/1

Stochastic gradient descent

Overfitting with logistic regression, MAP estimation, regularization, Softmax, Stochastic gradient descent (SGD)

HTF Section 4.5, BIS Sections 7.1, MRT Sections 4.1-4.2, , SSBD Sections 15.1-15.1.1, MUR Sections 14.5

F
3/3

Recitation: MLE, MAP

Parametric distributions, parameter estimation (MLE), overfitting, MAP

BIS Chapter 2l

M
3/6

Support Vector Machines

Optimal separating hyperplane, Large margin classifier, margin and regularization, Lagrange multipliers, KKT conditions, max-margin optimization, quadratic programming, support vectors

HTF Section 4.5, MRT Sections 4.1-4.2

W
3/8

Support Vector Machines II

Non-separable case, SVMs with slack, loss in SVM, solving SVM in the primal SVM regression

BIS Section 7.1, SSBD Sections 15.1-15.1.1, MUR Section 14.5

F
3/10

Recitation: classification

Classification as regression: linear regression, logistic regression, SVMs, empirical risk minimization, losses

Notes

M
3/13

Kernel methods

solving SVM in the primal, subgradient, subgradient descent, nonlinear features, feature space, kernel trick, representer theorem, kernel SVM in the primal, Mercer's kernels, radial basic function, kernel SVM, SVM regression

BIS Sections 6.1, 6.2, MRT Sections 5.1, 5.2, 5.3.1-5.3.2, SSBD Sections 15.2, 15.4-15.5, MUR Sections 14.1-2

W
3/15

Graphical Models: Representations

Bayesian networks, Markov random fields, factor graphs

BIS Chapter 8

F
3/17

Graphical Models: Inference Algorithms

Inference Algorithms, Exact Inference Algorithms, Message Passing - BP, Variational Inference

BIS Chapter 8

Spring Break (3/20-3/24)

M
3/27

Expectation Maximization

probabilistic model, the EM algorithm, EM for MAP estimation, incremental EM, GMM, structured prediction, sequential models, Markov chains

BIS Chapter 9

W
3/29

Sequence models

Markov models, hidden Markov models, linear dynamical systems,

BIS Chapter 13

F
3/31

Recitation: Midterm review

Foundations of learning, bias-variance tradeoffs, Bayes optimal rule, regression, linear regression, least squares, polynomical regression, regularization, classification, logisitic regression, regularization, support vector machines, primal and dual forms, kernel SVMs

M
4/3

In-class midterm

W
4/5

Multi-class SVMs, generative models for classification

Multiclass SVMs, generative models, discriminant functions, likelihood ration test, Gaussian discriminant analysis, linear discriminant, quadratic discriminants, generative models for classification, mixture models

HTF Sections 4.3, 12.4-12.6

F
4/7

Returned midterms, discussed solutions

M
4/10

No CLASS

W
4/12

Mixture models, the EM algorithm

Review multivariate Gaussians, mixture models, likelihood of mixture models, mixture density estimation, expectation maximization, EM for GMM, generic EM for mixture models, EM overfitting and regularization

HTF Section 12.7; BIS Sections 9.2, 9.3

F
4/14

Conditional mixture models

Mixture model for regression, mixture of experts model, gating network, conditional mixtures, EM for mixture of experts

BIS Section 14.5

M
4/17

Perceptron, Neural Networks

Perceptron, perceptron loss, perceptron and neurons, general linear methods - representation as neural networks, two-layer network, feed-forward networks, training the networks

BIS Sections 5.1, 5.2

W
4/19

Back propagation

Training neural networks, back propagation, MLP as universal approximators, deep vs shallow, non convex optimization, classification networks, multi-class networks

BIS Sections 5.3-5.5

F
4/21

Deep neural networks

Classification networks, multi-class, multi-label, model complexity, learning rate, momemuntum, reLU activation, network ensembles, dropout, multilayer network, network topology, skip-layer connections

BIS Section 14.5

M
4/24

Decision Trees

Partition tree, classification and regression trees (CART), regression tree construction, region splitting, regression tree complexity, regression tree pruning, classification trees, Gini index

BIS Section 14.4; HTF Section 9.2, 9.5

W
4/26

Ensemble methods

Combining "weak" classifiers, greedy assembly, AdaBoost algorithm, exponential loss function, weighted loss, optimizing weak learner

BIS Section 14.3; HTF Sections 10.1-10.3

F
4/28

AdaBoost

AdaBoost derivation, AdaBoost behavior, boosting the margin, boosting decision stumps, Boosting and bias-variance tradeoff, combination of regressors, forward stepwise regression, combining regression trees, random forests, classification with random forest, bagging (bootstrap aggregation)

HTF Sections 10.4-10.5

M
5/1

Representation Learning

Unsupervised feature learning, principal component analysis (PCA), power iteration method, kernel PCA, canonical correlation (CCA)

BIS Sections 12.1, 12.3; HTF Section 14.5

W
5/3

Clustering

Cluster analysis, dissimilarity based on attributes, k-means, soft k-means, hierarchical clustering, spectral clustering

BIS Section 9.1; HTF Section 14.3

F
5/5

Learning Theory

Probably approximately correct (PAC) learning, finite hypothesis class, infinite hypothesis class, VC dimension, Rademacher complexity, Final Review

MRT Chapter 2