Machine Learning - Introduction to Machine Learningmpetrik/teaching/intro_ml_17_files/class1.pdf ·...

Machine LearningIntroduction to Machine Learning

Marek Petrik

January 26, 2017

Some of the figures in this presentation are taken from ”An Introduction to Statistical Learning, with applications in R”(Springer, 2013) with permission from the authors: G. James, D. Wien, T. Hastie and R. Tibshirani

What is machine learning?Arthur Samuel (1959, IBM):

Field of study that gives computers the ability to learnwithout being explicitly programmed

The rise of machine learningICML: International Conference on Machine Learning

2009

500 aendees

2015

1600 aendees

2016

3300 aendees

Data is everywhere!

IBM Watson: Computers Beat Humans in Jeopardy

Fair use,hps://en.wikipedia.org/w/index.php?curid=31142331

AlphaGo: Computers Beat Humans in Go

Photograph by Saran Poroong Gey Images/iStockphoto

Personalized Product Recommendations

Online retailers mine purchase history to make recommendations

Identifying Crops from Space

Identify the type of crop / vegetation / urban space from satelliteobservations

Other Applications

1. Health-care: Identify risks of geing a disease

2. Health-care: Predict eectiveness of a treatment

3. Detect spam in emails

4. Recognize hand-wrien text

5. Speech recognition (speech to text)

6. Machine translation

7. Predict probability of an employee leaving

Activity: any other applications?

This Course: Introduction to Machine Learning

I Build a foundation for practice and research in MLI Basic machine learning concepts: max likelihood, cross

validationI Fundamental machine learning techniques: regression,

model-selection, deep learningI Educational goals:

1. How to apply basic methods2. Reveal what happens inside3. What are the pitfalls4. Expand understanding of linear algebra, statistics, and

optimization

Course Overview

I Grading:

50% 6 Assignments15% Midterm exam30% Final exam15% Class project

I Assignments: posted on myCourses and the websiteI Discuss questions:

https://piazza.com/unh/spring2017/cs780cs880

I Programming language: R (or Python, but discouraged)



Course Texts

I Textbooks:ISL James, G., Wien, D., Hastie, T., & Tibshirani, R. (2013). An

Introduction to Statistical LearningELS Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of

Statistical Learning. Springer Series in Statistics (2nd ed.)I Other sources:

DL Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning.[online pdf on github]

LA Strang, G. Introduction to Linear Algebra. (2016)CO Boyd, S., & Vandenberghe, L. (2004). Convex Optimization.

[online pdf]RL Suon, R. S., & Barto, A. (2012). Reinforcement learning. 2nd

edition [online pdf dra]

Class Plan I

Date Day Topic

Jan 24 Tue Statistical learning and R languageJan 26 Thu Linear regression with one variable

Jan 31 Tue Linear regression with multiple variablesFeb 02 Thu No class

Feb 07 Tue Classification and logistic regression 1Feb 09 Thu Classification, naive Bayes and LDA

Feb 14 Tue Linear algebra for machine learning: reviewMar 16 Thu Linear algebra and optimization

Feb 21 Tue Overfiing and resampling methodsFeb 23 Thu Cross-validation and bootstrapping

Feb 28 Tue Linear model selection, priorsMar 02 Thu Linear model selection and regularization

Class Plan IIMar 07 Tue Midterm reviewMar 09 Thu Midterm exam; material until 2/23

Mar 14 Tue Spring break, no classMar 16 Thu Spring break, no class

Mar 21 Tue Building nonlinear featuresMar 23 Thu Nearest neighbor methods and GAMs

Mar 28 Tue Tree-based methods and boostingMar 30 Thu Support vector machines and other techniques

Apr 04 Tue Unsupervised learning, PCAApr 06 Thu Unsupervised learning, k-means

Apr 11 Tue Reinforcement learningApr 13 Thu Neural networks and deep learning

Apr 18 Tue Neural networks and deep learningApr 20 Thu Big data and machine learning

Class Plan III

Apr 25 Tue Machine learning in practiceApr 27 Thu Project presentations

May 02 Tue Guest speakerMay 04 Thu Final exam review

May 11-17 ? Final exam

What is Machine LearningI Discover unknown function f :

Y = f(X)

I X = set of features, or inputsI Y = target, or response

0 50 100 200 300

51

01

52

02

5

TV

Sa

les

0 10 20 30 40 50

51

01

52

02

5

Radio

Sa

les

0 20 40 60 80 100

51

01

52

02

5Newspaper

Sa

les

Sales = f(TV,Radio,Newspaper)

Notation

Y = f(X) = f(X1, X2, X3)


I Y = Sales

I X1 = TV

I X2 = Radio

I X3 = Newspaper

Vector:

X =(X1, X2, X3

)

Dataset


0 50 100 200 300

51

01

52

02

5

TV

Sa

les

0 10 20 30 40 50

51

01

52

02

5

Radio

Sa

les

0 20 40 60 80 100

51

01

52

02

5

Newspaper

Sa

les

Dataset:

Y X1 X2 X3

10 101 20 3520 66 41 8511 101 43 7825 25 10 615 310 51 11

rows are samples

Errors in Machine Learning: World is Noisy

I World is too complex to model preciselyI Many features are not captured in data setsI Need to allow for errors ε in f :

Y = f(X) + ε

Machine Learning Algorithm

I Input:Training data-set with features and targets

I Output:Prediction function f

Parametric Prediction Methods

Years of Education

Sen

iorit

y

Incom

e

Linear models (linear regression)

income = f(education, seniority) = β0+β1×education+β2×seniority

Why Estimate f?

0 50 100 200 300

51

01

52

02

5

TV

Sa

les

0 10 20 30 40 505

10

15

20

25

Radio

Sa

les

0 20 40 60 80 100

51

01

52

02

5

Newspaper

Sa

les

1. Prediction: Make predictions about future: Best medium mixto spend ad money?

2. Inference: Understand the relationship: What kind of adswork? Why?

Prediction or Inference?

Application Prediction InferenceIdentify risk of geing a diseasePredict eectiveness of a treatmentRecognize hand-wrien textSpeech recognitionPredict probability of an employee leaving

Statistical View of Machine Learning

I Probability space Ω: Set of all adultsI Random variable: X(ω) = R: Years of educationI Random variable: Y (ω) = R: Salary

10 12 14 16 18 20 22

20

30

40

50

60

70

80

Years of Education

Inco

me

10 12 14 16 18 20 22

20

30

40

50

60

70

80

Years of Education

Inco

me

How Good are Predictions?

I Learned function fI Test data: (x1, y1), (x2, y2), . . .

I Mean Squared Error (MSE):

MSE =1

n

n∑i=1

(yi − f(xi))2

I This is the estimate of:

MSE = E[(Y − f(X))2] =1

|Ω|∑ω∈Ω

(Y (ω)− f(X(ω)))2

I Important: Samples xi are i.i.d.

Do We Need Test Data?I Why not just test on the training data?

0 20 40 60 80 100

24

68

10

12

X

Y

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

Flexibility

Me

an

Sq

ua

red

Err

or

I Flexibility is the degree of polynomial being fitI Gray line: training error, red line: testing error

Bias-Variance Decomposition

Y = f(X) + ε

Mean Squared Error can be decomposed as:

MSE = E(Y − f(X))2 = Var(f(X))︸︷︷︸Variance

+ (E(f(X)))2︸︷︷︸Bias

+ Var(ε)

I Bias: How well would method work with infinite dataI Variance: How much does output change with dierent data

sets

Bias-Variance Trade-o

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

Flexibility

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

Flexibility

2 5 10 20

05

10

15

20

Flexibility

MSEBiasVar

Types of Function f

Regression: continuous target

f : X → R

Years of Education

Sen

iorit

y

Incom

e

Classification: discrete target

f : X → 1, 2, 3, . . . , k

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o oo

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

X1

X2

Regression or Classification?

Application Regression ClassificationIdentify risk of geing a diseasePredict eectiveness of a treatmentRecognize hand-wrien textSpeech recognitionPredict probability of an employee leaving

Error Rate In Classification

I Learned function fI Test data: (x1, y1), (x2, y2), . . .

1

n

n∑i=1

I(yi 6= f(xi))

I Bayes classifier: assign each observation to the most likely class

f(x) = Pr[Y = j | X = x]

I Bayes classifier would require known true function fI Lower bound on the error

KNN: K-Nearest Neighbors

I Bayes classifier can only predict the values x in the training setI Idea: Use similar training points when making predictions

o

o

o

o

o

oo

o

o

o

o

o o

o

o

o

o

oo

o

o

o

o

o

I Non-parametric method (unlike regression)

KNN: Choosing k

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o o

o

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o o

o

o

o o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

oo

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

KNN: K=1 KNN: K=100

KNN: Training and Test Errors

0.01 0.02 0.05 0.10 0.20 0.50 1.00

0.0

00.0

50.1

00.1

50.2

0

1/K

Err

or

Rate

Training Errors

Test Errors

R Language

I Download and install R: http://cran.r-project.prgI Try using RStudio as an R IDEI Read the R lab: ISL 2.3I Use Piazza for questions

http://cran.r-project.prg

Alternatives to R

I Python (+ Numpy + Pandas + Matplotlib + Scikits Learn +Scipy)More popular, more flexible, but less mature

I MATLABLinear algebra and control focus, but a lot of ML toolkits

I JuliaNew technical computing language, “ walks like python runslike c”, but least mature

Machine Learning - Introduction to Machine Learningmpetrik/teaching/intro_ml_17_files/class1.pdf ·...

Documents

Transcript of Machine Learning - Introduction to Machine Learningmpetrik/teaching/intro_ml_17_files/class1.pdf ·...