Lec18-Perceptron

7/28/2019 Lec18-Perceptron

1/13

Linear Discriminators

Chapter 20

Only relevant parts


2/13

Concerns

Generalization Accuracy

Efficiency

Noise

Irrelevant features

Generality: when does this work?


3/13

Linear Model Let f1, fn be the feature values of an example. Let class be denoted {+1, -1}.

Define f0 = -1. (bias weight)

Linear model defines weights w0,w1,..wn. -w0 is the threshold

Classification rule: If w0*f0+w1*f1..+wn*fn> 0, predict class + else

predict class -.

Briefly: W*F>0 where * is inner product ofweight vector and feature weights and F has beenaugmented with extra 1.


4/13

Augmentation Trick

Suppose data defined features f1 and f2.

2* f1 + 3*f2 > 4 is classifier

Equivalently: * > 0 Mapping data to allows

learning/representing threshold as just

another featuer.

Mapping data into higher dimensions is key

idea behind SVMs


5/13

Mapping to enable Linear Separation

Let xi be m vectors in R^N.

Map xi into R^{N+M} by xi ->

where 1 in n+i position.

For any labelling of xi by classes +/-, the

embedding makes data linearly separable.

Define wi = 0 i


6/13

Representational Power

Or of n features

Wi = 1, threshold = 0

And of n features

Wi = 1 threshold = n -1

K of n features (prototype)

Wi =1 threshold = k -1

Cant do XOR Combining linear threshold units yields any

boolean function.


7/13


8/13

Classical Perceptron

Theorem: If concept linearly separable, thenalgorithm finds a solution.

Training time can be exponential in numberof features.

Epoch is single pass through entire data.

Convergence can take exponentially many

epochs, but guaranteed to work. If |xi|


9/13

Hill-Climbing Search

This is an optimization problem.

The solution is by hill-climbing so there is no

guarantee of finding the optimal solution.

While derivates tell you the direction (the negative

gradient) they do not tell you how much to change

each Xi.

On the plus side it is fast. On the negative side, no guarantee of separation


10/13

Hill-climbing View

Goal: minimize Squared-error = Err^2.

Let class yi be 1 or -1.

Let Err = sum(W*XiYi) where Xi is ithexample.

This is a function only of the weights.

Use Calculus; take partial derivates wrt Wj.

To move to lower value, move in direction of

negative gradient, i.e.

change in Xi is -2*Err*Xj


11/13

Support Vector Machine

Goal: maximize the margin.

Assuming the line separates the data, the

margin is the minimum of the closest

positive and negative example to the line.

Good News: This can be solved by

quadratic program.

Implemented in Weka as SOM.

If not linearly separable, SVM will add

more features.


12/13

If not Linearly Separable

1. Add more nodes: Neural Nets

1. Can Represent any boolean function: why?

2. No guarantees about learning

3. Slow4. Incomprehensible

2. Add more features: SVM

1. Can represent any boolean function

2. Learning guarantees3. Fast

4. Semi-comprehensible


13/13

Adding features

Suppose pt (x,y) is positive if it lies in theunit disk else negative.

Clearly very unlinearly separable Map (x,y) -> (x,y, x^2+y^2)

Now in 3-space, easily separable.

This works for any learning algorithm, butSVM will almost do it for you. (setparameters).

Lec18-Perceptron

Documents

Transcript of Lec18-Perceptron