Lec18-Perceptron

download Lec18-Perceptron

of 13

Transcript of Lec18-Perceptron

  • 7/28/2019 Lec18-Perceptron

    1/13

    Linear Discriminators

    Chapter 20

    Only relevant parts

  • 7/28/2019 Lec18-Perceptron

    2/13

    Concerns

    Generalization Accuracy

    Efficiency

    Noise

    Irrelevant features

    Generality: when does this work?

  • 7/28/2019 Lec18-Perceptron

    3/13

    Linear Model Let f1, fn be the feature values of an example. Let class be denoted {+1, -1}.

    Define f0 = -1. (bias weight)

    Linear model defines weights w0,w1,..wn. -w0 is the threshold

    Classification rule: If w0*f0+w1*f1..+wn*fn> 0, predict class + else

    predict class -.

    Briefly: W*F>0 where * is inner product ofweight vector and feature weights and F has beenaugmented with extra 1.

  • 7/28/2019 Lec18-Perceptron

    4/13

    Augmentation Trick

    Suppose data defined features f1 and f2.

    2* f1 + 3*f2 > 4 is classifier

    Equivalently: * > 0 Mapping data to allows

    learning/representing threshold as just

    another featuer.

    Mapping data into higher dimensions is key

    idea behind SVMs

  • 7/28/2019 Lec18-Perceptron

    5/13

    Mapping to enable Linear Separation

    Let xi be m vectors in R^N.

    Map xi into R^{N+M} by xi ->

    where 1 in n+i position.

    For any labelling of xi by classes +/-, the

    embedding makes data linearly separable.

    Define wi = 0 i

  • 7/28/2019 Lec18-Perceptron

    6/13

    Representational Power

    Or of n features

    Wi = 1, threshold = 0

    And of n features

    Wi = 1 threshold = n -1

    K of n features (prototype)

    Wi =1 threshold = k -1

    Cant do XOR Combining linear threshold units yields any

    boolean function.

  • 7/28/2019 Lec18-Perceptron

    7/13

  • 7/28/2019 Lec18-Perceptron

    8/13

    Classical Perceptron

    Theorem: If concept linearly separable, thenalgorithm finds a solution.

    Training time can be exponential in numberof features.

    Epoch is single pass through entire data.

    Convergence can take exponentially many

    epochs, but guaranteed to work. If |xi|

  • 7/28/2019 Lec18-Perceptron

    9/13

    Hill-Climbing Search

    This is an optimization problem.

    The solution is by hill-climbing so there is no

    guarantee of finding the optimal solution.

    While derivates tell you the direction (the negative

    gradient) they do not tell you how much to change

    each Xi.

    On the plus side it is fast. On the negative side, no guarantee of separation

  • 7/28/2019 Lec18-Perceptron

    10/13

    Hill-climbing View

    Goal: minimize Squared-error = Err^2.

    Let class yi be 1 or -1.

    Let Err = sum(W*XiYi) where Xi is ithexample.

    This is a function only of the weights.

    Use Calculus; take partial derivates wrt Wj.

    To move to lower value, move in direction of

    negative gradient, i.e.

    change in Xi is -2*Err*Xj

  • 7/28/2019 Lec18-Perceptron

    11/13

    Support Vector Machine

    Goal: maximize the margin.

    Assuming the line separates the data, the

    margin is the minimum of the closest

    positive and negative example to the line.

    Good News: This can be solved by

    quadratic program.

    Implemented in Weka as SOM.

    If not linearly separable, SVM will add

    more features.

  • 7/28/2019 Lec18-Perceptron

    12/13

    If not Linearly Separable

    1. Add more nodes: Neural Nets

    1. Can Represent any boolean function: why?

    2. No guarantees about learning

    3. Slow4. Incomprehensible

    2. Add more features: SVM

    1. Can represent any boolean function

    2. Learning guarantees3. Fast

    4. Semi-comprehensible

  • 7/28/2019 Lec18-Perceptron

    13/13

    Adding features

    Suppose pt (x,y) is positive if it lies in theunit disk else negative.

    Clearly very unlinearly separable Map (x,y) -> (x,y, x^2+y^2)

    Now in 3-space, easily separable.

    This works for any learning algorithm, butSVM will almost do it for you. (setparameters).