SVM : The Dual Formulation in Machine Learning

Optimization problem with convex quadratic objectives and linear constraints

Can be solved using QP.

Lagrange duality to get the optimization problem's dual form,

Allow us to use kernels to get optimal margin classifiers to work efficiently in very high dimensional spaces.
Allow us to derive an efficient algorithm for solving the above optimization problem that will typically do much better than generic QP software.

Lagrangian Duality in brief

The Primal Problem

The generalized Lagrangian :

the 𝛼's (𝛼≥0) and β' are called the Lagrange multipliers Lemma:

A re-written Primal:

The Primal Problem

The Dual Problem :

Theorem (weak duality) :

Theorem (strong duality):

If there exist a saddle point of L(w,𝛼,β), we have

The KKT conditions

If there exists some saddle point of L, then it satisfies the following "Karush-Kuhn-Tucker" (KKT) conditions:

Theorem : If w* , a* and b* satisfy the KKT condition , then it is also a solution to the primal and the dual problems.

Support Vectors

Only a few 𝛼i's can be nonzero
Call the training data points whose 𝛼i's are nonzero the support vectors

if 𝛼i > 0 then g(w) = 0

Solving the Optimization Problem

Quadratic Programming with linear constraints

Lagrangian Function ↕️

The Dual Problem

Now we have the following dual opt problem:

This is a quadratic programming problem.

- A global maximum of 𝛼i can always be found.

Support Vector Machine

Once we have the Lagrange multipliers {𝛼i} we can reconstruct the parameter vector w as a weighted combination of the training examples:

For testing with a new data z

- Compute

and classify z as class 1 if the sum is positive, and class 2 otherwise

Note: w need not be formed explicitly

The Solving the Optimization Problem

The discriminant function is:

It relies on a dot product between the test point x and the support vector xi.

Solving the optimization problem involved computing the dot products xiTxj between all pairs of training points.

The optimal w is a linear combination of a small number of data points.