In statistics, kernel Fisher discriminant analysis (KFD), also known as generalized discriminant analysis and kernel discriminant analysis, is a kernelized version of linear discriminant analysis (LDA). It is named after Ronald Fisher.
Linear discriminant analysis
Intuitively, the idea of LDA is to find a projection where class separation is maximized. Given two sets of labeled data,
and
, we can calculate the mean value of each class,
and
, as

where
is the number of examples of class
. The goal of linear discriminant analysis is to give a large separation of the class means while also keeping the in-class variance small. This is formulated as maximizing, with respect to
, the following ratio:

where
is the between-class covariance matrix and
is the total within-class covariance matrix:

The maximum of the above ratio is attained at

as can be shown by the Lagrange multiplier method (sketch of proof):
Maximizing
is equivalent to maximizing

subject to

This, in turn, is equivalent to maximizing
, where
is the Lagrange multiplier.
At the maximum, the derivatives of
with respect to
and
must be zero. Taking
yields

which is trivially satisfied by
and 
Extending LDA
To extend LDA to non-linear mappings, the data, given as the
points
can be mapped to a new feature space,
via some function
In this new feature space, the function that needs to be maximized is

where

and

Further, note that
. Explicitly computing the mappings
and then performing LDA can be computationally expensive, and in many cases intractable. For example,
may be infinite dimensional. Thus, rather than explicitly mapping the data to
, the data can be implicitly embedded by rewriting the algorithm in terms of dot products and using kernel functions in which the dot product in the new feature space is replaced by a kernel function,
.
LDA can be reformulated in terms of dot products by first noting that
will have an expansion of the form

Then note that

where

The numerator of
can then be written as:

Similarly, the denominator can be written as

with the
component of
defined as
is the identity matrix, and
the matrix with all entries equal to
. This identity can be derived by starting out with the expression for
and using the expansion of
and the definitions of
and 
![{\displaystyle {\begin{aligned}\mathbf {w} ^{\text{T}}\mathbf {S} _{W}^{\phi }\mathbf {w} &=\left(\sum _{i=1}^{l}\alpha _{i}\phi ^{\text{T}}(\mathbf {x} _{i})\right)\left(\sum _{j=1,2}\sum _{n=1}^{l_{j}}\left(\phi (\mathbf {x} _{n}^{j})-\mathbf {m} _{j}^{\phi }\right)\left(\phi (\mathbf {x} _{n}^{j})-\mathbf {m} _{j}^{\phi }\right)^{\text{T}}\right)\left(\sum _{k=1}^{l}\alpha _{k}\phi (\mathbf {x} _{k})\right)\\&=\sum _{j=1,2}\sum _{i=1}^{l}\sum _{n=1}^{l_{j}}\sum _{k=1}^{l}\left(\alpha _{i}\phi ^{\text{T}}(\mathbf {x} _{i})\left(\phi (\mathbf {x} _{n}^{j})-\mathbf {m} _{j}^{\phi }\right)\left(\phi (\mathbf {x} _{n}^{j})-\mathbf {m} _{j}^{\phi }\right)^{\text{T}}\alpha _{k}\phi (\mathbf {x} _{k})\right)\\&=\sum _{j=1,2}\sum _{i=1}^{l}\sum _{n=1}^{l_{j}}\sum _{k=1}^{l}\left(\alpha _{i}k(\mathbf {x} _{i},\mathbf {x} _{n}^{j})-{\frac {1}{l_{j}}}\sum _{p=1}^{l_{j}}\alpha _{i}k(\mathbf {x} _{i},\mathbf {x} _{p}^{j})\right)\left(\alpha _{k}k(\mathbf {x} _{k},\mathbf {x} _{n}^{j})-{\frac {1}{l_{j}}}\sum _{q=1}^{l_{j}}\alpha _{k}k(\mathbf {x} _{k},\mathbf {x} _{q}^{j})\right)\\&=\sum _{j=1,2}\left(\sum _{i=1}^{l}\sum _{n=1}^{l_{j}}\sum _{k=1}^{l}\left(\alpha _{i}\alpha _{k}k(\mathbf {x} _{i},\mathbf {x} _{n}^{j})k(\mathbf {x} _{k},\mathbf {x} _{n}^{j})-{\frac {2\alpha _{i}\alpha _{k}}{l_{j}}}\sum _{p=1}^{l_{j}}k(\mathbf {x} _{i},\mathbf {x} _{n}^{j})k(\mathbf {x} _{k},\mathbf {x} _{p}^{j})+{\frac {\alpha _{i}\alpha _{k}}{l_{j}^{2}}}\sum _{p=1}^{l_{j}}\sum _{q=1}^{l_{j}}k(\mathbf {x} _{i},\mathbf {x} _{p}^{j})k(\mathbf {x} _{k},\mathbf {x} _{q}^{j})\right)\right)\\&=\sum _{j=1,2}\left(\sum _{i=1}^{l}\sum _{n=1}^{l_{j}}\sum _{k=1}^{l}\left(\alpha _{i}\alpha _{k}k(\mathbf {x} _{i},\mathbf {x} _{n}^{j})k(\mathbf {x} _{k},\mathbf {x} _{n}^{j})-{\frac {\alpha _{i}\alpha _{k}}{l_{j}}}\sum _{p=1}^{l_{j}}k(\mathbf {x} _{i},\mathbf {x} _{n}^{j})k(\mathbf {x} _{k},\mathbf {x} _{p}^{j})\right)\right)\\[6pt]&=\sum _{j=1,2}\mathbf {\alpha } ^{\text{T}}\mathbf {K} _{j}\mathbf {K} _{j}^{\text{T}}\mathbf {\alpha } -\mathbf {\alpha } ^{\text{T}}\mathbf {K} _{j}\mathbf {1} _{l_{j}}\mathbf {K} _{j}^{\text{T}}\mathbf {\alpha } \\[4pt]&=\mathbf {\alpha } ^{\text{T}}\mathbf {N} \mathbf {\alpha } .\end{aligned}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/45ff8395147305e90a7f29a1f711d8ebe9e98b23)
With these equations for the numerator and denominator of
, the equation for
can be rewritten as

Then, differentiating and setting equal to zero gives

Since only the direction of
, and hence the direction of
matters, the above can be solved for
as

Note that in practice,
is usually singular and so a multiple of the identity is added to it

Given the solution for
, the projection of a new data point is given by

Multi-class KFD
The extension to cases where there are more than two classes is relatively straightforward. Let
be the number of classes. Then multi-class KFD involves projecting the data into a
-dimensional space using
discriminant functions

This can be written in matrix notation

where the
are the columns of
. Further, the between-class covariance matrix is now

where
is the mean of all the data in the new feature space. The within-class covariance matrix is

The solution is now obtained by maximizing

The kernel trick can again be used and the goal of multi-class KFD becomes

where
and

The
are defined as in the above section and
is defined as

can then be computed by finding the
leading eigenvectors of
. Furthermore, the projection of a new input,
, is given by

where the
component of
is given by
.
Classification using KFD
In both two-class and multi-class KFD, the class label of a new input can be assigned as

where
is the projected mean for class
and
is a distance function.
Applications
Kernel discriminant analysis has been used in a variety of applications. These include:
- Face recognition and detection
- Hand-written digit recognition
- Palmprint recognition
- Classification of malignant and benign cluster microcalcifications
- Seed classification
- Search for the Higgs Boson at CERN
See also
References
External links