**Feature Reduction :-**

The information about the target class inherent in the variables.

Native view :

More features

⇒ More information

⇒ More better discrimination power

In practice :

- many reasons why this is not the case!

**Course of Dimensionality**

number of training examples is fixed

- the classifier's performance usually will degrade for a large number of features !

**Feature Selection :-**

Given a set of features F = {𝓍1,........𝓍n}

the Feature Selection problem is to find a subset F' ⊆ F that maximizes the learners ability to classify patterns.

Formally F' should maximize some scoring function

𝓍1 → 𝓍i1

𝓍2 → 𝓍i2

. .

. .

. .

𝓍n → 𝓍in

**Feature Selection Steps**

Feature selection is an optimization problem

Step 1 : Search the space of possible feature subset.

Step 2 : Pick the subset that is optimal or near-optimal with respect to some objective function.

Search strategies

- Optimum

- Heuristic

- Randomized

Evaluation strategies

- Filter methods

- Wrapper methods

**Evaluating feature subset**

**Supervised (Wrapper method)**

- Train using selected subset

- Estimate error on validation dataset

**Unsupervised (Filter method)**

- Look at input only

- Select the subset that has the most information

**Forward Selection**

- Start with empty feature set

- Try each remaining feature

- Estimate classification/reg. error for adding each feature

- Select feature that given maximum improvement

- Stop when there is no significant improvement

**Backward Search**

- Start with full feature set

- Try remaining feature

- Drop the feature with smallest impact an error

Univariate (looks at each feature independently of others)

- Person correlation coefficient

- F-score

- Chi-square

- Signal to noise ration

- mutual information

- Etc.

Rank features by importance

Ranking cut-off is determined by user

**Person correlation coefficient**

- Measures the correlation between two variables

- Formula for person correlation =

- The correlation r is between +1 and -1.

- +1 means perfect positive correlation
- - 1 in the other direction

**Signal to noise ratio**

- Difference in means divided by difference in standard deviation between the two classes

S2N(X,Y) = (μx - μy) / (σx - σy)

- Large values indicate a strong correlation

**Multivariate feature selection**

- Multivariate (consider all features simultaneously)

- Consider the vector w for any linear classifier.

- Classification of a point x is given by wtx+w0.

- Small entries of w will have little effect on the dot product and therefore those features are less relevant

- For example if w = (10, 0.1, -9) then features 0 and 2 are contributing more to the dot product than feature 1.

- A ranking of features given by this w is 0,2,1.

- The w can be obtained by any of linear classifiers

- A variant of this approach is called

__recursive feature elimination__.

- Compute w on all features

- Remove feature with smallest wi

- Recompute w on reduced data

- If stopping criterion not met then go to step 2

## No comments:

## Post a Comment