Learning a tree that classifies the training data perfectly may not lead to the tree with the best generalization performance.

- There may be noise in the training data

- May be based on insufficient data

A hypothesis h is said to overfit the training data if there is another hypothesis, h', such that h has smaller error than h' on the training data but h has larger error on the test data than h'.

Lack of data points makes it difficult to predict correctly the class labels of that region.

Overfitting results in decision trees that are more complex than necessary.

Training error no longer provides a good estimate of how well the tree will perform on previously unseen records.

How can we avoid overfitting a decision tree?

- Prepruning : Stop growing when data split not statistically significant

- Postpruning : Grow full tree then remove nodes

Methods for evaluating subtrees to prune:

- Minimum description length (MDL):

Minimum: size(tree) + size(misclassifications(tree))

- Cross-validation

Evaluate splits before installing them :

- Don't install splits that don't look worthwhile

- When no worthwhile splits to install, done

Typical stopping condition for a node:

- Stop if all instances belong to the same class

- Stop if all the attribute values are the same

More restrictive conditions:

- Stop if number of instance is less than some user-specified threshold

- Stop if class distribution of instance are independent of the available features (e.g., using đ2 test)

- Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).

A post-pruning, cross validation approach

- Partition training data into "grow" set and "validation" set.

- Build a complete tree for the "grow" data

- Until accuracy on validation set decreases, do:

For each non-leaf node in the tree

Temporarily prune the tree below; replace it by majority vote

Test the accuracy of the hypothesis on the validation set

Permanently prune the node with the greatest increase in accuracy on the validation test.

Problem: Uses less data to construct the tree

Sometimes done at the rules level

Learning is an ill-posed problem; data is not sufficient to find a unique solution

The need for inductive bias, assumptions about H

Generalization: How well a model performs on new data

Overfitting: H more complex than C or Ć

Undderfitting: H less complex than C or Ć

There is a trade-off between three factors:

- Complexity of H, c (H),

- Training set size, N,

- Generalization error, E on new data

As N increases-, E decreases

As c (H) increases-, first E decreases and then E-increases

As c (H)- increases, the training error decreases for some time and then stays constant (frequently at 0)

Overfitting happens when a model is capturing idiosyncrasies of the data rather than generalities.

- Often caused by too many parameters relative to the amount of training data.

- E.g. an order-N polynomial can intersect any N+1 data points

Use more data

Use a tuning set

Regularization

Be a Bayesian

In a linear regression model overfitting is characterized by large weights.

Introduce a penalty term in the loss function.

- There may be noise in the training data

- May be based on insufficient data

A hypothesis h is said to overfit the training data if there is another hypothesis, h', such that h has smaller error than h' on the training data but h has larger error on the test data than h'.

**Underfitting and Overfitting (Example)****Underfitting :**When model is too simple, both training and test errors are large.**Overfitting due to Insufficient Examples**Lack of data points makes it difficult to predict correctly the class labels of that region.

Overfitting results in decision trees that are more complex than necessary.

Training error no longer provides a good estimate of how well the tree will perform on previously unseen records.

**Avoid Overfitting**How can we avoid overfitting a decision tree?

- Prepruning : Stop growing when data split not statistically significant

- Postpruning : Grow full tree then remove nodes

Methods for evaluating subtrees to prune:

- Minimum description length (MDL):

Minimum: size(tree) + size(misclassifications(tree))

- Cross-validation

**Pre-Pruning (Early stopping)**Evaluate splits before installing them :

- Don't install splits that don't look worthwhile

- When no worthwhile splits to install, done

Typical stopping condition for a node:

- Stop if all instances belong to the same class

- Stop if all the attribute values are the same

More restrictive conditions:

- Stop if number of instance is less than some user-specified threshold

- Stop if class distribution of instance are independent of the available features (e.g., using đ2 test)

- Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).

**Reduced-error Pruning**A post-pruning, cross validation approach

- Partition training data into "grow" set and "validation" set.

- Build a complete tree for the "grow" data

- Until accuracy on validation set decreases, do:

For each non-leaf node in the tree

Temporarily prune the tree below; replace it by majority vote

Test the accuracy of the hypothesis on the validation set

Permanently prune the node with the greatest increase in accuracy on the validation test.

Problem: Uses less data to construct the tree

Sometimes done at the rules level

**Model Selection & Generalization**Learning is an ill-posed problem; data is not sufficient to find a unique solution

The need for inductive bias, assumptions about H

Generalization: How well a model performs on new data

Overfitting: H more complex than C or Ć

Undderfitting: H less complex than C or Ć

**Triple Trade-Off**There is a trade-off between three factors:

- Complexity of H, c (H),

- Training set size, N,

- Generalization error, E on new data

As N increases-, E decreases

As c (H) increases-, first E decreases and then E-increases

As c (H)- increases, the training error decreases for some time and then stays constant (frequently at 0)

Overfitting happens when a model is capturing idiosyncrasies of the data rather than generalities.

- Often caused by too many parameters relative to the amount of training data.

- E.g. an order-N polynomial can intersect any N+1 data points

**Dealing with Overfitting**Use more data

Use a tuning set

Regularization

Be a Bayesian

**Regularization**In a linear regression model overfitting is characterized by large weights.

**Penalize large weights in Linear Regression**Introduce a penalty term in the loss function.

## No comments:

## Post a Comment