Overfitting in Machine Learning | Underfitting and Overfitting (Example)

Learning a tree that classifies the training data perfectly may not lead to the tree with the best generalization performance.
- There may be noise in the training data
- May be based on insufficient data
A hypothesis h is said to overfit the training data if there is another hypothesis, h', such that h has smaller error than h' on the training data but h has larger error on the test data than h'.

Underfitting and Overfitting (Example)

Underfitting : When model is too simple, both training and test errors are large.

Overfitting due to Insufficient Examples

Lack of data points makes it difficult to predict correctly the class labels of that region.

Overfitting results in decision trees that are more complex than necessary.

Training error no longer provides a good estimate of how well the tree will perform on previously unseen records.

Avoid Overfitting

How can we avoid overfitting a decision tree?
- Prepruning : Stop growing when data split not statistically significant
- Postpruning : Grow full tree then remove nodes

Methods for evaluating subtrees to prune:
- Minimum description length (MDL):
Minimum: size(tree) + size(misclassifications(tree))
- Cross-validation

Pre-Pruning (Early stopping)

Evaluate splits before installing them :
- Don't install splits that don't look worthwhile
- When no worthwhile splits to install, done
Typical stopping condition for a node:
- Stop if all instances belong to the same class
- Stop if all the attribute values are the same
More restrictive conditions:
- Stop if number of instance is less than some user-specified threshold
- Stop if class distribution of instance are independent of the available features (e.g., using 𝓍2 test)
- Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).

Reduced-error Pruning

A post-pruning, cross validation approach
- Partition training data into "grow" set and "validation" set.
- Build a complete tree for the "grow" data
- Until accuracy on validation set decreases, do:
For each non-leaf node in the tree
Temporarily prune the tree below; replace it by majority vote
Test the accuracy of the hypothesis on the validation set
Permanently prune the node with the greatest increase in accuracy on the validation test.
Problem: Uses less data to construct the tree
Sometimes done at the rules level

Model Selection & Generalization

Learning is an ill-posed problem; data is not sufficient to find a unique solution
The need for inductive bias, assumptions about H
Generalization: How well a model performs on new data
Overfitting: H more complex than C or ƒ
Undderfitting: H less complex than C or ƒ

Triple Trade-Off

There is a trade-off between three factors:
- Complexity of H, c (H),
- Training set size, N,
- Generalization error, E on new data

As N increases-, E decreases
As c (H) increases-, first E decreases and then E-increases
As c (H)- increases, the training error decreases for some time and then stays constant (frequently at 0)

Overfitting happens when a model is capturing idiosyncrasies of the data rather than generalities.
- Often caused by too many parameters relative to the amount of training data.
- E.g. an order-N polynomial can intersect any N+1 data points

Dealing with Overfitting

Use more data
Use a tuning set
Regularization
Be a Bayesian

Regularization
In a linear regression model overfitting is characterized by large weights.