Principled Criterion
Selection of an attribute to test at each node- choosing the most useful attribute for classifying examples.
Information gain
- measures how well a given attribute separates the training examples according to their target classification.
- This measure is used to select among the candidate attributes at each step while growing the tree.
- Gain is measure of how we can reduce uncertainty (Value lies between 0,1).
Entropy :-
A measure for -
- uncertainty
- purity
- information content
Information theory: optimal length code assigns (-log2p) bits to message having probability p
S is a sample of training examples
- p+ is the proportion of positive examples in S
- P_ is the proportion of negative examples in S
Entropy of S : average optimal number of bits to encode information about certainty/uncertainty about S
S is a sample of training examples
p+ is the proportion of positive of positive example
p_ is the proportion of negative examples
Entropy measures the impurity of S
Entropy(S) = -p+log2p+ - p_ log2p_
The entropy is 0 if the outcome is "certain".
The entropy is maximum if we have no knowledge of the system (or any outcome is equally possible).
Information Gain
Gain (S,A): expected reduction in entropy due to partitioning S on attribute A.
Entropy ([21+,5-]) = o.17
Entropy ([8+,30-]) = 0.74
Gain (S,A1) = Entropy(S)
-26/64*Entropy([21+,5-])
-38/64*Entropy([8+,30-])
= 0.27
Entropy ([18+,33-]) = o.94
Entropy ([8+,30-]) = 0.62
Gain (S,A2) = Entropy(S)
-51/64*Entropy([18+,33-])
-13/64*Entropy([11+,2-])
= 0.12
Training Example :-
Selecting the next Attributes
The information gain values for the 4 attributes are:
Gain(S,Outlook) = 0.247
Gain(S, Humidity) = 0.151
Gain(S,Wind) = 0.048
Gain(S,Temperature) = 0.029
where S denotes the collection of training examples
Splitting Rule: GINI Index
GINI Index
- Measure of node impurity
Splitting Based on Continuous Attributes
Continuous Attribute - Binary Split
For continuous attribute
- Partition the continuous value of attribute A into a discrete set of intervals
- Create a new Boolean attribute Ac , looking for a threshold c,
- Consider all possible splits and finds the best cut.
Selection of an attribute to test at each node- choosing the most useful attribute for classifying examples.
Information gain
- measures how well a given attribute separates the training examples according to their target classification.
- This measure is used to select among the candidate attributes at each step while growing the tree.
- Gain is measure of how we can reduce uncertainty (Value lies between 0,1).
Entropy :-
A measure for -
- uncertainty
- purity
- information content
Information theory: optimal length code assigns (-log2p) bits to message having probability p
S is a sample of training examples
- p+ is the proportion of positive examples in S
- P_ is the proportion of negative examples in S
Entropy of S : average optimal number of bits to encode information about certainty/uncertainty about S
S is a sample of training examples
p+ is the proportion of positive of positive example
p_ is the proportion of negative examples
Entropy measures the impurity of S
Entropy(S) = -p+log2p+ - p_ log2p_
The entropy is 0 if the outcome is "certain".
The entropy is maximum if we have no knowledge of the system (or any outcome is equally possible).
Information Gain
Gain (S,A): expected reduction in entropy due to partitioning S on attribute A.
Entropy ([21+,5-]) = o.17
Entropy ([8+,30-]) = 0.74
Gain (S,A1) = Entropy(S)
-26/64*Entropy([21+,5-])
-38/64*Entropy([8+,30-])
= 0.27
Entropy ([18+,33-]) = o.94
Entropy ([8+,30-]) = 0.62
Gain (S,A2) = Entropy(S)
-51/64*Entropy([18+,33-])
-13/64*Entropy([11+,2-])
= 0.12
Training Example :-
Selecting the next Attributes
The information gain values for the 4 attributes are:
Gain(S,Outlook) = 0.247
Gain(S, Humidity) = 0.151
Gain(S,Wind) = 0.048
Gain(S,Temperature) = 0.029
where S denotes the collection of training examples
Splitting Rule: GINI Index
GINI Index
- Measure of node impurity
Splitting Based on Continuous Attributes
Continuous Attribute - Binary Split
For continuous attribute
- Partition the continuous value of attribute A into a discrete set of intervals
- Create a new Boolean attribute Ac , looking for a threshold c,
- Consider all possible splits and finds the best cut.
No comments:
Post a Comment