Classification by Decision Tree Induction
n Decision
tree
n A
flow-chart-like tree structure
n Internal
node denotes a test on an attribute
n Branch
represents an outcome of the test
n Leaf
nodes represent class labels or class distribution
n Decision
tree generation consists of two phases
n Tree
construction
n At
start, all the training examples are at the root
n Partition
examples recursively based on selected attributes
n Tree
pruning
n Identify
and remove branches that reflect noise or outliers
n Use
of decision tree: Classifying an unknown sample
n Test
the attribute values of the sample against the decision tree
Training Dataset
Output: A Decision Tree for “buys_computer”
Algorithm for Decision Tree Induction
n Basic
algorithm (a greedy algorithm)
n Tree
is constructed in a top-down recursive divide-and-conquer manner
n At
start, all the training examples are at the root
n Attributes
are categorical (if continuous-valued, they are discretized in advance)
n Examples
are partitioned recursively based on selected attributes
n Test
attributes are selected on the basis of a heuristic or statistical measure
(e.g., information gain)
n Conditions
for stopping partitioning
n All
samples for a given node belong to the same class
n There
are no remaining attributes for further partitioning – majority voting is
employed for classifying the leaf
n There
are no samples left
Attribute Selection Measure
n Information
gain (ID3/C4.5)
n All
attributes are assumed to be categorical
n Can
be modified for continuous-valued attributes
n Gini
index (IBM IntelligentMiner)
n All
attributes are assumed continuous-valued
n Assume
there exist several possible split values for each attribute
n May
need other tools, such as clustering, to get the possible split values
n Can
be modified for categorical attributes
Information Gain
n Select
the attribute with the highest information gain
n Assume
there are two classes, P and N
n Let
the set of examples S contain p elements of class P and n elements of class N
n The
amount of information, needed to decide if an arbitrary example in S
belongs to P or N is
defined as
Information Gain in Decision Tree Induction
Attribute Selection by Information Gain Computation
Gini Index (IBM IntelligentMiner)
Extracting Classification Rules from Trees
n Represent
the knowledge in the form of IF-THEN rules
n One
rule is created for each path from the root to a leaf
n Each
attribute-value pair along a path forms a conjunction
n The
leaf node holds the class prediction
n Rules
are easier for humans to understand
n Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN
buys_computer = “yes”
IF age = “>40”
AND credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”
Avoid Overfitting in Classification
n The
generated tree may overfit the training data
n Too
many branches, some may reflect anomalies due to noise or outliers
n Result
is in poor accuracy for unseen samples
n Two
approaches to avoid overfitting
n Prepruning:
Halt tree construction early—do not split a node if this would result in the
goodness measure falling below a threshold
n Difficult
to choose an appropriate threshold
n Postpruning:
Remove branches from a “fully grown” tree—get a sequence of progressively
pruned trees
n Use
a set of data different from the training data to decide which is the “best
pruned tree”
Approaches to Determine the Final Tree Size
n Separate
training (2/3) and testing (1/3) sets
n Use
cross validation, e.g., 10-fold cross validation
n Use
all the data for training
n but
apply a statistical test (e.g., chi-square) to estimate whether expanding or
pruning a node may improve the entire distribution
n Use
minimum description length (MDL) principle:
n halting
growth of the tree when the encoding is minimized
Enhancements to basic decision tree induction
n Allow
for continuous-valued attributes
n Dynamically
define new discrete-valued attributes that partition the continuous attribute
value into a discrete set of intervals
n Handle
missing attribute values
n Assign
the most common value of the attribute
n Assign
probability to each of the possible values
n Attribute
construction
n Create
new attributes based on existing ones that are sparsely represented
n This
reduces fragmentation, repetition, and replication
No comments:
Post a Comment
silahkan membaca dan berkomentar