Classification and Prediction
Classification vs. Prediction
n Classification:
n predicts
categorical class labels
n classifies
data (constructs a model) based on the training set and the values (class
labels) in a classifying attribute and uses it in classifying new data
n Prediction:
n models
continuous-valued functions, i.e., predicts unknown or missing values
n Typical
Applications
n credit
approval
n target
marketing
n medical
diagnosis
n treatment
effectiveness analysis
Classification—A Two-Step Process
n Model
construction: describing a set of predetermined classes
n Each
tuple/sample is assumed to belong to a predefined class, as determined by the
class label attribute
n The
set of tuples used for model construction: training set
n The
model is represented as classification rules, decision trees, or mathematical
formulae
n Model
usage: for classifying future or unknown objects
n Estimate
accuracy of the model
n The
known label of test sample is compared with the classified result from the
model
n Accuracy
rate is the percentage of test set samples that are correctly classified by the
model
n Test
set is independent of training set, otherwise over-fitting will occur
Classification Process (): Model Construction
Classification Process (): Use the Model in Prediction
Supervised vs. Unsupervised Learning
n Supervised
learning (classification)
n Supervision:
The training data (observations, measurements, etc.) are accompanied by labels
indicating the class of the observations
n New
data is classified based on the training set
n Unsupervised
learning (clustering)
n The
class labels of training data is unknown
n Given
a set of measurements, observations, etc. with the aim of establishing the
existence of classes or clusters in the data
Issues regarding classification and prediction (1): Data
Preparation
n Data
cleaning
n Preprocess
data in order to reduce noise and handle missing values
n Relevance
analysis (feature selection)
n Remove
the irrelevant or redundant attributes
n Data
transformation
n Generalize
and/or normalize data
Issues regarding classification and prediction (2):
Evaluating Classification Methods
n Predictive
accuracy
n Speed
and scalability
n time
to construct the model
n time
to use the model
n Robustness
n handling
noise and missing values
n Scalability
n efficiency
in disk-resident databases
n Interpretability:
n understanding
and insight provded by the model
n Goodness
of rules
n decision
tree size
n compactness
of classification rules
Classification by Decision Tree Induction
n Decision
tree
n A
flow-chart-like tree structure
n Internal
node denotes a test on an attribute
n Branch
represents an outcome of the test
n Leaf
nodes represent class labels or class distribution
n Decision
tree generation consists of two phases
n Tree
construction
n At
start, all the training examples are at the root
n Partition
examples recursively based on selected attributes
n Tree
pruning
n Identify
and remove branches that reflect noise or outliers
n Use
of decision tree: Classifying an unknown sample
n Test
the attribute values of the sample against the decision tree
Training Dataset
Output: A Decision Tree for “buys_computer”
Algorithm for Decision Tree Induction
n Basic
algorithm (a greedy algorithm)
n Tree
is constructed in a top-down recursive divide-and-conquer manner
n At
start, all the training examples are at the root
n Attributes
are categorical (if continuous-valued, they are discretized in advance)
n Examples
are partitioned recursively based on selected attributes
n Test
attributes are selected on the basis of a heuristic or statistical measure
(e.g., information gain)
n Conditions
for stopping partitioning
n All
samples for a given node belong to the same class
n There
are no remaining attributes for further partitioning – majority voting is
employed for classifying the leaf
n There
are no samples left
Attribute Selection Measure
n Information
gain (ID3/C4.5)
n All
attributes are assumed to be categorical
n Can
be modified for continuous-valued attributes
n Gini
index (IBM IntelligentMiner)
n All
attributes are assumed continuous-valued
n Assume
there exist several possible split values for each attribute
n May
need other tools, such as clustering, to get the possible split values
n Can
be modified for categorical attributes
Information Gain
n Select
the attribute with the highest information gain
n Assume
there are two classes, P and N
n Let
the set of examples S contain p elements of class P and n elements of class N
n The
amount of information, needed to decide if an arbitrary example in S
belongs to P or N is
defined as
Information Gain in Decision Tree Induction
Attribute Selection by Information Gain Computation
Gini Index (IBM IntelligentMiner)
Extracting Classification Rules from Trees
n Represent
the knowledge in the form of IF-THEN rules
n One
rule is created for each path from the root to a leaf
n Each
attribute-value pair along a path forms a conjunction
n The
leaf node holds the class prediction
n Rules
are easier for humans to understand
n Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN
buys_computer = “yes”
IF age = “>40”
AND credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”
Avoid Overfitting in Classification
n The
generated tree may overfit the training data
n Too
many branches, some may reflect anomalies due to noise or outliers
n Result
is in poor accuracy for unseen samples
n Two
approaches to avoid overfitting
n Prepruning:
Halt tree construction early—do not split a node if this would result in the
goodness measure falling below a threshold
n Difficult
to choose an appropriate threshold
n Postpruning:
Remove branches from a “fully grown” tree—get a sequence of progressively
pruned trees
n Use
a set of data different from the training data to decide which is the “best
pruned tree”
Approaches to Determine the Final Tree Size
n Separate
training (2/3) and testing (1/3) sets
n Use
cross validation, e.g., 10-fold cross validation
n Use
all the data for training
n but
apply a statistical test (e.g., chi-square) to estimate whether expanding or
pruning a node may improve the entire distribution
n Use
minimum description length (MDL) principle:
n halting
growth of the tree when the encoding is minimized
Enhancements to basic decision tree induction
n Allow
for continuous-valued attributes
n Dynamically
define new discrete-valued attributes that partition the continuous attribute
value into a discrete set of intervals
n Handle
missing attribute values
n Assign
the most common value of the attribute
n Assign
probability to each of the possible values
n Attribute
construction
n Create
new attributes based on existing ones that are sparsely represented
n This
reduces fragmentation, repetition, and replication
Classification in Large Databases
n Classification—a
classical problem extensively studied by statisticians and machine learning
researchers
n Scalability:
Classifying data sets with millions of examples and hundreds of attributes with
reasonable speed
n Why
decision tree induction in data mining?
n relatively
faster learning speed (than other classification methods)
n convertible
to simple and easy to understand classification rules
n can
use SQL queries for accessing databases
n comparable
classification accuracy with other methods
Scalable Decision Tree Induction Methods in Data Mining
Studies
n SLIQ
(EDBT’96 — Mehta et al.)
n builds
an index for each attribute and only class list and the current attribute list
reside in memory
n SPRINT
(VLDB’96 — J. Shafer et al.)
n constructs
an attribute list data structure
n PUBLIC
(VLDB’98 — Rastogi & Shim)
n integrates
tree splitting and tree pruning: stop growing the tree earlier
n RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
n separates
the scalability aspects from the criteria that determine the quality of the
tree
n builds
an AVC-list (attribute, value, class label)
Data Cube-Based Decision-Tree Induction
n Integration
of generalization with decision-tree induction (Kamber et al’97).
n Classification
at primitive concept levels
n E.g.,
precise temperature, humidity, outlook, etc.
n Low-level
concepts, scattered classes, bushy classification-trees
n Semantic
interpretation problems.
n Cube-based
multi-level classification
n Relevance
analysis at multi-levels.
n Information-gain
analysis with dimension + level.
Presentation of Classification Results
Bayesian Classification: Why?
n Probabilistic
learning: Calculate explicit
probabilities for hypothesis, among the most practical approaches to certain
types of learning problems
n Incremental:
Each training example can incrementally increase/decrease the probability that
a hypothesis is correct. Prior knowledge
can be combined with observed data.
n Probabilistic
prediction: Predict multiple
hypotheses, weighted by their probabilities
n Standard:
Even when Bayesian methods are computationally intractable, they can provide a
standard of optimal decision making against which other methods can be measured
Bayesian Theorem
Naïve Bayes Classifier
n A
simplified assumption: attributes are conditionally independent:
n Greatly
reduces the computation cost, only count the class distribution.
n Given
a training set, we can compute the probabilities
Bayesian classification
n The
classification problem may be formalized using a-posteriori probabilities:
n P(C|X)
= prob. that the sample tuple X=<x1,…,xk>
is of class C.
n E.g.
P(class=N | outlook=sunny,windy=true,…)
n Idea:
assign to sample X the class label C such that P(C|X) is maximal
Estimating a-posteriori probabilities
n Bayes
theorem:
P(C|X) = P(X|C)·P(C) / P(X)
n P(X)
is constant for all classes
n P(C)
= relative freq of class C samples
n C
such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum
C such that P(X|C)·P(C) is maximum
n Problem:
computing P(X|C) is unfeasible!
Naïve Bayesian Classification
n Naïve
assumption: attribute independence
P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
n If
i-th attribute is categorical:
P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C
P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C
n If
i-th attribute is continuous:
P(xi|C) is estimated thru a Gaussian density function
P(xi|C) is estimated thru a Gaussian density function
n Computationally
easy in both cases
Play-tennis example: estimating P(xi|C)
Play-tennis example: classifying X
n An
unseen sample X = <rain, hot, high, false>
n P(X|p)·P(p)
=
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582
n P(X|n)·P(n)
=
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286
n Sample
X is classified in class n (don’t play)
The independence hypothesis…
n …
makes computation possible
n …
yields optimal classifiers when satisfied
n …
but is seldom satisfied in practice, as attributes (variables) are often
correlated.
n Attempts
to overcome this limitation:
n Bayesian
networks, that combine Bayesian reasoning with causal relationships between
attributes
n Decision
trees, that reason on one attribute at the time, considering most important
attributes first
Bayesian Belief Networks
n Bayesian
belief network allows a subset of the variables conditionally
independent
n A
graphical model of causal relationships
n Several
cases of learning Bayesian belief networks
n Given
both network structure and all the variables: easy
n Given
network structure but only some variables
n When
the network structure is not known in advance
Neural Networks
n Advantages
n prediction
accuracy is generally high
n robust,
works when training examples contain errors
n output
may be discrete, real-valued, or a vector of several discrete or real-valued
attributes
n fast
evaluation of the learned target function
n Criticism
n long
training time
n difficult
to understand the learned function (weights)
n not
easy to incorporate domain knowledge
A Neuron
Network Training
n The
ultimate objective of training
n obtain
a set of weights that makes almost all the tuples in the training data
classified correctly
n Steps
n Initialize
weights with random values
n Feed
the input tuples into the network one by one
n For
each unit
n Compute
the net input to the unit as a linear combination of all the inputs to the unit
n Compute
the output value using the activation function
n Compute
the error
n Update
the weights and the bias
Multi-Layer Perceptron
Network Pruning and Rule Extraction
n Network
pruning
n Fully
connected network will be hard to articulate
n N
input nodes, h hidden nodes and m output nodes lead to h(m+N)
weights
n Pruning:
Remove some of the links without affecting classification accuracy of the network
n Extracting
rules from a trained network
n Discretize
activation values; replace individual activation value by the cluster average
maintaining the network accuracy
n Enumerate
the output from the discretized activation values to find rules between activation
value and output
n Find
the relationship between the input and activation value
n Combine
the above two to have rules relating the output to input
Association-Based Classification
n Several
methods for association-based classification
n ARCS:
Quantitative association mining and clustering of association rules (Lent et
al’97)
n It
beats C4.5 in (mainly) scalability and also accuracy
n Associative
classification: (Liu et al’98)
n It
mines high support and high confidence rules in the form of “cond_set => y”,
where y is a class label
n CAEP
(Classification by aggregating emerging patterns) (Dong et al’99)
n Emerging
patterns (EPs): the itemsets whose support increases significantly from one
class to another
n Mine
Eps based on minimum support and growth rate
Other Classification Methods
n k-nearest
neighbor classifier
n case-based
reasoning
n Genetic
algorithm
n Rough
set approach
n Fuzzy
set approaches
Instance-Based Methods
n Instance-based
learning:
n Store
training examples and delay the processing (“lazy evaluation”) until a new
instance must be classified
n Typical
approaches
n k-nearest
neighbor approach
n Instances
represented as points in a Euclidean space.
n Locally
weighted regression
n Constructs
local approximation
n Case-based
reasoning
n Uses
symbolic representations and knowledge-based inference
The k-Nearest Neighbor Algorithm
n All
instances correspond to points in the n-D space.
n The
nearest neighbor are defined in terms of Euclidean distance.
n The
target function could be discrete- or real- valued.
n For
discrete-valued, the k-NN returns the most common value among the k
training examples nearest to Xq.
Vonoroi diagram: the decision surface induced by 1-NN for a
typical set of training examples
Discussion on the k-NN Algorithm
n The
k-NN algorithm for continuous-valued target functions
n Calculate
the mean values of the k nearest neighbors
n Distance-weighted
nearest neighbor algorithm
n Weight
the contribution of each of the k neighbors according to their distance to the
query point xq
n giving
greater weight to closer neighbors
n Similarly,
for real-valued target functions
n Robust
to noisy data by averaging k-nearest neighbors
n Curse
of dimensionality: distance between neighbors could be dominated by irrelevant
attributes.
n To
overcome it, axes stretch or elimination of the least relevant attributes.
Case-Based Reasoning
n Also
uses: lazy evaluation + analyze similar instances
n Difference:
Instances are not “points in a Euclidean space”
n Example:
Water faucet problem in CADET (Sycara et al’92)
n Methodology
n Instances
represented by rich symbolic descriptions (e.g., function graphs)
n Multiple
retrieved cases may be combined
n Tight
coupling between case retrieval, knowledge-based reasoning, and problem solving
n Research
issues
n Indexing
based on syntactic similarity measure,
and when failure, backtracking, and adapting to additional cases
Remarks on Lazy vs. Eager Learning
n Instance-based
learning: lazy evaluation
n Decision-tree
and Bayesian classification: eager
evaluation
n Key
differences
n Lazy
method may consider query instance xq when deciding how to generalize
beyond the training data D
n Eager
method cannot since they have already chosen global approximation when seeing
the query
n Efficiency:
Lazy - less time training but more time predicting
n Accuracy
n Lazy
method effectively uses a richer hypothesis space since it uses many local
linear functions to form its implicit global approximation to the target
function
n Eager:
must commit to a single hypothesis that covers the entire instance space
Genetic Algorithms
n GA:
based on an analogy to biological evolution
n Each
rule is represented by a string of bits
n An
initial population is created consisting of randomly generated rules
n e.g.,
IF A1 and Not A2 then C2 can be encoded as 100
n Based
on the notion of survival of the fittest, a new population is formed to
consists of the fittest rules and their offsprings
n The
fitness of a rule is represented by its classification accuracy on a set of
training examples
n Offsprings
are generated by crossover and mutation
Rough Set Approach
n Rough
sets are used to approximately or “roughly” define equivalent classes
n A
rough set for a given class C is approximated by two sets: a lower
approximation (certain to be in C) and an upper approximation (cannot be
described as not belonging to C)
n Finding
the minimal subsets (reducts) of attributes (for feature reduction) is NP-hard
but a discernibility matrix is used to reduce the computation intensity
Fuzzy Set Approaches
n Fuzzy
logic uses truth values between 0.0 and 1.0 to represent the degree of
membership (such as using fuzzy membership graph)
n Attribute
values are converted to fuzzy values
n e.g.,
income is mapped into the discrete categories {low, medium, high} with fuzzy
values calculated
n For
a given new sample, more than one fuzzy value may apply
n Each
applicable rule contributes a vote for membership in the categories
n Typically,
the truth values for each predicted category are summed
What Is Prediction?
n Prediction
is similar to classification
n First,
construct a model
n Second,
use model to predict unknown value
n Major
method for prediction is regression
n Linear
and multiple regression
n Non-linear
regression
n Prediction
is different from classification
n Classification
refers to predict categorical class label
n Prediction
models continuous-valued functions
Predictive Modeling in Databases
n Predictive
modeling: Predict data values or construct
generalized linear models based on the database data.
n One
can only predict value ranges or category distributions
n Method
outline:
n Minimal generalization
n Attribute relevance analysis
n Generalized linear model construction
n Prediction
n Determine
the major factors which influence the prediction
n Data
relevance analysis: uncertainty measurement, entropy analysis, expert
judgement, etc.
n Multi-level
prediction: drill-down and roll-up analysis
Regress Analysis and Log-Linear Models in Prediction
Locally Weighted Regression
Prediction: Numerical Data
Prediction: Categorical Data
Classification Accuracy: Estimating Error Rates
n Partition:
Training-and-testing
n use
two independent data sets, e.g., training set (2/3), test set(1/3)
n used
for data set with large number of samples
n Cross-validation
n divide
the data set into k subsamples
n use
k-1 subsamples as training data and one sub-sample as test data --- k-fold
cross-validation
n for
data set with moderate size
n Bootstrapping
(leave-one-out)
n for
small size data
Boosting and Bagging
n Boosting
increases classification accuracy
n Applicable
to decision trees or Bayesian classifier
n Learn
a series of classifiers, where each classifier in the series pays more
attention to the examples misclassified by its predecessor
n Boosting
requires only linear time and constant space
Boosting Technique — Algorithm
n Assign
every example an equal weight 1/N
n For
t = 1, 2, …, T Do
n Output
a weighted sum of all the hypothesis, with each hypothesis weighted according
to its accuracy on the training set
Summary
n Classification
is an extensively studied problem (mainly in statistics, machine learning &
neural networks)
n Classification
is probably one of the most widely used data mining techniques with a lot of
extensions
n Scalability
is still an important issue for database applications: thus combining classification with database
techniques should be a promising topic
n
Research directions: classification of
non-relational data, e.g., text, spatial, multimedia, etc..
No comments:
Post a Comment
silahkan membaca dan berkomentar