Translate

Saturday, October 8, 2016

Classification and Prediction


Classification and Prediction


Classification vs. Prediction
n  Classification: 
n  predicts categorical class labels
n  classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
n  Prediction: 
n  models continuous-valued functions, i.e., predicts unknown or missing values
n  Typical Applications
n  credit approval
n  target marketing
n  medical diagnosis
n  treatment effectiveness analysis
Classification—A Two-Step Process
n  Model construction: describing a set of predetermined classes
n  Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute
n  The set of tuples used for model construction: training set
n  The model is represented as classification rules, decision trees, or mathematical formulae
n  Model usage: for classifying future or unknown objects
n  Estimate accuracy of the model
n  The known label of test sample is compared with the classified result from the model
n  Accuracy rate is the percentage of test set samples that are correctly classified by the model
n  Test set is independent of training set, otherwise over-fitting will occur
Classification Process (): Model Construction
Classification Process (): Use the Model in Prediction
Supervised vs. Unsupervised Learning
n  Supervised learning (classification)
n  Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations
n  New data is classified based on the training set
n  Unsupervised learning (clustering)
n  The class labels of training data is unknown
n  Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
Issues regarding classification and prediction (1): Data Preparation
n  Data cleaning
n  Preprocess data in order to reduce noise and handle missing values
n  Relevance analysis (feature selection)
n  Remove the irrelevant or redundant attributes
n  Data transformation
n  Generalize and/or normalize data
Issues regarding classification and prediction (2): Evaluating Classification Methods
n  Predictive accuracy
n  Speed and scalability
n  time to construct the model
n  time to use the model
n  Robustness
n  handling noise and missing values
n  Scalability
n  efficiency in disk-resident databases
n  Interpretability:
n  understanding and insight provded by the model
n  Goodness of rules
n  decision tree size
n  compactness of classification rules
Classification by Decision Tree Induction
n  Decision tree
n  A flow-chart-like tree structure
n  Internal node denotes a test on an attribute
n  Branch represents an outcome of the test
n  Leaf nodes represent class labels or class distribution
n  Decision tree generation consists of two phases
n  Tree construction
n  At start, all the training examples are at the root
n  Partition examples recursively based on selected attributes
n  Tree pruning
n  Identify and remove branches that reflect noise or outliers
n  Use of decision tree: Classifying an unknown sample
n  Test the attribute values of the sample against the decision tree
Training Dataset
Output: A Decision Tree for “buys_computer”
Algorithm for Decision Tree Induction
n  Basic algorithm (a greedy algorithm)
n  Tree is constructed in a top-down recursive divide-and-conquer manner
n  At start, all the training examples are at the root
n  Attributes are categorical (if continuous-valued, they are discretized in advance)
n  Examples are partitioned recursively based on selected attributes
n  Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)
n  Conditions for stopping partitioning
n  All samples for a given node belong to the same class
n  There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf
n  There are no samples left
Attribute Selection Measure
n  Information gain (ID3/C4.5)
n  All attributes are assumed to be categorical
n  Can be modified for continuous-valued attributes
n  Gini index (IBM IntelligentMiner)
n  All attributes are assumed continuous-valued
n  Assume there exist several possible split values for each attribute
n  May need other tools, such as clustering, to get the possible split values
n  Can be modified for categorical attributes
Information Gain
n  Select the attribute with the highest information gain
n  Assume there are two classes, P  and N
n  Let the set of examples S contain p elements of class P  and n elements of class N
n  The amount of information, needed to decide if an arbitrary example in S belongs to P  or N is defined as
Information Gain in Decision Tree Induction
Attribute Selection by Information Gain Computation
Gini Index (IBM IntelligentMiner)
Extracting Classification Rules from Trees
n  Represent the knowledge in the form of IF-THEN rules
n  One rule is created for each path from the root to a leaf
n  Each attribute-value pair along a path forms a conjunction
n  The leaf node holds the class prediction
n  Rules are easier for humans to understand
n  Example
IF age = “<=30” AND student = “no”   THEN buys_computer = “no
IF age = “<=30” AND student = “yes”  THEN buys_computer = “yes
IF age = “31…40”                                              THEN buys_computer = “yes
IF age = “>40”   AND credit_rating = “excellent”   THEN buys_computer = “yes
IF age = “>40” AND credit_rating = “fair”  THEN buys_computer = “no
Avoid Overfitting in Classification
n  The generated tree may overfit the training data
n  Too many branches, some may reflect anomalies due to noise or outliers
n  Result is in poor accuracy for unseen samples
n  Two approaches to avoid overfitting
n  Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold
n  Difficult to choose an appropriate threshold
n  Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees
n  Use a set of data different from the training data to decide which is the “best pruned tree”
Approaches to Determine the Final Tree Size
n  Separate training (2/3) and testing (1/3) sets
n  Use cross validation, e.g., 10-fold cross validation
n  Use all the data for training
n  but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution
n  Use minimum description length (MDL) principle:
n  halting growth of the tree when the encoding is minimized
Enhancements to basic decision tree induction
n  Allow for continuous-valued attributes
n  Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals
n  Handle missing attribute values
n  Assign the most common value of the attribute
n  Assign probability to each of the possible values
n  Attribute construction
n  Create new attributes based on existing ones that are sparsely represented
n  This reduces fragmentation, repetition, and replication
Classification in Large Databases
n  Classification—a classical problem extensively studied by statisticians and machine learning researchers
n  Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed
n  Why decision tree induction in data mining?
n  relatively faster learning speed (than other classification methods)
n  convertible to simple and easy to understand classification rules
n  can use SQL queries for accessing databases
n  comparable classification accuracy with other methods
Scalable Decision Tree Induction Methods in Data Mining Studies
n  SLIQ (EDBT’96 — Mehta et al.)
n  builds an index for each attribute and only class list and the current attribute list reside in memory
n  SPRINT (VLDB’96 — J. Shafer et al.)
n  constructs an attribute list data structure
n  PUBLIC (VLDB’98 — Rastogi & Shim)
n  integrates tree splitting and tree pruning: stop growing the tree earlier
n  RainForest  (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
n  separates the scalability aspects from the criteria that determine the quality of the tree
n  builds an AVC-list (attribute, value, class label)
Data Cube-Based Decision-Tree Induction
n  Integration of generalization with decision-tree induction (Kamber et al’97).
n  Classification at primitive concept levels
n  E.g., precise temperature, humidity, outlook, etc.
n  Low-level concepts, scattered classes, bushy classification-trees
n  Semantic interpretation problems.
n  Cube-based multi-level classification
n  Relevance analysis at multi-levels.
n  Information-gain analysis with dimension + level.
Presentation of Classification Results
Bayesian Classification: Why?
n  Probabilistic learning:  Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems
n  Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct.  Prior knowledge can be combined with observed data.
n  Probabilistic prediction:  Predict multiple hypotheses, weighted by their probabilities
n  Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured
Bayesian Theorem
Naïve Bayes Classifier
n  A simplified assumption: attributes are conditionally independent:
n  Greatly reduces the computation cost, only count the class distribution.
n  Given a training set, we can compute the probabilities
Bayesian classification
n  The classification problem may be formalized using a-posteriori probabilities:
n    P(C|X)  = prob. that the sample tuple                                                   X=<x1,…,xk> is of class C.
n  E.g. P(class=N | outlook=sunny,windy=true,…)
n  Idea: assign to sample X the class label C such that P(C|X) is maximal
Estimating a-posteriori probabilities
n  Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
n  P(X) is constant for all classes
n  P(C) = relative freq of class C samples
n  C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum
n  Problem: computing P(X|C) is unfeasible!
Naïve Bayesian Classification
n  Naïve assumption: attribute independence
P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
n  If i-th attribute is categorical:
P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C
n  If i-th attribute is continuous:
P(xi|C) is estimated thru a Gaussian density function
n  Computationally easy in both cases
Play-tennis example: estimating P(xi|C)
Play-tennis example: classifying X
n  An unseen sample X = <rain, hot, high, false>
n  P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582
n  P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286
n  Sample X is classified in class n (don’t play)
The independence hypothesis…
n  … makes computation possible
n  … yields optimal classifiers when satisfied
n  … but is seldom satisfied in practice, as attributes (variables) are often correlated.
n  Attempts to overcome this limitation:
n  Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes
n  Decision trees, that reason on one attribute at the time, considering most important attributes first
Bayesian Belief Networks
n  Bayesian belief network allows a subset of the variables conditionally independent
n  A graphical model of causal relationships
n  Several cases of learning Bayesian belief networks
n  Given both network structure and all the variables: easy
n  Given network structure but only some variables
n  When the network structure is not known in advance
Neural Networks
n  Advantages
n  prediction accuracy is generally high
n  robust, works when training examples contain errors
n  output may be discrete, real-valued, or a vector of several discrete or real-valued attributes
n  fast evaluation of the learned target function
n  Criticism
n  long training time
n  difficult to understand the learned function (weights)
n  not easy to incorporate domain knowledge
A  Neuron
Network Training
n  The ultimate objective of training
n  obtain a set of weights that makes almost all the tuples in the training data classified correctly
n  Steps
n  Initialize weights with random values
n  Feed the input tuples into the network one by one
n  For each unit
n  Compute the net input to the unit as a linear combination of all the inputs to the unit
n  Compute the output value using the activation function
n  Compute the error
n  Update the weights and the bias
Multi-Layer Perceptron
Network Pruning and Rule Extraction
n  Network pruning
n  Fully connected network will be hard to articulate
n  N input nodes, h hidden nodes and m output nodes lead to h(m+N) weights
n  Pruning: Remove some of the links without affecting classification accuracy of the network
n  Extracting rules from a trained network
n  Discretize activation values; replace individual activation value by the cluster average maintaining the network accuracy
n  Enumerate the output from the discretized activation values to find rules between activation value and output
n  Find the relationship between the input and activation value
n  Combine the above two to have rules relating the output to input
Association-Based Classification
n  Several methods for association-based classification
n  ARCS: Quantitative association mining and clustering of association rules (Lent et al’97)
n  It beats C4.5 in (mainly) scalability and also accuracy
n  Associative classification: (Liu et al’98) 
n  It mines high support and high confidence rules in the form of “cond_set => y”, where y is a class label
n  CAEP (Classification by aggregating emerging patterns) (Dong et al’99)
n  Emerging patterns (EPs): the itemsets whose support increases significantly from one class to another
n  Mine Eps based on minimum support and growth rate
Other Classification Methods
n  k-nearest neighbor classifier
n  case-based reasoning
n  Genetic algorithm
n  Rough set approach
n  Fuzzy set approaches
Instance-Based Methods
n  Instance-based learning:
n  Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified
n  Typical approaches
n  k-nearest neighbor approach
n  Instances represented as points in a Euclidean space.
n  Locally weighted regression
n  Constructs local approximation
n  Case-based reasoning
n  Uses symbolic representations and knowledge-based inference
The k-Nearest Neighbor Algorithm
n  All instances correspond to points in the n-D space.
n  The nearest neighbor are defined in terms of Euclidean distance.
n  The target function could be discrete- or real- valued.
n  For discrete-valued, the k-NN returns the most common value among the k training examples nearest to Xq.
Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples
Discussion on the k-NN Algorithm
n  The k-NN algorithm for continuous-valued target functions
n  Calculate the mean values of the k nearest neighbors
n  Distance-weighted nearest neighbor algorithm
n  Weight the contribution of each of the k neighbors according to their distance to the query point xq
n  giving greater weight to closer neighbors
n  Similarly, for real-valued target functions
n  Robust to noisy data by averaging k-nearest neighbors
n  Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes.  
n  To overcome it, axes stretch or elimination of the least relevant attributes.
Case-Based Reasoning
n  Also uses: lazy evaluation + analyze similar instances
n  Difference: Instances are not “points in a Euclidean space”
n  Example: Water faucet problem in CADET (Sycara et al’92)
n  Methodology
n  Instances represented by rich symbolic descriptions (e.g., function graphs)
n  Multiple retrieved cases may be combined
n  Tight coupling between case retrieval, knowledge-based reasoning, and problem solving
n  Research issues
n  Indexing based on syntactic similarity measure,  and when failure, backtracking, and adapting to additional cases
Remarks on Lazy vs. Eager Learning
n  Instance-based learning:  lazy evaluation
n  Decision-tree and Bayesian classification:  eager evaluation
n  Key differences
n  Lazy method may consider query instance xq when deciding how to generalize beyond the training data D
n  Eager method cannot since they have already chosen global approximation when seeing the query
n  Efficiency: Lazy - less time training but more time predicting
n  Accuracy
n  Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function
n  Eager: must commit to a single hypothesis that covers the entire instance space
Genetic Algorithms
n  GA: based on an analogy to biological evolution
n  Each rule is represented by a string of bits
n  An initial population is created consisting of randomly generated rules
n  e.g., IF A1 and Not A2 then C2 can be encoded as 100
n  Based on the notion of survival of the fittest, a new population is formed to consists of the fittest rules and their offsprings 
n  The fitness of a rule is represented by its classification accuracy on a set of training examples
n  Offsprings are generated by crossover and mutation
Rough Set Approach
n  Rough sets are used to approximately or “roughly” define equivalent classes
n  A rough set for a given class C is approximated by two sets: a lower approximation (certain to be in C) and an upper approximation (cannot be described as not belonging to C)
n  Finding the minimal subsets (reducts) of attributes (for feature reduction) is NP-hard but a discernibility matrix is used to reduce the computation intensity
Fuzzy Set Approaches
n  Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership (such as using fuzzy membership graph)
n  Attribute values are converted to fuzzy values
n  e.g., income is mapped into the discrete categories {low, medium, high} with fuzzy values calculated
n  For a given new sample, more than one fuzzy value may apply
n  Each applicable rule contributes a vote for membership in the categories
n  Typically, the truth values for each predicted category are summed
What Is Prediction?
n  Prediction is similar to classification
n  First, construct a model
n  Second, use model to predict unknown value
n  Major method for prediction is regression
n  Linear and multiple regression
n  Non-linear regression
n  Prediction is different from classification
n  Classification refers to predict categorical class label
n  Prediction models continuous-valued functions
Predictive Modeling in Databases
n  Predictive modeling: Predict data values or construct   generalized linear models based on the database data.
n  One can only predict value ranges or category distributions
n  Method outline:
n   Minimal generalization
n   Attribute relevance analysis
n   Generalized linear model construction
n   Prediction
n  Determine the major factors which influence the prediction
n  Data relevance analysis: uncertainty measurement, entropy analysis, expert judgement, etc.
n  Multi-level prediction: drill-down and roll-up analysis
Regress Analysis and Log-Linear Models in Prediction
Locally Weighted Regression
Prediction: Numerical Data
Prediction: Categorical Data
Classification Accuracy: Estimating Error Rates
n  Partition: Training-and-testing
n  use two independent data sets, e.g., training set (2/3), test set(1/3)
n  used for data set with large number of samples
n  Cross-validation
n  divide the data set into k subsamples
n  use k-1 subsamples as training data and one sub-sample as test data --- k-fold cross-validation
n  for data set with moderate size
n  Bootstrapping (leave-one-out)
n  for small size data
Boosting and Bagging
n  Boosting increases classification accuracy
n  Applicable to decision trees or Bayesian classifier
n  Learn a series of classifiers, where each classifier in the series pays more attention to the examples misclassified by its predecessor
n  Boosting requires only linear time and constant space
Boosting Technique — Algorithm
n  Assign every example an equal weight  1/N
n  For t = 1, 2, …, T Do


n  Output a weighted sum of all the hypothesis, with each hypothesis weighted according to its accuracy on the training set
Summary
n  Classification is an extensively studied problem (mainly in statistics, machine learning & neural networks)
n  Classification is probably one of the most widely used data mining techniques with a lot of extensions
n  Scalability is still an important issue for database applications:  thus combining classification with database techniques should be a promising topic
n  Research directions: classification of non-relational data, e.g., text, spatial, multimedia, etc..

No comments:

Post a Comment

silahkan membaca dan berkomentar