teknik informatika: Classification and Prediction

Classification and Prediction

Classification vs. Prediction

n Classification:

n predicts categorical class labels

n classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

n Prediction:

n models continuous-valued functions, i.e., predicts unknown or missing values

n Typical Applications

n credit approval

n target marketing

n medical diagnosis

n treatment effectiveness analysis

Classification—A Two-Step Process

n Model construction: describing a set of predetermined classes

n Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute

n The set of tuples used for model construction: training set

n The model is represented as classification rules, decision trees, or mathematical formulae

n Model usage: for classifying future or unknown objects

n Estimate accuracy of the model

n The known label of test sample is compared with the classified result from the model

n Accuracy rate is the percentage of test set samples that are correctly classified by the model

n Test set is independent of training set, otherwise over-fitting will occur

Classification Process (): Model Construction

Classification Process (): Use the Model in Prediction

Supervised vs. Unsupervised Learning

n Supervised learning (classification)

n Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations

n New data is classified based on the training set

n Unsupervised learning (clustering)

n The class labels of training data is unknown

n Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

Issues regarding classification and prediction (1): Data Preparation

n Data cleaning

n Preprocess data in order to reduce noise and handle missing values

n Relevance analysis (feature selection)

n Remove the irrelevant or redundant attributes

n Data transformation

n Generalize and/or normalize data

Issues regarding classification and prediction (2): Evaluating Classification Methods

n Predictive accuracy

n Speed and scalability

n time to construct the model

n time to use the model

n Robustness

n handling noise and missing values

n Scalability

n efficiency in disk-resident databases

n Interpretability:

n understanding and insight provded by the model

n Goodness of rules

n decision tree size

n compactness of classification rules

Classification by Decision Tree Induction

n Decision tree

n A flow-chart-like tree structure

n Internal node denotes a test on an attribute

n Branch represents an outcome of the test

n Leaf nodes represent class labels or class distribution

n Decision tree generation consists of two phases

n Tree construction

n At start, all the training examples are at the root

n Partition examples recursively based on selected attributes

n Tree pruning

n Identify and remove branches that reflect noise or outliers

n Use of decision tree: Classifying an unknown sample

n Test the attribute values of the sample against the decision tree

Training Dataset

Output: A Decision Tree for “buys_computer”

Algorithm for Decision Tree Induction

n Basic algorithm (a greedy algorithm)

n Tree is constructed in a top-down recursive divide-and-conquer manner

n At start, all the training examples are at the root

n Attributes are categorical (if continuous-valued, they are discretized in advance)

n Examples are partitioned recursively based on selected attributes

n Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)

n Conditions for stopping partitioning

n All samples for a given node belong to the same class

n There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf

n There are no samples left

Attribute Selection Measure

n Information gain (ID3/C4.5)

n All attributes are assumed to be categorical

n Can be modified for continuous-valued attributes

n Gini index (IBM IntelligentMiner)

n All attributes are assumed continuous-valued

n Assume there exist several possible split values for each attribute

n May need other tools, such as clustering, to get the possible split values

n Can be modified for categorical attributes

Information Gain

n Select the attribute with the highest information gain

n Assume there are two classes, P and N

n Let the set of examples S contain p elements of class P and n elements of class N

n The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as

Information Gain in Decision Tree Induction

Attribute Selection by Information Gain Computation

Gini Index (IBM IntelligentMiner)

Extracting Classification Rules from Trees

n Represent the knowledge in the form of IF-THEN rules

n One rule is created for each path from the root to a leaf

n Each attribute-value pair along a path forms a conjunction

n The leaf node holds the class prediction

n Rules are easier for humans to understand

n Example

IF age = “<=30” AND student = “no” THEN buys_computer = “no”

IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”

IF age = “31…40” THEN buys_computer = “yes”

IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”

IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”

Avoid Overfitting in Classification

n The generated tree may overfit the training data

n Too many branches, some may reflect anomalies due to noise or outliers

n Result is in poor accuracy for unseen samples

n Two approaches to avoid overfitting

n Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold

n Difficult to choose an appropriate threshold

n Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees

n Use a set of data different from the training data to decide which is the “best pruned tree”

Approaches to Determine the Final Tree Size

n Separate training (2/3) and testing (1/3) sets

n Use cross validation, e.g., 10-fold cross validation

n Use all the data for training

n but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution

n Use minimum description length (MDL) principle:

n halting growth of the tree when the encoding is minimized

Enhancements to basic decision tree induction

n Allow for continuous-valued attributes

n Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals

n Handle missing attribute values

n Assign the most common value of the attribute

n Assign probability to each of the possible values

n Attribute construction

n Create new attributes based on existing ones that are sparsely represented

n This reduces fragmentation, repetition, and replication

Classification in Large Databases

n Classification—a classical problem extensively studied by statisticians and machine learning researchers

n Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed

n Why decision tree induction in data mining?

n relatively faster learning speed (than other classification methods)

n convertible to simple and easy to understand classification rules

n can use SQL queries for accessing databases

n comparable classification accuracy with other methods

Scalable Decision Tree Induction Methods in Data Mining Studies

n SLIQ (EDBT’96 — Mehta et al.)

n builds an index for each attribute and only class list and the current attribute list reside in memory

n SPRINT (VLDB’96 — J. Shafer et al.)

n constructs an attribute list data structure

n PUBLIC (VLDB’98 — Rastogi & Shim)

n integrates tree splitting and tree pruning: stop growing the tree earlier

n RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)

n separates the scalability aspects from the criteria that determine the quality of the tree

n builds an AVC-list (attribute, value, class label)

Data Cube-Based Decision-Tree Induction

n Integration of generalization with decision-tree induction (Kamber et al’97).

n Classification at primitive concept levels

n E.g., precise temperature, humidity, outlook, etc.

n Low-level concepts, scattered classes, bushy classification-trees

n Semantic interpretation problems.

n Cube-based multi-level classification

n Relevance analysis at multi-levels.

n Information-gain analysis with dimension + level.

Presentation of Classification Results

Bayesian Classification: Why?

n Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems

n Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.

n Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities

n Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

Bayesian Theorem

Naïve Bayes Classifier

n A simplified assumption: attributes are conditionally independent:

n Greatly reduces the computation cost, only count the class distribution.

n Given a training set, we can compute the probabilities

Bayesian classification

n The classification problem may be formalized using a-posteriori probabilities:

n P(C|X) = prob. that the sample tuple X=<x₁,…,x_k> is of class C.

n E.g. P(class=N | outlook=sunny,windy=true,…)

n Idea: assign to sample X the class label C such that P(C|X) is maximal

Estimating a-posteriori probabilities

n Bayes theorem:

P(C|X) = P(X|C)·P(C) / P(X)

n P(X) is constant for all classes

n P(C) = relative freq of class C samples

n C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum

n Problem: computing P(X|C) is unfeasible!

Naïve Bayesian Classification

n Naïve assumption: attribute independence

P(x₁,…,x_k|C) = P(x₁|C)·…·P(x_k|C)

n If i-th attribute is categorical:
P(x_i|C) is estimated as the relative freq of samples having value x_i as i-th attribute in class C

n If i-th attribute is continuous:
P(x_i|C) is estimated thru a Gaussian density function

n Computationally easy in both cases

Play-tennis example: estimating P(x_i|C)

Play-tennis example: classifying X

n An unseen sample X = <rain, hot, high, false>

n Sample X is classified in class n (don’t play)

The independence hypothesis…

n … makes computation possible

n … yields optimal classifiers when satisfied

n … but is seldom satisfied in practice, as attributes (variables) are often correlated.

n Attempts to overcome this limitation:

n Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes

n Decision trees, that reason on one attribute at the time, considering most important attributes first

Bayesian Belief Networks

n Bayesian belief network allows a subset of the variables conditionally independent

n A graphical model of causal relationships

n Several cases of learning Bayesian belief networks

n Given both network structure and all the variables: easy

n Given network structure but only some variables

n When the network structure is not known in advance

Neural Networks

n Advantages

n prediction accuracy is generally high

n robust, works when training examples contain errors

n output may be discrete, real-valued, or a vector of several discrete or real-valued attributes

n fast evaluation of the learned target function

n Criticism

n long training time

n difficult to understand the learned function (weights)

n not easy to incorporate domain knowledge

A Neuron

Network Training

n The ultimate objective of training

n obtain a set of weights that makes almost all the tuples in the training data classified correctly

n Steps

n Initialize weights with random values

n Feed the input tuples into the network one by one

n For each unit

n Compute the net input to the unit as a linear combination of all the inputs to the unit

n Compute the output value using the activation function

n Compute the error

n Update the weights and the bias

Multi-Layer Perceptron

Network Pruning and Rule Extraction

n Network pruning

n Fully connected network will be hard to articulate

n N input nodes, h hidden nodes and m output nodes lead to h(m+N) weights

n Pruning: Remove some of the links without affecting classification accuracy of the network

n Extracting rules from a trained network

n Discretize activation values; replace individual activation value by the cluster average maintaining the network accuracy

n Enumerate the output from the discretized activation values to find rules between activation value and output

n Find the relationship between the input and activation value

n Combine the above two to have rules relating the output to input

Association-Based Classification

n Several methods for association-based classification

n ARCS: Quantitative association mining and clustering of association rules (Lent et al’97)

n It beats C4.5 in (mainly) scalability and also accuracy

n Associative classification: (Liu et al’98)

n It mines high support and high confidence rules in the form of “cond_set => y”, where y is a class label

n CAEP (Classification by aggregating emerging patterns) (Dong et al’99)

n Emerging patterns (EPs): the itemsets whose support increases significantly from one class to another

n Mine Eps based on minimum support and growth rate

Other Classification Methods

n k-nearest neighbor classifier

n case-based reasoning

n Genetic algorithm

n Rough set approach

n Fuzzy set approaches

Instance-Based Methods

n Instance-based learning:

n Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified

n Typical approaches

n k-nearest neighbor approach

n Instances represented as points in a Euclidean space.

n Locally weighted regression

n Constructs local approximation

n Case-based reasoning

n Uses symbolic representations and knowledge-based inference

The k-Nearest Neighbor Algorithm

n All instances correspond to points in the n-D space.

n The nearest neighbor are defined in terms of Euclidean distance.

n The target function could be discrete- or real- valued.

n For discrete-valued, the k-NN returns the most common value among the k training examples nearest to Xq.

Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples

Discussion on the k-NN Algorithm

n The k-NN algorithm for continuous-valued target functions

n Calculate the mean values of the k nearest neighbors

n Distance-weighted nearest neighbor algorithm

n Weight the contribution of each of the k neighbors according to their distance to the query point x_q

n giving greater weight to closer neighbors

n Similarly, for real-valued target functions

n Robust to noisy data by averaging k-nearest neighbors

n Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes.

n To overcome it, axes stretch or elimination of the least relevant attributes.

Case-Based Reasoning

n Also uses: lazy evaluation + analyze similar instances

n Difference: Instances are not “points in a Euclidean space”

n Example: Water faucet problem in CADET (Sycara et al’92)

n Methodology

n Instances represented by rich symbolic descriptions (e.g., function graphs)

n Multiple retrieved cases may be combined

n Tight coupling between case retrieval, knowledge-based reasoning, and problem solving

n Research issues

n Indexing based on syntactic similarity measure, and when failure, backtracking, and adapting to additional cases

Remarks on Lazy vs. Eager Learning

n Instance-based learning: lazy evaluation

n Decision-tree and Bayesian classification: eager evaluation

n Key differences

n Lazy method may consider query instance xq when deciding how to generalize beyond the training data D

n Eager method cannot since they have already chosen global approximation when seeing the query

n Efficiency: Lazy - less time training but more time predicting

n Accuracy

n Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function

n Eager: must commit to a single hypothesis that covers the entire instance space

Genetic Algorithms

n GA: based on an analogy to biological evolution

n Each rule is represented by a string of bits

n An initial population is created consisting of randomly generated rules

n e.g., IF A₁ and Not A₂ then C₂ can be encoded as 100

n Based on the notion of survival of the fittest, a new population is formed to consists of the fittest rules and their offsprings

n The fitness of a rule is represented by its classification accuracy on a set of training examples

n Offsprings are generated by crossover and mutation

Rough Set Approach

n Rough sets are used to approximately or “roughly” define equivalent classes

n A rough set for a given class C is approximated by two sets: a lower approximation (certain to be in C) and an upper approximation (cannot be described as not belonging to C)

n Finding the minimal subsets (reducts) of attributes (for feature reduction) is NP-hard but a discernibility matrix is used to reduce the computation intensity

Fuzzy Set Approaches

n Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership (such as using fuzzy membership graph)

n Attribute values are converted to fuzzy values

n e.g., income is mapped into the discrete categories {low, medium, high} with fuzzy values calculated

n For a given new sample, more than one fuzzy value may apply

n Each applicable rule contributes a vote for membership in the categories

n Typically, the truth values for each predicted category are summed

What Is Prediction?

n Prediction is similar to classification

n First, construct a model

n Second, use model to predict unknown value

n Major method for prediction is regression

n Linear and multiple regression

n Non-linear regression

n Prediction is different from classification

n Classification refers to predict categorical class label

n Prediction models continuous-valued functions

Predictive Modeling in Databases

n Predictive modeling: Predict data values or construct generalized linear models based on the database data.

n One can only predict value ranges or category distributions

n Method outline:

n Minimal generalization

n Attribute relevance analysis

n Generalized linear model construction

n Prediction

n Determine the major factors which influence the prediction

n Data relevance analysis: uncertainty measurement, entropy analysis, expert judgement, etc.

n Multi-level prediction: drill-down and roll-up analysis

Regress Analysis and Log-Linear Models in Prediction

Locally Weighted Regression

Prediction: Numerical Data

Prediction: Categorical Data

Classification Accuracy: Estimating Error Rates

n Partition: Training-and-testing

n use two independent data sets, e.g., training set (2/3), test set(1/3)

n used for data set with large number of samples

n Cross-validation

n divide the data set into k subsamples

n use k-1 subsamples as training data and one sub-sample as test data --- k-fold cross-validation

n for data set with moderate size

n Bootstrapping (leave-one-out)

n for small size data

Boosting and Bagging

n Boosting increases classification accuracy

n Applicable to decision trees or Bayesian classifier

n Learn a series of classifiers, where each classifier in the series pays more attention to the examples misclassified by its predecessor

n Boosting requires only linear time and constant space

Boosting Technique — Algorithm

n Assign every example an equal weight 1/N

n For t = 1, 2, …, T Do

n Output a weighted sum of all the hypothesis, with each hypothesis weighted according to its accuracy on the training set

Summary

n Classification is an extensively studied problem (mainly in statistics, machine learning & neural networks)

n Classification is probably one of the most widely used data mining techniques with a lot of extensions

n Scalability is still an important issue for database applications: thus combining classification with database techniques should be a promising topic

n Research directions: classification of non-relational data, e.g., text, spatial, multimedia, etc..

teknik informatika

Translate

Saturday, October 8, 2016

Classification and Prediction

No comments:

Post a Comment