teknik informatika: Classification Accuracy: Estimating Error Rates

Saturday, October 8, 2016

Classification Accuracy: Estimating Error Rates

n Partition: Training-and-testing

n use two independent data sets, e.g., training set (2/3), test set(1/3)

n used for data set with large number of samples

n Cross-validation

n divide the data set into k subsamples

n use k-1 subsamples as training data and one sub-sample as test data --- k-fold cross-validation

n for data set with moderate size

n Bootstrapping (leave-one-out)

n for small size data

Boosting and Bagging

n Boosting increases classification accuracy

n Applicable to decision trees or Bayesian classifier

n Learn a series of classifiers, where each classifier in the series pays more attention to the examples misclassified by its predecessor

n Boosting requires only linear time and constant space

Boosting Technique — Algorithm

n Assign every example an equal weight 1/N

n For t = 1, 2, …, T Do

n Output a weighted sum of all the hypothesis, with each hypothesis weighted according to its accuracy on the training set

Summary

n Classification is an extensively studied problem (mainly in statistics, machine learning & neural networks)

n Classification is probably one of the most widely used data mining techniques with a lot of extensions

n Scalability is still an important issue for database applications: thus combining classification with database techniques should be a promising topic

n Research directions: classification of non-relational data, e.g., text, spatial, multimedia, etc..

teknik informatika

Translate

Saturday, October 8, 2016

Classification Accuracy: Estimating Error Rates

No comments:

Post a Comment