Translate

Saturday, October 8, 2016

Classification Accuracy: Estimating Error Rates


Classification Accuracy: Estimating Error Rates
n  Partition: Training-and-testing
n  use two independent data sets, e.g., training set (2/3), test set(1/3)
n  used for data set with large number of samples
n  Cross-validation
n  divide the data set into k subsamples
n  use k-1 subsamples as training data and one sub-sample as test data --- k-fold cross-validation
n  for data set with moderate size
n  Bootstrapping (leave-one-out)
n  for small size data
Boosting and Bagging
n  Boosting increases classification accuracy
n  Applicable to decision trees or Bayesian classifier
n  Learn a series of classifiers, where each classifier in the series pays more attention to the examples misclassified by its predecessor
n  Boosting requires only linear time and constant space
Boosting Technique — Algorithm
n  Assign every example an equal weight  1/N
n  For t = 1, 2, …, T Do 
n  Output a weighted sum of all the hypothesis, with each hypothesis weighted according to its accuracy on the training set
Summary
n  Classification is an extensively studied problem (mainly in statistics, machine learning & neural networks)
n  Classification is probably one of the most widely used data mining techniques with a lot of extensions
n  Scalability is still an important issue for database applications:  thus combining classification with database techniques should be a promising topic
n  Research directions: classification of non-relational data, e.g., text, spatial, multimedia, etc..

No comments:

Post a Comment

silahkan membaca dan berkomentar