n Partition:
Training-and-testing
n use
two independent data sets, e.g., training set (2/3), test set(1/3)
n used
for data set with large number of samples
n Cross-validation
n divide
the data set into k subsamples
n use
k-1 subsamples as training data and one sub-sample as test data --- k-fold
cross-validation
n for
data set with moderate size
n Bootstrapping
(leave-one-out)
n for
small size data
Boosting and Bagging
n Boosting
increases classification accuracy
n Applicable
to decision trees or Bayesian classifier
n Learn
a series of classifiers, where each classifier in the series pays more
attention to the examples misclassified by its predecessor
n Boosting
requires only linear time and constant space
Boosting Technique — Algorithm
n Assign
every example an equal weight 1/N
n For
t = 1, 2, …, T Do
n Output
a weighted sum of all the hypothesis, with each hypothesis weighted according
to its accuracy on the training set
Summary
n Classification
is an extensively studied problem (mainly in statistics, machine learning &
neural networks)
n Classification
is probably one of the most widely used data mining techniques with a lot of
extensions
n Scalability
is still an important issue for database applications: thus combining classification with database
techniques should be a promising topic
n
Research directions: classification of
non-relational data, e.g., text, spatial, multimedia, etc..
No comments:
Post a Comment
silahkan membaca dan berkomentar