teknik informatika: Classification in Large Databases

Classification in Large Databases

n Classification—a classical problem extensively studied by statisticians and machine learning researchers

n Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed

n Why decision tree induction in data mining?

n relatively faster learning speed (than other classification methods)

n convertible to simple and easy to understand classification rules

n can use SQL queries for accessing databases

n comparable classification accuracy with other methods

Scalable Decision Tree Induction Methods in Data Mining Studies

n SLIQ (EDBT’96 — Mehta et al.)

n builds an index for each attribute and only class list and the current attribute list reside in memory

n SPRINT (VLDB’96 — J. Shafer et al.)

n constructs an attribute list data structure

n PUBLIC (VLDB’98 — Rastogi & Shim)

n integrates tree splitting and tree pruning: stop growing the tree earlier

n RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)

n separates the scalability aspects from the criteria that determine the quality of the tree

n builds an AVC-list (attribute, value, class label)

Data Cube-Based Decision-Tree Induction

n Integration of generalization with decision-tree induction (Kamber et al’97).

n Classification at primitive concept levels

n E.g., precise temperature, humidity, outlook, etc.

n Low-level concepts, scattered classes, bushy classification-trees

n Semantic interpretation problems.

n Cube-based multi-level classification

n Relevance analysis at multi-levels.

n Information-gain analysis with dimension + level.

Presentation of Classification Results

teknik informatika