References: Data-Mining_3rd-Edition Book.
Classification is a form of data analysis that extracts models describing important data classes. Such models, called classifiers, predict categorical (discrete, unordered) class labels. For example, we can build a classification model to categorize bank loan applications as either safe or risky. Such analysis can help provide us with a better understanding of the data at large. Many classification methods have been proposed by researchers in machine learning, pattern recognition, and statistics.
The classification has numerous applications, including fraud detection, target marketing, performance prediction, manufacturing, and medical diagnosis.
What Is Classification?
A bank loans officer wants an analysis of her data to learn which loan applicants are “safe” and which are “not safe” for the bank. A marketing manager at AllElectronics needs some data analysis to guess whether a customer with a profile will buy a brand new computer.
Suppose that the marketing manager wants to predict how much a given customer will spend during a sale at AllElectronics. This data analysis task is an example of numeric prediction, where the model constructed predicts a continuous-valued function, or ordered value, as opposed to a class label. This model is a predictor. Classification and numeric prediction are the two major types of prediction problems.
General Approach to Classification
“How does classification work?
Data classification is a two-step process, consisting of a learning step (where a classification model is constructed) and a classification step (where the model is used to predict class labels for given data). The process is shown for the loan application data in Figure 8.1. (The data are simplified for illustrative purposes.)
(Classification Process)
(a) Learning: Training data are analyzed by a classification algorithm. Here, the class label attribute is a loan decision, and the classifier is represented in the form of classification rules.
(b) Classification: Test data are used to estimate the accuracy of the classification rules. If the accuracy is considered acceptable, the rules can be applied to the classification of new data tuples.
Because the class label of each training tuple is provided, this step is also known as supervised learning (i.e., the learning of the classifier is “supervised” in that it is told to which class each training tuple belongs).
“What about classification accuracy?”
” In the second step, the model is used for classification. First, the predictive accuracy of the classifier is estimated. If we were to use the training set to measure the classifier’s accuracy, this estimate would likely be optimistic, because the classifier tends to overfit the data. Therefore, a test set is used, made up of test tuples and their associated class labels. The topmost node in a tree is the root node. A typical decision tree is shown in Figure.
Decision Tree Induction
Decision tree induction is the learning of decision trees from class-labeled training tuples. A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label. The topmost node in a tree is the root node. A typical decision tree is shown in the below Figure.
Attribute Selection Measures
There are different attribute selection measures. Some of them are as follows;
- Information Gain
- Gain ratio
- Gini index
An attribute selection measure is a heuristic for selecting the splitting criterion that “best” separates a given data partition, D, of class-labeled training tuples into individual classes. Attribute selection measures are also known as splitting rules because they determine how the tuples at a given node are to be split.
Information Gain
Information gain is defined as the difference between the original information requirement (i.e., based on just the proportion of classes) and the new requirement (i.e., obtained after partitioning on A). ID3 uses information gain as its attribute selection measure. That is,
(or)
Information gain is the amount of information that's gained by knowing the value of the attribute, which is the entropy of the distribution before the split minus the entropy of the distribution after it. The largest information gain is equivalent to the smallest entropy.
Let node N represent or hold the tuples of partition D. The attribute with the highest information gain is chosen as the splitting attribute for node N.
This attribute minimizes the information needed to classify the tuples in the resulting partitions and reflects the least randomness or “impurity” in these partitions. Such an approach minimizes the expected number of tests needed to classify a given tuple and guarantees that a simple (but not necessarily the simplest) tree is found.
The expected information needed to classify a tuple in D is given by:
where pi is the non-zero probability that an arbitrary tuple in D belongs to class Ci and is estimated by |Ci,D|/|D|. A log function to the base 2 is used because the information is encoded in bits. Info(D) is just the average amount of information needed to identify the class label of a tuple in D.
Decision Tree Classifier - Information Gain: Video
Gain Ratio
modification of the information gain that reduces its bias towards multi-valued attributes.
takes number and size of branches into account when choosing an attribute
- corrects the information gain by taking the intrinsic information of a split into account
it prefers to select attributes having a large number of values.
C4.5, a successor of ID3, uses an extension to information gain known as the gain ratio, which attempts to overcome this bias. It applies a kind of normalization to information gain using a “split information” value defined analogously with Info(D) as:
The gain ratio is defined as:
Gini Index
The Gini index is used in CART. the Gini index measures the impurity (instead of entropy) of D, a data partition or set of training tuples, as
Tree Pruning:
When a decision tree is built, many of the branches will reflect anomalies in the training data due to noise or outliers. Tree pruning is performed in order to remove anomalies in the training data due to noise or outliers. The pruned trees are smaller and less complex.
Tree Pruning Approaches
There are two approaches to prune a tree −
- Pre-pruning − The tree is pruned by halting its construction early (e.g., by deciding not to further split or partition the subset of training tuples at a given node). Upon halting, the node becomes a leaf.
- Post-pruning - The second and more common approach is post-pruning, which removes subtrees from a “fully grown” tree. A subtree at a given node is pruned by removing its branches and replacing it with a leaf. The leaf is labeled with the most frequent class among the subtree being replaced.
The cost complexity pruning algorithm used in CART is an example of the post-pruning approach. C4.5 uses a method called pessimistic pruning, which is similar to the cost complexity method in that it also uses error rate estimates to make decisions regarding subtree pruning.
Bayes Classification Methods
Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical classifiers. Bayesian classifiers can predict class membership probabilities such as the probability that a given tuple belongs to a particular class.
Bayes’ Theorem
Let X be a data tuple. Let H be some hypotheses such as that the data tuple X belongs to a specified class C. For classification problems, we want to determine P(H|X), the probability that the hypothesis H holds given the “evidence” or observed data tuple X. P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X. P(H) is the prior probability, or a priori probability, of H.
Naive Bayesian Classification The na¨ıve Bayesian classifier, or simple Bayesian classifier, works as follows:
1. Let D be a training set of tuples and their associated class labels. As usual, each tuple is represented by an n-dimensional attribute vector, X = (x1, x2,..., xn), depicting n measurements made on the tuple from n attributes, respectively, A1, A2,..., An.
2. Suppose that there are m classes, C1, C2,..., Cm. Given a tuple, X, the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X. That is, the na¨ıve Bayesian classifier predicts that tuple X belongs to the class Ci if and only if
Thus, we maximize P(Ci |X). The class Ci for which P(Ci |X) is maximized is called the maximum posterior hypothesis. By Bayes’ theorem (Eq. 8.10),
1. Let D be a training set of tuples and their associated class labels. As usual, each tuple is represented by an n-dimensional attribute vector, X = (x1, x2,..., xn), depicting n measurements made on the tuple from n attributes, respectively, A1, A2,..., An.
2. Suppose that there are m classes, C1, C2,..., Cm. Given a tuple, X, the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X. That is, the na¨ıve Bayesian classifier predicts that tuple X belongs to the class Ci if and only if
P(Ci |X) > P(Cj |X) for 1 ≤ j ≤ m,j 6= i.
P(Ci |X) = P(X|Ci)P(Ci)/ P(X) .
3. As P(X) is constant for all classes, only P(X|Ci)P(Ci) needs to be maximized. If the class prior probabilities are not known, then it is commonly assumed that the classes are equally likely, that is, P(C1) = P(C2) = ··· = P(Cm), and we would therefore maximize P(X|Ci). Otherwise, we maximize P(X|Ci)P(Ci). Note that the class prior probabilities may be estimated by P(Ci) = |Ci, D|/|D|, where |Ci, D| is the number of training tuples of class Ci in D.
4. . Given data sets with many attributes, it would be extremely computationally expensive to compute P(X|Ci). To reduce computation in evaluating P(X|Ci), the na¨ıve assumption of class-conditional independence is made.
0 Comments