SIPINA incorporates various supervised approaches. But it is
especially intended to classification trees algorithms. If you want to use other learning
algorithms, I suggest to use Tanagra which implements a broad range of statistical and
machine learning methods.
In this webpage, we describe mainly the induction tree algorithms (french description)
Decision Trees Algorithms
A limited search induction tree algorithm (Catlett,
Description. Jason Catlett is maybe one of the first dataminer of the
history. The title of his PhD thesis is "Megainduction : Machine learning on
very large databases". He suggests adapted solutions for handling huge
databases, especially for decision tree induction.
This approach is one of its suggestions. The number of splits is a priori
determined, before the induction. In other context than megainduction, this
method is not really efficient. But when we handle a huge dataset, the method
gives satisfactory results, and the computing time is manageable, it is known in
Parameters. Number of max splits : Maximum
number of node that we can split during the induction process.
Reference. J. Catlett, « Megainduction : Machine learning on very large
databases », PhD Thesis, School of Computer Science, University of Technology,
Sydney, Australia, 1991.
Description. It is one of the last version of ID3 (Quinlan, 1979).
Subsequently, Quinlan incorporates post-pruning process in its new induction
algorithms, C4 and C4.5.
Compared with the original ID3, this version uses a chi-square test as a
pre-pruning rule. When a split is not statistically significant, the growing
process is stopped.
Parameters. Confidence level : it is the
confidence level of the statistical test. When the computed p-value is upper
than this threshold, the node is not split.
Reference. J. Quinlan, " Induction of Decision Trees ", in Machine
Learning, 1, pp.81-106, 1986.
GID3 (Cheng et al.,
GID3 is a generalization of ID3 where some leaves from a split process can be
merged. The idea is to highlight the most interesting leaves, and to gather
together the other ones in a "default leaf".
Roughly speaking, when a split a node, we detect the leaf which minimizes the
Shannon entropy. Then we compute a threshold (TL * Min.Entropy), where TL is the
tolerance level, the main parameter of the method. When a leaf has an entropy
upper than the threshold, it is merged with the "default leaf".
Confidence level : it is the same as the ID3-IV parameter.
Tolerance level : It is the parameter for the
leaves merging. If (TL <1), all the leaves are merged. If TL = 1, we obtain a
binary tree. If TL is high, we obtain the standard ID3.
Reference. J. Cheng, U. Fayyad, K. Irani, Z. Qian, " Improved
decision trees: a generalized version of ID3 ", Proc. of 5th International
Conference on Machine Learning, pp.100-108, 1988.
(Cestnik et al., 1987)
ASSISTANT is another improvement of the original ID3 (Quinlan, 1979). It induces
a binary tree. The merging process optimizes the information gain criterion. We
use a hill-climbing approach in our implementation.
Various parameters enables to control the size of the tree.
Attribute suitability : The goodness of split is
the information gain, multiplied with the size of the node. If it is upper than
this attribute suitability threshold, the split is performed.
Class frequency : If the proportion of one class value is upper than this threshold, we do not try to split the node.
Node weight : If the weight of a node, i.e. the node size divided by the root node size, is lower than this threshold, we do not try to split the node.
Reference. B. Cestnik, I. Kononenko, I. Bratko, " ASSISTANT 86: A
Knowledge Elicitation Tool for Sophistical Users ", Proc. of the 2nd European
Working Session on Learning, pp.31-45, 1987.
CHAID is the version of AID (Morgan & Sonquist, 1963 - the forerunner of all the induction trees approaches) devoted to categorical class attribute.
The particularities of CHAID are : (1) utilization of the chi-square statistic as goodness of split criterion; (2) merging some leaves which come from the same node.
CHAID is widely spread in commercial softwares. On the other hand, it is less implemented in free software. This method is above all known of the statisticians.
P-level for splitting nodes : The computed p-value of the chi-square statistic is compared to this threshold. When it is upper, the split process is not performed. If we decrease the "p-level" threshold, we obtain a shorter tree.
P-level for merging nodes : The merging process compares the class distribution in the leaves. It searches the two most similar leaves. They are merged if the difference between distributions is not significant. The statistical significance is determined by comparing the computed p-value of the test with this "p-level" threshold defined by the users. If we decrease the threshold, we obtain larger trees. If we increase the threshold, we obtain a more compact tree.
Bonferroni adjustment : Because we multiply the tests in order to find the best solutions at each node, we increase the possibility to find good solution "by chance". The true p-value of the tests must be corrected with this kind of adjustment. In my opinion, it is not really convincing in the practice. It is easier to obtain the desirable solutions in using the "p-levels" parameters.
Reference. G. Kass, « An exploratory technique for investigating
large quantities of categorical data », Applied Statistics, 29(2), pp. 119-127,
(Tschuprow Goodness of Split)
It is my favorite approach. There are some modifications of CHAID in order to better control the size of the tree. They have mainly practical justification. The users can to understand them easily, and to set the right values.
In addition, the Tschuprow's t is used instead of standard CHI-square, essentially because it is a normalized measure, varying between 0 to 1.
P-level for splitting nodes, P-level for merging nodes and Bonferroni adjustment are the same as the parameters of CHAID.
Max depth : It enables to limit the depth of the tree. This parameter has not really a theoretical justification, but it is useful when we handle a big dataset.
Min size of node to split : When a node has a size lower than this threshold, we do not attempt to split it.
Min size of leaves : When one of the leaves coming from a splitting has a size lower than this threshold, we do not accept the splitting.
Reference. R. Rakotomalala, " Arbres de
décision ", in Revue Modulad, n°33, 2005.
It is the state-of-the-art Quinlan's algorithm. It is widely used in the machine learning community.
Our version is based on the Quinlan's book (1993), we implement mainly two key ideas : we use the gain ratio in order to select the right split attribute; we use the post-pruning according the pessimistic error criterion in order to detect the right size of the tree.
Quinlan has the own implementation, available on line. There are several options which can slightly modify the behavior of the algorithm. But they are not really essentials.
CL (Confidence level) for pessimistic pruning : It is the confidence level used for the computation of the pessimistic error rate i.e. the upper bound of the confidence interval of the error rate on a leaf.
Size of leaves : A split is accepted if two leaves at least have size upper than this threshold.
References.R. Quinlan, « C4.5: Programs for Machine Learning »,
Morgan Kaufman Publishers, 1993.
R. Kohavi, R. Quinlan, « Decision Tree Discovery », in Handbook of Data Mining and
Knowledge Discovery, Klosgen & Zytkow Editors, Chapter 16.1.3, pages 267-276,
Oxford University Press, 2002.
(Rakotomalala & Lallich, 1999)
It is a generalization of C4.5. We use generalized entropy instead of Shannon entropy. An additional parameter is available.
The same parameters as the standard C4.5.
Beta : It is the parameter of generalized entropy (see the on line reference for detailed formulas).
Improved gain ratio : If it is checked, we use the Quinlan's gain ratio instead of the standard information gain.
References. R. Rakotomalala, S. Lallich, "Handl.ing noise with
generalized entropy of type beta in induction graphs algorithm", Proc. Int.
Conf. on Computer Science and Informatics, pp. 25-27, 1998.
S. Lallich, S. Di Palma, « Studying the behavior of generalized entropy in
induction trees using a m-of-n concept », Proc. of 3rd European Conf. on KDD,
pp. 510-517, 1999.
I. Taneja, « Generalized Information Measures and their Applications » ;
see chapter 3 « Entropy-type Measures - Entropy of Degree s and Degress (r,s) »,
Departemento de Matematica, UFSC, Brazil, 2001.
Cost sensitive C4.5
(Rakotomalala & Chauchat, 2001)
It is a version of C4.5 which can take "misclassification costs matrix" into account during the induction process.
The same parameters as the standard C4.5.
Handling costs growing : If checked, we exploit the costs during the growing phase, not really decisive.
Handling costs pruning : If checked, we exploit the costs during the pruning phase.
References. J.H. Chauchat, R. Rakotomalala, M. Carloz, C.
Pelletier, « Targeting
customer groups using gain and cost matrix : a marketing application »,
Proc. of Data Mining for Marketing Applications Workshop, PKDD'2001, pp. 1-13,
J. Bradford, C. Kunz, R. Kohavi, C. Brunk, C. Brodley, « Pruning
decision trees with misclassification costs », Proc of 10th ECML, pp. 131-136,
C. Drummond, R. Holte, « Exploiting the cost of (in)sensitivity of
decision tree splitting criteria », Proc. of ICML, pp.239-246, 2000.