SIPINA LEARNING ALGORITHMS

SIPINA incorporates various supervised approaches. But it is especially intended to classification trees algorithms. If you want to use other learning algorithms, I suggest to use Tanagra which implements a broad range of statistical and machine learning methods.

In this webpage, we describe mainly the induction tree algorithms (french description)


Decision Trees Algorithms

A limited search induction tree algorithm (Catlett, 1991)
ID3-IV (Quinlan, 1986)
GID3 (Cheng et al., 1988)
ASSISTANT 86 (Cestnik et al., 1987)
CHAID (Kass, 1980)
Improved CHAID (Tschuprow Goodness of Split)
C4.5 (Quinlan, 1993)
Improved C4.5 (Rakotomalala & Lallich, 1999)
Cost sensitive C4.5 (Rakotomalala & Chauchat, 2001)


A limited search induction tree algorithm (Catlett, 1991)

Description.

Jason Catlett is maybe one of the first dataminer of the history. The title of his PhD thesis is "Megainduction : Machine learning on very large databases". He suggests adapted solutions for handling huge databases, especially for decision tree induction.

This approach is one of its suggestions. The number of splits is a priori determined, before the induction. In other context than megainduction, this method is not really efficient. But when we handle a huge dataset, the method gives satisfactory results, and the computing time is manageable, it is known in advance.

Parameters.

Number of max splits : Maximum number of node that we can split during the induction process.

Reference.

J. Catlett, « Megainduction : Machine learning on very large databases », PhD Thesis, School of Computer Science, University of Technology, Sydney, Australia, 1991.

ID3-IV (Quinlan, 1986)

Description.

It is one of the last version of ID3 (Quinlan, 1979). Subsequently, Quinlan incorporates post-pruning process in its new induction algorithms, C4 and C4.5.

Compared with the original ID3, this version uses a chi-square test as a pre-pruning rule. When a split is not statistically significant, the growing process is stopped.

Parameters.

Confidence level : it is the confidence level of the statistical test. When the computed p-value is upper than this threshold, the node is not split.

Reference.

J. Quinlan, " Induction of Decision Trees ", in Machine Learning, 1, pp.81-106, 1986.

GID3 (Cheng et al., 1988)

Description.

GID3 is a generalization of ID3 where some leaves from a split process can be merged. The idea is to highlight the most interesting leaves, and to gather together the other ones in a "default leaf".

Roughly speaking, when a split a node, we detect the leaf which minimizes the Shannon entropy. Then we compute a threshold (TL * Min.Entropy), where TL is the tolerance level, the main parameter of the method. When a leaf has an entropy upper than the threshold, it is merged with the "default leaf".

Parameters.

Confidence level : it is the same as the ID3-IV parameter.
Tolerance level : It is the parameter for the leaves merging. If (TL <1), all the leaves are merged. If TL = 1, we obtain a binary tree. If TL is high, we obtain the standard ID3.

Reference.

J. Cheng, U. Fayyad, K. Irani, Z. Qian, " Improved decision trees: a generalized version of ID3 ", Proc. of 5th International Conference on Machine Learning, pp.100-108, 1988.

ASSISTANT 86 (Cestnik et al., 1987)

Description.

ASSISTANT is another improvement of the original ID3 (Quinlan, 1979). It induces a binary tree. The merging process optimizes the information gain criterion. We use a hill-climbing approach in our implementation.

Various parameters enables to control the size of the tree.

Parameters.

Attribute suitability : The goodness of split is the information gain, multiplied with the size of the node. If it is upper than this attribute suitability threshold, the split is performed.
Class frequency : If the proportion of one class value is upper than this threshold, we do not try to split the node.
Node weight : If the weight of a node, i.e. the node size divided by the root node size, is lower than this threshold, we do not try to split the node.

Reference.

B. Cestnik, I. Kononenko, I. Bratko, " ASSISTANT 86: A Knowledge Elicitation Tool for Sophistical Users ", Proc. of the 2nd European Working Session on Learning, pp.31-45, 1987.


CHAID (Kass, 1980)

Description.

CHAID is the version of AID (Morgan & Sonquist, 1963 - the forerunner of all the induction trees approaches) devoted to categorical class attribute.

The particularities of CHAID are : (1) utilization of the chi-square statistic as goodness of split criterion; (2) merging some leaves which come from the same node.

CHAID is widely spread in commercial softwares. On the other hand, it is less implemented in free software. This method is above all known of the statisticians.

Parameters.

P-level for splitting nodes : The computed p-value of the chi-square statistic is compared to this threshold. When it is upper, the split process is not performed. If we decrease the "p-level" threshold, we obtain a shorter tree.
P-level for merging nodes : The merging process compares the class distribution in the leaves. It searches the two most similar leaves. They are merged if the difference between distributions is not significant. The statistical significance is determined by comparing the computed p-value of the test with this "p-level" threshold defined by the users. If we decrease the threshold, we obtain larger trees. If we increase the threshold, we obtain a more compact tree.
Bonferroni adjustment : Because we multiply the tests in order to find the best solutions at each node, we increase the possibility to find good solution "by chance". The true p-value of the tests must be corrected with this kind of adjustment. In my opinion, it is not really convincing in the practice. It is easier to obtain the desirable solutions in using the "p-levels" parameters.

Reference.

G. Kass, « An exploratory technique for investigating large quantities of categorical data », Applied Statistics, 29(2), pp. 119-127, 1980.


Improved CHAID (Tschuprow Goodness of Split)

Description.

It is my favorite approach. There are some modifications of CHAID in order to better control the size of the tree. They have mainly practical justification. The users can to understand them easily, and to set the right values. In addition, the Tschuprow's t is used instead of standard CHI-square, essentially because it is a normalized measure, varying between 0 to 1.

Parameters.

P-level for splitting nodes, P-level for merging nodes and Bonferroni adjustment are the same as the parameters of CHAID.
Max depth : It enables to limit the depth of the tree. This parameter has not really a theoretical justification, but it is useful when we handle a big dataset.
Min size of node to split : When a node has a size lower than this threshold, we do not attempt to split it.
Min size of leaves : When one of the leaves coming from a splitting has a size lower than this threshold, we do not accept the splitting.

Reference.

R. Rakotomalala, " Arbres de décision ", in Revue Modulad, n°33, 2005.


C4.5 (Quinlan, 1993)

Description.

It is the state-of-the-art Quinlan's algorithm. It is widely used in the machine learning community.

Our version is based on the Quinlan's book (1993), we implement mainly two key ideas : we use the gain ratio in order to select the right split attribute; we use the post-pruning according the pessimistic error criterion in order to detect the right size of the tree.

Quinlan has the own implementation, available on line. There are several options which can slightly modify the behavior of the algorithm. But they are not really essentials.

Parameters.

CL (Confidence level) for pessimistic pruning : It is the confidence level used for the computation of the pessimistic error rate i.e. the upper bound of the confidence interval of the error rate on a leaf.
Size of leaves : A split is accepted if two leaves at least have size upper than this threshold.

References.

R. Quinlan, « C4.5: Programs for Machine Learning », Morgan Kaufman Publishers, 1993.
R. Kohavi, R. Quinlan, « Decision Tree Discovery », in Handbook of Data Mining and Knowledge Discovery, Klosgen & Zytkow Editors, Chapter 16.1.3, pages 267-276, Oxford University Press, 2002.

Improved C4.5 (Rakotomalala & Lallich, 1999)

Description.

It is a generalization of C4.5. We use generalized entropy instead of Shannon entropy. An additional parameter is available.

Parameters.

The same parameters as the standard C4.5.
Beta : It is the parameter of generalized entropy (see the on line reference for detailed formulas).
Improved gain ratio : If it is checked, we use the Quinlan's gain ratio instead of the standard information gain.

References.

R. Rakotomalala, S. Lallich, "Handl.ing noise with generalized entropy of type beta in induction graphs algorithm", Proc. Int. Conf. on Computer Science and Informatics, pp. 25-27, 1998.
R. Rakotomalala, S. Lallich, S. Di Palma, « Studying the behavior of generalized entropy in induction trees using a m-of-n concept », Proc. of 3rd European Conf. on KDD, pp. 510-517, 1999.
I. Taneja, « Generalized Information Measures and their Applications » ; see chapter 3 « Entropy-type Measures - Entropy of Degree s and Degress (r,s) », Departemento de Matematica, UFSC, Brazil, 2001.

Cost sensitive C4.5 (Rakotomalala & Chauchat, 2001)

Description.

It is a version of C4.5 which can take "misclassification costs matrix" into account during the induction process.

Parameters.

The same parameters as the standard C4.5.
Handling costs growing : If checked, we exploit the costs during the growing phase, not really decisive.
Handling costs pruning : If checked, we exploit the costs during the pruning phase.

References.

J.H. Chauchat, R. Rakotomalala, M. Carloz, C. Pelletier, « Targeting customer groups using gain and cost matrix : a marketing application », Proc. of Data Mining for Marketing Applications Workshop, PKDD'2001, pp. 1-13, 2001.
J. Bradford, C. Kunz, R. Kohavi, C. Brunk, C. Brodley, « Pruning decision trees with misclassification costs », Proc of 10th ECML, pp. 131-136, 1998.
C. Drummond, R. Holte, « Exploiting the cost of (in)sensitivity of decision tree splitting criteria », Proc. of ICML, pp.239-246, 2000.
R.Rakotomalala.