SIPINA incorporates various supervised approaches. But it is especially intended to classification trees algorithms. If you want to use other learning algorithms, I suggest to use Tanagra which implements a broad range of statistical and machine learning methods.

In this webpage, we describe mainly the induction tree algorithms (french description)

This approach is one of its suggestions. The number of splits is a priori determined, before the induction. In other context than megainduction, this method is not really efficient. But when we handle a huge dataset, the method gives satisfactory results, and the computing time is manageable, it is known in advance.

Compared with the original ID3, this version uses a chi-square test as a pre-pruning rule. When a split is not statistically significant, the growing process is stopped.

Roughly speaking, when a split a node, we detect the leaf which minimizes the Shannon entropy. Then we compute a threshold (TL * Min.Entropy), where TL is the tolerance level, the main parameter of the method. When a leaf has an entropy upper than the threshold, it is merged with the "default leaf".

Tolerance level : It is the parameter for the leaves merging. If (TL <1), all the leaves are merged. If TL = 1, we obtain a binary tree. If TL is high, we obtain the standard ID3.

Various parameters enables to control the size of the tree.

Class frequency : If the proportion of one class value is upper than this threshold, we do not try to split the node.

Node weight : If the weight of a node, i.e. the node size divided by the root node size, is lower than this threshold, we do not try to split the node.

The particularities of CHAID are : (1) utilization of the chi-square statistic as goodness of split criterion; (2) merging some leaves which come from the same node.

CHAID is widely spread in commercial softwares. On the other hand, it is less implemented in free software. This method is above all known of the statisticians.

P-level for merging nodes : The merging process compares the class distribution in the leaves. It searches the two most similar leaves. They are merged if the difference between distributions is not significant. The statistical significance is determined by comparing the computed p-value of the test with this "p-level" threshold defined by the users. If we decrease the threshold, we obtain larger trees. If we increase the threshold, we obtain a more compact tree.

Bonferroni adjustment : Because we multiply the tests in order to find the best solutions at each node, we increase the possibility to find good solution "by chance". The true p-value of the tests must be corrected with this kind of adjustment. In my opinion, it is not really convincing in the practice. It is easier to obtain the desirable solutions in using the "p-levels" parameters.

Max depth : It enables to limit the depth of the tree. This parameter has not really a theoretical justification, but it is useful when we handle a big dataset.

Min size of node to split : When a node has a size lower than this threshold, we do not attempt to split it.

Min size of leaves : When one of the leaves coming from a splitting has a size lower than this threshold, we do not accept the splitting.

Our version is based on the Quinlan's book (1993), we implement mainly two key ideas : we use the gain ratio in order to select the right split attribute; we use the post-pruning according the pessimistic error criterion in order to detect the right size of the tree.

Quinlan has the own implementation, available on line. There are several options which can slightly modify the behavior of the algorithm. But they are not really essentials.

Size of leaves : A split is accepted if two leaves at least have size upper than this threshold.

R. Kohavi, R. Quinlan, « Decision Tree Discovery », in Handbook of Data Mining and Knowledge Discovery, Klosgen & Zytkow Editors, Chapter 16.1.3, pages 267-276, Oxford University Press, 2002.

Beta : It is the parameter of generalized entropy (see the on line reference for detailed formulas).

Improved gain ratio : If it is checked, we use the Quinlan's gain ratio instead of the standard information gain.

R. Rakotomalala, S. Lallich, S. Di Palma, « Studying the behavior of generalized entropy in induction trees using a m-of-n concept », Proc. of 3rd European Conf. on KDD, pp. 510-517, 1999.

I. Taneja, « Generalized Information Measures and their Applications » ; see chapter 3 « Entropy-type Measures - Entropy of Degree s and Degress (r,s) », Departemento de Matematica, UFSC, Brazil, 2001.

Handling costs growing : If checked, we exploit the costs during the growing phase, not really decisive.

Handling costs pruning : If checked, we exploit the costs during the pruning phase.

J. Bradford, C. Kunz, R. Kohavi, C. Brunk, C. Brodley, « Pruning decision trees with misclassification costs », Proc of 10th ECML, pp. 131-136, 1998.

C. Drummond, R. Holte, « Exploiting the cost of (in)sensitivity of decision tree splitting criteria », Proc. of ICML, pp.239-246, 2000.

R.Rakotomalala.