Releases TANAGRA -- december 18, 2013 -- version 1.4.50
Improvements have been introduced, a new component is added.
HAC. Hierarchical agglomerative clustering. Computing time has been dramatically improved. We will detail the new procedure used in a new tutorial soon.
CATVARHAC. Classification of the levels of the nominal variables. The calculations are based on the work of Abdallah and Saporta (1998). The component performs an agglomerative hierarchical clustering of the levels of qualitative variables. The Dice's index is used as the distance measures. Three kind of linkage criteria are proposed: single linkage, complete linkage, average linkage. A tutorial will come to describe the method soon.
Releases TANAGRA -- september 15, 2013 -- version 1.4.49
Some enhancements regarding factor analysis approaches (PCA - principal component analysis, MCA - multiple correspondence analysis, CA - correspondence analysis, FDMA - factorial analysis of mixed data) have been incorporated. In particular, outputs have been completed.
The VARIMAX rotation has been improved. Thanks to Frédéric Glausinger for optimized source code.
The Benzecri correction is added in the MCA outputs. Thanks to Bernard Choffat for this suggestion.
Releases TANAGRA -- december 1, 2012 -- version 1.4.48
Some new components have been added.
K-Means Strengthening. This component was suggested to me by Mrs. Claire Gauzente. The idea is to strengthen an existing partition (e.g. from a HAC) by using K-Means algorithm. A comparison of groups before and after optimization is proposed, indicating the efficiency of the optimization. The approach can be plugged to all clustering algorithm into Tanagra. Thanks to Claire for this valuable idea.
Discriminant Correspondence Analysis. This is an extension of the canonical discriminant analysis to discrete attributes (Hervé Abdi, 2007). The approach is based on a clever transformation of the dataset. The initial dataset is transformed into a crosstab. The values of the target attribute are in row, all the values of the input attributes are in column. The algorithm performs a correspondence analysis to this new data table to identify the associations between the values of the target and the input variables. Thus, we dispose of the tools of the correspondence analysis for a comprehensive reading of the results (factor scores, contributions, quality of representation).
Other components have been improved.
HAC. After the choice of the number of groups in the dendrogram in the Hierarchical Agglomerative Clustering, a last pass on the data is performed, it assigns each individual of the learning sample into the group with the nearest centroid. Thus, there may be discrepancy between the number of instances displayed on the tree nodes and the number of individuals in the groups. Tanagra displays the two partitions. Only the last one is used when Tanagra applies the clustering model on new instances, when it computes conditional statistics, etc.
Correspondence Analysis. Tanagra now provides the coefficients of the factor score functions for supplementary columns and rows in the factorial correspondence analysis. Thus, it will be possible to easily calculate the factor scores of new points described by their row or column profile. Finally, the results tables can be sorted according to contributions to the factors of the modalities.
Multiple correspondence analysis. Several improvements have been made to the multiple correspondence analysis: the component knows how to take into account supplementary continuous and discrete variables; the variables can be sorted according to their contribution to the factors; all indicators for the interpretation can be brought together in a single large table for a synthetic visualization of the results, this feature is especially interesting if we have a small number of factors; the coefficients for the factor score functions are provided, we can easily calculate the factorial coordinates of the supplementary individuals apart from Tanagra.
Some tutorials will come soon to describe the use of these components on realistic case studies.
Releases TANAGRA -- september 24, 2012 -- version 1.4.47
Non iterative Principal Factor Analysis (PFA). This is an approach which tries to detect underlying structures in the relationships between the variables of interest. Unlike PCA, the PFA is focused only on the shared variances of the set of variables. It is suited when the goal is to uncover the latent structure of the variables. It works on a slightly modified version of the correlation matrix where the diagonal, the prior communality estimate of each variable, is replaced by its squared multiple correlation with all others.
Harris Component Analysis. This is a non-iterative factor analysis approach. It tries to detect underlying structures in the relationships between the variable of interest. Like Principal Factor Analysis, it focuses on the shared variances of the set of variables. It works on a modified version of the correlation matrix.
Principal Component Analysis. Two functionalities are added: the reproduced and residual correlation matrices can be computed, the variables can be sorted according to the loadings in the output tables.
These three components can be combined with the FACTOR ROTATION component (varimax or quartimax). They can be combined also to the re-sampling approaches for the detection of the relevant number of factors (PARALLEL ANALYSIS and BOOTSTRAP EIGENVALUES).
Releases TANAGRA -- september 1, 2012 -- version 1.4.46
AFDM (Factor analysis for mixed data). It extends the principal component analysis (PCA) to data containing a mixture of quantitative and qualitative variables. The method is developed by Pagès (2004). A tutorial will come to describe the use of the method and the reading of the results.
Releases TANAGRA -- june 12, 2012 -- version 1.4.45
New features for the principal component analysis (PCA).
PRINCIPAL COMPONENT ANALYSIS. Additional outputs for the component: Scree plot and variance explained cumulative curve; PCA Correlation Matrix - Some outputs are provided for the detection of the significant factors (Kaiser-Guttman, Karlis-Saporta-Spinaki, Legendre-Legendre broken-stick test); PCA Correlation Matrix - Bartlett's sphericity test is performed and the Kaiser's measure of sampling adequacy (MSA) is calculated; PCA Correlation Matrix - The correlation matrix and the partial correlations between each pair of variables controlling for all other variables (the negative anti-image correlation) are produced.
PARALLEL ANALYSIS. The component calculates the distribution of eigenvalues for a set of randomly generated data. It proceeds by randomization. It applies to the principal components analysis and te multiple correspondence analysis. A factor is considered significant if its observed eigenvalue is greater than the 95-th percentile (this setting can be modified).
BOOTSTRAP EIGENVALUES. It calculates by bootstrap approach the confidence intervals of eigenvalues. A factor is considered significant if its eigenvalue is greater than a threshold which depends on the underlying factor method (PCA or MCA) method, or if the lower bound of the eigenvalue of a factor is greater than higher bound of the following one. The confidence level 0.90 can be modified. This component can be applied to the principal component analysis or the multiple correspondence analysis.
JITTERING. Jittering feature is incorporated to the scatter plot components (SCATTERPLOT, CORRELATION SCATTERPLOT, SCATTERPLOT WITH LABEL, VIEW MULTIPLE SCATTERPLOT).
RANDOM FOREST. The not used memory is released after the decision tree learning process. This feature is especially useful when we use an ensemble learning approach where we store a large number of trees in memory (BAGGING, BOOSTING, RANDOM FOREST). The memory occupation is reduced. The computation efficiency is improved.
Releases TANAGRA -- may 14, 2012 -- version 1.4.44
LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/). Update of the LIBSVM library for support vector machine algorithms (version 3.12, April 2012) [C - SVC, Epsilon-SVR, nu - SVR]. The calculations are faster. The attributes can be normalized or not. They were automatically normalized previously.
LIBCVM (http://c2inet.sce.ntu.edu.sg/ivor/cvm.html; version 2.2). Incorporation of the LIBCVM library. Two methods are available: CVM and BVM (Core Vector Machine and Ball Vector Machine). The dezscriptors can be normalized or not.
TR-IRLS (http://autonlab.org/autonweb/10538). Update of the TR-IRLS library, for the logistic regression on large dataset (large number of predictive attributes) [last available version – 2006/05/08]. The deviance is automatically provided. The display of the regression coefficients is more precise (higher number of decimals). The user can tune the learning algorithms, especially the stopping rules.
SPARSE DATA FILE. Tanagra can handle sparse data file format now (see SVMlight ou libsvm file format). The data can be used for supervised learning process or regression problem. A description of this kind of file is available on line (http://c2inet.sce.ntu.edu.sg/ivor/cvm.html).
INSTANCE SELECTION. A new component for the selection of the m first individuals among n in a branch of the diagram is available [SELECT FIRST EXAMPLES]. This option is useful when the data file is the result of the concatenation of the learning and test samples.
Releases TANAGRA -- march 28, 2012 -- version 1.4.43
A few bugs have been fixed and some new features added.
The computed contributions of individuals in PCA (PRINCIPAL COMPONENT ANALYSIS) have been corrected. It was not valid when we work on a subsample of our data file. This error has been reported by Mr. Gilbert Laffond.
The standardization of the factors after VARIMAX (FACTOR ROTATION) have been corrected so that their variance coincides with the sum of the squares of the correlations with the axes, and thus with the eigen value associated to the axis. This modification has been suggested by Mr. Gilbert Laffond.
During the calculation of the confidence interval of the PLS regression coefficients (PLS CONF. INTERVAL), an error may occur when the requested number of axes was upper than the number of predictor variables. It is now corrected. This error has been reported by Mr. Alain Morineau.
In some circumstances, an error may occur in FISHER FILTERING, especially when Tanagra is run under Wine for Linux. We introduce some additional checking. This error has been reported by Mr. Bastien Barchiési.
The checking of missing values is now optional. The performance can be preferred for the treatment of very large files. We find the performances of 1.4.41 and previous versions.
The "COMPONENT / COPY RESULTS" menu sends information in HTML format. It is now compatible with the spreadsheet Calc of Libre Office 3.5.1. It was operating with the Excel spreadsheet only before. Curiously, the copy to the OOCalc (Open Office spreadsheet) is not possible at the present time (Open Office 3.3.0).
Releases TANAGRA -- february 4, 2012 -- version 1.4.42
The Tanagra.xla add-in for Excel can work now for both the 32 and 64-bit versions of EXCEL.
With the FastMM memory manager, Tanagra can address up to 3 GB under 32-bit Windows and 4 GB under 64-bit Windows. The processing capabilities, especially about the handling of large datasets, are improved.
The importation of the tab-delimited text file format and xls file format (Excel 97-2003) is made safer. Previously, the importation is interrupted and the dataset is truncated when an invalid line is read (with missing or inconsistent values). Now, Tanagra skips the line and continues on the next rows. The number of skipped lines is reported into the importation report.
Releases TANAGRA -- september 22, 2011 -- version 1.4.41
A PRIORI PT. This component generates association rules. It is based on theBorgelt’s apriori.exe program which has been recently updated (2011/09/02 - 5.57 version). The improvement of this new version, in terms of calculation time, is impressive.
FREQUENT ITEMSETS. Also based on the Borgelt’s apriori.exe program (version 5.57), this component generates frequent (or closed, maximum, generators) itemsets.
Some tutorials are coming soon to describe the use of these new tools.
Releases TANAGRA -- july 05, 2011 -- version 1.4.40
Few improvements for this new version.
A new addon for the connection between Tanagra and the recent version of OpenOffice Calc spreadsheet has been created. The old one did not work for recent versions - OpenOffice 3.3 and LibreOffice 3.4. During the installation process, another library was added ("TanagraModule.oxt") to not interfere with the old, still functional for previous versions of Open Office (3.2 and earlier). A tutorial describing its installation and its utilization will be put online soon. I take this opportunity to highlight again how a privileged connection between a spreadsheet and a specialized tool for Data Mining is convenient. The annual poll organized by the kdnuggets.com website shows the interest of this connection (2011, 2010, 2009,...). We note that there is a similar addon for the R software (R4Calc). This change was suggested by Jérémy Roos (OpenOffice) and Franck Thomas (LibreOffice).
The non-standardized ACP is now available. It is possible to implement unchecking the option of standardization of the data in the Principal Component Analysis component. Change suggested by Elvire Antanjan.
Simultaneous regression was introduced. It is very similar to the method programmed into LazStats, which is unfortunately more accessible freely now. The approach is described in a free booklet online "Practice of linear regression analysis" (in French) (section 3.6).
The color codes according to the p-value have been introduced for the Linear Correlation component. Change suggested by Samuel KL.
Once again, thank you very much to all those who help me to improve this work by their comments or suggestions.
Releases TANAGRA -- may 26, 2011 -- version 1.4.39
Some minor corrections for the Tanagra 1.4.39 version.
For the PCA (principal component analysis) component, when we ask all the factors, none are generated. Reported by Jérémy Roos.
In the previous 1.4.38 version, the results of Multinomial Logistic Regression are not consistent with the tutorial on the website. The calculations are wrong. Reported by Nicole Jurado.
It is not possible to obtain the scores from the PLS-DA component (Partial Least Squares Regression - Discriminant Analysis). Reported by Carlos Serrano.
All these bugs are corrected in the 1.4.39 version.
Once again, thank you very much to all those who help me to improve this work by their comments or suggestions.
Releases TANAGRA -- february 04, 2011 -- version 1.4.38
Some minor corrections for the Tanagra 1.4.38 version.
The color codes for the normality tests have been harmonized (Normality Test). In some configurations, the colors associated with p-values were not consistent, it could misleading the users. This problem has been reported by Lawrence M. Garmendia.
Following indications from Mr. Oanh Chau, I realized that the standardization of variables to the HAC (hierarchical agglomerative clustering) was based on the sample standard deviation. This is not an error in itself. But the sum of index of level into the dendrogram does not consistent with the TSS (total sum of squares). This is unwelcome. The difference is especially noticeable on small dataset, it disappears when the dataset size increases. The correction has been introduced. Now the BSS ratio is equal to 1 when we have the trivial partition i.e. one individual per group.
Multiple linear regression (MULTIPLE LINEAR REGRESSION) displays the matrix (X'X) ^ (-1). It allows to deduce the variance covariance matrix of coefficients (by multiplying the matrix by the estimated variance of the error). It can be also used in the generalized tests for the model coefficients.
Last, the outputs of the descriptive discriminant analysis (CANONICAL DISCRIMINANT ANALYSIS) were improved. The group centroids (Group centroids) on the factorial axes are directly provided.
Thank you very much to all those who help me to improve this work by their comments or suggestions.
Releases TANAGRA -- october 19, 2010 -- version 1.4.37
Naive Bayes Continuous is a supervised learning component. It implements the naive bayes principle for continuous predictors (gaussian assumption, heteroscedasticity or homoscedasticity). The main originality is that it provides an explicit model corresponding to a linear combination of predictors and, eventually, their square.
Enhancement of the reporting module.
Releases TANAGRA -- march 23, 2010 -- version 1.4.36
ReliefF is a component for automatic variable selection in a supervised learning task. It can handle both continuous and discrete descriptors. It can be inserted before any supervised method.
Naive Bayes was modified. It now described a prediction model in an explicit form (in a linear combination form), easy to understand and to deploy.
Releases TANAGRA -- january 19, 2010 -- version 1.4.35
CTP. The method of detection of the right size of the tree is modified for the "Clustering Tree" with post-pruning component (CTP). It relies both on the angle between half-lines at each point on the curve of decreasing the WSS (within-group sum of squares) on the growing sample and the decrease of the same indicator computed on the pruning sample. Compared to the previous implementation, it results in a smaller number of clusters.
Regression Tree. The previous modification is incorporated into the Regression Tree component which is a univariate version of CTP.
C-RT Regression Tree. A new regression tree component was added. It faithfully implements the technique described in the Breiman's and al. (1984) book, including the post-pruning part with the 1-SE Rule (Chapter 8, especially p. 226 about the formula for the variance of the MSE).
C-RT. The report of the induction of decision tree C-RT has been completed. Based on the last column of the post-pruning table, it becomes easier to choose the parameter x (in x-SE Rule) to arbitrarily define the size of the pruned tree.
Some tutorials will describe these various changes soon.
Releases TANAGRA -- november 22, 2009 -- version 1.4.34
A component of induction of predictive rules (RULE INDUCTION) was added under "Supervised Learning" tab. Its use is described in a tutorial available online (will be translated soon).
The DECISION LIST component has been improved, we changed the test done during the pre-pruning process. The formula is described in the tutorial above.
The SAMPLING and STRATIFIED SAMPLING components (Instance Selection tab) have been slightly modified. It is now possible to set ourself the seed number of the pseudorandom number generator.
Following an indication of Anne Viallefont, calculation of degrees of freedom in tests on contingency tables is now more generic. Indeed, the calculation was wrong when the database was filtered and some margins (row or column) contained a number equal to zero. Anne, thank you for this information. More generally, thank you to everyone who sent me comments. Programming has always been for me a kind of leisure. The real work starts when it is necessary to check the results, compare them with the available references, cross them with other data mining tools, free or not, understand the possible differences, etc.. At this step, your help is really valuable.
Releases TANAGRA -- october 3, 2009 -- version 1.4.33
Several logistic regression diagnostics and evaluation tools were implemented, one of them (reliability diagram) can be applied to any supervised method
A tutorial describing the utilization of these tools will be available soon.
Releases TANAGRA -- april 15, 2009 -- version 1.4.31
Thierry Leiber has improved add-on making the connection between Tanagra and Open Office. It is now possible on Linux, install the add-on for Open Office and launch Tanagra directly after selecting the data (see the tutorials on installing Linux Tanagra and integration of add-on in Open Office). Thierry, thank you very much for this contribution, which helps the users of Tanagra.
Following a suggestion by Mr. Laurent Bougrain, the matrix of confusion is added to the automatic saving of results in experiments. Thank you to Laurent, and all others who by their constructive comments helps me move in the right direction.
In addition, two new components for regression using the support vector machine principle (support vector regression) were added: Epsilon-Nu-SVR and SVR. A tutorial shows these methods and compare our results with the R software will be available soon. Tanagra, as with the R package "e1071", are based on the famous LIBSVM library.
Tutorials about these releases are coming soon.
Releases TANAGRA -- february 6, 2009 -- version 1.4.30
The main change is the integration of the FastMM library (http://sourceforge.net/projects/fastmm/). Memory allocations have been optimized. This affects mainly the processing time when importing data. It is really much faster now.
EXPORT DATASET component (DATA VISUALIZATION tab) can now export files in ARFF (Weka) and EXCEL (97 -> XP) format. For Excel, the number of rows (columns) is limited to 65,534 (256).
The components A PRIORI MR, SPV ASSOC TREE and SPV ASSOC RULE have been updated. They are described in several tutorials (http://data-mining-tutorials.blogspot.com/).
Releases TANAGRA -- january 6, 2009 -- version 1.4.29
Thierry Leiber pointed out a bug in the component ROTATION FACTOR, the projections for new individuals were wrong. I changed the calculation. Thank you very much Thierry.
This new version is characterized by a series of components dedicated to the incorporation of misclassification costs in the supervised learning process.
Components for cost sensitive learning | Description |
POSTERIOR PROB (Scoring tab) | This component computes, for each examples of the dataset, the posterior probabilities e.g. the conditional probability for each value of the class attribute. |
CS-CRT (Spv Learning tab) | The CART approach (Breiman et al, 1984) where we want to minimize the expected misclassification cost. The misclassification cost matrix is the most important parameter of this component. |
CS-MC4 (Spv Learning tab) | This is a cost sensitive version of C4.5(Chauchat et Rakotomalala, 2001). We try also to minimize the expected misclassification cost. |
COST SENSITIVE LEARNING (Meta-Spv Learning tab) | This is a generic component which adjusts the prediction of a supervised learning algorithm by incorporating the misclassification cost matrix. The goal is to minimize the overall expected misclassification cost. |
COST SENSITIVE BAGGING (Meta-Spv Learning tab) | This is a generic component which implements a bagging process. The prediction of each individual model is adjusted according to the misclassification costs matrix. |
MULTICOST (Meta-Spv Learning tab) | It is a version of MetaCost (Domnigos, 1999). The difference is that individual models are already adjusted. The final model is a single model. This is the main advantage of this approach. |
Releases TANAGRA -- october 26, 2008 -- version 1.4.28
Some minor modifications. Mainly, the output of K-Means component (and other clustering methods) are upgraded.
Releases TANAGRA -- august 22, 2008 -- version 1.4.27
New components about nonparametric statistical test have been added.
Composant | Description |
K-S 2-Sample Test |
Kolmogorov-Smirnov test of equality of one-dimensional probability distributions of two-samples. Kuiper and Cramer - von Mises statistics are also computed.
See Kolmogorov-Smirnov test |
Ansari-Bradley Scale Test | Test for the differences in scale between (K >= 2) independent samples. |
Mood Scale Test | Test for the differences in scale between (K >= 2) independent samples. |
Klotz Scale Test | Test for the differences in scale between (K >= 2) independent samples. |
FYTH 1-way ANOVA | Fisher-Yates-Terry-Hoeffding test. Test for differences in central tendency between (K >= 2) independent samples. This methis is closely related to the Wilcoxon-Mann-Whitney test. |
Median Test |
Median test. Test for differences in central tendency between (K >= 2) independent samples.
See Median test. |
Van der Waerden 1-way ANOVA | Van der Waerden test. Test for differences in central tendency between (K >= 2) independent samples. This methis is closely related to the Wilcoxon-Mann-Whitney test. |
Cochran's Q-test | Cochran's Q test. It is the exetension of the McNemar test. Test for differences in proportions between (K >=2) related (matched) samples. |
Releases TANAGRA -- july 17, 2008 -- version 1.4.26
New components about univariate and multivariate tests for homogeneity.
Component | Description |
ANOVA Randomized Blocks | Analysis of variance for randomized blocks/repeated measures design. |
Paired V-Test | Comparison of variances for 2 related samples. |
Welch ANOVA | One-way analysis of variance, with the heteroscedasticity assumption (unequal variance). |
Hotelling's T2 | Multivariate comparison of 2 vector of means. |
Hotelling's T2 Heteroscedastic | Multivariate comparison of 2 vector of means with the unequal variance covariance matrices assumption. |
Box's M Test | Comparison of K (K >= 2) variance covariance matrices for independent samples. |
Releases TANAGRA -- june 17, 2008 -- version 1.4.25
Two components are added : PARTIAL CORRELATION and SEMI-PARTIAL CORRELATION. The utilization of these components is detailed into tutorials (1 and 2).
The output of the PLSR component (Partial Least Square Regression - version 1.4.24) below is completed with graphical charts (scatter plot between w* and c vectors).
Releases TANAGRA -- may 25, 2008 -- version 1.4.24
A new component for PLS regression is proposed (PLSR). It combines the features of two former components (always present): PLS Factorial, which produces scores, and PLS Regression, which produces predictions and residuals. The report has been improved so as to move towards the standards of specialized software. A tutorial detailing features component is available. Its particularity is to compare detail of the results with those of state of the art softwares, such that SIMCA-P, SAS with the PROC PLS, R with the PLS package, SPAD with PLS regression component (the tutorial is currently in French, it will be translated soon).
On quite another subject, a second component for the univariate detection of outliers has been implemented (UNIVARIATE OUTLIER DETECTION, see tutorial).
Finally, a small review of the Tanagra project today. It contains 213780 lines of code; 139 methods are implemented; nearly 70 tutorials (in english, about 90 in french) accompanying the distribution of software. The website Tanagra has an average of 160 visitors daily (STATCOUNTER on the period January 2008 to April 2008).
Releases TANAGRA -- may 8, 2008 -- version 1.4.23
With this version 1.4.23 of TANAGRA, we wanted to highlight the supervised learning methods based on the PLS regression, commonly called PLS Discriminant Analysis (with three components C-PLS, PLS-DA, PLS-LDA). PLS Regression is very popular in many research areas, but it is less diffused in the machine learning community. Yet its characteristics are very interesting, even decisive in some contexts, especially when the descriptors are very numerous and highly redundant. This kind of situation occurs frequently in real DATA MINING problems.
In a tutorial, we show how to implement these methods with TANAGRA, how to read and interpret the results.
Releases TANAGRA -- april 5, 2008 -- version 1.4.22
Some minor modifications (scatterplot of misclassification error rate for C- RT and scatterplot for the within-point scatter of clustering tree) are made. Above all, two minor bugs are fixed:
Releases TANAGRA -- december 11, 2007 -- version 1.4.21
Two components for variable selection for the logistic regression were added (tab FEATURE SELECTION): forward selection (FORWARD-LOGIT) and backward elimination (BACKWARD-LOGIT). All steps of the computation can be traced. A tutorial describes the utilization of these components.
The component Multinomial logistic regression (MULTINOMIAL LOGISTIC REGRESSION, Tab LEARNING SPV) has been improved. Testing the significance of the variables in each equation and globally, based on Wald statistics, are now provided. A tutorial describes the utilization of the component.
Releases TANAGRA -- october 20, 2007 -- version 1.4.20
Few visible changes but many internal improvements. GongYu has completed the upgrade of several libraries, he also tracked a large majority of memory leaks in the source code. A big big thank you, GongYu, for your impressive work. He is currently working on a version derived from Tanagra.
Nouveautés TANAGRA -- october 1st, 2007 -- version 1.4.19
New components for measures of association between ordinal variables are added: Goodman and Kruskal's Gamma, Kendall's Tau-b and Tau-c, Sommers' d (NONPARAMETRIC STATISTICS tab).
A tutorial about the utilization of these components is available.
Some modifications have been made :
Releases -- may 28th, 2007 -- version 1.4.18
New components for measures of association between nominal variables are added: Goodman and Kruskal's Lambda, Goodman and Kruskal's Tau, Theil's U (NONPARAMETRIC STATISTICS tab).
A tutorial about the utilization of these components is available. The component which allows to build contingency table and performs a CHI-SQUARE test of independence is also presented (CONTINGENCY CHI-SQUARE).
Releases -- may 8th, 2007 -- version 1.4.17
New tools for Regression Analysis are available :
BARCKWARD Stepwise variable selection (BACKWARD ELIMINATION REG)
Two new components (OUTLIER DETECTION et DFBETAS) for outliers and inluential points detection. Usual indicators are available (leverage, DFFITS, COVRATIO, Distance de Cook, DFBETAS). See tutorial.
Releases -- march 1st, 2007 -- version 1.4.16
Variable clustering components are added. They rely on the same principle: "clustering around the latent components" (Vigneau et Qannari, 2003). Three methods were implemented: VARKMEANS (reallocation approach), VARHCA (hierarchical ascending method), and VARCLUS (top-down or divisive method). This latter is a simplified alternative of the procedure which one finds in some software.
A tutorial details the utilization of these components.
Releases -- february 11, 2007 -- version 1.4.15
A new component, CORRESPONDENCE ANALYSIS is added. It is a factorial analysis technique. It is especially intended to cross-table exploration/description.
The formulas and the tutorial is suggested by an very good french book -- Lebart, Morineau and Piron, "Statistique Exploratoire Multidimensionnelle", Dunod, 2000 (pages 67 to 107 for Correspondence Analysis).
A tutorial on how to perform Correspondence Analysis with TANAGRA is available.
Releases -- february 07th, 2007 -- version 1.4.14
"Factorial Analysis" reports are improved. They are filled out according to the standard description provided by two famous french books: G. Saporta's book for PCA (« Probabilités, Analyse de Données et Statistique », Dunod, 2006, pages 177 à 181) ; and M. Tenenhaus' book for MCA (« Méthodes Statistiques en Gestion », Dunod, 1996, pages 212 à 222).
Releases -- january 27th, 2007 -- version 1.4.13
Feature selection - the research of the relevant variables - is a key activity of the data mining. In the new 1.4.13 version we add the STEPDISC method (Stepwise Discriminant Analysis). It is especially intended to linear discriminant analysis but it can be used in other contexts (see STEPDISC tutorial).
Releases -- december 22nd, 2006 -- version 1.4.12
A new add-on enables to automatically transfer a dataset from Open Office Calc spreadsheet to Tanagra (see tutorial).
As well as usual tutorials (PDF), animated tutorials are now available in order to show the functionnalities of Tanagra (see OOoCalc for instance).
It is now possible to use a keywords search through the tutorials on the whole website (see the demonstration).
Releases -- november 22nd, 2006 -- version 1.4.11
A EXCEL add-in (TANAGRA.XLA) is now available. It adds a new menu to EXCEL and enables us to prepare and transfer a dataset to TANAGRA directly from the EXCEL spreadsheet (see tutorial ).
This approach is an alternative to the software where the specification menu and the reports are embedded in the spreadsheet, such as XLMINER or XLSTAT. Our add-in is compatible with all EXCEL version from 97 to 2007.
Releases -- october 31th, 2006 -- version 1.4.10
A detailed tooltip is now available on each component. It quickly depicts the underlying method. It most of all describes the requirements of use of the component.
The confidence intervals of the PLS regression coefficients is now computed using a bootstrap scheme. It is a suggestion of Rainer Block.
Releases -- september 01st, 2006 -- version 1.4.9
New components :
Rainer Block has reported some bugs and suggested me some modification and new components. Thank you very much for your helpful comments Rainer.
Releases -- july 03, 2006 -- version 1.4.8
Parts of the diagram can be saved/loaded. We can thus apply the same analysis on various dataset (see tutorial).
Dendrogram (HAC) and Correlation circle (PCA) are available (see tutorial).
Luc Sorel send me the following indications about using TANAGRA under Linux : "In order to use TANAGRA under Linux ( Kubuntu 5.10, based on Debian), we must first of all install the software under Wine, before starting it under Linux". Thank you very much Luc !
Releases -- may 18, 2006 -- version 1.4.7
It is now possible to copy/paste a component or part of the diagram. The parameters of the components are also duplicated, rigorously identical treatments can be carried out on various sets of variables in the diagram (see the following tutorial -- will be translated soon).
A regression trees component very similar to Breiman's algorithm (Breiman and al., 1984) is available.
Releases -- may 3, 2006 -- version 1.4.6
A new clustering CLUSTERING TREES was added. It adapts the top down induction of decision trees methods towards clustering. The groups are described with logical rules, the approach performs automatically a selection of relevant feature. Our main references are Chavent (1998) and Blockeel (1998) (see tutorials/exploratory data analysis section).
The PLS regression was adapted to classification task (C-PLS). We obtain a linear classifier, we can monitor the learning bias with the number of relevant axis. This approach is useful in a high dimensional spaces and gives similar performances to linear SVM.
The Binary Logistic Regression was improved (see Nakache and Confais, 2005, pp.82 and pp.162).
Releases -- february 12, 2006 -- version 1.4.4
A new visual component, GROUP EXPLORATION, allows to explore manually a group of individuals, it is a genaralization of GROUP CHARACTERIZATION. A small tutorial shows how to use this component.
Releases -- january 31, 2006 -- version 1.4.3
A "supervised" association rules mining was added. It is simply an association rule generator for which we can specify the item in the consequent of the rule.
Other important modification in the structure of TANAGRA, the possibility of launching an external program was added. To test this technology, we decided to integrate the Christian Borgelt's association rule mining with prefix tree, particularly powerful, as well in speed as in occupation memory. The results are impressive, building and loading the temporary files is transparent and does not significantly deteriorate the computation time, even on big dataset (90 MB).
Releases -- january 9, 2006 -- version 1.4.2
Quinlan's C4.5 decision algorithm is added.
C-SVC, a multi-class support vector machine for classification task from the LIBSVM is added. It is a very efficient library, the original source code was compiled in a DLL which is automatically downloaded. We integrate the 2.8 version. See the website for more informations about the used algorithms.
Releases -- october 24, 2005 -- version 1.4.1
A PLS Regression component is added (Partial Least Squares). Our main reference was M. Tenenhaus, "La Régression PLS -- Théorie et Pratique", Technip, 1998 (Chapter 7 to 10).
This component was implemented by Jean-François Grange (Master IDS -- Université Lyon 2).
Releases -- august 15, 2005 -- version 1.3.4 & 1.3.5
Various components are added.
Method | Description and References |
Forward Selection for Regression | Forward stepwise regression, it starts with no model and at each step adds the most significant variable using a partial correlation measure. |
Friedman & Kendall K-related Samples Tests | Nonparametric two way analysis of variances for related samples. |
Kendall's Concordance W | Check the agreements among raters on the rank orders of individuals. |
MANOVA | One way multivariate analysis of variance. |
Canonical Discriminant Analysis | Canonical discriminant analysis produces canonical variables which maximize the distance between groups. |
Releases -- july 31, 2005 -- version 1.3.3
Un new component goodness-of-fit is added NOMALITY TEST: this component tests the compatibility of the empirical distribution with the normality hypothesis.
Various methods are available, the main restriction is that the sample size must be upper than 4.
Method | Description and References |
Shapiro Wilk Test |
This test is the most pouplar for checking normality distribution. Our main references are
Patrick Royston's papers (1) "An Extension of Shapiro and Wilk's W Test for
Normality to Large Samples". Applied Statistics, 31, 115--124, 1982 ; (2) A Remark on Algorithm AS 181: The W Test for Normality.
Applied Statistics, 44, 547--551, 1995. More details are available here NIST. Our implementation is a conversion in the DELPHI language of the following FORTRAN source code STATLIB (R94). NB : This implementation is valid only if n <= 5000. |
Kolmogorov-Smirnov & Lilliefors Test |
This test is based on Kolmogorov-Smirnov statistic.
The Kolmogorov-Smirnov test may be used for various distributions, we use the test for a normality checking and the distribution parameters (mean ; standard deviation) are estimated: we must use a specific critical values supplied by Lilliefors (1967) that we can find here . |
Anderson-Darling Test |
This test is a modification of the Kolmogorov-Smirnov (K-S) test and
gives more weight to the tails of the distribution than does the K-S test
(ref.)
Critical values for normality checking and various significance level are available here. |
D'Agostino Test |
D'Agostino test uses the SKEWNESS and the KURTOSIS computed on the dataset.
The main reference is D'Agostino, "Test for Normal Distribution", in
R. B. D'Agostino and M. A. Stephens, editors. Goodness-of-Fit Techniques. Marcel Dekker, Inc., 1986.
Our implementation uses the following description. |
Releases -- july 25, 2005 -- version 1.3.2
Some parametric statistical tests are added.
The main references for these components is : NIST/SEMATECH e-Handbook of Statistical Methods -- NIST
Component | Tests for Equal Means |
T-Test | Two-samples, equal variances assumed. |
T-Test Unequal Variance | Two-samples, unequal variances. |
Component | Tests for Equality of Variances |
Fisher's Test | Two-samples, this test is very sensitive to departures from normality. |
Bartlett's Test | K-samples, this test is very sensitive to departures from normality. |
Levene's Test | K-samples, this test is less sensitive to departures form normality. |
Brown & Forsythe's Test | K-samples, this test is an extension of Levene's test |
Releases -- july 18, 2005 -- version 1.3.1
A new family of components is added: nonparametric statistics.
Our main reference is Siegel & Castellan, "Nonparametric Statistics for the Behavioral Sciences", McGraw-Hill, 1988.
Component | Description |
Kendall's Tau | The Kendall Rank-Order Correlation Coefficient Tau |
Spearman's Rho | The Spearman Rank Order Correlation Rho |
Wald & Wolfowitz Runs Test | Comparison of two independant samples |
Mood Runs Test | Comparison of K independant samples (generalization of Wald & Wolfowitz test) |
Mann & Whitney | Comparison of 2 independant samples based on rank average |
Kruskal & Wallis | Comparison of K independant sample based on rank average (Generalization of Mann & Whitney Test) |
Sign Test | Comparison on 2 Related Measures or Paired Replicates |
Wilcoxon Signed Ranks Test | Comparison on 2 Related Measures or Paired Replicates |
Releases -- april 28, 2005 -- version 1.2.1
A new family of components is added to TANAGRA : SCORING
These components use the posterior class probabilities informations.
Three new components:
Releases -- april 22, 2005 -- version 1.1.7
Two supervised learning methods are added :
Releases -- april, 17th -- version 1.1.6
SVM (Support Vector Machine) is implemented.
Implements John C. Platt's sequential minimal optimization algorithm for training a support vector classifier using polynomial or RBF kernels.
ReferencesNota: This is a port of WEKA implementation (SMO.JAVA, ver. 3-4)
The results can very slightly differ in certain cases, that is due to the fact that a random mixture of the observations is carried out before the training process, JAVA and DELPHI do not use the same random numbers generator.
Releases -- april, 8th -- version 1.1.5
NIPALS algorithm for SVD (singular value decomposition) and PCA (principal component analysis) computing.Releases -- february, 18th -- version 1.1.4
2 new evaluation procedures of supervised learning were added (Spv Learning Assessment):Releases -- december 29th, 2004 -- version 1.1.2
TANAGRA can handle in a native way two new file format: WEKA (.arff) and EXCEL (.xls, 97 & 2000 version).The computational time of data importation was improved considerably.
Careful, if you want to have more options for WEKA files importation (a better missing data processing for instance), use DATANAMORF.
Releases -- december 19th, 2004 -- version 1.1.1
New components.
Component | Section | Description |
Fisher Filtering | Feature selection | Use an ANOVA for predictive attribute evaluation |
Runs Filtering | Feature selection | Use a non-parametric test (Mood's runs test, 1940) for predictive attribute evaluation |
C-RT | Supervised learning | The famous Breiman's et al. (1984) classification tree algorithm |
Select examples | Instance selection | Select a subset of example from attribute-value description. You can use this component to define external test set or apply a classifier on a new dataset (see tutorials). |
Releases -- october 15th, 2004 -- version 1.1
A - Integration of a new operating mode.
Batch mode execution, it is now possible to run a stream diagram from command line, the stream is automatically executed and results can be saved.
HTML reports are produced.
B - Others functionalities
Optimize computation time when there are a lot of attributes (several thousands).
Univariate continuous attribute selection for supervised learning (Fisher's F statistics)
Assessment for supervised learning (Train-test, cross-validation) has a new parameters, it is now possible to save the results in a persistent text file. Be careful, old binary stream diagram are not compatible with the new one, you must save the old diagram in a text mode (.TDM file) before to download them in this new version.
DATANAMORF 1.0 -- june 29th, 2004
DATANAMORF is a software which is intended to convert WEKA (.arff) file format into TANAGRA textfile format (.txt).DATANAMORF uses a diagram to represent the importation processing.
DATANAMORF is written by Aurélien BERTRAND, it is freely available at DATANAMORF
Releases -- may 24th, 2004 -- version 1.0.2
Add "correlation-based" feature selection methods for supervised learning. They are filtering methods i.e. they operate before training and they are independant to learning bias.ID3 component is modified.
No structural modification.
Component | Section | Description |
ID3 | Spv Learning | Size limit for leaves is modified, it is an "upper or equal to" now. |
CFS Filtering | Feature selection | Hall et al. (1997 - 2000). |
CHI-2 Filtering | Feature selection | Compute Chi-2 statistic between each descriptor and class attribute. |
FCBF Filtering | Feature selection | Yu et Liu (2003). |
MIFS Filtering | Feature selection | Battiti (1994). |
MODTree Filtering | Feature selection | Lallich et Rakotomalala (1999 - 2002). |
Remove constant | Feature selection | Delete from descriptors constant attribute. |
Releases -- april 2004 -- version 1.0.1
Add some components, no structural modification.
Component | Section | Description |
EqFreq Disc | Feature construction | Equal-frequency binning, the arity (number of intervals) is a parameter. It is an unsupervised univariate discretization, it gives good performances if you are not in a supervised framework. |
EqWidth Disc | Feature construction | Equal-width binning, the arity (number of intervals) is a parameter. It is an unsupervised univariate discretization, it is very fast because it is not necessary to sort the attribute. But, it gives wrong results, especially when the distribution is not symetric. |
MDLPC | Feature construction | Supervised univariate discretization (Fayyad et Irani, IJCAI-1993), this is a state-of-the-art method, it gives good performance. |
Standardize | Feature construction | Standardization of attributes, it may be used to set the dataset in the same scale, in order to make them comparable for instance. |
HAC | Clustering | Hierarchical agglomerative clustering, the method implemented here introduces a variation known as "hybrid clustering", it works on two levels: first, roughly compute clusters (e.g. 15--20 clusters); use these clusters to build classical HAC tree. |