The concept of stream diagram
Introduced by SPAD for data analyses in the early 90s - at this time people didn't talk about Data Mining yet - the stream diagram represents the sequence of operations applied on data by a graph where (1) the nodes (an operator, a component, etc.) symbolize the analysis performed on the data ; (2) the links between nodes, the flow of processed data.The main advantage of this representation is the clear aspect of it, it also resides in the ability to easily chain operations on the data generated by some methods : for example, applying a clustering using the factorial axis produced from a multiple correspondances analysis. Of course it is possible to do the same thing using the script functionalities of some softwares, but has the person realizing the study the time and the will to learn a new language ? Some people consider that using stream diagrams can be assimilated to visual programming, it's a little sententious as, apart from the succession of operations, none usual algorithmic structure is used (loops, conditions,...). Anyway, the representation using stream diagrams has become an unavoidable paradigm, adopted by most of the data mining software editors (cf. STATISTICA DATA MINER, INSIGHTFULL MINER, SAS EM, SPSS CLEMENTINE, etc.).
In TANAGRA, the graph has been replaced with a simplier form of it : a tree. Only one source can provide data in the same diagram, so the user must prepare them before importing. This choice has two main consequences : for users, an easier reading of the operations really done ; for the developers, simplier classes with fewer integrity controls on data.
The tree structure allows to lead in parallel several concomitant analyses on the same data. This can be useful if, for example, we want to compare the performances of some prediction algorithms.
The operator (component) is a key element, as it represents an operation done on the data. The first operator is always a connection to a dataset, a set of records-attributes. The importation wizard automatically places the connection at the top of the diagram.
Four types of results can be expected when adding the operators, each of them having parameters :
(a) analytical results, that describe the data or modelize them ;
(b) a restriction or an elargement of the set of active examples used for analyses ;
(c) a restriction or an elargement of the set of attributes used for analyses ;
(d) the production of new attributes, added to the dataset.
Commonly, operators are linked to form a sequence in a diagram; there are two levels of controls : while adding the operator, or while executing the sequence.
Two categories of operators have some unusual comportment to take care of the specificities of their methods : supervised learning and meta supervised learning. So that their way of working is : first to add a "meta" component to the diagram, which is a sort of support for single methods or even arcing ones, then to include the "learning" operator in it (discriminant analysis, induction tree, etc.). This way of doing allows to multiply the combinations, for example to proceed a boosting on a multi-layer perceptron, that's not really recommended, but it's possible anyway.