MCD'05 - The First Internationnal Workshop on Mining Complex Data 2005
Data mining and knowledge discovery, as stated in their early
definition, can today be considered as stable fields with numerous efficient
methods and studies that have been proposed to extract knowledge from data.
Nevertheless, the famous golden nugget is still challenging. Actually, the context
evolved since the first definition of the KDD process has been given and knowledge
has now to be extracted from data getting more and more complex.
In the framework of Data Mining, many software solutions were
developed for the extraction of knowledge from tabular data (which are typically
obtained from relational databases). Methodological extensions were proposed
to deal with data initially obtained from other sources, like in the context
of natural language (text mining) and image (image mining). KDD has thus evolved
following a unimodal scheme instantiated according to the type of the underlying
data (tabular data, text, images, etc), which, in the end, always leads to working
on the classical double entry tabular format.
However, in a large number of application domains, this unimodal
approach appears to be too restrictive. Consider for instance a corpus of medical
files. Each file can contain tabular data such as results of biological analyzes,
textual data coming from clinical reports, image data such as radiographies,
echograms, or electrocardiograms. In a decision making framework, treating each
type of information separately has serious drawbacks. It appears therefore more
and more necessary to consider these different data simultaneously, thereby
encompassing all their complexity.
Hence, a natural question arises: how could one combine information
of different nature and associate them with a same semantic unit, which is for
instance the patient? On a methodological level, one could also wonder how to
compare such complex units via similarity measures. The classical approach consists
in aggregating partial dissimilarities computed on components of the same type.
However, this approach tends to make superposed layers of information. It considers
that the whole entity is the sum of its components. By analogy with the analysis
of complex systems, it appears that knowledge discovery in complex data can
not simply consist of the concatenation of the partial information obtained
from each part of the object. The aim would rather be to discover more «
global » knowledge giving a meaning to the components and associating
them with the semantic unit. This fundamental information cannot be extracted
by the currently considered approaches and the available tools.
The new data mining strategies shall take into account the
specificities of complex objects (units with which are associated the complex
data). These specificities are summarized hereafter:
- Different kind. The data associated to an object
are of different types. Besides classical numerical, categorical or symbolic
descriptors, text, image or audio/video data are often available.
- Diversity of the sources. The data come from
different sources. As shown in the context of medical files, the collected
data can come from surveys filled in by doctors, textual reports, measures
acquired from medical equipment, radiographies, echograms, etc.
- Evolving and distributed. It often happens that
the same object is described according to the same characteristics at different
times or different places. For instance, a patient may often consult several
doctors, each one of them producing specific information. These different
data are associated with the same subject.
- Linked to expert knowledge. Intelligent data
mining should also take into account external information, also called expert
knowledge, which could be taken into account by means of ontology. In the
framework of oncology for instance, the expert knowledge is organized under
the form of decision trees and is made available under the form of “best
practice guides” called Standard Option Recommendations (SOR).
- Dimensionality of the data. The association
of different data sources at different moments multiplies the points of
view and therefore the number of potential descriptors. The resulting high
dimensionality is the cause of both algorithmic and methodological difficulties
The difficulty of Knowledge Discovery in complex data lies
in all these specificities.
The reasons why the workshop is of interest this time.
Many people approach the field of mining complex data from different and interesting
angles. They come from various communities such as data mining, classification,
knowledge discovery and engineering. We believe it is now time to establish
and enhance communication between these communities.
The aim of this workshop will be to address issues related
to the concepts of mining complex data. The whole knowledge discovery process
being involved, our goal will be to attract papers dealing with each step of
this field. Actually, managing complex data within the KDD process implies to
work on every step, starting from the pre-processing (e.g. structuring and organizing)
to the visualization and interpretation (e.g. sorting or filtering) of the results,
via the data mining methods themselves (e.g. classification, clustering, frequent