J. Darmont, O. Boussaïd, F. Bentayeb, S. Rabaseda, Y. Zellouf, "Web multiform data structuring for warehousing", Multimedia Systems and Applications, Vol. 22, Kluwer Academic Publishers, 2003, 179-194 (In C. Djeraba, ed., Multimedia Mining: A Highway to Intelligent Multimedia Documents).


In a data warehousing process, the data preparation phase is crucial. Mastering this phase allows multidimensional analysis or the use of data mining algorithms, as well as substantial gains in terms of time and performance when performing such analyses. Furthermore, a data warehouse can require external data. The web is a prevalent data source in this context, though the data broadcasted on this medium are very heterogeneous.

In this chapter, we propose a modeling process for integrating all these diverse, heterogeneous data into a unified format. Furthermore, the very schema definition provides first-rate metadata in our data warehousing context. At the conceptual level, a complex object is represented in UML as a superclass of any useful data source (databases, plain or tagged texts, images, sounds, video clips, etc.). Our logical model is an XML schema that can be described with a DTD or the XML-Schema language. Eventually, we have designed a Java prototype that transforms our multiform input data into XML documents representing our physical model.

Then, the XML documents we obtain are mapped into a relational database. We view this database as an ODS (Operational Data Storage), whose data will have to be re-modeled in a multidimensional way to allow their storage in a warehouse and, later, their analysis.


Web farming, Multiform data, Integration, Modeling process, UML, XML, Mapping, Data warehousing, Data analysis


