Back RSS stream

Publications of Jérôme Darmont

Reference (inproceedings)

R. El-Idrissi, E. Makhlouf, M.R. Sahrane, D.R. Elohounkpon, J.A. Sanchez-Cristancho, J. Simon-Reig, J. Agoun, J.P. Girard, G. de-Prado, S. Loudcher, J. Darmont, "Data, Archives and Archaeological Texts: Creation and Exploitation of a Semantic Data Lake for Archaeology in Catalonia. The DataLAC Projec", International symposium Modeling the Past to Anticipate the Future, December 2025.

Abstract

Archaeology in the previous century was primarily focused on collecting data, achieved through the inventory and description of objects. Since the 1970s, archaeology has shifted toward contextual approaches, examining the complex, multidimensional resources that are often found in disparate materials, preservation, or languages. The DataLAC project addresses these challenges on a significant archaeological site in Ullastret, Catalonia, by digitizing and unifying 30 years of field notes, scientific literature, and audiovisual records into an interoperable data lake that enables complex queries. Switching to a data lake is necessary to support the heterogeneity of archaeological data and to process a variety of data types effectively. The presence of powerful metadata management in the data lake enables orderly and flexible searching, as well as sustainable interoperability, allowing for advanced study and transdisciplinary research.

Our data lake is architected into four phases: data acquisition, data processing, metadata modeling, and finally, data enrichment and semantic querying. The initial step involves scanning handwritten data (field notes, scientific reports) at high resolution, followed by image preprocessing to obtain data quality for subsequent automation.

The second step uses automatic handwriting text recognition (HTR) from field notes. Moreover, the ManuMC French model is fine-tuned against the linguistic and stylistic variety of Ullastret notebooks. Transcription quality is estimated in terms of Character Error Rate (CER) and Word Error Rate (WER). When the CER and WER metrics do not meet the confidence levels, archaeologists carefully review the transcription to ensure accurate results and preserve crucial archaeological information.

The third stage is metadata modeling and data enrichment. Metadata characterizes each notebook, document, and any visual object. Each object is semantically enriched and indexed in a cross-lingual relational database system, facilitating sophisticated links between text, image, and scientific data.

In the last step, we design a multilingual archaeological thesaurus to output standardized data that enable powerful semantic queries. Overall, data and metadata are stored in a data lake, accessible via a Web interface and an Application Programming Interface (API) that simplifies data retrieval and research processes.

More than 3,000 notebook pages and various images have already been processed and serve as a data lake for further analysis and study. At the intersection of archaeology and data science, DataLAC seeks to demonstrate how interdisciplinary digital methods can facilitate the access and interpretation of archaeological records.

Keywords

Artificial intelligence, Machine learning, Archeology, Metadata modeling, Data lakes

 

[ BibTeX | XML | Back ]