Publications of Jérôme Darmont
Reference (inproceedings)
R. El-Idrissi, E. Makhlouf, M.R. Sahrane, D.R. Elohounkpon, J.A. Sanchez-Cristancho, J. Simon-Reig, J. Agoun, J.P. Girard, G. de-Prado, S. Loudcher, J. Darmont, "Data, Archives and Archaeological Texts: Creation and Exploitation of a Semantic Data Lake for Archaeology in Catalonia. The DataLAC Projec", International symposium Modeling the Past to Anticipate the Future, December 2025.
BibTeX entry
@INPROCEEDINGS{past2025,
Author = {Rajae El-Idrissi and Elias Makhlouf and Mohamed-Riad Sahrane and Dewanou-Romeo Elohounkpon and Jorje-Antonion Sanchez-Cristancho and Josefina Simon-Reig and Juba Agoun and Jean-Pierre Girard and Gabriel de-Prado and Sabine Loudcher and Jérôme Darmont},
Title = {Data, Archives and Archaeological Texts: Creation and Exploitation of a Semantic Data Lake for Archaeology in Catalonia. The DataLAC Projec},
Booktitle = {International symposium Modeling the Past to Anticipate the Future},
Month = {December},
Year = {2025},
Abstract = {Archaeology in the previous century was primarily focused on collecting data, achieved through the inventory and description of objects. Since the 1970s, archaeology has shifted toward contextual approaches, examining the complex, multidimensional resources that are often found in disparate materials, preservation, or languages. The DataLAC project addresses these challenges on a significant archaeological site in Ullastret, Catalonia, by digitizing and unifying 30 years of field notes, scientific literature, and audiovisual records into an interoperable data lake that enables complex queries. Switching to a data lake is necessary to support the heterogeneity of archaeological data and to process a variety of data types effectively. The presence of powerful metadata management in the data lake enables orderly and flexible searching, as well as sustainable interoperability, allowing for advanced study and transdisciplinary research.
Our data lake is architected into four phases: data acquisition, data processing, metadata modeling, and finally, data enrichment and semantic querying. The initial step involves scanning handwritten data (field notes, scientific reports) at high resolution, followed by image preprocessing to obtain data quality for subsequent automation.
The second step uses automatic handwriting text recognition (HTR) from field notes. Moreover, the ManuMC French model is fine-tuned against the linguistic and stylistic variety of Ullastret notebooks. Transcription quality is estimated in terms of Character Error Rate (CER) and Word Error Rate (WER). When the CER and WER metrics do not meet the confidence levels, archaeologists carefully review the transcription to ensure accurate results and preserve crucial archaeological information.
The third stage is metadata modeling and data enrichment. Metadata characterizes each notebook, document, and any visual object. Each object is semantically enriched and indexed in a cross-lingual relational database system, facilitating sophisticated links between text, image, and scientific data.
In the last step, we design a multilingual archaeological thesaurus to output standardized data that enable powerful semantic queries. Overall, data and metadata are stored in a data lake, accessible via a Web interface and an Application Programming Interface (API) that simplifies data retrieval and research processes.
More than 3,000 notebook pages and various images have already been processed and serve as a data lake for further analysis and study. At the intersection of archaeology and data science, DataLAC seeks to demonstrate how interdisciplinary digital methods can facilitate the access and interpretation of archaeological records.},
Keywords = {Artificial intelligence, Machine learning, Archeology, Metadata modeling, Data lakes}
}