Project DEFENCE: Data-driven harmfull content detection system

Recherche-Research
Partenariat Hubert Curien (PHC) Braccusi, France-Roumanie.

Ce projet vise à mettre en œuvre un système de détection de contenus préjudiciables, en utilisant une approche fondée sur les données. Les principaux objectifs sont les suivants : 1) développer des modèles efficaces de détection des contenus préjudiciables par apprentissage automatique et
entraîner selon une approche fondée sur les données ; 2) déterminer si les contenus préjudiciables transcendent les facteurs sociaux, culturels, politiques et économiques.

New accepted publication at #ADBIS: #Learning #Metadata #Enrichment #Hand-Drawn #Illustrations

Recherche-Research
R. El-Idrissi, J. Agoun, J. Darmont, S. Loudcher, « Content Learning for Metadata Extraction and Enrichment of Historical Hand-Drawn Illustrations », New Trends in Databases and Information Systems – ADBIS 2026 Short Papers, Sept-October 2026; Communications in Computer and Information Science, Springer, Heidelberg, Germany.

Abstract: The digitisation of cultural heritage collections requires automated metadata generation to improve resource discovery and reuse. This paper presents the Content Learning for Metadata Extraction and Enrichment (CLEAD), a multimodal approach for generating metadata from illustrations in digitised archaeological diaries. CLEAD combines illustration detection and segmentation with vision-language and large language models to generate and enrich illustration descriptions using page-level textual context. Experiments on the DataLAC and IlluHisDoc datasets show that integrating visual and textual information improves the semantic quality and relevance of the generated metadata. These results demonstrate that CLEAD effectively supports the indexing and retrieval of visual heritage resources in digital archives and data lakes.

New accepted publication at #DaWaK: Discovering Relationships

Recherche-Research
A. Diouan, S. Loudcher, J. Darmont, E. Ferey, « Discovering Relationships in Data Lakes Using Large Language Models: An Industrial Case ». 28th International Conference on Big Data Analytics and Knowledge Discovery (DaWaK 2026), Graz, Austria. LNCS.

Abstract: Data lakes rely on metadata to remain usable, yet this meta data is often limited or weakly informative for column relationship discovery, especially in ERP-derived datasets with coded or abbreviated schema labels. We propose ColRel, a two-stage method that builds column embeddings from metadata and data available at ingestion time. In difficult cases, such as coded schemata, business dictionaries help better interpret column names and support the generation of short natural-language descriptions used in the second stage. Experiments on public benchmarks and an industrial ERP dataset show that ColRel is particularly effective in semantically related, weak-signal settings.

Nouvelle publication #HumanitésNumériques

Recherche-Research

R. El-Idrissi, J. Simon-Reig, L. Romero, J. Agoun, J.P. Girard, G. de-Prado, J. Darmont, S. Loudcher, « Structuration, exploration et valorisation d’archives archéologiques par l’intelligence artificielle au sein d’un lac de données », 7e Colloque de l’association francophone des Humanités numériques (Humanistica), Paris, Mai 2026.