15/06/26 – Thèse de Hui Huang : Structure-Aware Representation #Learning for Complex Question Answering.

La soutenance aura lieu le lundi 15 juin à 10h00, au campus Berges du Rhône,
4 bis rue de l’Université, 69007 Lyon, dans l’Amphithéâtre Laprade.

Vous trouverez un plan du campus Berges du Rhône à l’adresse suivante: plan campus BDR. Si vous pensez assister à la soutenance en présentiel, veuillez arriver quelques minutes en avance.

Additionnellement, la soutenance pourra être suivie à distance avec le
lien suivant

Résumé : Knowledge-intensive question answering (QA) increasingly relies on large language models (LLMs) combined with retrieval-augmented generation (RAG). In these systems, the retrieval module is a critical entry point: if relevant evidence is not retrieved, even powerful LLMs cannot produce correct or trustworthy answers. This challenge is particularly acute in industrial and scientific settings, where information is stored in long, structurally rich documents and where answers often depend on multiple related documents. This thesis investigates how to augment semantic retrieval with explicit modeling of both document-internal and inter-document structures to improve retrieval and reasoning in long-document and multi-document QA. Motivated by a CIFRE collaboration with Worldline, we focus on retrieval architectures and benchmarks that remain efficient and transferable while leveraging structure typically ignored by standard dense retrievers. Methodologically, we propose the Hierarchical Quantized Document Retriever (HQDR), which models documents as hierarchical graphs, fuses pre-trained language model and graph neural network representations, and uses self-supervised structural quantization to align structure-aware embeddings with a fixed semantic codebook. A hybrid dense–sparse scoring function combines semantic similarity with explicit structural profiles. On the data and evaluation side, we introduce MDA-QA, a Multi-Document Academic QA benchmark built on citation communities and LLM-based QA generation with automated filtering to enforce strict multi-document dependence. Systematic evaluation of retrieval and RAG methods on MDA-QA reveals that, even with simple citation-based enhancements, retrieving all necessary documents remains a major bottleneck, motivating further work on structure-aware retrieval and enterprise “knowledge agents”.

Laisser un commentaire