Subject.
Recent alignment techniques such as Reinforcement Learning from Human Feedbac (RLHF) [Christiano et al., 2017] and Reinforcement Learning from AI Feedback (RLAIF) [Bai et al., 2022] have improved the behavior of Large Language Models (LLMs), but they still rely on simplified scalar reward signals and limited preference structures. Such signals fail to capture the complexity of real-world tasks, especially as LLMs evolve toward more agentic behaviors, where models plan, decide, and act autonomously over extended sequences. To address these limitations, this internship will investigate more expressive reward and loss designs that better reflect nuanced quality criteria and support emerging agent-style decision processes. The project draws on ideas from preference learning [Ziegler et al., 2019, Rafailov et al., 2023], bilevel optimization [Colson et al., 2007, Franceschi et al., 2018], inverse reinforcement learning [Ng et al., 2000], and modern loss-based alignment methods.
The goal is to explore how structured feedback, (whether human, AI-generated, or taskderived) can shape models that behave more consistently, robustly, and interpretably than those trained with standard RLHF pipelines. This internship has several research objectives, listed below:
â—Ź Structured reward modeling: Develop and evaluate reward formulations that go beyond a single scalar score. This may include multi-aspect quality signals, decomposed rewards, or decision-aware reward shaping suitable for agentic systems [Christiano et al., 2017, Bai et al., 2022].
â—Ź Design of differentiable alignment losses: Explore surrogate objectives that approximate or replace RLHF/RLAIF without requiring full reinforcement learning. Compare and extend recent methods such as preference ranking losses, policy-based alignment losses, or theoretically grounded alternatives [Rafailov et al., 2023].
â—Ź Agentic decision evaluation: Examine how different reward and loss structures influence models that perform multi-step reasoning, planning, or tool-use. Assess whether structured feedback can enable more stable or reliable decision-making in agent-style settings [Zhuang et al., 2023, Webb et al., 2023].
â—Ź AI or human feedback consistency: Analyze the effectiveness of human and AI evaluators in providing structured signals. Study reliability, agreement, and potential advantages of scalable AI-based supervision for more complex decision tasks.
â—Ź Benchmarking across some environments: Evaluate proposed reward and loss mechanisms using some benchmarks, including preference datasets, agentic task environments (on texts or games), or classical RL testbeds, to assess generalization, robustness, and decision quality across domains and textual environments [Webb et al., 2023, Lambert et al., 2025].
Contacts.
The intern will join the Data Mining & Decision team of the ERIC lab. (Campus Porte des Alpes, Bron).
Duration: 6 months, starting in January or later
Supervision: Pegah Alizadeh (pegah.alizadeh@univ-lyon2.fr)
Required profile.
The ideal candidate is technically and theoretically strong, with a solid background in machine learning, reinforcement learning, or optimization, and strong programming skills in Python (preferably with PyTorch, Ray, Docker, and Git). Familiarity with deep RL algorithms (PPO, SAC, A2C) or differentiable optimization is desirable. Familiarity with Large Language Models (LLMs), Vision-Language Models (VLMs), or other foundation models is a plus.
Some references
[Bai et al., 2022] Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
[Christiano et al., 2017] Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
[Colson et al., 2007] Colson, B., Marcotte, P., and Savard, G. (2007). An overview of bilevel optimization. Annals of operations research, 153(1):235– 256.
[Franceschi et al., 2018] Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil, M. (2018). Bilevel programming for hyperparameter optimization and meta-learning. In International conference on machine learning, pages 1568– 1577. PMLR.
[Lambert et al., 2025] Lambert, N., Pyatkin, V., Morrison, J., Miranda, L. J. V., Lin, B. Y., Chandu, K., Dziri, N., Kumar, S., Zick, T., Choi, Y., et al. (2025). Rewardbench: Evaluating reward models for language modeling. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 1755–1797.
[Ng et al., 2000] Ng, A. Y., Russell, S., et al. (2000). Algorithms for inverse reinforcement learning. In Icml, volume 1, page 2.
[Rafailov et al., 2023] Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741.
[Webb et al., 2023] Webb, T., Mondal, S. S., and Momennejad, I. (2023). Improving planning with large language models: A modular agentic architecture. arXiv preprint arXiv:2310.00194.
[Zhuang et al., 2023] Zhuang, Y., Chen, X., Yu, T., Mitra, S., Bursztyn, V., Rossi, R. A., Sarkhel, S., and Zhang, C. (2023). Toolchain*: Efficient action space navigation in large language models with a* search. arXiv preprint arXiv:2310.13227.
[Ziegler et al., 2019] Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. (2019). Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
