Metadata
Title
Human-in-the-loop for Safe and Verifiable Reinforcement Learning
Category
general
UUID
bdf4add6a9cb445cb03f5ea0db36526a
Source URL
https://wsai.iitm.ac.in/projects/human-in-the-loop-for-safe-and-verifiable-reinf...
Parent URL
https://wsai.iitm.ac.in/projects/
Crawl Time
2026-03-23T19:04:41+00:00
Rendered Raw Markdown
# Human-in-the-loop for Safe and Verifiable Reinforcement Learning

**Source**: https://wsai.iitm.ac.in/projects/human-in-the-loop-for-safe-and-verifiable-reinforcement-learning/
**Parent**: https://wsai.iitm.ac.in/projects/

- [Home](https://wsai.iitm.ac.in/)
- [Projects](https://wsai.iitm.ac.in/projects/)
- [Human-in-the-loop for Safe and Verifiable Reinforcement Learning](#)

## Human-in-the-loop for Safe and Verifiable Reinforcement Learning

Investigators

Nirav Bhatt
B. Ravindran

Tags

[AI Safety](https://wsai.iitm.ac.in/tags/ai-safety)
[Human in the Loop](https://wsai.iitm.ac.in/tags/human-in-the-loop)
[Reinforcement Learning](https://wsai.iitm.ac.in/tags/reinforcement-learning)

One of the main challenges in implementation of RL in real life applications is safety. Particularly, the undesired and harmful behaviour of RL agents involving humans is one of the major safety concerns. Utilizing the human in the RL agent as an active participant has been an active area of research, labelled as, human-in-the-loop RL. In this work, we propose to model human participants as a constraint provider. Humans can provide context specific constraints to ensure safety. This project aims to develop a framework for safe human-in-the-loop RL with human being a constraint provider. The framework will address the following questions: (i) how do we formulate mathematical constraints imposed by humans automatically, (ii) How does the human provide constraints such that safe optimal policy can be learnt with minimum explorations (or experiments)?, (iii) How do we identify that the imposed constraints by humans lead to a feasible problem?, (iv) how to handle the trade-off between short term actions and policy optimization?, and (v) How can we use a nominal solution of unconstrained (or constrained) MDP for recursive improvement with additional constraints?