This is the start of my third and last year at KU Leuven as a postdoc.
Eventually, we published a paper about learning non-Markovian reward machines in an active learning setting. So, in an MDP, if the agent’s rewards have temporal dependencies, we learn a finite state machine that models this temporal reward behaviour.
At the moment, we are finishing up a project about learning safety properties given a set of MDP states labeled as either safe or dangerous. What makes our work novel is that we are in a relational MDP setting, and the safety properties are expressed as probabilistic computation tree logic (pCTL) formulae.