I have now also completed a small project to show off my new skill: using Behavior Trees for agent control. Check out my project here.
I have now finished implementing two agent architectures (and one which combines the two) using the Unity editor and the C# language.
- Reward Machine Agent for Patrolling (https://github.com/GavinRens/Reward-Machine-Agent—Patrolling) and Reward Machine Agent for Treasure Hunting (https://github.com/GavinRens/Reward-Machine-Agent—Treasure-Hunting). The algorithm in the frameworks is based on my work with Reward Machines: Instead of rewarding an agent for a given action in a given state, a reward machine allows one to specify rewards for sequences of observations. Every observation is mapped from an action-state pair. For instance, if you want to make your agent kick the ball twice in a row, then give it a reward only after seeing that it has kicked the ball twice in a row. A regular reward function would have to give the same reward for the first and second kick. I implemented a Monte Carlo Tree Search (MCTS) planner, which plans over the given reward machine. In the patrolling environment, the observation mapping function is not completely deterministic, whereas the treasure-hunting environment is fully observable).
- EatPrayDanceSleep_HMB_Agent (https://github.com/GavinRens/BDI-Agent—EatPrayDanceSleep). The algorithm in the framework is based on my work with the Belief-Desire-Intention (BDI) architecture: The agent has a set of goals. The agent periodically selects a subset of these goals to pursue for a while. The currently selected goals are called intentions. In the framework in this project, an agent can pursue all or some intentions simultaneously. The agent designer can specify which goals can/cannot be pursued simultaneously, and the ‘importance’ of every goal can be set. The agent designer can also define what rewards the agent will get in general (besides for goals) and define the cost of each action. Taken together, these specifications and definitions produce emergent behavior, where an agent will keep selecting different intentions to pursue. I implemented a Monte Carlo Tree Search (MCTS) planner, which plans over the current set of intentions, weighted by their importance.
- Hybrid-Agent—Work-n-Home (https://github.com/GavinRens/Hybrid-Agent—Work-n-Home) is an agent architecture combining the BDI and Reward-Machine architectures – controlled by two MCTS planners, adapted for each architecture.
Details of these architectures can be found in the README files of the respective repositories on my GitHub site: https://github.com/GavinRens.
I recently finished converting two agent control algorithms from Python code i used for research into C# code for Unity (the real-time 3D engine).
The first algorithm is based on my work with Reward Machines: instead of rewarding an agent for a given action in a given state, a reward machine allows one to specify rewards for sequences of observations. Every observation is mapped from an action-state pair. For instance, if you want to make your agent kick the ball twice in a row, then give it a reward only after seeing that it has kicked the ball twice in a row. A regular reward function would only be able to give the same reward for the first and second kick.
The second algorithm is based on my earlier work on a Hybrid POMDP-BDI agent architecture. In this architecture, an agent can pursue several goals at the same time. The goals that the agent is currently pursuing are called its intentions. The agent programmer must specify what it means for a goal/intention to be satisfied. When an intention is satisfied ‘enough’ or cannot be satisfied at the moment, then it is removed from the set of intentions. The ‘desire-level’ of a goal which has not been satisfied for a while increases. When a goal’s desire-level reaches a user-defined threshold, the goal becomes an intention. With this algorithm, all goals are satisfied periodically. There are several parameters that the programmer can set to achieve a desired agent behavior.
These two algoriths still need some poslishing (including comments and descriptions) and i need to implement each one with more example problems. I’ll be making them available on GitHub in a few weeks.
I have always wanted to visualize experiments with my agents and algorithms. I could never find time to learn a real-time graphics program. Since last year, i considered learning Unity 3D Engine. With Unity, i would also be able to make simple games, something that has interested me for a long time.
So i have the opportunity to put my academic career on hold for a year (the whole of 2022) and just learn Unity. I’ve already made my first game – called Reach the Red (available on the Google Play store). And currently i’m working on visualizing an agent in a gridworld where ther agent has some simple goal seeking behaviour. I want to see how the agent behaves, given different planning algorithms. Perhaps i’ll release a program/interface where the user can choose a task or environment, and select the planning algo and play around with the applicable parameters to see how it changes the agent’s behavior.
Last week, I attended the 13th International Conference on Agents and Artificial Intelligence (ICAART). On 4 February I presented my work about Online Learning of Non-Markovian Reward Models. On 5 Feb. I chaired two technical sessions. On the last day, (Saturday, 6 Feb.) I could simply attend talks.
There were several interesting papers (for me) and a few interesting keynote talks. The keynote by Gerhard Widmer titled Con Espressione! AI, Machine Learning, and Musical Expressivity was especially interesting: I was not aware how much research has been done in the field of music. Another keynote I found interesting, especially related to my work, was by Guy Van den Broeck about Probabilistic Circuits and Probabilistic Programs. Guy argues that many of the currently used probabilistic graphical models are ‘underpowered’ or incorrectly applied, and that Probabilistic Circuits and Probabilistic Programs are newer, richer and often tractable. There is a three-hour tutorial here (independent of ICAART).
From the middle of last year, I’ve been advising or co-advising five Masters students at KU Leuven. I’ll list the topics they are working on below.
- The goal is to define how an MDP can be modelled by a first-order ProbLog program and to make a program that can solve such an MDP by using the techniques that were applied in DT-ProbLog. ProbLog is a probabilistic logic programming language developed by Luc De Raedt et al. at KU Leuven. DT-ProbLog is a Decision-Theoretic approach implemented in ProbLog.
- Mealy Machines have been used to represent non-Markovian reward models in MDP settings. In this topic, we want to define a Mealy Reward Machine (MRM) with transitions depending on the satisfaction of propositional logic formulae. And we want to learn these logical MRMs. Logical MRMs could be exponentially smaller than regular/flat MRMs, more human readable, and possibly more efficient to use.
- I previously developed the Hybrid POMDP-BDI (HPB) agent architecture. An HPB agent maintains a probability distribution over possible states it can be in, and updates this distribution every time it makes a new observation (due to an action it executes). In this topic, we want to extend the HPB architecture to deal with streams of observations. We are looking at complex event processing for inspiration.
- Angluin’s L* algorithm is a method for learning Mealy (Reward) Machines in an active-learning setting. MRMs represent non-Markovian (temporally dependent) reward models. There is a relationship between non-Markovian reward models and partially observable MDPs (POMDPs). So one might think that one could learn a reward model using L* in a POMDP. In recent work, I could not get it right. So the student is investigating this issue.
- One student is investigating general game playing (GGP) and how to extend the associated game description language (GDL) to deal with stochastic actions. GDL for imperfect information (II) already exists, which can deal with situations where not all players know what other players know, and where actions or moves have uniformly probabilistic effects (like a fair dice). The student is extending GDL II to deal with non-uniform move effects. He is also implementing a user-friendly interface for his version of GGP.
This is the start of my third and last year at KU Leuven as a postdoc.
Eventually, we published a paper about learning non-Markovian reward machines in an active learning setting. So, in an MDP, if the agent’s rewards have temporal dependencies, we learn a finite state machine that models this temporal reward behaviour.
At the moment, we are finishing up a project about learning safety properties given a set of MDP states labeled as either safe or dangerous. What makes our work novel is that we are in a relational MDP setting, and the safety properties are expressed as probabilistic computation tree logic (pCTL) formulae.
I’m working on the topic of an agent learning a non-Markovian reward model, given the agent’s interactions with it environment — a Markov decision process (MDP). It is taking me longer than I hoped to get results.
I am also the daily advisor of three Masters students. The topics are
- Extending and developing a better understanding of my work on “probabilistic belief revision via similarity of worlds modulo evidence”.
- Investigating probabilistic belief update and an analysis of partially observable MDP (POMDP) and dynamic Bayesian Networks (DBN) methods to do so.
- A simplification, analysis and implementation of my work on maximising expected impact of an agent in a network of potentially hostile agents.
I am also involved in setting up one of the topics and evaluating students in this semester’s Capita Selecta course. The topic I am leading is Safe AI and Reinforcement Learning.
I arrived in Leuven, Belgium (with my wife and two cats) on the first day of 2019.
I’ll been working in the VeriLearn project, that is, Verifying Learning in Artificial Intelligence.
I’m in Luc De Raedt’s group, which is part of the DTAI group in the Computer Science department.
One full paper was accepted and and a shorter paper was also accepted that will be presented as a poster at the German conference on AI. The former is about investigations into generalizing approaches to probabilistic belief revision, going from Bayesian conditioning to Lewis imaging. The latter is about how to formally describe how agents should act given that their impact on each other could be positive or negative to some degree. A notion of reputation is used.
The third paper – about probabilistic belief update – was accepted at a co-located workshop called Formal and Cognitive Reasoning.
I’ll present the papers and poster in Berlin in September.