Having explored an environment, intelligent agents should be able to transfer their knowledge to most downstream tasks within that environment without additional interactions.
Referred to as "zero-shot learning", this ability remains elusive for general-purpose reinforcement learning algorithms.
While recent works have attempted to produce zero-shot RL agents, they make assumptions about the nature of the tasks or the structure of the MDP.
We present Proto Successor Measure: the basis set for all possible behaviors of a Reinforcement Learning Agent in a dynamical system.
We prove that any possible behavior (represented using visitation distributions) can be represented using an affine combination of these policy-independent basis functions.
Given a reward function at test time, we simply need to find the right set of linear weights to combine these bases corresponding to the optimal policy.
We derive a practical algorithm to learn these basis functions using reward-free interaction data from the environment and show that our approach can produce the optimal policy at test time for any given reward function without additional environmental interactions.