Key points are not available for this paper at this time.
This paper is about the problem of learning a stochastic policy for an object (like a molecular graph) from a sequence of actions, such the probability of generating an object is proportional to a given reward for that object. Whereas standard return maximization tends to to a single return-maximizing sequence, there are cases where we would to sample a diverse set of high-return solutions. These arise, for, in black-box function optimization when few rounds are possible, each large batches of queries, where the batches should be diverse, e. g. , in design of new molecules. One can also see this as a problem of converting an energy function to a generative distribution. While methods can achieve that, they are expensive and generally only perform exploration. Instead, training a generative policy amortizes the cost of during training and yields to fast generation. Using insights from Difference learning, we propose GFlowNet, based on a view of the process as a flow network, making it possible to handle the tricky where different trajectories can yield the same final state, e. g. , there many ways to sequentially add atoms to generate some molecular graph. We the set of trajectories as a flow and convert the flow consistency into a learning objective, akin to the casting of the Bellman into Temporal Difference methods. We prove that any global minimum of proposed objectives yields a policy which samples from the desired, and demonstrate the improved performance and diversity of on a simple domain where there are many modes to the reward function, on a molecule synthesis task.
Bengio et al. (Tue,) studied this question.