This visualization compares Set Reinforcement Learning (Set RL) and Standard RL as two frameworks with objectives that aim to address the exploration–exploitation tradeoff during training. The central quantity we plot is the advantage assigned to a fixed generation y under each framework — positive values mean y gains probability mass under the next policy update, negative values mean it loses it.
In Set RL, this is the marginal advantage: a measure of the expected score of all sets containing y, minus the average score of all sets. In Standard RL, this is the classical advantage: the score of y itself minus the average score of all generations. The key conceptual difference is that Set RL evaluates generations collectively — all generations within a set are coupled under a shared set score, so each generation's value depends on the others it appears alongside. Standard RL, by contrast, scores each generation in isolation.
Throughout this visualization, we fix the set size to .
The polychromic objective evaluates generations through the lens of the sets they belong to. This leads to two distinctive properties.
The diversity bonus objective treats reward and diversity as independent, additive quantities.
Let us look at the exact equation for the advantage under this framework as shown below:
Let us look at the exact equation for the advantage under this framework as shown below: