Set-RL vs Standard-RL: Advantage Visualization

About this visualization

This visualization compares Set Reinforcement Learning (Set RL) and Standard RL as two frameworks with objectives that aim to address the exploration–exploitation tradeoff during training. The central quantity we plot is the advantage assigned to a fixed generation y under each framework — positive values mean y gains probability mass under the next policy update, negative values mean it loses it.

In Set RL, this is the marginal advantage: a measure of the expected score of all sets containing y, minus the average score of all sets. In Standard RL, this is the classical advantage: the score of y itself minus the average score of all generations. The key conceptual difference is that Set RL evaluates generations collectively — all generations within a set are coupled under a shared set score, so each generation's value depends on the others it appears alongside. Standard RL, by contrast, scores each generation in isolation.

Throughout this visualization, we fix the set size to .

Set RL with Polychromic Objective

Standard RL with Diversity Bonus

The polychromic objective evaluates generations through the lens of the sets they belong to. This leads to two distinctive properties.

It can increase the probability of exploratory but failed generations (i.e. when r(x, y) = 0), if it contributes diversity to sets and there are other generations in the sets that can contribute reward instead.
The marginal advantage explicitly depends on whether a generation allows the model to align the two objectives — exploitation and exploration. If the presence of a generation in a set allows the model to better align these two goals, it receives a positive update even if it is a failed generation.

The diversity bonus objective treats reward and diversity as independent, additive quantities.

The objective strongly depends on the hyperparameter λ that weighs these two goals.
The standard RL advantage does not take into account whether a particular generation better enables the model to align exploration and exploitation within a set. A generation does not get a score that depends on how the other generations may perform.

Let us look at the exact equation for the advantage under this framework as shown below:

Hyperparameters

Shared — appear in both formulas

— reward of y both 0.00

The reward of the specific generation y we are tracking. Set r(y) = 0 to study the key case from the paper: can a zero-reward generation still gain probability mass?

both 0.50

Expected diversity of a completely random set drawn from π. In standard-RL this equals 𝔼_Y[d(x,Y)] by iterated expectation (same diversity notion).

Method-specific

set-RL +0.500

How much does the reward of Y₂ co-vary with the diversity of the set {y, Y₂,…,Y_n}? Enters the advantage as + ¾ · Cov(y). Bounded in [−0.25, 0.25] since both variables lie in [0,1].

set-RL 0.00

How much does the reward of Y₁ co-vary with the diversity of a fully random set {Y₁,…,Y_n}? Enters the advantage as − Cov_1:n. Bounded in [−0.25, 0.25].

— diversity bonus weight std-RL 0.50

Coefficient on the diversity bonus r̃(x,y) = r(x,y) + λ · d(x,y). Controls the tilt of the advantage plane along the d(y) axis.

Default value is 0.5, which is the maximum weight used in prior works [1, 2].