Recovering Classical RL from GRL¶
Purpose: Demonstrate that traditional RL algorithms are special cases of GRL
Audience: Classical RL researchers, practitioners, skeptics
Goal: Bridge the gap between familiar methods and the GRL framework
Executive Summary¶
Key Claim: Generalized Reinforcement Learning (GRL) is not a replacement for classical RL—it's a unifying framework that recovers existing methods as special cases while enabling new capabilities.
Why This Matters:
- Adoption: Researchers trust frameworks that subsume what they already know
- Validation: If GRL recovers DQN/PPO/SAC, it must be correct
- Innovation: Once the connection is clear, extensions become natural
What You'll Learn:
- How Q-learning emerges from GRL with discrete actions
- How DQN is GRL with neural network approximation
- How Policy Gradients (REINFORCE) follow from the energy landscape
- How Actor-Critic (PPO, SAC) naturally arise in GRL
- How RLHF for LLMs is GRL applied to language modeling
Table of Contents¶
- The GRL→Classical RL Dictionary
- Recovery 1: Q-Learning
- Recovery 2: DQN (Deep Q-Network)
- Recovery 3: REINFORCE (Policy Gradient)
- Recovery 4: Actor-Critic (PPO, SAC)
- Recovery 5: RLHF for LLMs
- What GRL Adds Beyond Classical RL
- Implementation: From GRL to Classical
1. The GRL→Classical RL Dictionary¶
| Classical RL Concept | GRL Equivalent | Notes |
|---|---|---|
| State \(s\) | State \(s\) | Same |
| Discrete Action \(a \in \mathcal{A}\) | Fixed parametric mapping \(\theta_a\) | One \(\theta\) per discrete action |
| Continuous Action \(a \in \mathbb{R}^d\) | Action parameters \(\theta \in \mathbb{R}^d\) | Direct correspondence |
| Q-function \(Q(s, a)\) | Reinforcement field \(Q^+(s, \theta)\) | Evaluated at discrete \(\theta_a\) |
| Replay Buffer \(\mathcal{D}\) | Particle Memory \(\Omega\) | Particles are weighted experiences |
| Experience \((s, a, r, s')\) | Particle \((z, w)\) where \(z=(s,\theta)\), \(w=r\) | Single transition |
| TD Target \(y = r + \gamma \max_{a'} Q(s', a')\) | MemoryUpdate with TD target | Belief transition |
| Policy \(\pi(a\|s)\) | Boltzmann over \(Q^+(s, \cdot)\) | Temperature-controlled sampling |
| Value Function \(V(s)\) | \(\max_\theta Q^+(s, \theta)\) | Maximum over action parameters |
| Exploration \(\epsilon\)-greedy | Temperature \(\beta\) in Boltzmann | Smooth instead of discrete |
Key Insight: Classical RL is GRL with:
-
Discrete or fixed action spaces
-
Tabular or neural network approximation of the field
- Specific choices of update rules
2. Recovery 1: Q-Learning¶
Classical Q-Learning¶
Setup:
- State space: \(\mathcal{S}\)
- Action space: \(\mathcal{A} = \{a_1, \ldots, a_K\}\) (discrete, finite)
- Q-function: \(Q(s, a)\) for each \((s, a)\) pair
Update Rule: $\(Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s, a)]\)$
Policy: \(\epsilon\)-greedy or softmax over \(Q(s, \cdot)\)
GRL Version¶
Setup:
- State space: \(\mathcal{S}\) (same)
- Action space: \(\mathcal{A} = \{a_1, \ldots, a_K\}\) (discrete)
- Map each discrete action to a parameter: \(\theta_1, \ldots, \theta_K\) (fixed)
- Augmented space: \(\mathcal{Z} = \mathcal{S} \times \{\theta_1, \ldots, \theta_K\}\)
- Reinforcement field: \(Q^+(s, \theta_i)\) evaluated only at discrete points \(\theta_i\)
Particle Memory:
- Each experience \((s, a_i, r, s')\) creates particle \((z_i, w_i)\) where \(z_i = (s, \theta_i)\), \(w_i = r\)
MemoryUpdate:
- Add particle \((z_t, w_t)\) where \(z_t = (s_t, \theta_{a_t})\), \(w_t = r_t\)
- With no kernel association (set \(k(z, z') = \delta(z, z')\)), MemoryUpdate reduces to: $\(Q^+(s, \theta_a) \leftarrow Q^+(s, \theta_a) + \alpha [y_t - Q^+(s, \theta_a)]\)$ where \(y_t = r_t + \gamma \max_{a'} Q^+(s', \theta_{a'})\)
This is exactly Q-learning!
Key Takeaways¶
Q-learning is GRL with:
- Discrete action space (finite \(\{\theta_i\}\))
- Delta kernel (no generalization between actions)
- Tabular representation (store \(Q\) for each state-action pair)
What GRL adds:
- Generalization via non-trivial kernels: \(k((s, \theta_i), (s, \theta_j)) > 0\) for \(i \neq j\)
- Continuous interpolation: \(Q^+(s, \theta)\) defined for all \(\theta\), not just discrete actions
- Weighted particles: Experience importance via \(w_i\)
3. Recovery 2: DQN (Deep Q-Network)¶
Classical DQN¶
Setup:
- Q-function approximated by neural network: \(Q_\psi(s, a)\)
- Experience replay buffer: \(\mathcal{D} = \{(s_i, a_i, r_i, s_i')\}\)
- Target network: \(Q_{\psi^-}\) (delayed copy)
Update Rule: $\(\psi \leftarrow \psi - \eta \nabla_\psi \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} [(Q_\psi(s, a) - y)^2]\)$ where \(y = r + \gamma \max_{a'} Q_{\psi^-}(s', a')\)
GRL Version¶
Setup:
- Field approximator: Neural network \(Q_\psi(s, \theta)\) approximates reinforcement field
- Particle memory: \(\Omega = \{(z_i, w_i)\}\) where \(z_i = (s_i, \theta_i)\)
- Kernel: Implicit kernel induced by neural network architecture
Update Rule: Sample particles from \(\Omega\), compute TD targets: $\(\psi \leftarrow \psi - \eta \nabla_\psi \mathbb{E}_{(z,w) \sim \Omega} [(Q_\psi(z) - y)^2]\)$ where \(y = w + \gamma \max_{\theta'} Q_\psi(s', \theta')\)
Target network: Optional, same as DQN
Key Takeaways¶
DQN is GRL with:
- Neural network approximation of the reinforcement field
- Experience replay = particle memory sampling
- Discrete actions (typically)
What GRL adds:
- Explicit particle representation: Particles are not just for replay, they define the field
- Kernel interpretation: Neural network as implicit kernel
- Continuous action generalization: \(Q_\psi(s, \theta)\) for any \(\theta\)
4. Recovery 3: REINFORCE (Policy Gradient)¶
Classical REINFORCE¶
Setup:
- Policy: \(\pi_\phi(a|s)\) (parameterized, e.g., neural network)
- Objective: \(J(\phi) = \mathbb{E}_{\tau \sim \pi_\phi} [R(\tau)]\) (expected return)
Update Rule (score function gradient): $\(\nabla_\phi J(\phi) = \mathbb{E}_{\tau \sim \pi_\phi} [\sum_t \nabla_\phi \log \pi_\phi(a_t | s_t) \cdot G_t]\)$
where \(G_t = \sum_{t'=t}^T \gamma^{t'-t} r_{t'}\) (return from time \(t\))
GRL Version¶
Setup:
- Reinforcement field: \(Q^+(s, \theta)\)
- Policy: Boltzmann over field $\(\pi(a | s) = \frac{\exp(\beta \, Q^+(s, \theta_a))}{\int \exp(\beta \, Q^+(s, \theta')) d\theta'}\)$
Gradient of Expected Return:
The policy gradient in GRL is: $\(\nabla_\theta \mathbb{E}_{\pi} [R] = \mathbb{E}_{\pi} [\nabla_\theta Q^+(s, \theta) \cdot \text{advantage}]\)$
If we parameterize \(Q^+(s, \theta) = Q_\phi(s, \theta)\), then: $\(\nabla_\phi J(\phi) = \mathbb{E} [\nabla_\phi Q_\phi(s, \theta) \cdot G_t]\)$
Connection to REINFORCE:
The score function gradient \(\nabla_\phi \log \pi_\phi(a|s)\) in REINFORCE is equivalent to the field gradient \(\nabla_\phi Q_\phi(s, \theta)\) when policy is Boltzmann.
This recovers REINFORCE!
Key Takeaways¶
REINFORCE is GRL with:
- Boltzmann policy derived from energy landscape
- Direct parameterization of the field (or policy)
- Monte Carlo returns (\(G_t\)) as targets
What GRL adds:
- Energy interpretation: \(Q^+ = -E\) provides physics-inspired regularization
- Particle-based updates: No need for full gradient, use particle approximation
- Smooth action selection: Temperature \(\beta\) controls exploration naturally
5. Recovery 4: Actor-Critic (PPO, SAC)¶
Classical Actor-Critic¶
Setup:
- Actor: Policy \(\pi_\phi(a|s)\)
- Critic: Value function \(V_\psi(s)\) or \(Q_\psi(s, a)\)
Update:
-
Critic: TD learning $\(\psi \leftarrow \psi - \eta \nabla_\psi [Q_\psi(s, a) - (r + \gamma V_\psi(s'))]^2\)$
-
Actor: Policy gradient with advantage $\(\phi \leftarrow \phi + \eta \nabla_\phi \log \pi_\phi(a|s) \cdot A(s, a)\)$ where \(A(s, a) = Q(s, a) - V(s)\) (advantage)
Variants:
- PPO: Clipped objective, KL penalty
- SAC: Entropy regularization, temperature tuning
GRL Version¶
Setup:
- Critic: Reinforcement field \(Q^+(s, \theta)\) (this is the "critic")
- Actor: Policy inferred from field via Boltzmann $\(\pi(\theta | s) \propto \exp(\beta \, Q^+(s, \theta))\)$
Update:
- Field (Critic): RF-SARSA (two-layer TD system)
- Primitive layer: TD learning for discrete transitions
-
GP layer: Smooth field over augmented space
-
Policy (Actor): Derived automatically from field
- No separate actor parameters!
- Policy gradient = field gradient
Connection:
- \(Q^+(s, \theta)\) plays the role of both Q-function and value function
- Boltzmann policy automatically balances exploitation (high \(Q^+\)) and exploration (low \(\beta\))
- Advantage: \(A(s, \theta) = Q^+(s, \theta) - \max_{\theta'} Q^+(s, \theta')\)
This recovers Actor-Critic!
Key Takeaways¶
Actor-Critic is GRL with:
-
Reinforcement field as critic
-
Boltzmann policy as actor (no separate parameters)
- RF-SARSA as update rule
What GRL adds:
- Unified representation: No need for separate actor and critic
- Automatic exploration: Temperature \(\beta\) replaces entropy regularization
- Particle-based: Memory naturally handles off-policy data
Special Cases:
- PPO: GRL with clipped field updates, on-policy sampling
- SAC: GRL with entropy term in field (equivalent to temperature)
6. Recovery 5: RLHF for LLMs¶
Classical RLHF (Reinforcement Learning from Human Feedback)¶
Setup (e.g., for ChatGPT):
- LLM: \(\pi_\phi(a_t | s_t)\) where \(s_t\) = (prompt, response so far), \(a_t\) = next token
- Reward Model: \(r_\theta(s, a)\) learned from human preferences
- Algorithm: PPO or similar policy gradient method
Update (PPO): $\(\mathcal{L}(\phi) = \mathbb{E}_{(s,a) \sim \pi_\phi} [\min(r_\text{clip}, r_\text{KL})]\)$
where:
- \(r_\text{clip}\) = clipped advantage
- \(r_\text{KL}\) = KL penalty from reference policy
GRL Version¶
Setup:
- State: \(s_t\) = (prompt, partial response)
- Action: \(\theta_t\) = token ID or logit vector (discrete or continuous parameterization)
- Augmented space: \((s_t, \theta_t)\) in semantic embedding space
- Reinforcement field: \(Q^+(s_t, \theta_t)\) = expected reward for generating token \(\theta_t\) in context \(s_t\)
Formulation:
Option 1: Discrete Tokens (Classical RL recovery) - \(\theta_t \in \{1, \ldots, V\}\) (vocabulary size \(V\)) - Field: \(Q^+(s_t, \theta_t)\) for each token - Policy: Softmax over field $\(\pi(\theta_t | s_t) = \frac{\exp(\beta \, Q^+(s_t, \theta_t))}{\sum_{\theta'} \exp(\beta \, Q^+(s_t, \theta'))}\)$
Option 2: Continuous Parameterization (GRL extension) - \(\theta_t \in \mathbb{R}^d\) = token embedding or logit vector - Field: \(Q^+(s_t, \theta_t)\) smooth over embedding space - Policy: Sample from continuous distribution over \(\theta\), map to nearest token
Update: RF-SARSA with human feedback as rewards - Particle memory stores (prompt, response, reward) tuples - Kernel generalizes across similar prompts/responses - Field learns \(Q^+(s, \theta)\) via TD learning
Advantages of GRL for RLHF¶
1. Off-Policy Learning
- Particle memory enables replay
- Sample efficiency: learn from all past human feedback
- Classical RLHF (PPO) is on-policy only
2. Smooth Generalization
- Kernel similarity between prompts
- Transfer value across related contexts
- Fewer human labels needed
3. Uncertainty Quantification
- Sparse particles = high uncertainty
- Exploration naturally targets uncertain regions
- Safety: avoid high-stakes decisions with low confidence
4. Interpretability
- Energy landscape over prompt space
- Visualize which responses are preferred
- Particle inspection: "Why did you say that?"
Implementation Path¶
Phase 1: Discrete Tokens (Q1 2026)
- Implement GRL on small model (GPT-2)
- Reproduce PPO results on standard RLHF benchmarks
- Show GRL recovers classical behavior
Phase 2: Comparison (Q2 2026)
- Compare sample efficiency: GRL vs. PPO
- Measure stability: fewer human labels needed?
- Quantify uncertainty: does GRL know what it doesn't know?
Phase 3: Scale (Q3 2026)
- Apply to larger model (LLaMA-7B or Mistral-7B)
- Test on real human feedback datasets
- Submit paper: "GRL for LLM Fine-tuning"
Key Takeaways¶
RLHF is GRL with:
- Discrete action space (tokens)
- On-policy updates (PPO)
- Neural network approximation of field
What GRL adds for RLHF:
- Off-policy learning (replay buffer of human feedback)
- Kernel generalization (transfer across prompts)
- Uncertainty (exploration where most uncertain)
- Interpretability (energy landscapes, particle inspection)
Strategic Impact: Demonstrating GRL on RLHF:
- Validates GRL on most commercially relevant RL problem
- Opens door to industry adoption (OpenAI, Anthropic, Meta)
- Natural bridge to scaling research
7. What GRL Adds Beyond Classical RL¶
While GRL recovers classical methods, it also enables capabilities that are difficult or impossible in standard RL:
1. Continuous Action Generalization¶
Classical RL: Discretize continuous actions or use neural networks GRL: Smooth field \(Q^+(s, \theta)\) over continuous \(\theta\) via kernels
Example: Robot grasping - Classical: Sample \(N\) discrete grasp poses, learn Q-value for each - GRL: Learn smooth field over continuous grasp space, interpolate between samples
2. Compositional Actions¶
Classical RL: Actions are atomic GRL: Actions are operators that can be composed
Example: Multi-step manipulation - Classical: Learn separate policies for "pick", "place", "push" - GRL: Learn operators that compose: \(\hat{O}_{\text{place}} \circ \hat{O}_{\text{pick}}\)
3. Uncertainty Quantification¶
Classical RL: Uncertainty requires ensembles or Bayesian NNs GRL: Particle sparsity directly indicates uncertainty
Example: Safe exploration - Classical: Ensemble of Q-networks, high variance = uncertain - GRL: Sparse particles = uncertain, avoid or explore based on risk
4. Energy-Based Regularization¶
Classical RL: Entropy regularization (SAC), KL penalties (PPO) GRL: Energy function \(E = -Q^+\) naturally regularizes via least-action principle
Example: Smooth, efficient policies - Classical: Add entropy bonus to reward - GRL: Energy naturally prefers smooth, low-energy paths (physics-inspired)
5. Particle-Based Interpretability¶
Classical RL: Black-box neural networks GRL: Particles are interpretable experiences
Example: Debugging - Classical: "Why did the policy fail?" → Inspect millions of weights - GRL: "Why did the policy fail?" → Inspect nearby particles, visualize energy landscape
6. Hierarchical Abstraction (Part II)¶
Classical RL: Hierarchical RL requires careful design GRL: Concepts emerge via spectral clustering
Example: Long-horizon tasks - Classical: Manually define options/skills - GRL: Spectral decomposition discovers concepts automatically
8. Implementation: From GRL to Classical¶
Code Example: Q-Learning from GRL¶
from grl.core import ParticleMemory, DeltaKernel
from grl.algorithms import MemoryUpdate
# Classical Q-learning setup
state_space = ["s1", "s2", "s3"]
action_space = ["a1", "a2"]
# GRL setup: Map discrete actions to fixed parameters
action_params = {
"a1": torch.tensor([0.0]), # θ_1
"a2": torch.tensor([1.0]), # θ_2
}
# Particle memory (GRL)
memory = ParticleMemory()
# Delta kernel (no generalization)
kernel = DeltaKernel()
# Experience: (s, a, r, s')
s, a, r, s_next = "s1", "a1", 1.0, "s2"
# Convert to GRL format
z = (s, action_params[a])
w = r
# MemoryUpdate (GRL) ≡ Q-learning update (Classical)
memory = memory_update(memory, z, w, kernel, alpha=0.1)
# Query field (GRL) ≡ Q(s, a) (Classical)
Q_sa = memory.query((s, action_params[a]))
# This is Q-learning!
Code Example: DQN from GRL¶
from grl.core import NeuralField
from grl.algorithms import FieldTDLearning
# Neural network approximates reinforcement field
field = NeuralField(state_dim=4, action_dim=2, hidden_dim=64)
# Experience replay (GRL particle memory)
memory = ParticleMemory()
# Training loop
for episode in range(num_episodes):
for step in range(max_steps):
# Sample from memory (experience replay)
batch = memory.sample(batch_size=32)
# Compute TD targets (same as DQN)
td_targets = compute_td_targets(batch, field, gamma=0.99)
# Update field (GRL) ≡ Update Q-network (DQN)
loss = field.update(batch, td_targets)
# This is DQN!
Conclusion¶
GRL is a unifying framework that:
- ✅ Recovers classical RL (Q-learning, DQN, REINFORCE, PPO, SAC, RLHF)
- ✅ Extends to continuous actions (smooth generalization via kernels)
- ✅ Enables composition (operator algebra)
- ✅ Provides uncertainty (particle sparsity)
- ✅ Interprets naturally (energy landscapes, particles)
- ✅ Discovers structure (spectral concepts in Part II)
For practitioners: GRL gives you what you already know (classical RL), plus new tools for continuous control, uncertainty, and interpretability.
For researchers: GRL provides a principled foundation for understanding why existing methods work and how to extend them.
For industry: GRL applies to modern problems (RLHF for LLMs) while offering advantages (off-policy, uncertainty, sample efficiency).
Next Steps¶
Reproduce Classical Results:
- Implement Q-learning recovery on GridWorld
- Implement DQN recovery on CartPole
- Implement PPO recovery on continuous control (Pendulum)
- Implement RLHF recovery on small LLM (GPT-2)
Document Connections:
- Add "Classical RL Recovery" section to each tutorial chapter
- Create comparison tables (classical vs. GRL)
- Write blog post: "GRL: A Unifying Framework for RL"
Validate:
- Benchmark: GRL vs. DQN on Atari
- Benchmark: GRL vs. SAC on MuJoCo
- Benchmark: GRL vs. PPO on RLHF tasks
Scale:
- Apply GRL to LLaMA-7B fine-tuning
- Demonstrate advantages (sample efficiency, uncertainty)
- Submit paper: "GRL for LLM Fine-tuning"
Last Updated: January 14, 2026
See also: Implementation Roadmap | Research Roadmap