Chapter 07a: Beyond Discrete Actions — Continuous Policy Inference¶

Purpose: Address the discrete action bottleneck and explore fully continuous GRL formulations
Prerequisites: Chapter 07 (RF-SARSA)
Key Concepts: Continuous action spaces, Langevin sampling, actor-critic in RKHS, action discovery, learned embeddings

Introduction¶

Chapter 7 presented RF-SARSA as the core learning algorithm for GRL. However, the algorithm as specified has a critical limitation that constrains its applicability:

The Discrete Action Assumption: RF-SARSA requires a finite set of primitive actions $\mathcal{A} = \{a^{(1)}, \ldots, a^{(n)}\}$ to query the field $Q^+(s, a^{(i)})$ during policy inference.

This creates two problems:

Manual Action Design: Requires hand-crafted mapping $f_{A^+}: a^{(i)} \mapsto x_a^{(i)}$ from primitive actions to parametric representation
Scalability: For high-dimensional action parameters $\theta \in \mathbb{R}^{d_a}$, enumerating all actions is intractable

Example: In the 2D navigation domain (original GRL paper):

Primitive actions: move in 12 directions (like a clock: 0°, 30°, 60°, ..., 330°)
Manual mapping: each direction → angle $\theta \in [0, 2\pi)$
Limitation: What if optimal action is at angle $\pi/7 \approx 25.7°$ (not in the discrete set)?

This chapter explores solutions that eliminate discrete actions entirely, enabling fully continuous policy inference in parametric action spaces.

1. The Problem: Discrete Actions as a Bottleneck¶

1.1 Where Discrete Actions Enter RF-SARSA¶

In Algorithm 2 (Chapter 7), discrete actions appear in two places:

Step 6: Field-Based Action Evaluation

For each a^(i) ∈ A:
    Form z^(i) = (s, x_a^(i))
    Query Q+(z^(i)) via GPR
Select a via Boltzmann policy

Step 10: Primitive SARSA Update

δ = r + γ Q(s', a') - Q(s, a)
Q(s, a) ← Q(s, a) + α δ

Both steps assume $a \in \mathcal{A}$ is discrete.

1.2 Why This Is Limiting¶

Problem 1: Manual Feature Engineering

The mapping $f_{A^+}: a \mapsto x_a$ must be designed by hand:

2D navigation: direction → angle
Robotic reaching: discrete waypoints → joint angles
Continuous control: requires discretizing continuous action space

This defeats the purpose of parametric actions—you still need domain expertise!

Problem 2: Curse of Dimensionality

For high-dimensional actions $\theta \in \mathbb{R}^{d_a}$:

Discretizing each dimension with $k$ values → $k^{d_a}$ primitive actions
Example: $d_a = 10$, $k = 10$ → $10^{10}$ actions (intractable!)

Problem 3: Suboptimal Actions

Optimal action might lie between discrete choices:

Discrete set: $\{30°, 45°, 60°\}$
Optimal: $42°$ (not representable!)

1.3 The SARSA Constraint¶

Why does RF-SARSA use discrete actions? Because SARSA does.

Original SARSA (Rummery & Niranjan, 1994): $$Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma Q(s', a') - Q(s, a)]$$

This assumes $a, a'$ are indexable (discrete) so you can store $Q(s, a)$ in a table or lookup structure.

For continuous actions $\theta \in \mathbb{R}^{d_a}$, you can't index $Q(s, \theta)$ this way!

2. Solution 1: Continuous SARSA via Langevin Sampling¶

Key insight: We don't need primitive actions—we can sample directly from the continuous field $Q^+(s, \theta)$ using gradient-based sampling.

2.1 Langevin Dynamics Refresher (from Chapter 03a)¶

Recall from the least action principle that optimal actions follow gradient flow:

\[\theta_{t+1} = \theta_t + \epsilon \nabla_\theta Q^+(s, \theta_t) + \sqrt{2\epsilon\lambda} \, \xi_t\]

where:

$\nabla_\theta Q^+(s, \theta)$ = gradient of field w.r.t. action parameters (via Riesz representer, Chapter 04a)
$\lambda$ = temperature (exploration)
$\xi_t \sim \mathcal{N}(0, I)$ = Gaussian noise

This is Langevin Monte Carlo sampling from the Boltzmann distribution $\pi(\theta | s) \propto \exp(Q^+(s, \theta) / \lambda)$.

Advantage: No discrete action set needed! Sample $\theta$ directly from the field.

2.2 Continuous RF-SARSA: Modified Algorithm¶

Replace discrete action enumeration with continuous sampling:

Original (discrete) RF-SARSA:

For each a^(i) ∈ A:
    Query Q+(s, a^(i))
Select a ~ Boltzmann(Q+)

Continuous RF-SARSA:

Initialize θ_0 randomly (or from heuristic)
For k = 1 to K_sample:
    Compute ∇_θ Q+(s, θ_{k-1})  [via Riesz representer]
    θ_k ← θ_{k-1} + ε ∇_θ Q+(s, θ_{k-1}) + √(2ελ) ξ_k
Return θ_K

Key change: Action inference becomes gradient-based optimization on the field.

2.3 How to Update Q Without Discrete Actions?¶

Problem: SARSA update requires $Q(s, a)$ as a scalar, but now $\theta$ is continuous.

Solution 1: Function Approximation (Standard Approach)

Use a parametric function approximator (e.g., neural network): $$Q_w(s, \theta)$$

Update via gradient descent: $$w \leftarrow w - \alpha \nabla_w [Q_w(s, \theta) - (r + \gamma Q_w(s', \theta'))]^2$$

But wait—this is just deep RL! We've abandoned the particle-based field representation.

Solution 2: Particle-Based Continuous SARSA (GRL Way)

Keep the particle representation $Q^+(z) = \sum_i w_i k(z, z_i)$ but eliminate primitive $Q(s, a)$ table.

Modified update:

Sample action: $\theta \sim \pi(\cdot | s)$ via Langevin
Execute: observe $r$, $s'$
Sample next action: $\theta' \sim \pi(\cdot | s')$ via Langevin
Compute TD target directly from field: $$\delta = r + \gamma Q^+(s', \theta') - Q^+(s, \theta)$$
Form particle: $\omega = ((s, \theta), r + \gamma Q^+(s', \theta'))$ (bootstrap target, not table Q)
MemoryUpdate: $\Omega \leftarrow \text{MemoryUpdate}(\omega, \delta, k, \tau, \Omega)$

Key difference: No primitive $Q(s, a)$ table! TD target computed directly from field queries.

2.4 Algorithm: Continuous RF-SARSA¶

Inputs:

Kernel $k(\cdot, \cdot; \theta)$
Langevin step size $\epsilon$, temperature $\lambda$
Number of Langevin steps $K_{\text{sample}}$
TD learning rate $\alpha$ (for particle weight updates)
Discount $\gamma$, association threshold $\tau$

Initialization:

Particle memory $\Omega \leftarrow \emptyset$
Kernel hyperparameters $\theta$ (via ARD or prior)

For each episode:

Observe initial state $s_0$

For each step $t$:

Action sampling via Langevin:
Initialize $\theta_0$ randomly or from heuristic
For $k = 1, \ldots, K_{\text{sample}}$: $$\nabla_\theta Q^+(s_t, \theta_{k-1}) = \sum_{i=1}^N w_i \nabla_\theta k((s_t, \theta_{k-1}), z_i)$$ $$\theta_k \leftarrow \theta_{k-1} + \epsilon \nabla_\theta Q^+(s_t, \theta_{k-1}) + \sqrt{2\epsilon\lambda} \, \xi_k$$
Set $\theta_t \leftarrow \theta_{K_{\text{sample}}}$
Execute action:
Execute $\theta_t$ in environment
Observe $r_t$, $s_{t+1}$
Next action sampling:
Repeat step 2 for $s_{t+1}$ to get $\theta_{t+1}$
Field-based TD update:
Query field: $Q_t^+ \leftarrow Q^+(s_t, \theta_t)$, $Q_{t+1}^+ \leftarrow Q^+(s_{t+1}, \theta_{t+1})$
Compute TD error: $\delta_t \leftarrow r_t + \gamma Q_{t+1}^+ - Q_t^+$
Form TD target: $y_t \leftarrow r_t + \gamma Q_{t+1}^+$ (bootstrap from field)
Particle reinforcement:
Form particle: $\omega_t \leftarrow ((s_t, \theta_t), y_t)$
Update memory: $\Omega \leftarrow \text{MemoryUpdate}(\omega_t, \delta_t, k, \tau, \Omega)$
Periodic ARD:
Every $T$ steps, update kernel hyperparameters $\theta$ via ARD on $\Omega$

Repeat until terminal.

2.5 Advantages and Limitations¶

✅ Advantages:

No manual action mapping: $\theta$ is sampled directly from field
Fully continuous: No discretization of action space
Principled: Langevin sampling from Boltzmann distribution (Chapter 03a)
Natural exploration: Temperature $\lambda$ controls stochasticity

⚠️ Limitations:

Gradient computation: Requires $\nabla_\theta k(z, z')$ (analytic or autodiff)
Langevin convergence: Need $K_{\text{sample}}$ steps per action (slower)
Local optima: Gradient descent can get stuck (non-convex $Q^+$)
No primitive Q-table: Loses SARSA's tabular grounding

When to use: High-dimensional continuous actions where discrete enumeration is impossible.

3. Solution 2: Actor-Critic in RKHS¶

Idea: Decouple policy (actor) from value function (critic), as in standard actor-critic methods.

3.1 The Actor-Critic Framework¶

Actor: Parametric policy $\pi_\phi(\theta | s)$ - Could be Gaussian: $\pi_\phi(\theta | s) = \mathcal{N}(\mu_\phi(s), \sigma_\phi(s))$ - Trained via policy gradient

Critic: Value function $Q^+(s, \theta)$ in RKHS (as before) - Trained via TD learning (using particles)

Advantage: Policy is flexible, efficient to sample; critic provides value estimates for learning.

3.2 GRL Actor-Critic Algorithm¶

Modifications to RF-SARSA:

Action selection (no Langevin needed):

Sample from actor: $\theta_t \sim \pi_\phi(\cdot | s_t)$
Fast sampling (single forward pass)

Critic update (unchanged):

Particle-based TD: $\delta_t = r_t + \gamma Q^+(s_{t+1}, \theta_{t+1}) - Q^+(s_t, \theta_t)$
MemoryUpdate as before

Actor update (policy gradient):

Compute advantage: $A_t = Q^+(s_t, \theta_t) - V(s_t)$ where $V(s) = \mathbb{E}_{\theta \sim \pi_\phi}[Q^+(s, \theta)]$
Update policy: $\phi \leftarrow \phi + \beta \nabla_\phi \log \pi_\phi(\theta_t | s_t) A_t$

3.3 Pseudocode¶

class GRL_ActorCritic:
    def __init__(self, actor_net, kernel, alpha=0.1, beta=0.01, gamma=0.9, tau=0.1):
        self.actor = actor_net  # Neural network: s → μ(s), σ(s)
        self.kernel = kernel
        self.particles = []  # Critic: particle memory Ω
        self.alpha = alpha  # Critic learning rate
        self.beta = beta  # Actor learning rate
        self.gamma = gamma
        self.tau = tau

    def sample_action(self, s):
        """Sample action from actor policy."""
        mu, sigma = self.actor(s)
        theta = np.random.normal(mu, sigma)
        return theta, mu, sigma

    def critic_predict(self, s, theta):
        """Predict Q+(s, θ) via GPR on particles."""
        z = np.concatenate([s, theta])
        if len(self.particles) == 0:
            return 0.0
        return gpr_predict(z, self.particles, self.kernel)

    def update(self, s, theta, r, s_next, theta_next):
        """Actor-critic update."""
        # Critic TD update
        Q_current = self.critic_predict(s, theta)
        Q_next = self.critic_predict(s_next, theta_next)
        delta = r + self.gamma * Q_next - Q_current

        # Particle reinforcement (critic)
        z = np.concatenate([s, theta])
        y = r + self.gamma * Q_next  # TD target
        particle = (z, y)
        self.particles = memory_update(particle, delta, self.kernel, self.tau, self.particles)

        # Actor policy gradient
        # Advantage: A = Q(s,θ) - V(s)
        # For simplicity, use TD error as advantage (A ≈ δ)
        log_prob = self.actor.log_prob(theta, s)  # log π_φ(θ|s)
        actor_loss = -log_prob * delta  # Policy gradient

        self.actor.update(actor_loss, self.beta)

3.4 Advantages and Limitations¶

✅ Advantages:

Efficient sampling: No Langevin iterations, single forward pass
Flexible policy: Can model complex distributions (multimodal, correlations)
Standard framework: Leverages decades of actor-critic research
Scalable: Neural networks handle high-dimensional states/actions

⚠️ Limitations:

Parametric policy: Loses non-parametric flexibility of particle-based field
Two learning systems: Actor and critic can be unstable (common in AC methods)
Hyperparameters: Requires tuning learning rates, network architectures
Divergence risk: Policy and value can diverge without careful tuning

When to use: High-dimensional, complex policies where sampling efficiency matters.

4. Solution 3: Learned Action Embeddings¶

Idea: Instead of hand-designing $f_{A^+}: a \mapsto x_a$, learn the embedding jointly with the value function.

4.1 The Embedding Problem¶

Original GRL: Requires manual mapping from primitive actions to parametric representation.

Example (2D navigation):

Primitive: "move North"
Manual embedding: North → angle $\theta = \pi/2$

Question: Can we learn this mapping automatically?

Answer: Yes! Use contrastive learning or auto-encoder.

4.2 Contrastive Action Embedding¶

Objective: Embed actions such that:

Similar actions (similar outcomes) → close in embedding space
Dissimilar actions → far apart

Training:

Collect transitions: $(s, a, s')$
Define similarity: $\text{sim}(a, a') = \exp(-\|s' - s''\|^2)$ where $s', s''$ are outcomes
Learn embedding: $f_\psi: a \mapsto x_a$ such that $\|x_a - x_{a'}\|^2 \propto -\log \text{sim}(a, a')$

Loss (InfoNCE-style): $$\mathcal{L}_{\text{embed}} = -\log \frac{\exp(\langle f_\psi(a), f_\psi(a^+) \rangle / \tau)}{\sum_{a^-} \exp(\langle f_\psi(a), f_\psi(a^-) \rangle / \tau)}$$

where $a^+$ is a positive pair (similar action), $a^-$ are negatives.

4.3 Joint Learning: Embedding + Value Function¶

Algorithm:

Initialization: Random embedding $f_\psi$
Collect experience: $(s, a, r, s')$
Embed actions: $x_a \leftarrow f_\psi(a)$, $x_{a'} \leftarrow f_\psi(a')$
TD update: Standard RF-SARSA using embedded actions
Embedding update: Contrastive loss based on $(a, a')$ pairs with similar outcomes
Repeat

Key insight: Embedding evolves to make TD learning easier—actions with similar values cluster in embedding space!

4.4 Advantages and Limitations¶

✅ Advantages:

No manual design: Embedding learned from data
Adaptive: Embedding adapts to task (reward structure)
Discovers structure: May reveal latent action properties

⚠️ Limitations:

Requires experience: Need data to learn embedding
Non-stationarity: Embedding changes as policy improves (moving target)
Computational cost: Joint optimization is complex

When to use: When action space structure is unknown, or manual embedding is infeasible.

5. Solution 4: Hierarchical Action Discovery¶

Idea: Let the agent discover a discrete action set automatically by clustering in action parameter space.

5.1 The Clustering Approach¶

Step 1: Start with continuous action parameter space $\mathbb{R}^{d_a}$

Step 2: After initial exploration, cluster observed action parameters:

Use k-means, GMM, or DBSCAN on $\{\theta_i\}$ from particle memory
Cluster centers become "discovered actions"

Step 3: Use discovered actions as discrete set for RF-SARSA

Step 4: Periodically re-cluster as policy evolves

5.2 Algorithm Sketch¶

def discover_actions(particles, n_clusters=10):
    """Discover discrete actions via clustering."""
    # Extract action parameters from particle memory
    theta_set = [z[len(state):] for (z, q) in particles]  # z = (s, θ)

    # Cluster in action space
    from sklearn.cluster import KMeans
    kmeans = KMeans(n_clusters=n_clusters)
    kmeans.fit(theta_set)

    # Cluster centers = discovered actions
    discovered_actions = kmeans.cluster_centers_
    return discovered_actions

# In RF-SARSA loop:
if episode % T_discovery == 0:
    A_discovered = discover_actions(self.particles, n_clusters=10)
    # Use A_discovered as discrete action set for next T_discovery episodes

5.3 Options Framework Connection¶

This is closely related to the options framework (Sutton et al., 1999):

Options: Temporally extended actions (subpolicies)
GRL version: Discovered action clusters become "options"
Can learn hierarchical policies: high-level chooses option, low-level executes

5.4 Advantages and Limitations¶

✅ Advantages:

Automatic: No manual action design
Data-driven: Discovers actions that matter for the task
Hierarchical: Enables multi-level policies

⚠️ Limitations:

Requires exploration: Need diverse data to cluster well
K selection: How many clusters? (hyperparameter)
Non-stationary: Discovered actions change over time

When to use: Complex action spaces where structure is unknown, or hierarchical RL is beneficial.

6. Solution 5: Direct Optimization on the Field¶

Most radical solution: Eliminate TD learning entirely! Directly optimize the field $Q^+$ via gradient ascent on expected return.

6.1 Field-Based Policy Gradient¶

Objective: Maximize expected return $$J(\Omega) = \mathbb{E}_{\tau \sim \pi_\Omega}[\sum_{t=0}^T r_t]$$

where $\pi_\Omega$ is the policy induced by field $Q^+(\cdot; \Omega)$ (e.g., via Langevin sampling).

Gradient: $$\nabla_\Omega J = \mathbb{E}_{\tau}[\sum_{t=0}^T \nabla_\Omega \log \pi_\Omega(\theta_t | s_t) R_t]$$

where $R_t = \sum_{t'=t}^T r_{t'}$ is the return from time $t$.

Update: Gradient ascent on particle memory:

Add particles where policy needs reinforcement
Remove particles where policy is suboptimal
Adjust weights to increase expected return

6.2 Challenges¶

Problem 1: Computing $\nabla_\Omega \log \pi_\Omega(\theta | s)$ is non-trivial—policy is implicit (via GPR + Langevin).

Problem 2: High variance (standard policy gradient issue).

Solution: Use score matching or diffusion models to learn $\nabla_\theta \log \pi(\theta | s)$ directly.

This is an active research direction! (Connect to modern diffusion-based RL: Diffusion-QL, Diffuser, Decision Diffusion, etc.)

7. Alternative Learning Mechanisms Beyond SARSA¶

7.1 Q-Learning in RKHS¶

Replace SARSA's on-policy update with Q-learning's off-policy update:

Original SARSA: $$\delta = r + \gamma Q(s', a') - Q(s, a) \quad \text{(uses actual next action } a'\text{)}$$

Q-Learning: $$\delta = r + \gamma \max_{a'} Q(s', a') - Q(s, a) \quad \text{(uses max over actions)}$$

For continuous actions: Replace $\max_{a'}$ with optimization: $$\theta^* = \arg\max_\theta Q^+(s', \theta)$$

Use gradient ascent or Langevin sampling to find $\theta^*$.

Advantage: Off-policy (can reuse experience, more sample-efficient).

Limitation: Still requires optimization at each step (costly).

7.2 Model-Based: Dyna-Style Planning¶

Idea: Use particle memory as a forward model, perform planning.

Dyna-Q analog for GRL:

Direct learning: Update $\Omega$ from real experience (as in RF-SARSA)
Model learning: Particles encode transitions $(s, \theta) \to (r, s')$
Planning: Simulate trajectories using particle memory
Query $Q^+(s, \theta)$ to predict $r$
Use GP to predict $s'$ (if we model transition dynamics)
Perform TD updates on simulated experience

Advantage: Sample efficiency (real experience + simulated).

Limitation: Requires modeling $p(s' | s, \theta)$, not just $Q^+$.

7.3 Successor Representations¶

Idea: Decouple environment dynamics from reward.

Successor representation: $$\psi(s, \theta) = \mathbb{E}[\sum_{t=0}^\infty \gamma^t \phi(s_t) \mid s_0 = s, \theta_0 = \theta]$$

where $\phi(s)$ are state features.

Value function: $$Q^+(s, \theta) = \psi(s, \theta)^\top w$$

where $w$ are reward weights.

Advantage: Transfer learning (reuse $\psi$ for new reward functions $w$).

GRL connection: Can represent $\psi$ via particles in RKHS!

8. Practical Recommendations¶

8.1 Decision Tree: Which Approach to Use?¶

Is action space discrete with < 100 actions?
├─ YES → Use original RF-SARSA (Chapter 7)
└─ NO → Is action dimensionality high (d_a > 5)?
    ├─ YES → Use Actor-Critic in RKHS (Solution 2)
    │        Efficient sampling, scalable
    └─ NO → Is gradient computation feasible?
        ├─ YES → Use Continuous RF-SARSA with Langevin (Solution 1)
        │        Principled, no extra networks
        └─ NO → Use Learned Embeddings (Solution 3) or
                 Hierarchical Discovery (Solution 4)

8.2 Hybrid Approach (Recommended)¶

Best of both worlds: Combine solutions!

Stage 1: Exploration (Episodes 1-100) - Use Langevin sampling (Solution 1) for pure exploration - Build diverse particle memory

Stage 2: Discovery (Episodes 100-200) - Cluster actions (Solution 4) to find structure - Use discovered discrete actions for efficiency

Stage 3: Exploitation (Episodes 200+) - Use Actor-Critic (Solution 2) with structured embedding - Fine-tune via continuous Langevin when needed

9. Implementation: Continuous RF-SARSA¶

Here's a complete implementation of Solution 1:

import numpy as np

class ContinuousRFSARSA:
    """RF-SARSA without discrete actions: pure Langevin sampling."""

    def __init__(self, kernel, epsilon=0.01, lambda_temp=1.0, K_sample=10, 
                 gamma=0.9, tau=0.1, T_ard=100):
        self.kernel = kernel
        self.epsilon = epsilon  # Langevin step size
        self.lambda_temp = lambda_temp  # Temperature
        self.K_sample = K_sample  # Langevin iterations
        self.gamma = gamma
        self.tau = tau
        self.T_ard = T_ard

        self.particles = []  # (z, y) pairs
        self.t_ard = 0

    def q_plus(self, s, theta):
        """Query Q+(s, θ) via GPR."""
        if len(self.particles) == 0:
            return 0.0

        z = np.concatenate([s, theta])

        # GP prediction (assuming precomputed alpha coefficients)
        k_vec = np.array([self.kernel(z, p[0]) for p in self.particles])
        return k_vec @ self.alpha_gpr

    def grad_q_plus(self, s, theta):
        """Compute ∇_θ Q+(s, θ) via Riesz representer."""
        if len(self.particles) == 0:
            return np.zeros_like(theta)

        z = np.concatenate([s, theta])
        d_s = len(s)

        # Gradient via kernel derivatives
        grad = np.zeros_like(theta)
        for (z_i, q_i), alpha_i in zip(self.particles, self.alpha_gpr):
            # ∂k(z, z_i)/∂θ (assuming RBF kernel)
            diff = z - z_i
            k_val = self.kernel(z, z_i)
            grad_k = -diff[d_s:] / (self.kernel.lengthscale**2) * k_val
            grad += alpha_i * grad_k

        return grad

    def sample_action_langevin(self, s):
        """Sample action via Langevin dynamics on Q+ field."""
        # Initialize randomly
        theta = np.random.randn(self.action_dim)

        # Langevin iterations
        for k in range(self.K_sample):
            grad = self.grad_q_plus(s, theta)
            theta = theta + self.epsilon * grad + \
                    np.sqrt(2 * self.epsilon * self.lambda_temp) * np.random.randn(self.action_dim)

        return theta

    def update(self, s, theta, r, s_next, theta_next):
        """Continuous RF-SARSA update."""
        # Query field for TD
        Q_current = self.q_plus(s, theta)
        Q_next = self.q_plus(s_next, theta_next)

        # TD error
        delta = r + self.gamma * Q_next - Q_current

        # TD target (bootstrap from field, not from Q-table!)
        y = r + self.gamma * Q_next

        # Form particle
        z = np.concatenate([s, theta])
        particle = (z, y)

        # MemoryUpdate
        self.particles = memory_update(particle, delta, self.kernel, self.tau, self.particles)

        # Periodic ARD
        self.t_ard += 1
        if self.t_ard % self.T_ard == 0:
            self.update_kernel_hyperparameters()

    def update_kernel_hyperparameters(self):
        """Run ARD to update kernel lengthscales."""
        if len(self.particles) < 10:
            return  # Need sufficient data

        Z = np.array([p[0] for p in self.particles])
        q = np.array([p[1] for p in self.particles])

        # Fit GP with ARD
        from sklearn.gaussian_process import GaussianProcessRegressor
        from sklearn.gaussian_process.kernels import RBF

        kernel = RBF(length_scale=np.ones(Z.shape[1]), length_scale_bounds=(1e-2, 1e2))
        gp = GaussianProcessRegressor(kernel=kernel, alpha=1e-6)
        gp.fit(Z, q)

        # Update kernel
        self.kernel.set_lengthscales(gp.kernel_.length_scale)

        # Recompute GPR coefficients
        K = self.kernel.matrix(Z, Z)
        self.alpha_gpr = np.linalg.solve(K + 1e-6 * np.eye(len(Z)), q)

    def train(self, env, n_episodes=100):
        """Training loop."""
        for episode in range(n_episodes):
            s = env.reset()
            theta = self.sample_action_langevin(s)

            done = False
            while not done:
                # Execute
                s_next, r, done = env.step(theta)

                # Next action
                theta_next = self.sample_action_langevin(s_next)

                # Update
                self.update(s, theta, r, s_next, theta_next)

                # Advance
                s, theta = s_next, theta_next

            print(f"Episode {episode}: Total reward = {env.episode_return}")

10. Summary¶

10.1 The Discrete Action Bottleneck¶

RF-SARSA (Chapter 7) requires discrete primitive actions, creating limitations:

Manual action mapping needed
Curse of dimensionality for high-d actions
Suboptimal actions (discrete approximation of continuous space)

10.2 Five Solutions¶

Solution	Key Idea	Advantages	Limitations
1. Continuous SARSA	Langevin sampling on field	Principled, no manual design	Gradient computation, convergence
2. Actor-Critic in RKHS	Parametric policy + particle critic	Efficient sampling, scalable	Parametric policy, instability risk
3. Learned Embeddings	Learn action representation jointly	Adaptive, discovers structure	Non-stationary, requires experience
4. Hierarchical Discovery	Cluster actions automatically	Data-driven, enables hierarchy	K selection, non-stationary
5. Direct Optimization	Policy gradient on field	No TD needed	High variance, complex

10.3 Practical Recommendations¶

For most problems: Start with Continuous SARSA (Solution 1) or Actor-Critic (Solution 2).

Hybrid approach: Combine exploration (Langevin) → discovery (clustering) → exploitation (actor-critic).

10.4 Beyond SARSA¶

Alternative learning mechanisms:

Q-Learning in RKHS: Off-policy, sample-efficient
Dyna-style planning: Model-based, simulate with particles
Successor representations: Transfer learning across reward functions

11. Key Takeaways¶

Original RF-SARSA has a discrete action bottleneck requiring manual mapping $f_{A^+}$
Continuous SARSA via Langevin eliminates discrete actions entirely—sample directly from field
Actor-Critic in RKHS combines efficiency (neural network policy) with non-parametric critic (particles)
Learned embeddings remove manual design—discover action structure from data
Hierarchical discovery via clustering enables data-driven action spaces
SARSA is not the only way—Q-learning, model-based, policy gradients all viable
The least action principle (Chapter 03a) provides theoretical foundation for Langevin-based continuous RL

12. Open Research Questions¶

Convergence theory for continuous RF-SARSA: Does Langevin + TD converge? Under what conditions?
Optimal exploration in continuous actions: How to balance Langevin temperature vs. ARD adaptation?
Sparse particle representations: Can we maintain performance with fewer particles in continuous action spaces?
Hierarchical GRL: How to discover and learn action hierarchies automatically?
Transfer learning: Can learned action embeddings transfer across tasks?

Chapter 07a: Beyond Discrete Actions — Continuous Policy Inference¶

Introduction¶

1. The Problem: Discrete Actions as a Bottleneck¶

1.1 Where Discrete Actions Enter RF-SARSA¶

1.2 Why This Is Limiting¶

1.3 The SARSA Constraint¶

2. Solution 1: Continuous SARSA via Langevin Sampling¶

2.1 Langevin Dynamics Refresher (from Chapter 03a)¶

2.2 Continuous RF-SARSA: Modified Algorithm¶

2.3 How to Update Q Without Discrete Actions?¶

2.4 Algorithm: Continuous RF-SARSA¶

2.5 Advantages and Limitations¶

3. Solution 2: Actor-Critic in RKHS¶

3.1 The Actor-Critic Framework¶

3.2 GRL Actor-Critic Algorithm¶

3.3 Pseudocode¶

3.4 Advantages and Limitations¶

4. Solution 3: Learned Action Embeddings¶

4.1 The Embedding Problem¶

4.2 Contrastive Action Embedding¶

4.3 Joint Learning: Embedding + Value Function¶

4.4 Advantages and Limitations¶

5. Solution 4: Hierarchical Action Discovery¶

5.1 The Clustering Approach¶

5.2 Algorithm Sketch¶

5.3 Options Framework Connection¶

5.4 Advantages and Limitations¶

6. Solution 5: Direct Optimization on the Field¶

6.1 Field-Based Policy Gradient¶

6.2 Challenges¶

7. Alternative Learning Mechanisms Beyond SARSA¶

7.1 Q-Learning in RKHS¶

7.2 Model-Based: Dyna-Style Planning¶

7.3 Successor Representations¶

8. Practical Recommendations¶

8.1 Decision Tree: Which Approach to Use?¶

8.2 Hybrid Approach (Recommended)¶

9. Implementation: Continuous RF-SARSA¶

10. Summary¶

10.1 The Discrete Action Bottleneck¶

10.2 Five Solutions¶

10.3 Practical Recommendations¶

10.4 Beyond SARSA¶

11. Key Takeaways¶

12. Open Research Questions¶

Further Reading¶