Chapter 03a: The Principle of Least Action (Supplement)¶

Purpose: This supplement bridges classical physics and reinforcement learning, showing why GRL's energy-based formulation is not just convenient notation—it's a principled framework grounded in one of the most fundamental laws of physics: the principle of least action.

Why this matters for GRL:

Explains why the Boltzmann policy \(\pi(\theta|s) \propto \exp(Q^+/\lambda)\) emerges naturally
Provides a principled way for agents to discover smooth, optimal actions (not just select from pre-defined sets)
Connects modern RL to 300+ years of physics and optimal control theory

1. Classical Mechanics: A Crash Course¶

1.1 The Action Functional¶

In classical mechanics, a particle doesn't "choose" its trajectory arbitrarily. Among all possible paths from point A to point B, nature selects the one that minimizes a quantity called the action.

The action functional \(S[\gamma]\) assigns a real number to each possible trajectory \(\gamma(t)\):

\[S[\gamma] = \int_{t_0}^{t_f} L(q(t), \dot{q}(t), t) \, dt\]

where:

\(L(q, \dot{q}, t)\) = Lagrangian = Kinetic Energy - Potential Energy
\(q(t)\) = position at time \(t\)
\(\dot{q}(t)\) = velocity at time \(t\)

Principle of Least Action: The actual trajectory taken by the system is the one that makes \(S[\gamma]\) stationary (usually a minimum).

Example: Free particle

For a particle of mass \(m\) moving freely:

\[L = \frac{1}{2}m\dot{q}^2\]

The action is:

\[S[\gamma] = \int_{t_0}^{t_f} \frac{1}{2}m\dot{q}^2 \, dt\]

Result: The trajectory that minimizes action is a straight line at constant velocity—Newton's first law emerges from a variational principle!

1.2 The Euler-Lagrange Equations¶

Minimizing the action leads to the Euler-Lagrange equations:

\[\frac{d}{dt}\left(\frac{\partial L}{\partial \dot{q}}\right) - \frac{\partial L}{\partial q} = 0\]

These are equivalent to Newton's equations of motion.

Example: Particle in potential

For \(L = \frac{1}{2}m\dot{q}^2 - V(q)\):

\[m\ddot{q} = -\frac{\partial V}{\partial q}\]

This is Newton's second law: \(F = ma\).

1.3 Why Is This Powerful?¶

The action principle is powerful because:

Unified framework: Works for all physical systems (mechanics, optics, quantum mechanics, field theory)
Coordinate-free: The action is independent of coordinate choices
Reveals conservation laws: Via Noether's theorem (symmetries ↔ conserved quantities)
Generalizes naturally: Extends to stochastic systems, optimal control, and RL

2. From Physics to Control: Path Integral Control¶

2.1 The Control Problem¶

In optimal control, we want to find a trajectory that:

Starts at state \(s_0\)
Ends at goal state \(s_f\) (or maximizes reward)
Minimizes a cost functional

Sound familiar? This is exactly an action minimization problem!

Control action functional:

\[S[\tau] = \int_{t_0}^{t_f} \left[ C(s_t, u_t) + \frac{1}{2\nu} \|u_t\|^2 \right] dt\]

where:

\(C(s, u)\) = instantaneous cost (like potential energy)
\(\frac{1}{2\nu} \|u_t\|^2\) = control cost (like kinetic energy)
\(u_t\) = control input at time \(t\)
\(\nu\) = "temperature" parameter (exploration vs. exploitation)

The control Lagrangian is:

\[L(s, u) = -C(s, u) - \frac{1}{2\nu} \|u\|^2\]

(Note the minus signs: we minimize cost, which is like maximizing negative cost)

2.2 Optimal Policy from Action¶

Key insight (Kappen 2005, Todorov 2009): The optimal stochastic policy is:

\[\pi^*(u|s) \propto \exp\left(-\frac{1}{\nu} S[s \to u]\right)\]

where \(S[s \to u]\) is the action along the optimal trajectory from \(s\) when applying control \(u\).

This is a Boltzmann distribution over actions!

The policy naturally emerges from minimizing action, with temperature \(\nu\) controlling the sharpness:

High \(\nu\) (high temperature) → More exploration, softer policy
Low \(\nu\) (low temperature) → More exploitation, sharper policy

2.3 The Cost-to-Go as "Potential"¶

In path integral control, the cost-to-go (or value function) plays the role of potential energy:

\[V(s) = \min_{\tau} \mathbb{E}\left[\int_{t}^{\infty} C(s_\tau, u_\tau) d\tau \mid s_t = s\right]\]

The optimal control is:

\[u^*(s) = -\nu \nabla_s \log \mathcal{Z}(s)\]

where \(\mathcal{Z}(s)\) is the "partition function" (sum over all trajectories).

This is gradient descent on an energy landscape!

3. GRL's Boltzmann Policy as Least Action¶

3.1 The GRL Action Functional¶

In GRL, we work in augmented state-action space \(z = (s, \theta)\). The natural action functional is:

\[S[\tau] = \int_{t_0}^{t_f} \left[ E(s_t, \theta_t) + \frac{1}{2\lambda} \|\dot{\theta}_t\|^2 \right] dt\]

where:

\(E(s, \theta) = -Q^+(s, \theta)\) = energy landscape (potential)
\(\|\dot{\theta}_t\|^2\) = "kinetic energy" of action parameter changes
\(\lambda\) = temperature (controls exploration)

The GRL Lagrangian:

\[L(s, \theta, \dot{\theta}) = -E(s, \theta) - \frac{1}{2\lambda} \|\dot{\theta}\|^2 = Q^+(s, \theta) - \frac{1}{2\lambda} \|\dot{\theta}\|^2\]

This says: Good trajectories have high \(Q^+\) (high reward potential) and smooth changes in action parameters (low kinetic cost).

3.2 Why the Boltzmann Policy Emerges¶

From path integral control theory, the optimal policy is:

\[\pi^*(\theta|s) \propto \exp\left(-\frac{1}{\lambda} S[s \to \theta]\right)\]

For a single-step decision (no trajectory dynamics), this simplifies to:

\[\pi^*(\theta|s) \propto \exp\left(-\frac{1}{\lambda} E(s, \theta)\right) = \exp\left(\frac{Q^+(s, \theta)}{\lambda}\right)\]

This is exactly GRL's Boltzmann policy!

It's not an ad-hoc choice—it's the optimal policy under the action minimization principle.

3.3 Smooth Actions from Kinetic Regularization¶

The kinetic term \(\frac{1}{2\lambda}\|\dot{\theta}\|^2\) penalizes rapid changes in action parameters.

Why this matters:

Prevents jerky, discontinuous actions
Encourages smooth, physically realizable trajectories
Natural regularization (Occam's razor for actions)

Example: Robotic reaching

Without kinetic penalty: Agent might command wild, discontinuous joint torques With kinetic penalty: Agent learns smooth, human-like reaching motions

In GRL: The kernel function \(k(z, z')\) implicitly encodes this smoothness preference!

\[k((s, \theta), (s', \theta')) = k_s(s, s') \cdot k_\theta(\theta, \theta')\]

A smooth kernel (e.g., RBF) ensures that nearby actions have similar \(Q^+\) values, effectively implementing the kinetic penalty.

4. Implications for Action Discovery¶

4.1 Beyond Fixed Action Sets¶

Traditional RL assumes a fixed action space:

Discrete: \(\mathcal{A} = \{a_1, a_2, \ldots, a_n\}\)
Continuous: \(\mathcal{A} = \mathbb{R}^d\) with pre-defined parameterization

GRL with least action: Actions are discovered by minimizing the action functional.

The agent learns:

What actions are smooth (low kinetic cost)
What actions are effective (high \(Q^+\), low energy)
How to balance exploration and exploitation (\(\lambda\) temperature)

No pre-defined action repertoire needed!

4.2 Gradient Flow on the Energy Landscape¶

From the Euler-Lagrange equations, the optimal trajectory satisfies:

\[\lambda \ddot{\theta}_t = -\nabla_\theta E(s_t, \theta_t) + \sqrt{2\lambda} \, \xi_t\]

where \(\xi_t\) is Brownian noise (from stochasticity).

In the overdamped limit (high friction), this becomes:

\[\dot{\theta}_t = -\nabla_\theta E(s_t, \theta_t) + \sqrt{2\lambda} \, \xi_t = \nabla_\theta Q^+(s_t, \theta_t) + \sqrt{2\lambda} \, \xi_t\]

This is Langevin dynamics!

Deterministic part: Follow the gradient of \(Q^+\) uphill (toward high-value actions)
Stochastic part: Explore via temperature-controlled noise
Result: Agent naturally discovers smooth, high-value action trajectories

4.3 Neural Network Policies as Action Minimizers¶

When implementing GRL with a neural network \(Q_\phi(s, \theta)\):

Policy optimization becomes:

\[\theta_t \sim \pi_\phi(\theta|s_t) = \frac{\exp(Q_\phi(s_t, \theta)/\lambda)}{\int \exp(Q_\phi(s_t, \theta')/\lambda) d\theta'}\]

Sampling via gradient flow:

Initialize \(\theta_0\) randomly or from heuristic
Update: \(\theta_{t+1} = \theta_t + \alpha \nabla_\theta Q_\phi(s_t, \theta_t) + \sqrt{2\alpha\lambda} \, \epsilon_t\)
Repeat until convergence

This is Langevin Monte Carlo sampling from the Boltzmann distribution!

The agent doesn't need a pre-defined action set—it samples actions from the energy landscape shaped by learning.

5. Principled Policy Optimization¶

5.1 The Energy-Based Learning Objective¶

Given the least action principle, the natural learning objective is:

Minimize expected action over trajectories:

\[J(\phi) = \mathbb{E}_{\tau \sim \pi_\phi}\left[\int_0^T \left[E(s_t, \theta_t) + \frac{1}{2\lambda}\|\dot{\theta}_t\|^2\right] dt\right]\]

subject to environment dynamics \(s_{t+1} = f(s_t, \theta_t, w_t)\).

In practice (episodic RL):

\[J(\phi) = \mathbb{E}_{\tau \sim \pi_\phi}\left[\sum_{t=0}^T \left[-r_t + \frac{1}{2\lambda}\|\theta_{t+1} - \theta_t\|^2\right]\right]\]

This naturally balances:

Reward maximization: via \(-r_t\) (minimize negative reward)
Action smoothness: via kinetic penalty

5.2 Natural Gradient on the Policy Manifold¶

The least action principle also suggests using the natural gradient (Amari 1998):

\[\nabla_\phi^{\text{nat}} J = F^{-1} \nabla_\phi J\]

where \(F\) is the Fisher information matrix (the Riemannian metric on the policy space).

Why this is "natural": It measures policy distance in terms of KL divergence, not Euclidean distance in parameter space.

Connection to action: The Fisher metric is the infinitesimal version of the action metric on the policy manifold.

Practical algorithms:

TRPO (Trust Region Policy Optimization)
PPO (Proximal Policy Optimization)
Natural Actor-Critic

All implicitly minimize action-like functionals!

5.3 Smoothness as Inductive Bias¶

The kinetic term \(\frac{1}{2\lambda}\|\dot{\theta}\|^2\) is an inductive bias favoring smooth policies.

Why this helps learning:

Reduces sample complexity (smooth functions generalize better)
Improves stability (prevents policy collapse)
Encodes physical priors (real systems have inertia)

In GRL: This is naturally encoded by:

Kernel smoothness: RBF kernels enforce continuity
Particle memory: Weighted neighbors smooth the \(Q^+\) estimate
MemoryUpdate propagation: \(\lambda_{\text{prop}}\) controls local smoothing

6. Connection to GRL's Core Ideas¶

6.1 Energy Function (Chapter 03)¶

The energy \(E(z) = -Q^+(z)\) is the potential in the action functional:

\[S[\tau] = \int \left[E(z_t) + \frac{1}{2\lambda}\|\dot{z}_t\|^2\right] dt\]

Why call it energy?

Consistent with physics (potential energy landscape)
Optimal trajectories minimize total energy + kinetic cost
Connects to statistical mechanics (Boltzmann distribution)

6.2 Reinforcement Field (Chapter 04)¶

The reinforcement field \(Q^+: \mathcal{Z} \to \mathbb{R}\) defines the potential energy landscape.

From least action perspective:

High \(Q^+\) regions: Low potential energy, attractors
Low \(Q^+\) regions: High potential energy, repellers
Gradient \(\nabla Q^+\): Force field guiding action selection

The field emerges from particles (Chapter 05), which act like "mass distributions" creating the energy landscape.

6.3 MemoryUpdate (Chapter 06)¶

MemoryUpdate modifies the particle ensemble, which reshapes the energy landscape.

From least action perspective:

Adding particle \((z_{\text{new}}, w_{\text{new}})\): Creates a potential well at \(z_{\text{new}}\)
Propagating weights: Smooths the landscape (kinetic regularization)
Hard threshold \(\epsilon\): Limits influence radius (finite-range potential)

The updated field \(Q^+_{\text{new}}\) guides future action selection via gradient flow.

6.4 RF-SARSA (Chapter 07, coming next)¶

RF-SARSA implements temporal difference learning on the energy landscape.

From least action perspective:

TD error: Mismatch between predicted and actual action along trajectory
Weight update: Adjusts potential to make future trajectories optimal
Exploration (\(\lambda\)): Temperature for Langevin sampling

The algorithm is performing stochastic gradient descent on the expected action over trajectories!

7. Practical Implementation Notes¶

7.1 Choosing the Temperature \(\lambda\)¶

The temperature \(\lambda\) controls exploration:

High \(\lambda\) (hot):

Broad distribution over actions
More exploration
Good early in learning

Low \(\lambda\) (cold):

Peaked distribution (near-greedy)
More exploitation
Good after convergence

Typical schedule: Exponential decay \(\lambda_t = \lambda_0 \cdot \alpha^t\) with \(\alpha \approx 0.99\).

7.2 Implementing Gradient Flow¶

For continuous action parameters \(\theta \in \mathbb{R}^d\):

def sample_action_langevin(Q_field, s, theta_init, lambda_temp, n_steps=10, step_size=0.01):
    """Sample action via Langevin dynamics on Q+ landscape."""
    theta = theta_init.clone()

    for _ in range(n_steps):
        # Compute gradient of Q+ w.r.t. theta
        grad_Q = Q_field.gradient(s, theta)  # ∇_θ Q+(s, θ)

        # Langevin update
        theta = theta + step_size * grad_Q + np.sqrt(2 * step_size * lambda_temp) * np.random.randn(*theta.shape)

    return theta

Practical note: For high-dimensional \(\theta\), use Metropolis-adjusted Langevin (MALA) for better convergence.

7.3 Kinetic Regularization in Loss¶

When training a neural network \(Q_\phi(s, \theta)\):

def compute_loss(Q_phi, trajectories, lambda_temp, lambda_kinetic):
    """Compute action-based loss."""
    loss = 0.0

    for tau in trajectories:
        for t in range(len(tau) - 1):
            s_t, theta_t, r_t = tau[t]
            s_tp1, theta_tp1, r_tp1 = tau[t + 1]

            # Energy term: negative reward
            loss += -r_t

            # Kinetic term: smoothness penalty
            loss += (1 / (2 * lambda_kinetic)) * torch.norm(theta_tp1 - theta_t)**2

            # TD error (coming in Chapter 07)
            Q_current = Q_phi(s_t, theta_t)
            Q_next = Q_phi(s_tp1, theta_tp1)
            td_error = r_t + gamma * Q_next - Q_current
            loss += td_error**2

    return loss / len(trajectories)

8. Summary: Why Least Action Matters for GRL¶

Physics justification:

Energy-based formulation is not arbitrary—it's grounded in fundamental physics
Boltzmann policy emerges naturally from action minimization
Smooth trajectories are optimal, not just convenient

Algorithmic benefits:

Principled exploration via temperature \(\lambda\)
Natural regularization via kinetic penalty
Gradient-based action discovery (no fixed action sets needed)

Theoretical depth:

Connects RL to 300+ years of physics and optimal control
Provides a unified framework (discrete, continuous, hybrid actions)
Opens path to advanced techniques (natural gradients, Riemannian optimization)

Next steps:

Chapter 07: RF-SARSA — How to learn \(Q^+\) via temporal differences
Quantum-Inspired Chapter 09 — Path integrals and Feynman's formulation

Chapter 03a: The Principle of Least Action (Supplement)¶

1. Classical Mechanics: A Crash Course¶

1.1 The Action Functional¶

1.2 The Euler-Lagrange Equations¶

1.3 Why Is This Powerful?¶

2. From Physics to Control: Path Integral Control¶

2.1 The Control Problem¶

2.2 Optimal Policy from Action¶

2.3 The Cost-to-Go as "Potential"¶

3. GRL's Boltzmann Policy as Least Action¶

3.1 The GRL Action Functional¶

3.2 Why the Boltzmann Policy Emerges¶

3.3 Smooth Actions from Kinetic Regularization¶

4. Implications for Action Discovery¶

4.1 Beyond Fixed Action Sets¶

4.2 Gradient Flow on the Energy Landscape¶

4.3 Neural Network Policies as Action Minimizers¶

5. Principled Policy Optimization¶

5.1 The Energy-Based Learning Objective¶

5.2 Natural Gradient on the Policy Manifold¶

5.3 Smoothness as Inductive Bias¶

6. Connection to GRL's Core Ideas¶

6.1 Energy Function (Chapter 03)¶

6.2 Reinforcement Field (Chapter 04)¶

6.3 MemoryUpdate (Chapter 06)¶

6.4 RF-SARSA (Chapter 07, coming next)¶

7. Practical Implementation Notes¶

7.1 Choosing the Temperature \(\lambda\)¶

7.2 Implementing Gradient Flow¶

7.3 Kinetic Regularization in Loss¶

8. Summary: Why Least Action Matters for GRL¶

Further Reading¶