Skip to content

Chapter 03a: The Principle of Least Action (Supplement)

Purpose: This supplement bridges classical physics and reinforcement learning, showing why GRL's energy-based formulation is not just convenient notation—it's a principled framework grounded in one of the most fundamental laws of physics: the principle of least action.

Why this matters for GRL:

  • Explains why the Boltzmann policy \(\pi(\theta|s) \propto \exp(Q^+/\lambda)\) emerges naturally
  • Provides a principled way for agents to discover smooth, optimal actions (not just select from pre-defined sets)
  • Connects modern RL to 300+ years of physics and optimal control theory

1. Classical Mechanics: A Crash Course

1.1 The Action Functional

In classical mechanics, a particle doesn't "choose" its trajectory arbitrarily. Among all possible paths from point A to point B, nature selects the one that minimizes a quantity called the action.

The action functional \(S[\gamma]\) assigns a real number to each possible trajectory \(\gamma(t)\):

\[S[\gamma] = \int_{t_0}^{t_f} L(q(t), \dot{q}(t), t) \, dt\]

where:

  • \(L(q, \dot{q}, t)\) = Lagrangian = Kinetic Energy - Potential Energy
  • \(q(t)\) = position at time \(t\)
  • \(\dot{q}(t)\) = velocity at time \(t\)

Principle of Least Action: The actual trajectory taken by the system is the one that makes \(S[\gamma]\) stationary (usually a minimum).

Example: Free particle

For a particle of mass \(m\) moving freely:

\[L = \frac{1}{2}m\dot{q}^2\]

The action is:

\[S[\gamma] = \int_{t_0}^{t_f} \frac{1}{2}m\dot{q}^2 \, dt\]

Result: The trajectory that minimizes action is a straight line at constant velocity—Newton's first law emerges from a variational principle!


1.2 The Euler-Lagrange Equations

Minimizing the action leads to the Euler-Lagrange equations:

\[\frac{d}{dt}\left(\frac{\partial L}{\partial \dot{q}}\right) - \frac{\partial L}{\partial q} = 0\]

These are equivalent to Newton's equations of motion.

Example: Particle in potential

For \(L = \frac{1}{2}m\dot{q}^2 - V(q)\):

\[m\ddot{q} = -\frac{\partial V}{\partial q}\]

This is Newton's second law: \(F = ma\).


1.3 Why Is This Powerful?

The action principle is powerful because:

  1. Unified framework: Works for all physical systems (mechanics, optics, quantum mechanics, field theory)
  2. Coordinate-free: The action is independent of coordinate choices
  3. Reveals conservation laws: Via Noether's theorem (symmetries ↔ conserved quantities)
  4. Generalizes naturally: Extends to stochastic systems, optimal control, and RL

2. From Physics to Control: Path Integral Control

2.1 The Control Problem

In optimal control, we want to find a trajectory that:

  • Starts at state \(s_0\)
  • Ends at goal state \(s_f\) (or maximizes reward)
  • Minimizes a cost functional

Sound familiar? This is exactly an action minimization problem!

Control action functional:

\[S[\tau] = \int_{t_0}^{t_f} \left[ C(s_t, u_t) + \frac{1}{2\nu} \|u_t\|^2 \right] dt\]

where:

  • \(C(s, u)\) = instantaneous cost (like potential energy)
  • \(\frac{1}{2\nu} \|u_t\|^2\) = control cost (like kinetic energy)
  • \(u_t\) = control input at time \(t\)
  • \(\nu\) = "temperature" parameter (exploration vs. exploitation)

The control Lagrangian is:

\[L(s, u) = -C(s, u) - \frac{1}{2\nu} \|u\|^2\]

(Note the minus signs: we minimize cost, which is like maximizing negative cost)


2.2 Optimal Policy from Action

Key insight (Kappen 2005, Todorov 2009): The optimal stochastic policy is:

\[\pi^*(u|s) \propto \exp\left(-\frac{1}{\nu} S[s \to u]\right)\]

where \(S[s \to u]\) is the action along the optimal trajectory from \(s\) when applying control \(u\).

This is a Boltzmann distribution over actions!

The policy naturally emerges from minimizing action, with temperature \(\nu\) controlling the sharpness:

  • High \(\nu\) (high temperature) → More exploration, softer policy
  • Low \(\nu\) (low temperature) → More exploitation, sharper policy

2.3 The Cost-to-Go as "Potential"

In path integral control, the cost-to-go (or value function) plays the role of potential energy:

\[V(s) = \min_{\tau} \mathbb{E}\left[\int_{t}^{\infty} C(s_\tau, u_\tau) d\tau \mid s_t = s\right]\]

The optimal control is:

\[u^*(s) = -\nu \nabla_s \log \mathcal{Z}(s)\]

where \(\mathcal{Z}(s)\) is the "partition function" (sum over all trajectories).

This is gradient descent on an energy landscape!


3. GRL's Boltzmann Policy as Least Action

3.1 The GRL Action Functional

In GRL, we work in augmented state-action space \(z = (s, \theta)\). The natural action functional is:

\[S[\tau] = \int_{t_0}^{t_f} \left[ E(s_t, \theta_t) + \frac{1}{2\lambda} \|\dot{\theta}_t\|^2 \right] dt\]

where:

  • \(E(s, \theta) = -Q^+(s, \theta)\) = energy landscape (potential)
  • \(\|\dot{\theta}_t\|^2\) = "kinetic energy" of action parameter changes
  • \(\lambda\) = temperature (controls exploration)

The GRL Lagrangian:

\[L(s, \theta, \dot{\theta}) = -E(s, \theta) - \frac{1}{2\lambda} \|\dot{\theta}\|^2 = Q^+(s, \theta) - \frac{1}{2\lambda} \|\dot{\theta}\|^2\]

This says: Good trajectories have high \(Q^+\) (high reward potential) and smooth changes in action parameters (low kinetic cost).


3.2 Why the Boltzmann Policy Emerges

From path integral control theory, the optimal policy is:

\[\pi^*(\theta|s) \propto \exp\left(-\frac{1}{\lambda} S[s \to \theta]\right)\]

For a single-step decision (no trajectory dynamics), this simplifies to:

\[\pi^*(\theta|s) \propto \exp\left(-\frac{1}{\lambda} E(s, \theta)\right) = \exp\left(\frac{Q^+(s, \theta)}{\lambda}\right)\]

This is exactly GRL's Boltzmann policy!

It's not an ad-hoc choice—it's the optimal policy under the action minimization principle.


3.3 Smooth Actions from Kinetic Regularization

The kinetic term \(\frac{1}{2\lambda}\|\dot{\theta}\|^2\) penalizes rapid changes in action parameters.

Why this matters:

  • Prevents jerky, discontinuous actions
  • Encourages smooth, physically realizable trajectories
  • Natural regularization (Occam's razor for actions)

Example: Robotic reaching

Without kinetic penalty: Agent might command wild, discontinuous joint torques With kinetic penalty: Agent learns smooth, human-like reaching motions

In GRL: The kernel function \(k(z, z')\) implicitly encodes this smoothness preference!

\[k((s, \theta), (s', \theta')) = k_s(s, s') \cdot k_\theta(\theta, \theta')\]

A smooth kernel (e.g., RBF) ensures that nearby actions have similar \(Q^+\) values, effectively implementing the kinetic penalty.


4. Implications for Action Discovery

4.1 Beyond Fixed Action Sets

Traditional RL assumes a fixed action space:

  • Discrete: \(\mathcal{A} = \{a_1, a_2, \ldots, a_n\}\)
  • Continuous: \(\mathcal{A} = \mathbb{R}^d\) with pre-defined parameterization

GRL with least action: Actions are discovered by minimizing the action functional.

The agent learns:

  1. What actions are smooth (low kinetic cost)
  2. What actions are effective (high \(Q^+\), low energy)
  3. How to balance exploration and exploitation (\(\lambda\) temperature)

No pre-defined action repertoire needed!


4.2 Gradient Flow on the Energy Landscape

From the Euler-Lagrange equations, the optimal trajectory satisfies:

\[\lambda \ddot{\theta}_t = -\nabla_\theta E(s_t, \theta_t) + \sqrt{2\lambda} \, \xi_t\]

where \(\xi_t\) is Brownian noise (from stochasticity).

In the overdamped limit (high friction), this becomes:

\[\dot{\theta}_t = -\nabla_\theta E(s_t, \theta_t) + \sqrt{2\lambda} \, \xi_t = \nabla_\theta Q^+(s_t, \theta_t) + \sqrt{2\lambda} \, \xi_t\]

This is Langevin dynamics!

  • Deterministic part: Follow the gradient of \(Q^+\) uphill (toward high-value actions)
  • Stochastic part: Explore via temperature-controlled noise
  • Result: Agent naturally discovers smooth, high-value action trajectories

4.3 Neural Network Policies as Action Minimizers

When implementing GRL with a neural network \(Q_\phi(s, \theta)\):

Policy optimization becomes:

\[\theta_t \sim \pi_\phi(\theta|s_t) = \frac{\exp(Q_\phi(s_t, \theta)/\lambda)}{\int \exp(Q_\phi(s_t, \theta')/\lambda) d\theta'}\]

Sampling via gradient flow:

  1. Initialize \(\theta_0\) randomly or from heuristic
  2. Update: \(\theta_{t+1} = \theta_t + \alpha \nabla_\theta Q_\phi(s_t, \theta_t) + \sqrt{2\alpha\lambda} \, \epsilon_t\)
  3. Repeat until convergence

This is Langevin Monte Carlo sampling from the Boltzmann distribution!

The agent doesn't need a pre-defined action set—it samples actions from the energy landscape shaped by learning.


5. Principled Policy Optimization

5.1 The Energy-Based Learning Objective

Given the least action principle, the natural learning objective is:

Minimize expected action over trajectories:

\[J(\phi) = \mathbb{E}_{\tau \sim \pi_\phi}\left[\int_0^T \left[E(s_t, \theta_t) + \frac{1}{2\lambda}\|\dot{\theta}_t\|^2\right] dt\right]\]

subject to environment dynamics \(s_{t+1} = f(s_t, \theta_t, w_t)\).

In practice (episodic RL):

\[J(\phi) = \mathbb{E}_{\tau \sim \pi_\phi}\left[\sum_{t=0}^T \left[-r_t + \frac{1}{2\lambda}\|\theta_{t+1} - \theta_t\|^2\right]\right]\]

This naturally balances:

  • Reward maximization: via \(-r_t\) (minimize negative reward)
  • Action smoothness: via kinetic penalty

5.2 Natural Gradient on the Policy Manifold

The least action principle also suggests using the natural gradient (Amari 1998):

\[\nabla_\phi^{\text{nat}} J = F^{-1} \nabla_\phi J\]

where \(F\) is the Fisher information matrix (the Riemannian metric on the policy space).

Why this is "natural": It measures policy distance in terms of KL divergence, not Euclidean distance in parameter space.

Connection to action: The Fisher metric is the infinitesimal version of the action metric on the policy manifold.

Practical algorithms:

  • TRPO (Trust Region Policy Optimization)
  • PPO (Proximal Policy Optimization)
  • Natural Actor-Critic

All implicitly minimize action-like functionals!


5.3 Smoothness as Inductive Bias

The kinetic term \(\frac{1}{2\lambda}\|\dot{\theta}\|^2\) is an inductive bias favoring smooth policies.

Why this helps learning:

  • Reduces sample complexity (smooth functions generalize better)
  • Improves stability (prevents policy collapse)
  • Encodes physical priors (real systems have inertia)

In GRL: This is naturally encoded by:

  1. Kernel smoothness: RBF kernels enforce continuity
  2. Particle memory: Weighted neighbors smooth the \(Q^+\) estimate
  3. MemoryUpdate propagation: \(\lambda_{\text{prop}}\) controls local smoothing

6. Connection to GRL's Core Ideas

6.1 Energy Function (Chapter 03)

The energy \(E(z) = -Q^+(z)\) is the potential in the action functional:

\[S[\tau] = \int \left[E(z_t) + \frac{1}{2\lambda}\|\dot{z}_t\|^2\right] dt\]

Why call it energy?

  • Consistent with physics (potential energy landscape)
  • Optimal trajectories minimize total energy + kinetic cost
  • Connects to statistical mechanics (Boltzmann distribution)

6.2 Reinforcement Field (Chapter 04)

The reinforcement field \(Q^+: \mathcal{Z} \to \mathbb{R}\) defines the potential energy landscape.

From least action perspective:

  • High \(Q^+\) regions: Low potential energy, attractors
  • Low \(Q^+\) regions: High potential energy, repellers
  • Gradient \(\nabla Q^+\): Force field guiding action selection

The field emerges from particles (Chapter 05), which act like "mass distributions" creating the energy landscape.


6.3 MemoryUpdate (Chapter 06)

MemoryUpdate modifies the particle ensemble, which reshapes the energy landscape.

From least action perspective:

  • Adding particle \((z_{\text{new}}, w_{\text{new}})\): Creates a potential well at \(z_{\text{new}}\)
  • Propagating weights: Smooths the landscape (kinetic regularization)
  • Hard threshold \(\epsilon\): Limits influence radius (finite-range potential)

The updated field \(Q^+_{\text{new}}\) guides future action selection via gradient flow.


6.4 RF-SARSA (Chapter 07, coming next)

RF-SARSA implements temporal difference learning on the energy landscape.

From least action perspective:

  • TD error: Mismatch between predicted and actual action along trajectory
  • Weight update: Adjusts potential to make future trajectories optimal
  • Exploration (\(\lambda\)): Temperature for Langevin sampling

The algorithm is performing stochastic gradient descent on the expected action over trajectories!


7. Practical Implementation Notes

7.1 Choosing the Temperature \(\lambda\)

The temperature \(\lambda\) controls exploration:

High \(\lambda\) (hot):

  • Broad distribution over actions
  • More exploration
  • Good early in learning

Low \(\lambda\) (cold):

  • Peaked distribution (near-greedy)
  • More exploitation
  • Good after convergence

Typical schedule: Exponential decay \(\lambda_t = \lambda_0 \cdot \alpha^t\) with \(\alpha \approx 0.99\).


7.2 Implementing Gradient Flow

For continuous action parameters \(\theta \in \mathbb{R}^d\):

def sample_action_langevin(Q_field, s, theta_init, lambda_temp, n_steps=10, step_size=0.01):
    """Sample action via Langevin dynamics on Q+ landscape."""
    theta = theta_init.clone()

    for _ in range(n_steps):
        # Compute gradient of Q+ w.r.t. theta
        grad_Q = Q_field.gradient(s, theta)  # ∇_θ Q+(s, θ)

        # Langevin update
        theta = theta + step_size * grad_Q + np.sqrt(2 * step_size * lambda_temp) * np.random.randn(*theta.shape)

    return theta

Practical note: For high-dimensional \(\theta\), use Metropolis-adjusted Langevin (MALA) for better convergence.


7.3 Kinetic Regularization in Loss

When training a neural network \(Q_\phi(s, \theta)\):

def compute_loss(Q_phi, trajectories, lambda_temp, lambda_kinetic):
    """Compute action-based loss."""
    loss = 0.0

    for tau in trajectories:
        for t in range(len(tau) - 1):
            s_t, theta_t, r_t = tau[t]
            s_tp1, theta_tp1, r_tp1 = tau[t + 1]

            # Energy term: negative reward
            loss += -r_t

            # Kinetic term: smoothness penalty
            loss += (1 / (2 * lambda_kinetic)) * torch.norm(theta_tp1 - theta_t)**2

            # TD error (coming in Chapter 07)
            Q_current = Q_phi(s_t, theta_t)
            Q_next = Q_phi(s_tp1, theta_tp1)
            td_error = r_t + gamma * Q_next - Q_current
            loss += td_error**2

    return loss / len(trajectories)

8. Summary: Why Least Action Matters for GRL

Physics justification:

  • Energy-based formulation is not arbitrary—it's grounded in fundamental physics
  • Boltzmann policy emerges naturally from action minimization
  • Smooth trajectories are optimal, not just convenient

Algorithmic benefits:

  • Principled exploration via temperature \(\lambda\)
  • Natural regularization via kinetic penalty
  • Gradient-based action discovery (no fixed action sets needed)

Theoretical depth:

  • Connects RL to 300+ years of physics and optimal control
  • Provides a unified framework (discrete, continuous, hybrid actions)
  • Opens path to advanced techniques (natural gradients, Riemannian optimization)

Next steps:

  • Chapter 07: RF-SARSA — How to learn \(Q^+\) via temporal differences
  • Quantum-Inspired Chapter 09 — Path integrals and Feynman's formulation

Further Reading

Path Integral Control:

  • Kappen, H. J. (2005). "Path integrals and symmetry breaking for optimal control theory." Journal of Statistical Mechanics.
  • Todorov, E. (2009). "Efficient computation of optimal actions." PNAS.
  • Theodorou, E., Buchli, J., & Schaal, S. (2010). "A generalized path integral control approach to reinforcement learning." JMLR.

Variational Principles:

  • Goldstein, H., Poole, C., & Safko, J. (2002). Classical Mechanics (3rd ed.), Chapter 2.
  • Landau, L. D., & Lifshitz, E. M. (1976). Mechanics (3rd ed.), Chapter 2.

Natural Gradients & Policy Optimization:

  • Amari, S. (1998). "Natural gradient works efficiently in learning." Neural Computation.
  • Schulman, J., et al. (2015). "Trust region policy optimization." ICML.

← Back to Chapter 03: Energy and Fitness | Next: Chapter 04 →

Related: Quantum-Inspired Chapter 09 - Path Integrals