Chapter 10: Leveraging Complex-Valued Kernels for GRL — A Practical Tutorial¶
Prerequisites: Chapter 03 — Complex-Valued RKHS (theory), Chapter 01a — Wavefunction Interpretation, Chapter 07 — Learning the Field Beyond GP.
Companion notebook: planned notebooks/field_series/03_complex_fields.ipynb.
Status: 🔬 Tutorial draft — theory is established in Chapter 03; this document focuses on how to actually use complex kernels in practice.
0. What This Tutorial Does¶
Chapter 03 established that we can move GRL from a real-valued reinforcement field \(Q^+(z) \in \mathbb{R}\) to a complex-valued wavefunction \(\Psi(z) \in \mathbb{C}\). This chapter answers the follow-up question:
Given that I have a complex kernel, what concretely changes in the GRL pipeline, and when do I gain something from it?
We walk through five practical patterns, with pseudocode, gradient derivations, pitfalls, and when not to reach for a complex kernel.
1. The Minimum Viable Change¶
The entire GRL machinery goes through with one substitution. The real-valued reinforcement field
becomes the complex wavefunction
with \(c_i = w_i e^{i \phi_i}\) packing magnitude and phase. The field value at \(z\) (analogue of \(Q^+\)) is the Born rule:
Everything downstream — gradients, policy inference, memory update — generalizes straightforwardly, but with a few subtleties that are worth making explicit.
What you store, what you compute¶
| Quantity | Real-valued GRL | Complex-valued GRL |
|---|---|---|
| Particle | \((z_i, w_i)\) | \((z_i, w_i, \phi_i)\) or directly \((z_i, c_i)\) |
| Kernel evaluation | \(k(z, z_i) \in \mathbb{R}\) | \(k_{\mathbb{C}}(z, z_i) \in \mathbb{C}\) |
| Field | \(Q^+(z) \in \mathbb{R}\) | \(\Psi(z) \in \mathbb{C}\) |
| Value (for policy) | \(Q^+(z)\) directly | \(V(z) = \lvert\Psi(z)\rvert^2\) |
| Gradient for policy | \(\nabla_a Q^+(s, a)\) | \(\nabla_a \lvert\Psi(s, a)\rvert^2\) |
The only structural change is "what does the policy read off the field?" In real GRL you ascend \(Q^+\) directly. In complex GRL you ascend the squared magnitude. This is what enables interference.
2. Worked Example: Two Particles, Three Phase Choices¶
Take two particles on the real line at \(z_1 = -1, z_2 = +1\), both with magnitude \(w_1 = w_2 = 1\), and a Gaussian kernel with \(\ell = 1\):
(Here we absorb each particle's phase into its own kernel evaluation for simplicity; equivalently one can set \(\phi_i\) on the complex weight \(c_i\).)
Case A — Aligned phases (\(\phi_1 = \phi_2 = 0\)). Reduces to real GRL. \(\Psi(z) = e^{-(z+1)^2/2} + e^{-(z-1)^2/2}\), giving a smooth landscape with a single bump-like region between the particles. Constructive interference throughout.
Case B — Opposite phases (\(\phi_1 = 0\), \(\phi_2 = \pi\)). Here \(c_2 = -1\). At the midpoint \(z = 0\), the two kernel contributions have equal magnitude and opposite sign, so \(\Psi(0) = 0\) and hence \(V(0) = 0\). A node forms between them — the value landscape has a valley at the midpoint even though both particles have positive magnitude.
Case C — Quadrature (\(\phi_1 = 0\), \(\phi_2 = \pi/2\)). At the midpoint, \(\Psi(0) = k(0, z_1) + i\, k(0, z_2)\). The magnitudes add in quadrature: \(|\Psi(0)|^2 = k(0, z_1)^2 + k(0, z_2)^2\) — no interference at the midpoint, just magnitude addition.
The same two particles produce three qualitatively different value landscapes depending on phase. Phase is a knob that shapes the reinforcement landscape without changing what events occurred.
Pseudocode¶
import numpy as np
def complex_rbf(z, z_i, ell=1.0):
"""Real Gaussian amplitude; phase carried by the particle weight."""
return np.exp(-np.sum((z - z_i)**2) / (2 * ell**2))
def psi(z, particles, ell=1.0):
"""Complex wavefunction at z."""
out = 0.0 + 0.0j
for p in particles:
c = p["w"] * np.exp(1j * p["phi"])
out += c * complex_rbf(z, p["z"], ell)
return out
def value(z, particles, ell=1.0):
"""Born-rule value |Psi(z)|^2."""
amp = psi(z, particles, ell)
return (amp.conjugate() * amp).real # == abs(amp)**2
For visualization, always plot magnitude \(|\Psi(z)|\) and phase \(\arg \Psi(z)\) separately. A contour plot of \(V(z) = |\Psi|^2\) alone hides the interference structure.
3. Gradients in the Complex Setting¶
The GRL policy ascends the value landscape: \(a \leftarrow a + \eta \nabla_a V(s, a)\). With \(V = |\Psi|^2\), the gradient is
This is standard Wirtinger calculus. The direction of steepest ascent is not just "up the kernel," it is modulated by the complex conjugate of the current amplitude — the gradient rotates as the phase rotates.
Explicit form for a complex Gaussian kernel¶
With \(k_{\mathbb{C}}(z, z_i) = k_{\mathrm{RBF}}(z, z_i) \cdot e^{i\phi_i(z)}\) and the simplest case of constant per-particle phase \(\phi_i\):
Plug into \(\nabla_z V = 2\,\mathrm{Re}\{\Psi^* \nabla_z \Psi\}\) and you have a
closed-form policy gradient. This is the direct analogue of the real-valued
build_gradient function in Notebook 2 — same structure, one extra complex
multiplication per term.
def grad_value(z, particles, ell=1.0):
"""
Gradient of V(z) = |Psi(z)|^2 with respect to z.
Returns a real-valued gradient vector.
"""
amp = psi(z, particles, ell) # complex scalar
grad_psi = np.zeros_like(z, dtype=complex)
for p in particles:
c = p["w"] * np.exp(1j * p["phi"])
disp = z - p["z"]
kern = np.exp(-np.dot(disp, disp) / (2 * ell**2))
grad_psi += c * (-disp / ell**2) * kern
return 2.0 * np.real(np.conjugate(amp) * grad_psi)
If you are working in PyTorch/JAX with complex dtypes, autograd handles all
of this — you just need to call .abs().pow(2) on the field and let the
framework differentiate. In a reference NumPy implementation, the closed-form
above is preferable because it's numerically stable and obvious to inspect.
4. Five Patterns for Leveraging Phase¶
This is the core of the chapter: when does the complex extension pay off? Five concrete patterns, each with a one-line recipe and a caveat.
Pattern 1 — Temporal credit assignment via rotating phase¶
Recipe: At experience time \(t_i\), store particle with phase \(\phi_i = \omega\, t_i\) for some angular frequency \(\omega\).
What you get: Experiences close in time interfere constructively; those separated by half a period cancel. The value landscape naturally emphasizes recent experience without an explicit exponential discount.
When it helps: Environments with non-stationary reward structure or episodic tasks where recency matters but exact decay is unknown.
Caveat: The choice of \(\omega\) is a hyperparameter. Too small and everything is constructive (no decoherence); too large and adjacent experiences cancel spuriously. A useful heuristic: set \(\omega = \pi / T_{\text{relevant}}\) where \(T_{\text{relevant}}\) is the horizon of interest.
Pattern 2 — Multi-task / multi-context separation via discrete phase¶
Recipe: For context \(c \in \{0, 1, \dots, C-1\}\), set \(\phi_i = 2\pi\, c_i / C\). Same-context particles align; different-context particles separate around the phase circle.
What you get: Task-separated value landscapes in a single shared memory. Queries from context \(c\) naturally read out values aligned with that context's phase; other contexts contribute with scrambled phase and partially cancel.
When it helps: Meta-RL, continual learning, or any setting where you want one particle memory to serve many related tasks without explicit segregation.
Caveat: With \(C\) well-spaced phases, cross-context interference averages toward zero but does not vanish exactly unless you project explicitly. For hard separation, use a phase-mask projection at query time (see §7).
Pattern 3 — Directional preference via position-dependent phase¶
Recipe: For navigation-style problems where the relevant structure is a direction in state-action space, define \(k_{\mathbb{C}}(z, z') = k_{\mathrm{RBF}}(z, z') \exp\!\bigl(i\, \mathbf{n} \cdot (z - z')\bigr)\) for a chosen direction vector \(\mathbf{n}\).
What you get: Particles "in front of" the agent (aligned with \(\mathbf{n}\)) interfere constructively; particles "behind" interfere destructively. The value landscape develops an implicit flow.
When it helps: Flow-field policies, asymmetric tasks where forward motion is preferred, learning from demonstrations with directional bias.
Caveat: This kernel is no longer invariant under translation in the direction of \(\mathbf{n}\). That's usually the point, but it breaks some of the closed-form nice properties (e.g., the equilibrium distribution of a random-walk policy).
Pattern 4 — Learned phase via a small network¶
Recipe: Train a small network \(\phi_\psi: \mathcal{Z} \to [0, 2\pi)\) that predicts phase from (possibly auxiliary) features. The forward model looks like \(k_{\mathbb{C}}(z, z') = k_{\mathrm{RBF}}(z, z') \exp\!\bigl(i[\phi_\psi(z) - \phi_\psi(z')]\bigr)\).
What you get: A data-adaptive phase structure that discovers whatever
interference pattern actually helps on your task. Combines with the
learned-kernel direction from
dev/GRL_extensions/learned_kernels/00-scope.md.
When it helps: When none of Patterns 1–3 obviously applies but the task seems to have latent temporal or contextual structure.
Caveat: Phase is periodic; naive regression will fight the wrap-around.
Parameterize as \((\cos, \sin)\) pair and recover \(\phi\) via atan2. Also:
anti-collapse regularization is needed just as for the learned-kernel case
(trivially \(\phi_\psi \equiv \text{const}\) recovers the real kernel).
Pattern 5 — Nodes as "forbidden regions"¶
Recipe: To create a hard "do not go here" region at location \(z_-\) without introducing a negative-weight particle (which just creates a valley), introduce two particles of equal magnitude and phase difference \(\pi\) straddling \(z_-\). The destructive interference zeros out the value on the line between them.
What you get: A node — a region where \(|\Psi|^2 = 0\) — which is qualitatively different from a region of low value. Gradients vanish at a node, and the node is persistent under uniform scaling.
When it helps: Modeling hard constraints (obstacles, infeasible operator parameters) through the field itself rather than external masks.
Caveat: Nodes are zero-measure structures. They are visible in \(|\Psi|^2\) but fragile under smoothing or noise. Not appropriate for soft constraints — use negative weights for those.
5. Reading Out a Policy¶
In real-valued GRL, the policy is gradient ascent on \(Q^+(s, a)\) over the action parameter \(a\). In complex-valued GRL there are three reasonable choices, and which one you pick matters.
Choice 1 — Ascend \(|\Psi|^2\) directly (recommended default)¶
This is the natural generalization and preserves the Born-rule interpretation. Default unless you have a specific reason otherwise.
Choice 2 — Boltzmann policy on \(|\Psi|^2\)¶
Useful for stochastic policies and for the path-integral connection in Chapter 09. Samples are drawn from the squared-amplitude distribution, which is exactly the Born rule in RL clothing.
Choice 3 — Project onto a chosen phase slice¶
For multi-context scenarios (Pattern 2), compute the projected field \(\tilde\Psi_c(z) = e^{-i \phi_c} \Psi(z)\) and ascend \(\mathrm{Re}\{\tilde\Psi_c\}\). This reads out only the value aligned with context \(c\), ignoring particles whose phase is orthogonal to \(\phi_c\).
def project_on_phase(z, particles, phi_target, ell=1.0):
"""
Read out the value component aligned with phase phi_target.
For context-conditional policies.
"""
amp = psi(z, particles, ell)
return np.real(np.exp(-1j * phi_target) * amp)
Which to use when¶
| Setup | Recommended readout |
|---|---|
| Single task, phase is a structural prior | Choice 1 — \(\lvert\Psi\rvert^2\) |
| Multi-task, phase encodes context | Choice 3 — phase-projected |
| Stochastic exploration / path-integral view | Choice 2 — Boltzmann on \(\lvert\Psi\rvert^2\) |
| Temporal credit assignment (Pattern 1) | Choice 1; phase decoheres automatically |
6. Learning the Complex Field¶
The Part I MemoryUpdate algorithm adds a particle and updates weights to fit a target value. The complex extension generalizes to complex particles with complex weights, but the training signal is usually still real-valued (rewards). So you need a loss on \(|\Psi|^2\), not on \(\Psi\) directly.
Loss functions¶
Option A — Born-rule MSE (no phase supervision). The phase of each particle is a free parameter; the loss only sees magnitude.
This is a non-convex objective (product \(\Psi^* \Psi\) is bilinear in \(c_i\)), so initialization matters. Warm-starting phases with one of Patterns 1–3 is a good strategy.
Option B — Phase-aware loss (when you have a phase signal). If the task provides a phase target (e.g., time-of-experience, context ID), regularize:
where \((\cdot)^2_\circ\) is the circular-squared-distance (handles wrap-around).
Option C — Complex-amplitude regression. If you can produce complex targets (e.g., in simulation where amplitudes are meaningful), fit directly \(\mathcal{L} = \sum_j |\Psi(z_j) - \psi_j^{\text{target}}|^2\). This is convex in \(\{c_i\}\) and has a closed-form solution — the complex analogue of kernel ridge regression.
Complex MemoryUpdate (sketch)¶
def memory_update_complex(memory, z_new, y_new, phi_new=0.0, ell=1.0, ridge=1e-3):
"""
Add a particle and refit complex weights.
Args:
memory: list of particles with keys z, w, phi.
z_new: new experience location.
y_new: observed real-valued return (or reward).
phi_new: phase assigned to the new particle (user-chosen per Pattern 1-4).
"""
memory.append({"z": z_new, "w": 1.0, "phi": phi_new})
Z = np.array([p["z"] for p in memory])
C = np.array([p["w"] * np.exp(1j * p["phi"]) for p in memory])
K = np.array([[complex_rbf(zi, zj, ell) * np.exp(1j * (pi - pj))
for pj, zj in zip([p["phi"] for p in memory], Z)]
for pi, zi in zip([p["phi"] for p in memory], Z)])
# Solve for magnitudes that reproduce observed |Psi|^2 at memory points.
# This is one iteration; in practice you'd run a short local optimizer
# since the objective is non-convex in the phase block.
...
(A full implementation belongs in src/grl/core/ alongside the planned
reference RF-SARSA — deferred until the baseline is in place.)
7. A Concrete Walk-Through: Context-Switching Gridworld¶
To make this concrete, consider a two-context gridworld:
- Context \(A\): goal at \((3, 0)\), reward +1.
- Context \(B\): goal at \((-3, 0)\), reward +1.
- Agent observes only position \((x, y)\), not which context is active.
- A single particle memory is shared across both contexts.
Real-valued GRL. The memory ends up with positive particles near both goals. The value landscape has two peaks, and the policy is confused when in context \(A\): it gets pulled toward \((-3, 0)\) from the left half of the map.
Complex-valued GRL with context phase (Pattern 2). Each particle is stored with \(\phi_i = 0\) if observed in context \(A\), \(\phi_i = \pi\) in context \(B\). A single memory holds both.
- In context \(A\), the policy reads the phase-projected field \(\tilde\Psi_A(z) = \Psi(z)\) (projection onto phase 0). The \(\pi\)-phase particles contribute with a sign flip, creating destructive interference near \((-3, 0)\). The peak near \((+3, 0)\) survives.
- In context \(B\), we read \(\tilde\Psi_B(z) = -\Psi(z)\), which re-flips the signs and reveals the \((-3, 0)\) peak while suppressing \((+3, 0)\).
One memory, two disambiguated policies, selected by a phase choice at inference time. This is the concrete win.
What had to change from the real-valued version:
- Each MemoryUpdate call tags the new particle with the current context's
phase.
- Each query provides the context, which selects the readout projection.
- No new learning algorithm, no new capacity — just phase bookkeeping.
This is the kind of leverage complex kernels offer: structural priors that compose through phase rather than through separate models.
8. Pitfalls¶
Phase drift under online updates¶
If phases are free parameters and you update them by gradient descent, they can rotate collectively without changing \(|\Psi|^2\) — a gauge symmetry. This manifests as apparent non-convergence. Fix: either pin one particle's phase to 0 (break the gauge), or project gradients onto the phase-difference subspace.
Loss of monotonicity¶
Real \(Q^+\) is linear in particle weights, so adding a positive-weight particle monotonically increases value everywhere. \(|\Psi|^2\) is not monotone in magnitudes — a new particle can reduce value through destructive interference. Don't rely on monotone heuristics (e.g., "more particles is always better") in tests.
Over-interpretation of phase¶
Phase \(\phi_i\) is a mathematical object. It's temporal/contextual/directional only because you chose to make it so. There's no canonical meaning; debugging benefits hugely from tracking what each particle's phase was supposed to encode.
Normalization¶
Unlike quantum mechanics, GRL does not require \(\int |\Psi|^2 = 1\). The reinforcement field is an energy/value, not a probability density. You can scale \(\Psi \to \alpha \Psi\) freely — it scales \(V\) by \(|\alpha|^2\) but preserves relative ordering. If you choose to normalize (e.g., for a Boltzmann policy), do it explicitly at the readout step, not in the field definition.
Computational cost¶
Complex arithmetic is roughly 4× real arithmetic on CPU, 2× on GPUs with native complex support. For small particle counts this is irrelevant; for large memories, factor the complex multiplications into real/imaginary batched matmuls.
9. When NOT to Use Complex Kernels¶
Complex kernels are not free lunch. Reach for them only when:
- Your task has a natural phase structure — time, context, direction, or a cycle. If you can't articulate what phase means, don't add it.
- You want interference, not just separation. If Pattern 2 (multi-task) is your use case, you could equally well use \(C\) independent real-valued memories — which is simpler and has the same expressive power as hard-projected complex GRL.
- You have visualization discipline. Debugging \(|\Psi|^2\) landscapes without separately inspecting magnitude and phase is painful.
For most first-attempt GRL problems, start real-valued. Add complexity (literally) only when the real-valued field has a concrete failure mode that phase addresses.
10. Summary and Connections¶
| Question | Real GRL | Complex GRL |
|---|---|---|
| What's stored per particle? | \((z_i, w_i)\) | \((z_i, w_i, \phi_i)\) |
| What's the field? | \(Q^+(z) \in \mathbb{R}\) | \(\Psi(z) \in \mathbb{C}\) |
| What does policy read? | \(Q^+(s, a)\) | \(\lvert\Psi(s, a)\rvert^2\) |
| How do particles combine? | Weighted sum | Complex superposition |
| What interference exists? | Additive only | Constructive / destructive / partial |
| What's the extra knob? | — | Phase \(\phi_i\) |
Cross-references:
- Chapter 03 — theoretical foundations and Hermitian
kernel math.
- Chapter 04 — action/state projections
generalize to complex projections naturally; Pattern 2's phase projection
is the action-field projection with \(e^{i\phi}\) modulation.
- Chapter 07 — non-GP learning
mechanisms; §6 of this chapter sketches how they extend to complex
amplitudes.
- Chapter 09 — path-integral
formulation and genuine quantum interference; the Boltzmann readout
(Choice 2 in §5) is the link.
- dev/GRL_extensions/learned_kernels/
— learned feature maps; Pattern 4 (learned phase) is the complex-valued
analogue.
11. Next Steps¶
- Companion notebook:
notebooks/field_series/03_complex_fields.ipynb— visualize the two-particle example from §2 across the three phase cases; then implement the context-switching gridworld from §7. - Complex kernel module: scaffold
src/grl/kernels/complex.pyoncesrc/grl/core/exists, with the RBF + phase kernel from §1 as the reference implementation. - Spectral clustering on complex Gram matrices: cross-promote this tutorial's Pattern 2 to Part II of the tutorial series (Emergent Structure & Spectral Abstraction) — complex eigendecomposition naturally produces phase-coherent concept subspaces.
Last Updated: 2026-04-22