Chapter 6: The Agent's State and Belief Evolution¶
Motivation¶
Throughout this series, we've discussed "the state" \(Q^+\), projections like \(Q^+(s, a)\), and operations like MemoryUpdate. But what exactly is the agent's state in GRL?
This is not a philosophical question—it's a precise technical question with important implications:
- What object encodes the agent's knowledge?
- What changes when the agent learns?
- What stays fixed during inference?
This chapter provides definitive answers by clarifying:
- The agent's state = particle memory = reinforcement field
- Belief evolution = MemoryUpdate as state transition operator
- The role of weights = implicit GP-derived coefficients, not learned parameters
- Three distinct operations = fixing state, querying state, evolving state
This resolves a common confusion: mixing up the state (what the agent knows) with observations (what the agent queries).
1. What Is "The State" in GRL?¶
The Question¶
In traditional RL, "state" usually means environment state \(s \in \mathcal{S}\).
But in GRL, we have multiple candidates:
- Environment state \(s\)?
- Augmented point \(z = (s, a)\)?
- Field value \(Q^+(s, a)\)?
- Action projection \(Q^+(s, \cdot)\)?
- The entire field \(Q^+\)?
Which one is "the agent's state"?
The Answer: The Entire Field¶
The agent's state is the reinforcement field:
Why the entire field?
Because this object completely specifies the agent's beliefs about value across all state-action combinations.
Equivalent representation: Particle memory
Key equation:
These are two views of the same object:
- \(\Omega\) = discrete representation (particles)
- \(Q^+\) = continuous representation (field)
Why Not Something Smaller?¶
Q: Why isn't \(Q^+(s, \cdot)\) the state?
A: Because \(Q^+(s, \cdot)\) is a projection of the state onto a particular subspace, not the state itself.
Analogy to QM:
| Quantum Mechanics | GRL |
|---|---|
| State: \(\|\psi\rangle \in \mathcal{H}\) | State: \(Q^+ \in \mathcal{H}_k\) |
| Wavefunction: \(\psi(x) = \langle x \| \psi \rangle\) | Field value: \(Q^+(s, a) = \langle Q^+, k((s,a), \cdot) \rangle\) |
| Position representation | Augmented space representation |
The wavefunction \(\psi(x)\) is not the state—it's a coordinate representation of the state.
Same in GRL: \(Q^+(s, a)\) is not the state—it's a coordinate representation (amplitude) of the state.
Particle Memory IS the State¶
Critical insight:
What the particles encode:
| Component | Meaning | Type |
|---|---|---|
| \(z_i = (s_i, a_i)\) | Where experience occurred | Position in \(\mathcal{Z} = \mathcal{S} \times \Theta\) |
| \(w_i\) | Evidence strength at \(z_i\) | Real number (positive or negative) |
| \(k(z_i, \cdot)\) | Kernel section | Basis function in \(\mathcal{H}_k\) |
From particles to field:
This representation is complete! You can compute \(Q^+(z)\) for any \(z\):
What About \(Q^+(z_i)\) at the Particle Locations?¶
Question: "Should we store \(Q^+(z_i)\) as part of the particle?"
Answer: No, it's redundant!
\(Q^+(z_i)\) is computable from the particles:
So the particle representation is:
NOT:
What \(w_i\) represents:
- Original paper: Fitness contribution
- Modern framing: Energy contribution (negative fitness: \(E(z_i) = -w_i k(z_i, z_i)\))
- Mathematically: RKHS expansion coefficient
2. Three Distinct Operations¶
Now that we know the state is \(Q^+\) (equivalently: \(\Omega\)), let's clarify three operations that are often confused:
Operation A: Fixing the Belief State¶
At time \(t\), given particle memory \(\Omega_t\):
This fixes the belief state:
Meaning: "Conditional on the current memory, the agent's knowledge is \(Q^+_t\)."
This is NOT learning—it's just stating what the current belief is.
Operation B: Querying the State (Inference)¶
Given fixed \(Q^+_t\), compute:
or action wavefunction (from Chapter 4):
or concept activation (from Chapter 5):
Key point: These operations do not change \(Q^+_t\)!
They are:
- Queries
- Projections
- Evaluations
- Inferences
Analogy: Computing \(\psi(x) = \langle x | \psi \rangle\) doesn't change \(|\psi\rangle\).
This is pure inference, no learning.
Operation C: Evolving the State (Learning via MemoryUpdate)¶
MemoryUpdate transforms the belief state:
or equivalently:
What can change:
- Add particles: \(\Omega_{t+1} = \Omega_t \cup \{(z_{new}, w_{new})\}\)
- Update weights: \(w_i^{(t+1)} = w_i^{(t)} + \Delta w_i\)
- Merge particles: Combine nearby particles into one
- Prune particles: Remove low-influence particles
Result: New belief state \(Q^+_{t+1} \neq Q^+_t\)
This IS learning!
Summary Table¶
| Operation | Changes \(Q^+\)? | Purpose |
|---|---|---|
| A. Fix state | No (just specify current state) | Define what agent knows |
| B. Query state | No (projection/evaluation) | Action selection, concept activation |
| C. Evolve state | Yes (belief update) | Learning from experience |
Critical distinction:
Between MemoryUpdate events, \(Q^+\) is fixed. During MemoryUpdate, \(Q^+\) evolves.
3. Two Time Scales¶
This gives GRL a natural separation of time scales:
Slow Time Scale: Belief Evolution¶
MemoryUpdate events: \(t = 0, 1, 2, \ldots\)
State transitions:
This is learning.
Frequency: Every episode, or every \(K\) steps, or based on novelty
Fast Time Scale: Inference¶
Between \(t\) and \(t+1\), \(Q^+_t\) is fixed.
Agent performs many queries:
- Evaluate \(Q^+_t(s_1, a)\) for action selection at \(s_1\)
- Evaluate \(Q^+_t(s_2, a)\) for action selection at \(s_2\)
- Compute concept activation \(A_{k,t}\)
- Sample from policy \(\pi_t(a|s) \propto \exp(\beta Q^+_t(s, a))\)
This is inference.
Frequency: Every step, or multiple times per step
Why This Matters¶
Separation of concerns:
- Learning: Happens via MemoryUpdate (slow)
- Acting: Happens via inference (fast)
- No gradient descent mixing learning and inference
Computational efficiency:
- Don't recompute entire field for every action
- Cache kernel evaluations between updates
- Amortize expensive operations (merging, pruning)
Theoretical clarity:
- Clean POMDP interpretation (belief state = \(Q^+\))
- Well-defined state transition operator (\(\mathcal{U}\))
- No ambiguity about "what changed"
4. The Role of Weights: Implicit, Not Learned¶
Common Misconception¶
Misconception: "The weights \(w_i\) are learned parameters, like neural network weights."
Reality: The weights are implicit coefficients determined by the GP posterior, not explicit optimization variables.
How Weights Arise¶
In Gaussian Process regression:
Given data \(\mathcal{D} = \{(z_i, y_i)\}\) and kernel \(k\), the posterior mean is:
where:
- \(\mathbf{k}(z) = [k(z_1, z), \ldots, k(z_N, z)]^T\)
- \(\mathbf{K}_{ij} = k(z_i, z_j)\)
- \(\mathbf{y} = [y_1, \ldots, y_N]^T\)
This can be written as:
where \(\mathbf{w} = (\mathbf{K} + \sigma^2 \mathbf{I})^{-1} \mathbf{y}\).
The weights \(w_i\) are not learned—they're computed from the data and kernel!
In GRL¶
Similar structure:
The weights arise from:
- Experience accumulation: Each \((z_i, r_i)\) contributes
- Kernel propagation: Overlap spreads influence
- TD updates: Temporal difference signals adjust weights
They are NOT:
- Gradient descent parameters
- Explicitly optimized
- Independent of the kernel structure
They ARE:
- State variables (part of the belief state)
- Functionally determined by experience and kernel
- Evidence coefficients (strength of belief at each particle)
Representer Theorem Connection¶
The representer theorem says:
In RKHS, any function minimizing a regularized loss can be written as a finite sum over data points:
In GRL:
- Data points = experience particles \(z_i\)
- Coefficients = weights \(w_i\)
- Function = reinforcement field \(Q^+\)
So the particle representation is not arbitrary—it's the optimal form given the RKHS structure!
5. MemoryUpdate as Belief Transition Operator¶
Formal Definition¶
MemoryUpdate is an operator:
In particle coordinates:
What MemoryUpdate Does (Algorithm 1)¶
From Tutorial Chapter 6, MemoryUpdate performs:
Step 1: Particle instantiation
Given experience \((s_t, a_t, r_t)\), create:
where \(f(\cdot)\) maps reinforcement to weight (e.g., \(f(r) = r\), or \(f(r) = -r\) for energy).
Step 2: Kernel association
Compute similarity to existing particles:
Step 3: Weight propagation (optional)
For particles with high association:
This is "experience association"—evidence spreads through kernel geometry!
Step 4: Memory integration
Step 5: Structural consolidation
- Merge: Combine particles with \(k(z_i, z_j) > \tau_{merge}\)
- Prune: Remove particles with \(|w_i| < \tau_{prune}\)
- Decay: \(w_i^{(t+1)} = \gamma w_i^{(t)}\) for all \(i\)
Result:
This is a discrete, explicit state transition!
Connection to Gaussian Process Updates¶
MemoryUpdate can be viewed as:
GP posterior update expressed in particle (inducing point) coordinates
Standard GP update:
- Observe new data: \((z_{new}, y_{new})\)
- Update posterior: \(p(f | \mathcal{D}_{t+1}) \propto p(y_{new} | f, z_{new}) \cdot p(f | \mathcal{D}_t)\)
GRL equivalent:
- Observe new experience: \((z_{new}, r_{new})\)
- Update particle memory: \(\Omega_{t+1}\) (via MemoryUpdate)
- Resulting field: \(Q^+_{t+1}\)
Key difference: GRL also includes:
- Weight propagation (kernel association)
- Structural consolidation (merge/prune)
These are not standard GP operations, but natural extensions for lifelong learning!
6. Experience Association: What It Really Is¶
The Original Paper's Description¶
Section IV-A describes "experience association"—new experience affects nearby particles through kernel overlap.
Formalizing experience association:
Experience Association as Operator¶
Experience association is the weight propagation step in MemoryUpdate:
where:
In words:
- New evidence at \(z_{new}\) with strength \(w_{new}\)
- Propagates to associated particles \(z_i\) (where \(a_i = k(z_{new}, z_i) > \varepsilon\))
- Strength of propagation: \(\lambda \cdot a_i \cdot w_{new}\)
Why This Differs from Standard GP¶
Standard GP: Each data point contributes independently
where \(\alpha_i\) depends only on observation \(y_i\) and regularization.
GRL with experience association: Data points influence each other's weights
This is a form of:
- Soft credit assignment (not just local TD error)
- Geometric belief propagation (through kernel metric)
- Non-local update (affects multiple particles simultaneously)
Connection to Kernel-Based Message Passing¶
This is similar to:
- Kernel mean embedding updates
- Belief propagation in continuous spaces
- Kernel density estimation with adaptive weights
But GRL's version is unique because:
- Weights can be positive or negative (not just probabilities)
- Propagation is kernel-weighted (not uniform or discrete)
- Updates are compositional (new evidence builds on old)
7. Reconciling with Quantum Mechanics¶
The QM Analogy, Precisely Stated¶
In quantum mechanics:
State: \(|\psi\rangle \in \mathcal{H}\) (Hilbert space vector)
Evolution: Unitary operators (between measurements)
Measurement: Projects onto observable eigenspace
State "fixed": Between measurements
In GRL:
State: \(Q^+ \in \mathcal{H}_k\) (RKHS vector) ≡ particle memory \(\Omega\)
Evolution: MemoryUpdate operator (between inference queries)
Measurement: Projects onto query subspaces
State "fixed": Between MemoryUpdate events
The Parallel Is Structural, Not Metaphorical¶
| Aspect | Quantum Mechanics | GRL |
|---|---|---|
| State space | Hilbert space \(\mathcal{H}\) | RKHS \(\mathcal{H}_k\) |
| State vector | \(\|\psi\rangle\) | \(Q^+\) (or \(\Omega\)) |
| Basis | \(\{\|x\rangle\}\) | \(\{k(z, \cdot)\}\) |
| Coordinate rep | \(\psi(x) = \langle x \| \psi \rangle\) | \(Q^+(z) = \langle Q^+, k(z, \cdot) \rangle\) |
| Evolution | Hamiltonian \(\hat{H}\) | MemoryUpdate \(\mathcal{U}\) |
| Measurement | Observable \(\hat{O}\) | Projection \(P_k\) or query |
| Time scales | Between measurements: fixed | Between updates: fixed |
This is not poetry—it's the same mathematical structure!
8. Practical Implications¶
For Implementation¶
Representation choice:
Store particles, not the full field:
class BeliefState:
def __init__(self):
self.particles = [] # List of (z_i, w_i)
def query(self, z_query):
"""Compute Q^+(z_query) from particles"""
return sum(w_i * kernel(z_i, z_query)
for z_i, w_i in self.particles)
def update(self, experience):
"""MemoryUpdate: evolve belief state"""
z_new, r_new = experience
w_new = r_new # or more complex mapping
# Particle instantiation
self.particles.append((z_new, w_new))
# Experience association (weight propagation)
for i, (z_i, w_i) in enumerate(self.particles[:-1]):
a_i = kernel(z_new, z_i)
if a_i > epsilon:
self.particles[i] = (z_i, w_i + lambda_prop * a_i * w_new)
# Structural consolidation
self.merge_particles()
self.prune_particles()
For Efficiency¶
Between MemoryUpdate:
- Cache kernel evaluations
- Precompute Gram matrix if needed
- Use sparse representations for large particle sets
During MemoryUpdate:
- Only update associated particles (threshold \(\varepsilon\))
- Merge periodically, not every step
- Use KD-trees for fast nearest-neighbor finding
For Interpretation¶
Visualize belief evolution:
# Track field value at specific points over time
history = []
for t in range(T):
state_t = agent.belief_state.query(z_test)
history.append(state_t)
# Agent acts, observes, learns
experience = agent.interact(env)
agent.belief_state.update(experience)
# Plot belief evolution
plt.plot(history)
plt.xlabel('Time (MemoryUpdate events)')
plt.ylabel('Q^+(z_test)')
plt.title('Belief Evolution at Test Point')
Summary¶
Key Concepts¶
-
The Agent's State
-
State = reinforcement field \(Q^+ \in \mathcal{H}_k\)
- Equivalently: particle memory \(\Omega = \{(z_i, w_i)\}\)
-
Complete representation: particles determine field
-
Three Operations
-
Fix state: Specify current belief (Operation A)
- Query state: Compute projections/evaluations (Operation B)
-
Evolve state: MemoryUpdate (Operation C)
-
Two Time Scales
-
Slow: Learning via MemoryUpdate (\(Q^+_t \to Q^+_{t+1}\))
-
Fast: Inference via queries (\(Q^+_t(s, a)\), fixed \(Q^+_t\))
-
Weights Are Implicit
-
Not learned parameters
- GP-derived coefficients
-
State variables, not optimization variables
-
MemoryUpdate as Operator
-
Belief state transition: \(\mathcal{U}: Q^+_t \mapsto Q^+_{t+1}\)
- Includes: instantiation, association, consolidation
-
Experience association = weight propagation
-
QM Parallel
-
Same structure: state vector in Hilbert space
- Evolution via operators
- Fixed between update/measurement events
Key Equations¶
State specification:
Query (inference):
Evolution (learning):
Experience association:
What This Clarifies¶
For theory:
- Rigorous definition of "the state"
- Clean separation of learning and inference
- Well-defined belief evolution operator
- Precise QM parallel
For implementation:
- What to store (particles)
- What to compute (queries)
- When to update (MemoryUpdate events)
- How to optimize (caching, sparse ops)
For Part II (Section V):
- Concept activation operates on fixed \(Q^+\)
- Concept evolution tracks \(A_k(t)\) over MemoryUpdate events
- Clean distinction between concept inference and concept learning
Further Reading¶
Within This Series¶
- Chapter 2: RKHS Basis and Amplitudes
- Chapter 4: Action and State Projections
- Chapter 5: Concept Subspaces
GRL Tutorials¶
- Tutorial Chapter 5: Particle Memory
- Tutorial Chapter 6: MemoryUpdate Algorithm
Related Literature¶
Gaussian Processes:
- Rasmussen & Williams (2006). Gaussian Processes for Machine Learning. MIT Press.
- Qui & Candela (2005). "Sparse Gaussian Processes using Pseudo-inputs." NIPS.
Belief-State RL:
- Kaelbling et al. (1998). "Planning and Acting in Partially Observable Stochastic Domains."
- Ross et al. (2008). "Online Planning Algorithms for POMDPs."
Kernel Methods:
- Schölkopf & Smola (2002). Learning with Kernels. MIT Press.
- Berlinet & Thomas-Agnan (2004). Reproducing Kernel Hilbert Spaces in Probability and Statistics.
Last Updated: January 14, 2026