Chapter 07c: Experience Replay and Particle Memory¶

Purpose: Understand experience replay in modern RL, and how GRL's particle memory is a structurally richer form of the same idea
Prerequisites: Chapter 05 (Particle Memory), Chapter 07 (RF-SARSA)
Key Concepts: Experience replay, replay buffers, prioritized replay, particle memory as implicit replay, functional vs. data-level reuse

Introduction¶

In 2013, DeepMind's DQN paper introduced experience replay as one of two key innovations (alongside target networks) that made deep Q-learning stable enough to play Atari games at superhuman level. The idea was simple: instead of learning from each transition once and discarding it, store transitions in a buffer and resample them for training.

Experience replay is now ubiquitous in off-policy deep RL. But the core idea — reusing past experience to improve learning efficiency — is older than DQN and, as we'll see, was already present in GRL's particle memory (2010), albeit in a fundamentally different form.

This chapter covers:

What experience replay is and why it matters
How replay buffers work in practice (DQN, prioritized replay, etc.)
The deep structural comparison with GRL's particle memory
Why particle memory is "experience replay done at the function level"
What each approach can learn from the other

1. The Problem That Experience Replay Solves¶

1.1 Online learning is wasteful¶

In standard online TD learning, the agent:

Observes transition \((s_t, a_t, r_t, s_{t+1})\)
Performs one update using this transition
Discards the transition forever

This is enormously wasteful. Each interaction with the environment produces information that is used exactly once. In domains where environment interactions are expensive (robotics, simulation, real-world deployment), this is unacceptable.

1.2 Online learning is correlated¶

Consecutive transitions are highly correlated — \(s_{t+1}\) is similar to \(s_t\), the agent visits the same region of state space for many steps before moving on. Training a function approximator on correlated data violates the i.i.d. assumption that underlies stochastic gradient descent, leading to:

Oscillation: The approximator overfits to the current region, then overcorrects when the agent moves elsewhere
Catastrophic forgetting: Learning about new states erases knowledge about old ones
Poor convergence: Correlated updates bias the gradient estimates

1.3 The solution: store and resample¶

Experience replay addresses both problems:

Efficiency: Each transition can be reused many times
Decorrelation: Random sampling from a buffer breaks temporal correlations
Stability: The training distribution is more uniform over the state space

2. Experience Replay in Practice¶

2.1 The basic mechanism (DQN)¶

The DQN algorithm (Mnih et al., 2013, 2015) introduced the standard replay buffer:

Data structure: A fixed-capacity buffer \(\mathcal{D}\) storing transitions:

\[\mathcal{D} = \{(s_i, a_i, r_i, s_{i+1})\}_{i=1}^{|\mathcal{D}|}\]

Collection: At each step, store the new transition in \(\mathcal{D}\) (overwriting the oldest if full).

Training: Sample a random mini-batch \(\mathcal{B} \subset \mathcal{D}\) and perform a gradient update:

\[w \leftarrow w - \alpha \frac{1}{|\mathcal{B}|} \sum_{(s,a,r,s') \in \mathcal{B}} \nabla_w \left[ Q_w(s, a) - (r + \gamma \max_{a'} Q_{w^-}(s', a')) \right]^2\]

where \(w^-\) are the target network parameters (frozen periodically).

Key properties:

Uniform sampling: Each transition is equally likely to be sampled
FIFO eviction: When the buffer is full, the oldest transitions are discarded
Off-policy: The transitions in the buffer were generated by old policies — the current policy may be very different

2.2 Prioritized experience replay¶

Schaul et al. (2016) observed that not all transitions are equally useful. Transitions with large TD errors are more informative — they indicate regions where the value function is most wrong.

Priority: Assign each transition a priority proportional to its TD error:

\[p_i = |\delta_i| + \epsilon\]

where \(\delta_i = r_i + \gamma \max_{a'} Q(s_{i+1}, a') - Q(s_i, a_i)\) and \(\epsilon > 0\) prevents zero priority.

Sampling: Draw transitions with probability:

\[P(i) = \frac{p_i^\alpha}{\sum_j p_j^\alpha}\]

where \(\alpha \in [0, 1]\) controls how much prioritization matters (\(\alpha = 0\) is uniform, \(\alpha = 1\) is fully prioritized).

Importance sampling correction: Since prioritized sampling is biased (oversamples high-error transitions), correct with importance sampling weights:

\[w_i = \left( \frac{1}{|\mathcal{D}|} \cdot \frac{1}{P(i)} \right)^\beta\]

where \(\beta\) is annealed from a small value to 1 over training.

2.3 Other replay variants¶

The field has produced many refinements:

Variant	Key Idea	Reference
Hindsight Experience Replay (HER)	Relabel failed episodes with achieved goals	Andrychowicz et al., 2017
Distributed Replay	Multiple actors fill a shared buffer	Horgan et al., 2018 (Ape-X)
N-step Replay	Store n-step returns instead of 1-step	Mnih et al., 2016 (A3C)
Curiosity-Prioritized	Prioritize by novelty, not just TD error	Pathak et al., 2017
Reservoir Sampling	Uniform sampling without FIFO bias	Isele & Cosgun, 2018

All of these operate at the data level: they store, organize, and resample raw transitions.

3. What Experience Replay Actually Does (Abstractly)¶

Before comparing with GRL, let's identify what experience replay achieves at a conceptual level, independent of implementation:

3.1 Three functions of replay¶

Function 1: Temporal persistence of information

Without replay, information from a transition is used once and lost. Replay ensures that the information content of past experience persists and continues to influence learning.

Function 2: Distributional smoothing

Online learning sees a non-stationary, correlated stream of data. Replay transforms this into a more uniform, decorrelated distribution — closer to the stationary distribution needed for convergence.

Function 3: Multi-use of scarce data

Each environment interaction is expensive. Replay extracts more learning signal per interaction by revisiting transitions multiple times.

3.2 The key abstraction¶

At its core, experience replay is about making past experience available for current learning. The specific mechanism (buffer, sampling, prioritization) is implementation detail. The essential property is:

Information from past interactions persists and influences future updates.

This is exactly what GRL's particle memory does — but through a completely different mechanism.

4. GRL's Particle Memory as Implicit Experience Replay¶

4.1 The structural parallel¶

In GRL, when the agent experiences a transition and RF-SARSA produces a TD update, the result is a particle \((z, q)\) added to memory \(\Omega\). This particle then influences every future field query \(Q^+(z')\) through the kernel:

\[Q^+(z') = \sum_{i=1}^N \alpha_i \, k(z', z_i)\]

Every past experience — encoded as a particle — contributes to every future prediction. The particle doesn't need to be "replayed" or "resampled" because it is always active as a basis function in the field representation.

This is experience replay, but at the function level rather than the data level.

4.2 Side-by-side comparison¶

Aspect	Experience Replay (DQN)	Particle Memory (GRL)
What is stored	Raw transitions \((s, a, r, s')\)	Weighted points \((z, w)\) in augmented space
How past experience is reused	Resampled and re-trained on	Always active as basis functions
When reuse happens	At training time (mini-batch sampling)	At inference time (every field query)
Mechanism	Stochastic gradient descent on replayed data	Kernel-weighted superposition
Frequency of reuse	Each transition sampled \(\sim k\) times	Each particle contributes to every query
Information loss	Transitions eventually evicted (FIFO)	Particles merged/pruned but information preserved in neighbors
Correlation breaking	Random sampling from buffer	Kernel geometry (smooth, nonlocal)
Prioritization	Explicit (TD error priority)	Implicit (kernel similarity to query point)
Off-policy correction	Importance sampling weights	Not needed (on-policy with SARSA)

4.3 The deep difference: data-level vs. function-level reuse¶

This distinction is the crux of the comparison:

Experience replay (data-level):

Store raw data: \((s, a, r, s')\)
Resample data
Recompute gradients from resampled data
Update function approximator

The function approximator (neural network) is separate from the stored data. Replay feeds data into the approximator.

Particle memory (function-level):

Convert experience into a basis function: \((z, w) \to w \cdot k(\cdot, z)\)
The basis function is part of the value function
Every future query automatically incorporates this experience
No resampling needed

The stored experience is the function approximator. There is no separation between "data" and "model."

In DQN, past experience is replayed to the value function. In GRL, past experience is the value function.

5. Three Functions of Replay, Revisited Through GRL¶

Let's check whether GRL's particle memory achieves the three abstract functions of experience replay identified in Section 3.

5.1 Temporal persistence of information¶

Replay buffer: Transitions persist in the buffer until evicted (FIFO or priority-based). Information is lost when transitions are overwritten.

Particle memory: Particles persist indefinitely (subject to merging/pruning). Even when particles are merged, their information is preserved in the merged particle's weight and position. Information loss is gradual and controlled, not abrupt.

Verdict: Particle memory provides stronger temporal persistence. Information is never abruptly discarded — it is smoothly absorbed into the field.

5.2 Distributional smoothing¶

Replay buffer: Random sampling from the buffer approximates a uniform distribution over past experience. This breaks temporal correlations.

Particle memory: The kernel provides automatic smoothing. When the agent queries \(Q^+(z)\), the prediction is a kernel-weighted average over all particles — not a random sample. This is a deterministic, smooth interpolation that naturally decorrelates the influence of any single experience.

Verdict: Particle memory provides deterministic distributional smoothing (via kernel), compared to replay's stochastic smoothing (via random sampling). The kernel approach is arguably more principled — it weights past experience by relevance (kernel similarity) rather than by chance (random sampling).

5.3 Multi-use of scarce data¶

Replay buffer: Each transition is sampled \(\sim k\) times on average before eviction. The replay ratio (updates per environment step) controls how much each transition is reused.

Particle memory: Each particle contributes to every field query for its entire lifetime. A particle added at step 1 is still influencing predictions at step 10,000 (with influence decaying smoothly via the kernel as the agent moves to distant regions of augmented space).

Verdict: Particle memory achieves maximal data reuse — every past experience influences every future prediction, weighted by relevance. This is the theoretical ideal that replay buffers approximate through sampling.

6. Prioritized Replay and Kernel Similarity¶

6.1 An unexpected parallel¶

Prioritized experience replay (Schaul et al., 2016) samples transitions with probability proportional to their TD error. The intuition: transitions where the value function is most wrong are most informative.

GRL's particle memory has an analogous mechanism, but it operates through kernel similarity rather than explicit prioritization:

When the agent queries \(Q^+(z)\), each particle \(i\) contributes:

\[\text{contribution}_i = \alpha_i \, k(z, z_i)\]

Particles that are close to the query point (high \(k(z, z_i)\)) contribute more. Particles that are far contribute less. This is automatic, implicit prioritization by relevance.

6.2 Relevance vs. surprise¶

The two prioritization schemes optimize for different things:

Prioritized Replay	Particle Memory
Prioritizes by surprise (\(\lvert\delta\rvert\))	Prioritizes by relevance (\(k(z, z_i)\))
"Where am I most wrong?"	"What do I know about here?"
Global: any transition can be sampled	Local: nearby particles dominate
Requires explicit priority updates	Automatic via kernel geometry

Insight: These are complementary. Surprise-based prioritization helps learning (focus updates where errors are large). Relevance-based weighting helps inference (focus predictions on nearby evidence).

6.3 Can GRL benefit from surprise-based prioritization?¶

Yes. MemoryUpdate already incorporates the TD error \(\delta\) when updating particle weights (Chapter 6). Particles associated with large \(|\delta|\) receive larger weight updates. This is a form of surprise-based prioritization at the update level, complementing the relevance-based weighting at the query level.

A more explicit version could weight particles by both kernel similarity and accumulated TD error magnitude — combining the strengths of both approaches.

7. Where Replay Buffers Struggle and Particle Memory Doesn't¶

7.1 The staleness problem¶

Transitions in a replay buffer were generated by old policies. As the policy improves, old transitions become increasingly unrepresentative of the current policy's behavior. This is the staleness problem.

Consequences:

Old transitions may have incorrect state visitation frequencies
The replay distribution diverges from the current policy's distribution
Importance sampling corrections become high-variance

Particle memory: Particles are not "stale" in the same way. A particle \((z_i, w_i)\) represents a belief about the value at location \(z_i\), not a raw transition. MemoryUpdate continuously revises particle weights as new evidence arrives. Old particles don't represent old policies — they represent the agent's current best estimate at that location, informed by all evidence accumulated so far.

7.2 The capacity problem¶

Replay buffers have fixed capacity. When full, old transitions are evicted — permanently losing information. Choosing what to evict is a non-trivial problem (FIFO? priority-based? reservoir sampling?).

Particle memory: Particles can be merged rather than evicted. When two particles are close in augmented space, they can be combined into a single particle that preserves the aggregate information. This is information-preserving compression, not information-destroying eviction.

7.3 The representation gap¶

A replay buffer stores raw data. The function approximator (neural network) must learn to extract useful representations from this data. There is a gap between what is stored (transitions) and what is needed (value function).

Particle memory: There is no gap. Particles directly define the value function through kernel superposition. The representation is the stored experience.

8. Where Particle Memory Struggles and Replay Buffers Don't¶

The comparison is not one-sided. Replay buffers have genuine advantages:

8.1 Scalability (vanilla GP only — not inherent to GRL)¶

A common concern is that particle memory requires \(O(N^3)\) GP inference. This is a limitation of vanilla GP, not of GRL itself. The reinforcement field formalism is agnostic to the learning mechanism — GP is one choice among many.

As established in Chapter 7 of the quantum-inspired series, GRL supports multiple alternatives:

Method	Per-Update Cost	Per-Query Cost	Notes
Vanilla GP	\(O(N^3)\)	\(O(N^2)\)	Original formulation; small-to-medium particle sets
Sparse GP (inducing points)	\(O(M^3)\), \(M \ll N\)	\(O(M^2)\)	FITC, PITC, VFE; bounded memory
Online SGD	\(O(N)\)	\(O(N)\)	No matrix inversion; scalable to millions
Deep neural network	\(O(1)\) amortized	\(O(1)\)	Replaces kernel with learned representation
Mixture of Experts	\(O(M \cdot N/M)\)	Per-expert	Multiple local fields with gating

With online SGD or neural field representations, GRL scales comparably to replay-buffer-based methods. The key insight from that chapter: "The state-as-field formalism is agnostic to the learning mechanism — you can swap the inference engine while preserving GRL's structure."

That said, replay buffers still have a simplicity advantage: \(O(1)\) insertion and \(O(1)\) random access with no inference overhead at storage time. The cost is deferred to training time.

8.2 Flexibility with function approximators¶

In its original kernel-based form, particle memory is tied to a specific representation. Replay buffers are agnostic — they work with any function approximator trainable on mini-batches.

However, GRL's hybrid architectures (neural net + particle memory, as in Chapter 7 of the quantum-inspired series, Section 6) bridge this gap. A neural network provides the base value estimate \(Q_\theta(z)\), while a small particle memory provides fast episodic adaptation:

\[Q^+(z) = Q_\theta(z) + \beta \sum_{i \in \text{recent}} w_i \, k(z_i, z)\]

This combines the scalability and representational power of neural networks with the principled, uncertainty-aware interpolation of kernel methods.

8.3 Off-policy learning¶

Replay buffers are designed for off-policy learning — the whole point is to reuse data from old policies. Particle memory, as used in RF-SARSA, is on-policy. Using particle memory for off-policy learning requires the safeguards discussed in Chapter 07b.

8.4 High-dimensional state spaces¶

Kernel methods suffer from the curse of dimensionality in high-dimensional spaces (images, raw sensor data). Replay buffers combined with deep neural networks handle high-dimensional inputs naturally through learned representations.

Mitigation: Learned embeddings (Chapter 07a, Section 4) or hybrid architectures.

9. Historical Context: GRL Predated DQN's Experience Replay¶

9.1 Timeline¶

1992: Lin introduced experience replay for neural network RL (Lin, 1992)
2010: GRL proposed particle memory as a kernel-based experience representation
2013: Mnih et al. (DQN) popularized experience replay as essential for deep RL stability
2016: Schaul et al. introduced prioritized experience replay

GRL's particle memory (2010) was developed independently and contemporaneously with the resurgence of interest in experience replay. While Lin's 1992 work predates both, the modern understanding of why replay matters (decorrelation, stability, efficiency) was crystallized by DQN in 2013.

9.2 Different motivations, convergent solutions¶

DQN's motivation: "Neural networks need i.i.d. data. Online RL data is correlated. Solution: buffer and resample."

GRL's motivation: "Value functions are continuous fields. Experience provides sparse samples of this field. Solution: represent the field directly through kernel-weighted particles."

These are different starting points that arrive at the same abstract property: past experience persists and influences future learning. The implementations differ radically (data-level resampling vs. function-level superposition), but the functional role is the same.

9.3 What GRL got right early¶

GRL's particle memory anticipated several ideas that the deep RL community discovered later:

Prioritization by relevance (kernel similarity) — formalized by Schaul et al. (2016) as prioritized replay
Information-preserving compression (particle merging) — echoed in compressed replay buffers and memory-efficient replay
Continuous reuse (every particle always active) — the theoretical ideal that high replay ratios approximate
Uncertainty-aware inference (GP variance) — now pursued through ensemble methods and distributional RL

What GRL did not anticipate was the scalability of deep neural networks and the practical dominance of data-level replay in high-dimensional domains. The kernel-based approach is more principled but less scalable — a trade-off that the field is still navigating.

10. Bridging the Two Worlds¶

10.1 Hybrid architectures¶

The most promising direction may be to combine both approaches:

Neural network + particle critic:

Use a neural network for state representation (handles high-dimensional inputs)
Use particle memory in the learned representation space (provides kernel-based value estimation)
Use a replay buffer to train the neural network (provides data-level reuse)

This is essentially the Actor-Critic in RKHS architecture from Chapter 07a, augmented with a replay buffer for the actor.

10.2 Neural experience replay as learned particle memory¶

Modern work on neural episodic control (Pritzel et al., 2017) and differentiable neural dictionaries (Ritter et al., 2018) can be viewed as neural network implementations of particle memory:

Store (key, value) pairs where keys are learned embeddings
Query via kernel-weighted lookup (often with learned kernels)
This is structurally identical to particle memory with learned features

The convergence of these independent research lines suggests that the particle memory concept is a natural and powerful abstraction.

11. Summary¶

11.1 Experience replay in one sentence¶

Experience replay stores raw transitions and resamples them for training, breaking temporal correlations and improving sample efficiency.

11.2 Particle memory in one sentence¶

Particle memory converts experience into kernel basis functions that directly define the value function, providing automatic, continuous, relevance-weighted reuse of all past experience.

11.3 The relationship¶

Level	Experience Replay	Particle Memory
Data	Stores and resamples transitions	Stores weighted points
Function	Trains a separate approximator	Points are the approximator
Reuse	Stochastic (random sampling)	Deterministic (kernel weighting)
Prioritization	Explicit (TD error)	Implicit (kernel similarity)
Persistence	Until eviction	Until merging (information-preserving)

11.4 The key insight¶

Experience replay and particle memory solve the same problem — making past experience available for current learning — but at different levels of abstraction. Replay operates at the data level (store and resample transitions). Particle memory operates at the function level (experience is the value function). GRL's particle memory can be understood as experience replay elevated from data management to functional representation.

References¶

Lin, L.-J. (1992). "Self-improving reactive agents based on reinforcement learning, planning and teaching." Machine Learning, 8(3-4), 293-321. (Original experience replay)
Mnih, V., et al. (2013). "Playing Atari with deep reinforcement learning." NeurIPS Deep Learning Workshop. (DQN with replay)
Mnih, V., et al. (2015). "Human-level control through deep reinforcement learning." Nature, 518, 529-533. (DQN journal version)
Schaul, T., et al. (2016). "Prioritized experience replay." ICLR. (Prioritized replay)
Andrychowicz, M., et al. (2017). "Hindsight experience replay." NeurIPS. (HER)
Pritzel, A., et al. (2017). "Neural episodic control." ICML. (Neural particle memory analog)
Horgan, D., et al. (2018). "Distributed prioritized experience replay." ICLR. (Ape-X)
Chiu, C.-C. & Huber, M. (2022). "Generalized Reinforcement Learning." arXiv:2208.04822. (GRL particle memory)

← Back to Chapter 07: RF-SARSA | Related: Chapter 05 - Particle Memory

Last Updated: February 2026