Skip to content

Chapter 07c: Experience Replay and Particle Memory

Purpose: Understand experience replay in modern RL, and how GRL's particle memory is a structurally richer form of the same idea
Prerequisites: Chapter 05 (Particle Memory), Chapter 07 (RF-SARSA)
Key Concepts: Experience replay, replay buffers, prioritized replay, particle memory as implicit replay, functional vs. data-level reuse


Introduction

In 2013, DeepMind's DQN paper introduced experience replay as one of two key innovations (alongside target networks) that made deep Q-learning stable enough to play Atari games at superhuman level. The idea was simple: instead of learning from each transition once and discarding it, store transitions in a buffer and resample them for training.

Experience replay is now ubiquitous in off-policy deep RL. But the core idea — reusing past experience to improve learning efficiency — is older than DQN and, as we'll see, was already present in GRL's particle memory (2010), albeit in a fundamentally different form.

This chapter covers:

  1. What experience replay is and why it matters
  2. How replay buffers work in practice (DQN, prioritized replay, etc.)
  3. The deep structural comparison with GRL's particle memory
  4. Why particle memory is "experience replay done at the function level"
  5. What each approach can learn from the other

1. The Problem That Experience Replay Solves

1.1 Online learning is wasteful

In standard online TD learning, the agent:

  1. Observes transition \((s_t, a_t, r_t, s_{t+1})\)
  2. Performs one update using this transition
  3. Discards the transition forever

This is enormously wasteful. Each interaction with the environment produces information that is used exactly once. In domains where environment interactions are expensive (robotics, simulation, real-world deployment), this is unacceptable.

1.2 Online learning is correlated

Consecutive transitions are highly correlated — \(s_{t+1}\) is similar to \(s_t\), the agent visits the same region of state space for many steps before moving on. Training a function approximator on correlated data violates the i.i.d. assumption that underlies stochastic gradient descent, leading to:

  • Oscillation: The approximator overfits to the current region, then overcorrects when the agent moves elsewhere
  • Catastrophic forgetting: Learning about new states erases knowledge about old ones
  • Poor convergence: Correlated updates bias the gradient estimates

1.3 The solution: store and resample

Experience replay addresses both problems:

  • Efficiency: Each transition can be reused many times
  • Decorrelation: Random sampling from a buffer breaks temporal correlations
  • Stability: The training distribution is more uniform over the state space

2. Experience Replay in Practice

2.1 The basic mechanism (DQN)

The DQN algorithm (Mnih et al., 2013, 2015) introduced the standard replay buffer:

Data structure: A fixed-capacity buffer \(\mathcal{D}\) storing transitions:

\[\mathcal{D} = \{(s_i, a_i, r_i, s_{i+1})\}_{i=1}^{|\mathcal{D}|}\]

Collection: At each step, store the new transition in \(\mathcal{D}\) (overwriting the oldest if full).

Training: Sample a random mini-batch \(\mathcal{B} \subset \mathcal{D}\) and perform a gradient update:

\[w \leftarrow w - \alpha \frac{1}{|\mathcal{B}|} \sum_{(s,a,r,s') \in \mathcal{B}} \nabla_w \left[ Q_w(s, a) - (r + \gamma \max_{a'} Q_{w^-}(s', a')) \right]^2\]

where \(w^-\) are the target network parameters (frozen periodically).

Key properties:

  • Uniform sampling: Each transition is equally likely to be sampled
  • FIFO eviction: When the buffer is full, the oldest transitions are discarded
  • Off-policy: The transitions in the buffer were generated by old policies — the current policy may be very different

2.2 Prioritized experience replay

Schaul et al. (2016) observed that not all transitions are equally useful. Transitions with large TD errors are more informative — they indicate regions where the value function is most wrong.

Priority: Assign each transition a priority proportional to its TD error:

\[p_i = |\delta_i| + \epsilon\]

where \(\delta_i = r_i + \gamma \max_{a'} Q(s_{i+1}, a') - Q(s_i, a_i)\) and \(\epsilon > 0\) prevents zero priority.

Sampling: Draw transitions with probability:

\[P(i) = \frac{p_i^\alpha}{\sum_j p_j^\alpha}\]

where \(\alpha \in [0, 1]\) controls how much prioritization matters (\(\alpha = 0\) is uniform, \(\alpha = 1\) is fully prioritized).

Importance sampling correction: Since prioritized sampling is biased (oversamples high-error transitions), correct with importance sampling weights:

\[w_i = \left( \frac{1}{|\mathcal{D}|} \cdot \frac{1}{P(i)} \right)^\beta\]

where \(\beta\) is annealed from a small value to 1 over training.

2.3 Other replay variants

The field has produced many refinements:

Variant Key Idea Reference
Hindsight Experience Replay (HER) Relabel failed episodes with achieved goals Andrychowicz et al., 2017
Distributed Replay Multiple actors fill a shared buffer Horgan et al., 2018 (Ape-X)
N-step Replay Store n-step returns instead of 1-step Mnih et al., 2016 (A3C)
Curiosity-Prioritized Prioritize by novelty, not just TD error Pathak et al., 2017
Reservoir Sampling Uniform sampling without FIFO bias Isele & Cosgun, 2018

All of these operate at the data level: they store, organize, and resample raw transitions.


3. What Experience Replay Actually Does (Abstractly)

Before comparing with GRL, let's identify what experience replay achieves at a conceptual level, independent of implementation:

3.1 Three functions of replay

Function 1: Temporal persistence of information

Without replay, information from a transition is used once and lost. Replay ensures that the information content of past experience persists and continues to influence learning.

Function 2: Distributional smoothing

Online learning sees a non-stationary, correlated stream of data. Replay transforms this into a more uniform, decorrelated distribution — closer to the stationary distribution needed for convergence.

Function 3: Multi-use of scarce data

Each environment interaction is expensive. Replay extracts more learning signal per interaction by revisiting transitions multiple times.

3.2 The key abstraction

At its core, experience replay is about making past experience available for current learning. The specific mechanism (buffer, sampling, prioritization) is implementation detail. The essential property is:

Information from past interactions persists and influences future updates.

This is exactly what GRL's particle memory does — but through a completely different mechanism.


4. GRL's Particle Memory as Implicit Experience Replay

4.1 The structural parallel

In GRL, when the agent experiences a transition and RF-SARSA produces a TD update, the result is a particle \((z, q)\) added to memory \(\Omega\). This particle then influences every future field query \(Q^+(z')\) through the kernel:

\[Q^+(z') = \sum_{i=1}^N \alpha_i \, k(z', z_i)\]

Every past experience — encoded as a particle — contributes to every future prediction. The particle doesn't need to be "replayed" or "resampled" because it is always active as a basis function in the field representation.

This is experience replay, but at the function level rather than the data level.

4.2 Side-by-side comparison

Aspect Experience Replay (DQN) Particle Memory (GRL)
What is stored Raw transitions \((s, a, r, s')\) Weighted points \((z, w)\) in augmented space
How past experience is reused Resampled and re-trained on Always active as basis functions
When reuse happens At training time (mini-batch sampling) At inference time (every field query)
Mechanism Stochastic gradient descent on replayed data Kernel-weighted superposition
Frequency of reuse Each transition sampled \(\sim k\) times Each particle contributes to every query
Information loss Transitions eventually evicted (FIFO) Particles merged/pruned but information preserved in neighbors
Correlation breaking Random sampling from buffer Kernel geometry (smooth, nonlocal)
Prioritization Explicit (TD error priority) Implicit (kernel similarity to query point)
Off-policy correction Importance sampling weights Not needed (on-policy with SARSA)

4.3 The deep difference: data-level vs. function-level reuse

This distinction is the crux of the comparison:

Experience replay (data-level):

  1. Store raw data: \((s, a, r, s')\)
  2. Resample data
  3. Recompute gradients from resampled data
  4. Update function approximator

The function approximator (neural network) is separate from the stored data. Replay feeds data into the approximator.

Particle memory (function-level):

  1. Convert experience into a basis function: \((z, w) \to w \cdot k(\cdot, z)\)
  2. The basis function is part of the value function
  3. Every future query automatically incorporates this experience
  4. No resampling needed

The stored experience is the function approximator. There is no separation between "data" and "model."

In DQN, past experience is replayed to the value function. In GRL, past experience is the value function.


5. Three Functions of Replay, Revisited Through GRL

Let's check whether GRL's particle memory achieves the three abstract functions of experience replay identified in Section 3.

5.1 Temporal persistence of information

Replay buffer: Transitions persist in the buffer until evicted (FIFO or priority-based). Information is lost when transitions are overwritten.

Particle memory: Particles persist indefinitely (subject to merging/pruning). Even when particles are merged, their information is preserved in the merged particle's weight and position. Information loss is gradual and controlled, not abrupt.

Verdict: Particle memory provides stronger temporal persistence. Information is never abruptly discarded — it is smoothly absorbed into the field.

5.2 Distributional smoothing

Replay buffer: Random sampling from the buffer approximates a uniform distribution over past experience. This breaks temporal correlations.

Particle memory: The kernel provides automatic smoothing. When the agent queries \(Q^+(z)\), the prediction is a kernel-weighted average over all particles — not a random sample. This is a deterministic, smooth interpolation that naturally decorrelates the influence of any single experience.

Verdict: Particle memory provides deterministic distributional smoothing (via kernel), compared to replay's stochastic smoothing (via random sampling). The kernel approach is arguably more principled — it weights past experience by relevance (kernel similarity) rather than by chance (random sampling).

5.3 Multi-use of scarce data

Replay buffer: Each transition is sampled \(\sim k\) times on average before eviction. The replay ratio (updates per environment step) controls how much each transition is reused.

Particle memory: Each particle contributes to every field query for its entire lifetime. A particle added at step 1 is still influencing predictions at step 10,000 (with influence decaying smoothly via the kernel as the agent moves to distant regions of augmented space).

Verdict: Particle memory achieves maximal data reuse — every past experience influences every future prediction, weighted by relevance. This is the theoretical ideal that replay buffers approximate through sampling.


6. Prioritized Replay and Kernel Similarity

6.1 An unexpected parallel

Prioritized experience replay (Schaul et al., 2016) samples transitions with probability proportional to their TD error. The intuition: transitions where the value function is most wrong are most informative.

GRL's particle memory has an analogous mechanism, but it operates through kernel similarity rather than explicit prioritization:

When the agent queries \(Q^+(z)\), each particle \(i\) contributes:

\[\text{contribution}_i = \alpha_i \, k(z, z_i)\]

Particles that are close to the query point (high \(k(z, z_i)\)) contribute more. Particles that are far contribute less. This is automatic, implicit prioritization by relevance.

6.2 Relevance vs. surprise

The two prioritization schemes optimize for different things:

Prioritized Replay Particle Memory
Prioritizes by surprise (\(\lvert\delta\rvert\)) Prioritizes by relevance (\(k(z, z_i)\))
"Where am I most wrong?" "What do I know about here?"
Global: any transition can be sampled Local: nearby particles dominate
Requires explicit priority updates Automatic via kernel geometry

Insight: These are complementary. Surprise-based prioritization helps learning (focus updates where errors are large). Relevance-based weighting helps inference (focus predictions on nearby evidence).

6.3 Can GRL benefit from surprise-based prioritization?

Yes. MemoryUpdate already incorporates the TD error \(\delta\) when updating particle weights (Chapter 6). Particles associated with large \(|\delta|\) receive larger weight updates. This is a form of surprise-based prioritization at the update level, complementing the relevance-based weighting at the query level.

A more explicit version could weight particles by both kernel similarity and accumulated TD error magnitude — combining the strengths of both approaches.


7. Where Replay Buffers Struggle and Particle Memory Doesn't

7.1 The staleness problem

Transitions in a replay buffer were generated by old policies. As the policy improves, old transitions become increasingly unrepresentative of the current policy's behavior. This is the staleness problem.

Consequences:

  • Old transitions may have incorrect state visitation frequencies
  • The replay distribution diverges from the current policy's distribution
  • Importance sampling corrections become high-variance

Particle memory: Particles are not "stale" in the same way. A particle \((z_i, w_i)\) represents a belief about the value at location \(z_i\), not a raw transition. MemoryUpdate continuously revises particle weights as new evidence arrives. Old particles don't represent old policies — they represent the agent's current best estimate at that location, informed by all evidence accumulated so far.

7.2 The capacity problem

Replay buffers have fixed capacity. When full, old transitions are evicted — permanently losing information. Choosing what to evict is a non-trivial problem (FIFO? priority-based? reservoir sampling?).

Particle memory: Particles can be merged rather than evicted. When two particles are close in augmented space, they can be combined into a single particle that preserves the aggregate information. This is information-preserving compression, not information-destroying eviction.

7.3 The representation gap

A replay buffer stores raw data. The function approximator (neural network) must learn to extract useful representations from this data. There is a gap between what is stored (transitions) and what is needed (value function).

Particle memory: There is no gap. Particles directly define the value function through kernel superposition. The representation is the stored experience.


8. Where Particle Memory Struggles and Replay Buffers Don't

The comparison is not one-sided. Replay buffers have genuine advantages:

8.1 Scalability (vanilla GP only — not inherent to GRL)

A common concern is that particle memory requires \(O(N^3)\) GP inference. This is a limitation of vanilla GP, not of GRL itself. The reinforcement field formalism is agnostic to the learning mechanism — GP is one choice among many.

As established in Chapter 7 of the quantum-inspired series, GRL supports multiple alternatives:

Method Per-Update Cost Per-Query Cost Notes
Vanilla GP \(O(N^3)\) \(O(N^2)\) Original formulation; small-to-medium particle sets
Sparse GP (inducing points) \(O(M^3)\), \(M \ll N\) \(O(M^2)\) FITC, PITC, VFE; bounded memory
Online SGD \(O(N)\) \(O(N)\) No matrix inversion; scalable to millions
Deep neural network \(O(1)\) amortized \(O(1)\) Replaces kernel with learned representation
Mixture of Experts \(O(M \cdot N/M)\) Per-expert Multiple local fields with gating

With online SGD or neural field representations, GRL scales comparably to replay-buffer-based methods. The key insight from that chapter: "The state-as-field formalism is agnostic to the learning mechanism — you can swap the inference engine while preserving GRL's structure."

That said, replay buffers still have a simplicity advantage: \(O(1)\) insertion and \(O(1)\) random access with no inference overhead at storage time. The cost is deferred to training time.

8.2 Flexibility with function approximators

In its original kernel-based form, particle memory is tied to a specific representation. Replay buffers are agnostic — they work with any function approximator trainable on mini-batches.

However, GRL's hybrid architectures (neural net + particle memory, as in Chapter 7 of the quantum-inspired series, Section 6) bridge this gap. A neural network provides the base value estimate \(Q_\theta(z)\), while a small particle memory provides fast episodic adaptation:

\[Q^+(z) = Q_\theta(z) + \beta \sum_{i \in \text{recent}} w_i \, k(z_i, z)\]

This combines the scalability and representational power of neural networks with the principled, uncertainty-aware interpolation of kernel methods.

8.3 Off-policy learning

Replay buffers are designed for off-policy learning — the whole point is to reuse data from old policies. Particle memory, as used in RF-SARSA, is on-policy. Using particle memory for off-policy learning requires the safeguards discussed in Chapter 07b.

8.4 High-dimensional state spaces

Kernel methods suffer from the curse of dimensionality in high-dimensional spaces (images, raw sensor data). Replay buffers combined with deep neural networks handle high-dimensional inputs naturally through learned representations.

Mitigation: Learned embeddings (Chapter 07a, Section 4) or hybrid architectures.


9. Historical Context: GRL Predated DQN's Experience Replay

9.1 Timeline

  • 1992: Lin introduced experience replay for neural network RL (Lin, 1992)
  • 2010: GRL proposed particle memory as a kernel-based experience representation
  • 2013: Mnih et al. (DQN) popularized experience replay as essential for deep RL stability
  • 2016: Schaul et al. introduced prioritized experience replay

GRL's particle memory (2010) was developed independently and contemporaneously with the resurgence of interest in experience replay. While Lin's 1992 work predates both, the modern understanding of why replay matters (decorrelation, stability, efficiency) was crystallized by DQN in 2013.

9.2 Different motivations, convergent solutions

DQN's motivation: "Neural networks need i.i.d. data. Online RL data is correlated. Solution: buffer and resample."

GRL's motivation: "Value functions are continuous fields. Experience provides sparse samples of this field. Solution: represent the field directly through kernel-weighted particles."

These are different starting points that arrive at the same abstract property: past experience persists and influences future learning. The implementations differ radically (data-level resampling vs. function-level superposition), but the functional role is the same.

9.3 What GRL got right early

GRL's particle memory anticipated several ideas that the deep RL community discovered later:

  1. Prioritization by relevance (kernel similarity) — formalized by Schaul et al. (2016) as prioritized replay
  2. Information-preserving compression (particle merging) — echoed in compressed replay buffers and memory-efficient replay
  3. Continuous reuse (every particle always active) — the theoretical ideal that high replay ratios approximate
  4. Uncertainty-aware inference (GP variance) — now pursued through ensemble methods and distributional RL

What GRL did not anticipate was the scalability of deep neural networks and the practical dominance of data-level replay in high-dimensional domains. The kernel-based approach is more principled but less scalable — a trade-off that the field is still navigating.


10. Bridging the Two Worlds

10.1 Hybrid architectures

The most promising direction may be to combine both approaches:

Neural network + particle critic:

  • Use a neural network for state representation (handles high-dimensional inputs)
  • Use particle memory in the learned representation space (provides kernel-based value estimation)
  • Use a replay buffer to train the neural network (provides data-level reuse)

This is essentially the Actor-Critic in RKHS architecture from Chapter 07a, augmented with a replay buffer for the actor.

10.2 Neural experience replay as learned particle memory

Modern work on neural episodic control (Pritzel et al., 2017) and differentiable neural dictionaries (Ritter et al., 2018) can be viewed as neural network implementations of particle memory:

  • Store (key, value) pairs where keys are learned embeddings
  • Query via kernel-weighted lookup (often with learned kernels)
  • This is structurally identical to particle memory with learned features

The convergence of these independent research lines suggests that the particle memory concept is a natural and powerful abstraction.


11. Summary

11.1 Experience replay in one sentence

Experience replay stores raw transitions and resamples them for training, breaking temporal correlations and improving sample efficiency.

11.2 Particle memory in one sentence

Particle memory converts experience into kernel basis functions that directly define the value function, providing automatic, continuous, relevance-weighted reuse of all past experience.

11.3 The relationship

Level Experience Replay Particle Memory
Data Stores and resamples transitions Stores weighted points
Function Trains a separate approximator Points are the approximator
Reuse Stochastic (random sampling) Deterministic (kernel weighting)
Prioritization Explicit (TD error) Implicit (kernel similarity)
Persistence Until eviction Until merging (information-preserving)

11.4 The key insight

Experience replay and particle memory solve the same problem — making past experience available for current learning — but at different levels of abstraction. Replay operates at the data level (store and resample transitions). Particle memory operates at the function level (experience is the value function). GRL's particle memory can be understood as experience replay elevated from data management to functional representation.


References

  1. Lin, L.-J. (1992). "Self-improving reactive agents based on reinforcement learning, planning and teaching." Machine Learning, 8(3-4), 293-321. (Original experience replay)
  2. Mnih, V., et al. (2013). "Playing Atari with deep reinforcement learning." NeurIPS Deep Learning Workshop. (DQN with replay)
  3. Mnih, V., et al. (2015). "Human-level control through deep reinforcement learning." Nature, 518, 529-533. (DQN journal version)
  4. Schaul, T., et al. (2016). "Prioritized experience replay." ICLR. (Prioritized replay)
  5. Andrychowicz, M., et al. (2017). "Hindsight experience replay." NeurIPS. (HER)
  6. Pritzel, A., et al. (2017). "Neural episodic control." ICML. (Neural particle memory analog)
  7. Horgan, D., et al. (2018). "Distributed prioritized experience replay." ICLR. (Ape-X)
  8. Chiu, C.-C. & Huber, M. (2022). "Generalized Reinforcement Learning." arXiv:2208.04822. (GRL particle memory)

← Back to Chapter 07: RF-SARSA | Related: Chapter 05 - Particle Memory

Related: Chapter 07b - RF-Q-Learning and the Deadly Triad | Related: Chapter 06 - MemoryUpdate


Last Updated: February 2026