Why My Q-Learning Agent Failed to Learn
An agent that never reached the goal — hypothesis-driven debugging from rewards to environment reset.
Note: This is a representative case study illustrating debugging workflow. Values are illustrative for portfolio documentation.
Problem
A tabular Q-learning agent on a 10×10 grid world never reached the goal after 5,000 episodes. Average reward flatlined near zero.
Initial Hypothesis
- › Reward shaping — goal reward too small vs step penalty.
- › Exploration — ε-greedy decayed too fast; agent stopped exploring.
- › Learning rate — α too high caused Q-value oscillation.
Investigation
| Step | Action | Result |
|---|---|---|
| 1 | Logged ε, α, episode reward | ε reached 0.01 by episode 200; exploration collapsed early |
| 2 | Raised goal reward 10× | Slight improvement; still no consistent goal hits |
| 3 | Printed Q-table max per episode | Q-values reset to zeros every 100 episodes |
| 4 | Traced env.reset() | New environment instance created on each reset, orphaning learned state |
Root Cause
reset() instantiated a new GridWorld object instead of clearing the existing grid state. The Q-table keyed on (row, col) still referenced old coordinates, but transition dynamics changed when internal obstacle layout was re-randomized on construction.
State consistency broke between episodes — the agent was learning on a moving target.
The Bellman update the agent was trying to apply (on a stable environment) is:
With and , oscillation from resetting every 100 episodes masked any convergence signal.
Fix
def reset(self):
self.grid = self._initial_grid.copy()
self.agent_pos = self.start
return self._obs()
- › Reuse one environment instance for the full training run.
- › Separate
reset()from__init__for obstacle generation (fixed per run).
Result
After the fix, the agent reached the goal within ~320 episodes (ε=0.1, γ=0.99). Average reward curve stabilized with clear upward trend by episode 400.
Lessons Learned
- › State consistency is critical in RL — environment identity must be stable across episodes unless you intentionally want non-stationarity.
- › Debugging is hypothesis-driven — log exploration params before tuning rewards.
- › Always trace object lifetimes on
reset()— a common footgun in custom envs.
Related
- › A* Pathfinding Lab — complementary search intuition on grids