High Machine Learning Fixed

Why My Q-Learning Agent Failed to Learn

An agent that never reached the goal — hypothesis-driven debugging from rewards to environment reset.

Note: This is a representative case study illustrating debugging workflow. Values are illustrative for portfolio documentation.

Problem

A tabular Q-learning agent on a 10×10 grid world never reached the goal after 5,000 episodes. Average reward flatlined near zero.

Initial Hypothesis

  1. Reward shaping — goal reward too small vs step penalty.
  2. Exploration — ε-greedy decayed too fast; agent stopped exploring.
  3. Learning rate — α too high caused Q-value oscillation.

Investigation

Step Action Result
1 Logged ε, α, episode reward ε reached 0.01 by episode 200; exploration collapsed early
2 Raised goal reward 10× Slight improvement; still no consistent goal hits
3 Printed Q-table max per episode Q-values reset to zeros every 100 episodes
4 Traced env.reset() New environment instance created on each reset, orphaning learned state

Root Cause

reset() instantiated a new GridWorld object instead of clearing the existing grid state. The Q-table keyed on (row, col) still referenced old coordinates, but transition dynamics changed when internal obstacle layout was re-randomized on construction.

State consistency broke between episodes — the agent was learning on a moving target.

The Bellman update the agent was trying to apply (on a stable environment) is:

With and , oscillation from resetting every 100 episodes masked any convergence signal.

Fix

 def reset(self):
    self.grid = self._initial_grid.copy()
    self.agent_pos = self.start
    return self._obs() 
  • Reuse one environment instance for the full training run.
  • Separate reset() from __init__ for obstacle generation (fixed per run).

Result

After the fix, the agent reached the goal within ~320 episodes (ε=0.1, γ=0.99). Average reward curve stabilized with clear upward trend by episode 400.

Lessons Learned

  • State consistency is critical in RL — environment identity must be stable across episodes unless you intentionally want non-stationarity.
  • Debugging is hypothesis-driven — log exploration params before tuning rewards.
  • Always trace object lifetimes on reset() — a common footgun in custom envs.