Why My Q-Learning Agent Failed to Learn | Myint Myat Thein

Note: This is a representative case study illustrating debugging workflow. Values are illustrative for portfolio documentation.

Problem

A tabular Q-learning agent on a 10×10 grid world never reached the goal after 5,000 episodes. Average reward flatlined near zero.

Initial Hypothesis

› Reward shaping — goal reward too small vs step penalty.
› Exploration — ε-greedy decayed too fast; agent stopped exploring.
› Learning rate — α too high caused Q-value oscillation.

Investigation

Step	Action	Result
1	Logged ε, α, episode reward	ε reached 0.01 by episode 200; exploration collapsed early
2	Raised goal reward 10×	Slight improvement; still no consistent goal hits
3	Printed Q-table max per episode	Q-values reset to zeros every 100 episodes
4	Traced `env.reset()`	New environment instance created on each reset, orphaning learned state

Root Cause

reset() instantiated a new GridWorld object instead of clearing the existing grid state. The Q-table keyed on (row, col) still referenced old coordinates, but transition dynamics changed when internal obstacle layout was re-randomized on construction.

State consistency broke between episodes — the agent was learning on a moving target.

The Bellman update the agent was trying to apply (on a stable environment) is:

Q (s, a) \leftarrow Q (s, a) + α [r + γ a^{'} max Q (s^{'}, a^{'}) - Q (s, a)]

With $α = 0.1$ and $γ = 0.99$ , oscillation from resetting $Q$ every 100 episodes masked any convergence signal.

Fix

 def reset(self):
    self.grid = self._initial_grid.copy()
    self.agent_pos = self.start
    return self._obs()

› Reuse one environment instance for the full training run.
› Separate reset() from __init__ for obstacle generation (fixed per run).

Result

After the fix, the agent reached the goal within ~320 episodes (ε=0.1, γ=0.99). Average reward curve stabilized with clear upward trend by episode 400.

Lessons Learned

› State consistency is critical in RL — environment identity must be stable across episodes unless you intentionally want non-stationarity.
› Debugging is hypothesis-driven — log exploration params before tuning rewards.
› Always trace object lifetimes on reset() — a common footgun in custom envs.

› A* Pathfinding Lab — complementary search intuition on grids