CartPole — Custom observations
You’ve finished the GridWorld tutorial and you can write a TrainingEnv from scratch. This tutorial teaches a more subtle skill: how observations shape learning. A poorly-chosen observe() can make the fastest algorithm crawl. A well-chosen one can make a toy algorithm look brilliant. By the end you’ll have concrete intuition for picking observation spaces on your own envs.
Estimated time: 20–25 minutes.
Prerequisites
- You’ve completed the GridWorld tutorial or have equivalent experience writing a custom
TrainingEnv. - You have a Vite + TypeScript project with
@ignitionai/coreand@ignitionai/backend-tfjsinstalled. - You have the Quickstart’s
CartPoleEnv(or equivalent inlined cart-pole physics) in your project.
Step 1 — Start from the full-observation CartPole
First, make sure the Quickstart baseline works. Your src/cartpole-env.ts should return all four state variables from observe():
observe(): number[] {
return [this.x, this.xDot, this.theta, this.thetaDot]
}And your src/main.ts trains it with env.train('dqn'). Run it and note how long training takes to reach the 500-step survival ceiling (usually 30–60 seconds on modern hardware at setSpeed(10)).
Why this step exists: we need a control. Every experiment in the next steps is compared against this baseline. Write down your baseline time — you’ll refer back to it.
Step 2 — Drop cart-related observations
Create a new file src/partial-env.ts:
import { CartPoleEnv } from './cartpole-env'
export class PartialCartPoleEnv extends CartPoleEnv {
// Only show the agent the pole angle and angular velocity.
// Hide cart position and cart velocity completely.
observe(): number[] {
const [x, xDot, theta, thetaDot] = super.observe()
return [theta, thetaDot]
}
}Update src/main.ts to use it:
import { IgnitionEnvTFJS } from '@ignitionai/backend-tfjs'
import { PartialCartPoleEnv } from './partial-env'
const env = new IgnitionEnvTFJS(new PartialCartPoleEnv())
env.train('dqn')
env.setSpeed(10)Run it. What to observe: training works — the pole can still be balanced from pole angle and angular velocity alone, because the physics equations don’t actually depend on cart position. But the agent has to learn a harder policy: it can’t know when it’s about to hit the x boundary (at ±2.4 units). You should see more episodes that end due to cart runaway instead of pole falls, and slightly slower convergence.
Why this step exists: this is the most valuable lesson in the tutorial. The pole is solvable with fewer observations, but the agent learns a policy that works in the lab and fails at the boundaries. Missing observations silently limit the policy. Real envs have this failure mode constantly — the agent works great in training and degrades in deployment because training never exposed it to the edge cases.
Step 3 — Add noise to the observations
Restore the full four observations but add Gaussian noise:
import { CartPoleEnv } from './cartpole-env'
function gaussian(): number {
// Box–Muller
const u1 = Math.random()
const u2 = Math.random()
return Math.sqrt(-2 * Math.log(u1)) * Math.cos(2 * Math.PI * u2)
}
export class NoisyCartPoleEnv extends CartPoleEnv {
observe(): number[] {
const [x, xDot, theta, thetaDot] = super.observe()
const noise = 0.05 // 5% observation noise
return [
x + noise * gaussian(),
xDot + noise * gaussian(),
theta + noise * gaussian(),
thetaDot + noise * gaussian(),
]
}
}Point main.ts at NoisyCartPoleEnv and train. What to observe: training still converges, but the final policy is visibly jittery — the agent over-corrects because it can’t fully trust any single observation. Convergence time is longer than the baseline, shorter than PartialCartPoleEnv.
Why this step exists: real-world sensors are noisy. A camera-based pose estimate, an IMU, a LIDAR distance — all of them have noise. Training with noisy observations is a form of regularization: the policy learns to be robust to uncertainty. If you plan to deploy to a real robot, adding noise during training is often the single biggest policy-robustness win.
Step 4 — Add a redundant observation
Give the agent an extra “helpful” observation it can compute from the existing ones:
import { CartPoleEnv } from './cartpole-env'
export class RedundantCartPoleEnv extends CartPoleEnv {
observe(): number[] {
const [x, xDot, theta, thetaDot] = super.observe()
// Extra obs: how far the pole's tip has moved from the balanced position
const tipOffset = Math.sin(theta) * 0.5 + x
return [x, xDot, theta, thetaDot, tipOffset]
}
}Train it. What to observe: training converges at roughly the same speed as the baseline. The extra observation neither helps nor hurts significantly. You might even see a small improvement if you’re lucky.
Why this step exists: a common mistake is to think “more features = better agent.” In practice, adding redundant features that a neural network could easily derive itself has a neutral-to-slightly-positive effect. It’s never a silver bullet. The exception is when the derived feature is genuinely hard for the network — e.g., a long-range summary of recent history, or a feature that requires external knowledge. Those help. Repeating information the env already provides does not.
What you just learned
Three concrete lessons, in the order you should internalize them:
-
Missing observations silently limit the policy.
PartialCartPoleEnvlooked like it was working, but the agent couldn’t learn cart-boundary avoidance because it literally couldn’t see the cart. Your env needs to expose every dimension the agent needs to act on. -
Noisy observations are a form of regularization.
NoisyCartPoleEnvconverged slower but produced a more robust policy. If you plan to deploy to hardware, add noise during training. -
Redundant features are usually a wash.
RedundantCartPoleEnvdidn’t help meaningfully. Feature engineering in RL is rarely the bottleneck — the bottleneck is almost always the reward signal and the observation coverage.
The meta-lesson: before tuning hyperparameters, audit your observations. Nine times out of ten, a DQN agent that isn’t learning is missing something it should have seen.
Next steps
- MountainCar: reward shaping — the reward-signal equivalent of this tutorial. Observations control what the agent can learn; rewards control why it learns.
- Algorithms → DQN — Section “Failure 1 — Rewards never go up” is where bad observations usually surface.
- Tutorials index — pick your next tutorial.
Previous: ← GridWorld · Next: MountainCar: reward shaping →