Truncated, Terminated, Done
This article discusses the story and importance of three related signals in reinforcement learning — terminated, truncated, done. Understanding their differences is critical to getting an RL algorithm right.
The standard introduction to RL goes something like this: you have an agent, aka policy. This policy will see observations
and pick actions
based on them. It will then find from data what actions
correlate to high reward
and update so as to produce actions that maximise the expected future reward.
But much of what you need for a working RL setup is left not mentioned by this diagram. Many typical problems we might want RL to solve involve constant reward, ending in termination:
- hold a pendulum upright as long as possible. This is typically encoded as positive, e.g. +1, reward on each time step while pendulum is sufficiently upright.
- leave the maze as quick as possible. This is typically described with negative, e.g. -1, reward on each time step as long as agent didn’t reach exit.
These settings occur in many real life problems as they describe two things we often want: to keep something in a good state, and not waste time going towards your goal.
But if our reward is constant, how can the agent learn to maximise it? While reward is constant, termination
is what determines expected future reward. The constant reward then makes future reward be the number of timesteps agent has before termination. Thats a deep existential thought too, but I digress.
Now, when openai gym was released, it didn’t have termination — instead it had “done” signal:
obs, reward, done, info = env.step(action)
done
was often interpreted as termination. In algorithms doing bootstraping, meaning using a value function V(s) or Q(s,a) to estimate future value of a state, the difference lies in whether you set the value to zero for the last state. For actor-critic algorithm, it might look like in the formula below, where _n
means value at timestep n:
Q(s_1,a_1) = r_1 + gamma * Q(s_2, a_2) * (not termination_1)
Sometimes done
indeed meant the episode was over. On the other hand for tasks with no true termination, like e.g. forever running half-cheetah, the episodes were in fact just truncated at 1000 steps. This is also called TimeLimit, and is done just for convenience to have episodes of repeatable and not too long duration. However, when treating truncation as termination, setting future reward to zero meant for the agent there was no future to consider. So it could as well land the half-cheetah head-down, legs-up, as long as it brought few extra bits of reward before step 1000. This is ofcourse not the behavior you would want your robot to learn.
Many researches have noticed this and got frustrated with this short-coming of such wide-spread API, and in version 0.26 openai gym package has changed API to return two booleans, terminated
and truncated
instead of done
. This of course was a breaking change. There was quite a bit of discussion on how this could have been done — some other options included:
- returning an object with 3 possible values — x.terminated, x.truncated, x.not_done
- since
done = terminated or truncated
, it was also possible to return e.g.terminated
,done
which in my opition would have been better from practical considerations — in most cases one ends up still calculating done for the book keeping, its really important to know when your episode ends. On the other hand truncated is strictly less informative than done.
Anyways, now you have 5 return values in standard environment API as you can see in the intro to current version of gymnasium (ex openai gym):
import gymnasium as gym
env = gym.make("LunarLander-v2", render_mode="human")
observation, info = env.reset()
for _ in range(1000):
action = env.action_space.sample() # agent policy that uses the observation and info
observation, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
observation, info = env.reset()
env.close()
as you can see, they end up still calculating done: if terminated or truncated: ...
. What do you do with these values? for one thing, you calculate done using or
on the two. And then, remember to set your future value to zero when terminal
is True, but not for truncated
. Sounds simple, right?
And maybe the last point — when testing/benchmarking your algorithm, include an environment where termination is interesting, like InvertedPendulum — it should help you catch any issues early.
Stay tuned for more bite-sized (byte-sized?), practical articles about RL!