Next: Introduction
Tight Performance Bounds on Greedy Policies
Based on Imperfect Value Functions
Ronald J. Williams
College of Computer Science
Northeastern University
Boston, MA 02115
rjw@ccs.neu.edu
Leemon C. Baird, III
Wright Laboratory
Wright-Patterson Air Force Base, OH 45433-6543
bairdlc@wL.wpafb.af.mil
Northeastern University
College of Computer Science
Technical Report NU-CCS-93-14
November 24, 1993
Abstract:
Consider a given value function on states of a Markov decision problem,
as might result from applying a reinforcement learning algorithm.
Unless this value function equals the corresponding optimal value function,
at some states there will be a discrepancy, which is natural to call
the Bellman residual, between what the value function specifies at that
state and what is obtained by a one-step lookahead along the seemingly best
action at that state using the given value function to evaluate all
succeeding states. This paper derives a tight bound on how far
from optimal the discounted return for a greedy policy based on the given
value function will be as a function of the maximum norm magnitude
of this Bellman residual. A corresponding result is also obtained for
value functions defined on state-action pairs, as are used in Q-learning.
One significant application of these results is to problems where a function
approximator is used to learn a value function, with training of the
approximator based on trying to minimize the Bellman residual across
states or state-action pairs. When control is based on the use of the
resulting value function, this result provides a link between how well
the objectives of function approximator training are met and the quality
of the resulting control.
Leemon Baird
Thu Oct 12 17:01:39 MDT 1995