Key points are not available for this paper at this time.
We consider a system with a finite number S of states s, labeled by the integers 1, 2, , S. Periodically, say once a day, we observe the current state of the system, and then choose an action a from a finite set A of possible actions. As a joint result of the current state s and the chosen action a, two things happen: (1) we receive an immediate income i (s, a) and (2) the system moves to a new state s' with the probability of a particular new state s' given by a function q = q (s' s, a). Finally there is specified a discount factor, 0 < 1, so that the value of unit income n days in the future is ⁿ. Our problem is to choose a policy which maximizes our total expected income. This problem, which is an interesting special case of the general dynamic programming problem, has been solved by Howard in his excellent book 3. The case = 1, also studied by Howard, is substantially more difficult. We shall obtain in this case results slightly beyond those of Howard, though still not complete. Our method, which treats = 1 as a limiting case of < 1, seems rather simpler than Howard's.
David Blackwell (Fri,) studied this question.