Randomized Iterative Methods and Complexity for Markov Decision Process
Consider Markov decision processes with S states and A actions per state.
We provide the first set of nearly-linear and sublinear running time upper bounds and a nearly matching lower bound for the discounted Markov decision problem. For the upper bounds, we propose a randomized linear programming algorithm that takes advantage of the linear duality between value and policy functions. The method achieves nearly-linear run time for general problems, and sublinear run-time for ergodic decision process when the input is given in specific ways. To further improve the complexity, we propose a randomized value iteration method that leverages the contractive property of the Bellman operator and variance reduction techniques. For the lower bound, we show that any algorithm needs at least Omega(S^2 A) run time to get a constant-error approximation to the optimal policy. Surprisingly, this lower bound reduces to Omega(SA/epsilon) when the input is specified in suitable data structures. The upper and lower bounds suggest that the complexity of MDP depends on data structures of the input. These results provide new complexity benchmarks for solving stochastic dynamic programs.