Off-policy Policy Optimization
Dale Schuurmans (Google Brain & University of Alberta)
Off-policy optimization seeks to overcome the data inefficiency of on-policy learning by leveraging arbitrarily collected domain observations. Although compensating for data provenance is often considered to be the central challenge in off-policy policy optimization, the choice of training loss and the mechanism used for missing data inference also play a critical role. I will discuss some recent progress in developing alternative loss functions for off-policy optimization that are compatible with principled forms of missing data inference. By leveraging standard concepts from other areas of machine learning, such as calibrated surrogate losses and empirical Bayes estimation, simple policy optimization techniques can be derived that are theoretically sound and empirically effective in simple scenarios. I will discuss prospects for scaling these approaches up to large problems.
Attachment | Size |
---|---|
Off-policy Policy Optimization | 2.34 MB |