no code implementations • 27 Dec 2013 • Prashanth L. A., Abhranil Chatterjee, Shalabh Bhatnagar
For each criterion, we propose a convergent on-policy Q-learning algorithm that operates on two timescales, while employing function approximation to handle the curse of dimensionality associated with the underlying POMDP.