RobotLearning: Scaling Continuous Deep QLearning Part2
I explain DDPG as an early deterministic policy gradient method, transitioning from Deep Q-learning, which doesn't work for continuous actions. I detail how we approximate maximum actions using a mu model, sample batches for IID, and use target networks for training. I explain the policy update via gradient computation through the Q-function and the Polyac averaging used for target network updates, noting its empirical success but questioning its theoretical contraction. I then delve into practical challenges, like preventing policy outputs from exploding to infinity, and solutions such as regularization, tanh activations, and gradient squashing. We discussed exploration noise, comparing Gaussian and Ornstein-Uhlenbeck noise, and the choice between discrete and continuous action spaces, emphasizing the importance of multimodality. I highlight the sensitivity of Q-functions to policy inputs and how adding noise to target values, as in TD3, improves robustness and performance. I question the continued use of simple environments like the inverted pendulum for algorithm evaluation, advocating for more complex tasks to better differentiate algorithm performance and reflect real-world challenges, much like progressing from simple addition to complex homework assignments in our studies. Finally, we cover a number of recent papers and their exploration to find more scalable versions of deep Q learning methods using mixtures of experts, layer normalization, and network structure.
Download
0 formatsNo download links available.