RobotLearning: Scaling Deep Q-Learning Part1
In this lecture segment, I explained the progression from simple bandits to Q-learning, outlining the challenges and solutions in reinforcement learning. I began by discussing multi-armed bandits, emphasizing the exploration-exploitation dilemma and introducing methods like epsilon-greedy and upper confidence bound (UCB) to balance these competing needs. I then moved to contextual bandits, which incorporate state information, and finally to Q-learning, which learns a state-dependent policy. I highlighted the advantages of Q-learning over policy gradients, such as its ability to learn from off-policy data and its lower variance. I delved into the concept of approximate dynamic programming, explaining how value and policy iteration methods, like value iteration and policy iteration, can be used to train a Q-function. I discussed the computational cost of these methods, particularly the need to perform an argmax over all possible actions, and how policy iteration can reduce this cost by bootstrapping on previous policies. I concluded by hinting at the possibility of combining policy evaluation and improvement into a single step for further efficiency.
Download
0 formatsNo download links available.