Apply value iteration, policy iteration and Q learning to develop policies for agents to reach goal state.
Apply value iteration, policy iteration and Q learning to develop policies for agents to reach goal state. Two ’Maze Solving Problem’ with different state size (8X8 and 15X15) to investigate how the complexity increases when increasing state size.
Q-Learning (learning algorithm of choice) Q-Learning is a reinforcement learning method which tries to find an optimal action/selection policy for an MDP. It starts by learning an action-value function that gives an expected utility by taking that action when in a particular state and following an optimal policy from that point onward. The main advantage of this algorithm is that it is able to make a comparison between the expected utilities for the available actions without even requiring a model of the environment. BURLAP has a function called QLearing which is used for this purpose in the analysis section
All calulation are performed in Burlap, no external data source used.
Java 1.8 was used.
The two problem studies are in:
Although the rewards for each algorithm varies, the optimal policy in value iteration and policy iteration are the same. This could be due to 2 reasons:
Markov decision process and reinforcement learning algorithm are very different. This is due to the difference in the nature of MDP and RL.
MDP is using a model to create a planer to direct the future action, where as RL is learning from many transitions then come up with a policy.
The rewards in MDP was greater than that in Q-learning. It could be due to the fact that Q- learning as a reinforcement learning algorithms doesn’t take model as an input, but takes samples instead. So instead of of computing a policy, reinforcement learning computes a policy, which bring in more unpredictability to the policy.
Policy iteration > Value iteration > Q-learning
This is not as expected. Because Policy iteration is supposed to be able to over come the inefficiency in value iteration that the max of utility rarely changes when it converged.
If there are on average a constant number of next states with non-zero probability then the cost per iteration is linear in the number of states and linear in the number of actions. The 15X15 maze problem has almost 4X of the total states as in the 8X8 maze. So as expected, the overall time taken is 4 times as that in 8X8 problem.
All the three algorithms in each dataset converged at iteration 9. There is not too much difference in convergence. Policy iteration takes more time in each iteration than Value iteration.
*To simplify the problem, all comparisons were performed on discount factor (γ) = 0.9 and the only reward is obtained when exiting from the top right corner.