Foundations of Intelligent and Learning Agenet
Solution of Multi-arm Bandit Problem and analysis of performance of different sampling algorithms such as Round-Robin, epsilon-greedy, UCB, KL-UCB and Thompson Sampling
Solution of MDPs using Linear Programming and Howard Policy Iteration. Reconstruction of a family of MDPs (differing in discount factor) based on same value function for a certain range of discount factors
Estimation of value function of a policy for a given MDP from a trajectory of the form state, action, reward, state, action, reward…
Simulating “Windy Gridworld” environment (as an episodic MDP) and solving the environment using SARSA(0) agent and further analysis