# Inverse-Reinforcement-Learning **Repository Path**: wolf953/Inverse-Reinforcement-Learning ## Basic Information - **Project Name**: Inverse-Reinforcement-Learning - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-02-03 - **Last Updated**: 2021-02-03 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Inverse Reinforcement Learning [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.555999.svg)](https://doi.org/10.5281/zenodo.555999) Implements selected inverse reinforcement learning (IRL) algorithms as part of COMP3710, supervised by Dr Mayank Daswani and Dr Marcus Hutter. My final report is available [here](http://matthewja.com/pdfs/irl.pdf) and describes the implemented algorithms. If you use this code in your work, you can cite it as follows: ```bibtex @misc{alger16, author = {Matthew Alger}, title = {Inverse Reinforcement Learning}, year = 2016, doi = {10.5281/zenodo.555999}, url = {https://doi.org/10.5281/zenodo.555999} } ``` ## Algorithms implemented - Linear programming IRL. From Ng & Russell, 2000. Small state space and large state space linear programming IRL. - Maximum entropy IRL. From Ziebart et al., 2008. - Deep maximum entropy IRL. From Wulfmeier et al., 2015; original derivation. Additionally, the following MDP domains are implemented: - Gridworld (Sutton, 1998) - Objectworld (Levine et al., 2011) ## Requirements - NumPy - SciPy - CVXOPT - Theano - MatPlotLib (for examples) ## Module documentation Following is a brief list of functions and classes exported by modules. Full documentation is included in the docstrings of each function or class; only functions and classes intended for use outside the module are documented here. ### linear_irl Implements linear programming inverse reinforcement learning (Ng & Russell, 2000). **Functions:** - `irl(n_states, n_actions, transition_probability, policy, discount, Rmax, l1)`: Find a reward function with inverse RL. - `large_inverseRL(value, transition_probability, feature_matrix, n_states, n_actions, policy)`: Find the reward in a large state space. ### maxent Implements maximum entropy inverse reinforcement learning (Ziebart et al., 2008). **Functions:** - `irl(feature_matrix, n_actions, discount, transition_probability, trajectories, epochs, learning_rate)`: Find the reward function for the given trajectories. - `find_svf(feature_matrix, n_actions, discount, transition_probability, trajectories, epochs, learning_rate)`: Find the state visitation frequency from trajectories. - `find_feature_expectations(feature_matrix, trajectories)`: Find the feature expectations for the given trajectories. This is the average path feature vector. - `find_expected_svf(n_states, r, n_actions, discount, transition_probability, trajectories)`: Find the expected state visitation frequencies using algorithm 1 from Ziebart et al. 2008. - `expected_value_difference(n_states, n_actions, transition_probability, reward, discount, p_start_state, optimal_value, true_reward)`: Calculate the expected value difference, which is a proxy to how good a recovered reward function is. ### deep_maxent Implements deep maximum entropy inverse reinforcement learning based on Ziebart et al., 2008 and Wulfmeier et al., 2015, using symbolic methods with Theano. **Functions:** - `irl(structure, feature_matrix, n_actions, discount, transition_probability, trajectories, epochs, learning_rate, initialisation="normal", l1=0.1, l2=0.1)`: Find the reward function for the given trajectories. - `find_svf(n_states, trajectories)`: Find the state vistiation frequency from trajectories. - `find_expected_svf(n_states, r, n_actions, discount, transition_probability, trajectories)`: Find the expected state visitation frequencies using algorithm 1 from Ziebart et al. 2008. ### value_iteration Find the value function associated with a policy. Based on Sutton & Barto, 1998. **Functions:** - `value(policy, n_states, transition_probabilities, reward, discount, threshold=1e-2)`: Find the value function associated with a policy. - `optimal_value(n_states, n_actions, transition_probabilities, reward, discount, threshold=1e-2)`: Find the optimal value function. - `find_policy(n_states, n_actions, transition_probabilities, reward, discount, threshold=1e-2, v=None, stochastic=True)`: Find the optimal policy. ### mdp #### gridworld Implements the gridworld MDP. **Classes, instance attributes, methods:** - `Gridworld(grid_size, wind, discount)`: Gridworld MDP. - `actions`: Tuple of (dx, dy) actions. - `n_actions`: Number of actions. int. - `n_states`: Number of states. int. - `grid_size`: Size of grid. int. - `wind`: Chance of moving randomly. float. - `discount`: MDP discount factor. float. - `transition_probability`: NumPy array with shape (n_states, n_actions, n_states) where `transition_probability[si, a, sk]` is the probability of transitioning from state si to state sk under action a. - `feature_vector(i, feature_map="ident")`: Get the feature vector associated with a state integer. - `feature_matrix(feature_map="ident")`: Get the feature matrix for this gridworld. - `int_to_point(i)`: Convert a state int into the corresponding coordinate. - `point_to_int(p)`: Convert a coordinate into the corresponding state int. - `neighbouring(i, k)`: Get whether two points neighbour each other. Also returns true if they are the same point. - `reward(state_int)`: Reward for being in state state_int. - `average_reward(n_trajectories, trajectory_length, policy)`: Calculate the average total reward obtained by following a given policy over n_paths paths. - `optimal_policy(state_int)`: The optimal policy for this gridworld. - `optimal_policy_deterministic(state_int)`: Deterministic version of the optimal policy for this gridworld. - `generate_trajectories(n_trajectories, trajectory_length, policy, random_start=False)`: Generate n_trajectories trajectories with length trajectory_length, following the given policy. #### objectworld Implements the objectworld MDP described in Levine et al. 2011. **Classes, instance attributes, methods:** - `OWObject(inner_colour, outer_colour)`: Object in objectworld. - `inner_colour`: Inner colour of object. int. - `outer_colour`: Outer colour of object. int. - `Objectworld(grid_size, n_objects, n_colours, wind, discount)`: Objectworld MDP. - `actions`: Tuple of (dx, dy) actions. - `n_actions`: Number of actions. int. - `n_states`: Number of states. int. - `grid_size`: Size of grid. int. - `n_objects`: Number of objects in the world. int. - `n_colours`: Number of colours to colour objects with. int. - `wind`: Chance of moving randomly. float. - `discount`: MDP discount factor. float. - `objects`: Set of objects in the world. - `transition_probability`: NumPy array with shape (n_states, n_actions, n_states) where `transition_probability[si, a, sk]` is the probability of transitioning from state si to state sk under action a. - `feature_vector(i, discrete=True)`: Get the feature vector associated with a state integer. - `feature_matrix(discrete=True)`: Get the feature matrix for this gridworld. - `int_to_point(i)`: Convert a state int into the corresponding coordinate. - `point_to_int(p)`: Convert a coordinate into the corresponding state int. - `neighbouring(i, k)`: Get whether two points neighbour each other. Also returns true if they are the same point. - `reward(state_int)`: Reward for being in state state_int. - `average_reward(n_trajectories, trajectory_length, policy)`: Calculate the average total reward obtained by following a given policy over n_paths paths. - `generate_trajectories(n_trajectories, trajectory_length, policy)`: Generate n_trajectories trajectories with length trajectory_length, following the given policy.