revive.algo.policy package¶
Submodules¶
revive.algo.policy.base module¶
- class revive.algo.policy.base.PolicyOperator(*args, **kwargs)[source]¶
Bases:
object
- property env¶
- property policy¶
- property val_policy¶
- property other_models¶
- PARAMETER_DESCRIPTION = []¶
- classmethod get_tune_parameters(config: Dict[str, Any], **kargs)[source]¶
Use ray.tune to wrap the parameters to be searched.
- model_creator(config: Dict[str, Any], node: FunctionDecisionNode) List[Module] [source]¶
Create all the models. The algorithm needs to define models for the nodes to be learned.
- Args:
- config
configuration parameters
- Return:
a list of models
- optimizer_creator(models: List[Module], config: Dict[str, Any]) List[Optimizer] [source]¶
Define optimizers for the created models.
- Args:
- pmodels
list of all the models
- config
configuration parameters
- Return:
a list of optimizers
- data_creator(config: Dict[str, Any])[source]¶
Create DataLoaders.
- Args:
- config
configuration parameters
- Return:
(train_loader, val_loader)
- venv_test(expert_data: Batch, target_policy, traj_length=None, scope: str = 'trainPolicy_on_valEnv')[source]¶
Use the virtual env model to test the policy model
- generate_rollout(expert_data: Batch, target_policy, env: Union[VirtualEnvDev, List[VirtualEnvDev]], traj_length: int, maintain_grad_flow: bool = False, deterministic: bool = True, clip: bool = False)[source]¶
Generate trajectories based on current policy.
- Args:
- expert_data
sampled data from the dataset.
:target_policy: target_policy
:env: env
:traj_length: traj_length
:maintain_grad_flow: maintain_grad_flow
- Return:
batch trajectories
revive.algo.policy.ppo module¶
- class revive.algo.policy.ppo.PPOOperator(*args, **kwargs)[source]¶
Bases:
PolicyOperator
A class used to train target policy.
- PARAMETER_DESCRIPTION = [{'name': 'ppo_batch_size', 'description': 'Batch size of training process.', 'abbreviation': 'pbs', 'type': <class 'int'>, 'default': 256, 'doc': True}, {'name': 'policy_bc_epoch', 'type': <class 'int'>, 'default': 0}, {'name': 'ppo_epoch', 'description': 'Number of epcoh for the training process', 'abbreviation': 'bep', 'type': <class 'int'>, 'default': 200, 'doc': True}, {'name': 'ppo_rollout_horizon', 'description': 'Rollout length of the policy train.', 'abbreviation': 'prh', 'type': <class 'int'>, 'default': 100, 'doc': True}, {'name': 'policy_hidden_features', 'description': 'Number of neurons per layer of the policy network.', 'abbreviation': 'phf', 'type': <class 'int'>, 'default': 256, 'doc': True}, {'name': 'policy_hidden_layers', 'description': 'Depth of policy network.', 'abbreviation': 'phl', 'type': <class 'int'>, 'default': 4, 'doc': True}, {'name': 'policy_backbone', 'description': 'Backbone of policy network.', 'abbreviation': 'pb', 'type': <class 'str'>, 'default': 'mlp', 'doc': True}, {'name': 'value_hidden_features', 'abbreviation': 'vhf', 'type': <class 'int'>, 'default': 256}, {'name': 'value_hidden_layers', 'abbreviation': 'vhl', 'type': <class 'int'>, 'default': 4}, {'name': 'ppo_runs', 'type': <class 'int'>, 'default': 2}, {'name': 'epsilon', 'type': <class 'float'>, 'default': 0.2}, {'name': 'w_vl2', 'type': <class 'float'>, 'default': 0.001}, {'name': 'w_ent', 'type': <class 'float'>, 'default': 0.0}, {'name': 'w_kl', 'type': <class 'float'>, 'default': 1.0}, {'name': 'gae_gamma', 'type': <class 'float'>, 'default': 0.99}, {'name': 'gae_lambda', 'type': <class 'float'>, 'default': 0.95}, {'name': 'g_lr', 'description': 'Initial learning rate of the training process.', 'type': <class 'float'>, 'default': 4e-05, 'search_mode': 'continuous', 'search_values': [1e-06, 0.001], 'doc': True}, {'name': 'reward_uncertainty_weight', 'description': 'Reward uncertainty weight(MOPO)', 'type': <class 'float'>, 'default': 0, 'search_mode': 'continuous'}]¶
- model_creator(config, nodes)[source]¶
Create models including target policy and value net.
- Returns
env models, target policy, value net
- optimizer_creator(models, config)[source]¶
- Returns
generator optimizers including target policy optimizers and value net optimizers
- data_creator(config)[source]¶
Create DataLoaders.
- Args:
- config
configuration parameters
- Return:
(train_loader, val_loader)
- train_batch(*args, **kwargs)¶
- bc_train_batch(*args, **kwargs)¶
- ADV(reward, mask, value, gamma, lam, use_gae=True)[source]¶
Compute advantage function for PPO.
- Parameters
reward – rewards of each step
mask – mask is 1 if the trajectory done, else 0
value – value for each state
gamma – discount factor
lam – GAE lamda
use_gae – True or False
- Returns
advantages and new value
- ppo_step(target_policy, value_net, generator_optimizer, states, actions, ret, advantages, action_log_probs, expert_data, value_net_states)[source]¶
Train target_policy and value_net by PPO algorithm.
- Parameters
target_policy – target policy
value_net – value net
generator_optimizer – the optimizers used to optimize target policy and value net
state – state of target policy
action – action of target policy
ret – GAE value
adv – advantage
action_log_prob – action log probability of target policy
expert_data – batch of expert data
- Returns
v_loss, p_loss, total_loss
revive.algo.policy.sac module¶
- class revive.algo.policy.sac.ReplayBuffer(buffer_size)[source]¶
Bases:
object
A simple FIFO experience replay buffer for SAC agents.
- class revive.algo.policy.sac.SACOperator(*args, **kwargs)[source]¶
Bases:
PolicyOperator
A class used to train platform policy.
- PARAMETER_DESCRIPTION = [{'name': 'sac_batch_size', 'description': 'Batch size of training process.', 'abbreviation': 'pbs', 'type': <class 'int'>, 'default': 1024, 'doc': True}, {'name': 'policy_bc_epoch', 'type': <class 'int'>, 'default': 0}, {'name': 'sac_epoch', 'description': 'Number of epcoh for the training process.', 'abbreviation': 'bep', 'type': <class 'int'>, 'default': 200, 'doc': True}, {'name': 'sac_steps_per_epoch', 'description': 'The number of update rounds of sac in each epoch.', 'abbreviation': 'sspe', 'type': <class 'int'>, 'default': 200, 'doc': True}, {'name': 'sac_rollout_horizon', 'abbreviation': 'srh', 'type': <class 'int'>, 'default': 20, 'doc': True}, {'name': 'policy_hidden_features', 'description': 'Number of neurons per layer of the policy network.', 'abbreviation': 'phf', 'type': <class 'int'>, 'default': 256, 'doc': True}, {'name': 'policy_hidden_layers', 'description': 'Depth of policy network.', 'abbreviation': 'phl', 'type': <class 'int'>, 'default': 4, 'doc': True}, {'name': 'policy_backbone', 'description': 'Backbone of policy network.', 'abbreviation': 'pb', 'type': <class 'str'>, 'default': 'mlp', 'doc': True}, {'name': 'policy_hidden_activation', 'description': 'hidden_activation of policy network.', 'abbreviation': 'pha', 'type': <class 'str'>, 'default': 'leakyrelu', 'doc': True}, {'name': 'q_hidden_features', 'abbreviation': 'qhf', 'type': <class 'int'>, 'default': 256}, {'name': 'q_hidden_layers', 'abbreviation': 'qhl', 'type': <class 'int'>, 'default': 2}, {'name': 'num_q_net', 'abbreviation': 'nqn', 'type': <class 'int'>, 'default': 4}, {'name': 'buffer_size', 'description': 'Size of the buffer to store data.', 'abbreviation': 'bfs', 'type': <class 'int'>, 'default': 1000000.0, 'doc': True}, {'name': 'w_kl', 'type': <class 'float'>, 'default': 1.0}, {'name': 'gamma', 'type': <class 'float'>, 'default': 0.99}, {'name': 'alpha', 'type': <class 'float'>, 'default': 0.2}, {'name': 'polyak', 'type': <class 'float'>, 'default': 0.99}, {'name': 'batch_ratio', 'type': <class 'float'>, 'default': 1}, {'name': 'g_lr', 'description': 'Initial learning rate of the training process.', 'type': <class 'float'>, 'default': 4e-05, 'search_mode': 'continuous', 'search_values': [1e-06, 0.001], 'doc': True}, {'name': 'interval', 'description': 'interval step for index removing.', 'type': <class 'int'>, 'default': 0}, {'name': 'generate_deter', 'description': 'deterministic of generator rollout', 'type': <class 'int'>, 'default': 0}, {'name': 'reward_uncertainty_weight', 'description': 'Reward uncertainty weight(MOPO)', 'type': <class 'float'>, 'default': 0, 'search_mode': 'continuous'}]¶
- property policy¶
- property val_policy¶
- model_creator(config, nodes)[source]¶
Create model including platform policy and value net.
- Returns
env model, platform policy, value net
- optimizer_creator(models, config)[source]¶
- Returns
generator optimizer including platform policy optimizer and value net optimizer
- data_creator(config)[source]¶
Create DataLoaders.
- Args:
- config
configuration parameters
- Return:
(train_loader, val_loader)
- setup(*args, **kwargs)¶
- train_batch(*args, **kwargs)¶
- bc_train_batch(*args, **kwargs)¶