revive.algo.policy package¶
Submodules¶
revive.algo.policy.base module¶
- class revive.algo.policy.base.PolicyOperator(*args, **kwargs)[source]¶
Bases:
object
- property env¶
- property train_policy¶
- property val_policy¶
- property policy¶
- property other_train_models¶
- property other_val_models¶
- PARAMETER_DESCRIPTION = []¶
- classmethod get_tune_parameters(config: Dict[str, Any], **kargs)[source]¶
Use ray.tune to wrap the parameters to be searched.
- optimizer_creator(models: List[Module], config: Dict[str, Any]) List[Optimizer] [source]¶
Define optimizers for the created models.
- Args:
- pmodels:
list of all the models
- config:
configuration parameters
- Return:
a list of optimizers
- data_creator()[source]¶
Create DataLoaders.
- Args:
- config:
configuration parameters
- Return:
(train_loader, val_loader)
- venv_test(expert_data: Batch, target_policy, traj_length=None, scope: str = 'trainPolicy_on_valEnv')[source]¶
Use the virtual env model to test the policy model
- generate_rollout(expert_data: Batch, target_policy, env: VirtualEnvDev | List[VirtualEnvDev], traj_length: int, maintain_grad_flow: bool = False, deterministic: bool = True, clip: bool = False)[source]¶
Generate trajectories based on current policy.
- Args:
- expert_data:
sampled data from the dataset.
:target_policy: target_policy :env: env :traj_length: traj_length :maintain_grad_flow: maintain_grad_flow
- Return:
batch trajectories
revive.algo.policy.ppo module¶
- class revive.algo.policy.ppo.PPOOperator(*args, **kwargs)[source]¶
Bases:
PolicyOperator
A class used to train target policy.
- NAME = 'PPO'¶
- PARAMETER_DESCRIPTION = [{'abbreviation': 'pbs', 'default': 256, 'description': 'Batch size of training process.', 'doc': True, 'name': 'ppo_batch_size', 'type': <class 'int'>}, {'default': 0, 'description': 'pre-train policy with setting epoch', 'doc': True, 'name': 'policy_bc_epoch', 'type': <class 'int'>}, {'abbreviation': 'bep', 'default': 1000, 'description': 'Number of epcoh for the training process', 'doc': True, 'name': 'ppo_epoch', 'type': <class 'int'>}, {'abbreviation': 'prh', 'default': 100, 'description': 'Rollout length of the policy train.', 'doc': True, 'name': 'ppo_rollout_horizon', 'type': <class 'int'>}, {'abbreviation': 'phf', 'default': 256, 'description': 'Number of neurons per layer of the policy network.', 'doc': True, 'name': 'policy_hidden_features', 'type': <class 'int'>}, {'abbreviation': 'phl', 'default': 4, 'description': 'Depth of policy network.', 'doc': True, 'name': 'policy_hidden_layers', 'type': <class 'int'>}, {'abbreviation': 'pb', 'default': 'res', 'description': 'Backbone of policy network.[mlp, res, ft_transformer]', 'doc': True, 'name': 'policy_backbone', 'type': <class 'str'>}, {'abbreviation': 'vhf', 'default': 256, 'name': 'value_hidden_features', 'type': <class 'int'>}, {'abbreviation': 'vhl', 'default': 4, 'name': 'value_hidden_layers', 'type': <class 'int'>}, {'default': 2, 'name': 'ppo_runs', 'type': <class 'int'>}, {'default': 0.2, 'name': 'epsilon', 'type': <class 'float'>}, {'default': 0.001, 'name': 'w_vl2', 'type': <class 'float'>}, {'default': 0.0, 'name': 'w_ent', 'type': <class 'float'>}, {'default': 1.0, 'name': 'w_kl', 'type': <class 'float'>}, {'default': 0.99, 'name': 'gae_gamma', 'type': <class 'float'>}, {'default': 0.95, 'name': 'gae_lambda', 'type': <class 'float'>}, {'default': 4e-05, 'description': 'Initial learning rate of the training process.', 'doc': True, 'name': 'g_lr', 'search_mode': 'continuous', 'search_values': [1e-06, 0.001], 'type': <class 'float'>}, {'default': 0, 'description': 'Reward uncertainty weight(MOPO)', 'name': 'reward_uncertainty_weight', 'search_mode': 'continuous', 'type': <class 'float'>}, {'default': False, 'name': 'filter', 'type': <class 'bool'>}, {'default': 50, 'name': 'candidate_num', 'type': <class 'int'>}, {'default': 50, 'name': 'ensemble_size', 'type': <class 'int'>}, {'default': 10, 'name': 'ensemble_choosing_interval', 'type': <class 'int'>}, {'default': False, 'description': 'Disturbing the network node in policy learning', 'name': 'disturbing_transition_function', 'type': <class 'bool'>}, {'default': 'auto', 'description': 'Disturbing the network node in policy learning', 'name': 'disturbing_nodes', 'type': <class 'list'>}, {'default': 100, 'description': 'Disturbing the network node in policy learning', 'name': 'disturbing_net_num', 'type': <class 'int'>}, {'default': 0.05, 'description': 'Disturbing the network node in policy learning', 'name': 'disturbing_weight', 'type': <class 'float'>}, {'default': 1, 'description': 'deterministic of generator rollout', 'name': 'generate_deter', 'type': <class 'int'>}]¶
- property policy_bc_optimizer¶
- model_creator(nodes)[source]¶
Create models including target policy and value net.
- Returns:
env models, target policy, value net
- optimizer_creator(scope)[source]¶
- Returns:
generator optimizers including target policy optimizers and value net optimizers
- data_creator()[source]¶
Create DataLoaders.
- Args:
- config:
configuration parameters
- Return:
(train_loader, val_loader)
- after_train_epoch(*args, **kwargs)¶
- train_batch(*args, **kwargs)¶
- ADV(reward, mask, value, gamma, lam, use_gae=True)[source]¶
Compute advantage function for PPO.
- Parameters:
reward – rewards of each step
mask – mask is 1 if the trajectory done, else 0
value – value for each state
gamma – discount factor
lam – GAE lamda
use_gae – True or False
- Returns:
advantages and new value
- ppo_step(target_policy, value_net, generator_optimizer, states, actions, ret, advantages, action_log_probs, expert_data, value_net_states, valid_masks=1)[source]¶
Train target_policy and value_net by PPO algorithm.
- Parameters:
target_policy – target policy
value_net – value net
generator_optimizer – the optimizers used to optimize target policy and value net
state – state of target policy
action – action of target policy
ret – GAE value
adv – advantage
action_log_prob – action log probability of target policy
expert_data – batch of expert data
- Returns:
v_loss, p_loss, total_loss
revive.algo.policy.sac module¶
- class revive.algo.policy.sac.ReplayBuffer(buffer_size)[source]¶
Bases:
object
A simple FIFO experience replay buffer for SAC agents.
- class revive.algo.policy.sac.SACOperator(*args, **kwargs)[source]¶
Bases:
PolicyOperator
A class used to train platform policy.
- NAME = 'SAC'¶
- critic_pre_trained = False¶
- PARAMETER_DESCRIPTION = [{'abbreviation': 'pbs', 'default': 1024, 'description': 'Batch size of training process.', 'doc': True, 'name': 'sac_batch_size', 'type': <class 'int'>}, {'default': 0, 'description': 'pre-train policy with setting epoch', 'doc': True, 'name': 'policy_bc_epoch', 'type': <class 'int'>}, {'abbreviation': 'bep', 'default': 1000, 'description': 'Number of epcoh for the training process.', 'doc': True, 'name': 'sac_epoch', 'type': <class 'int'>}, {'abbreviation': 'sspe', 'default': 200, 'description': 'The number of update rounds of sac in each epoch.', 'doc': True, 'name': 'sac_steps_per_epoch', 'type': <class 'int'>}, {'abbreviation': 'srh', 'default': 20, 'doc': True, 'name': 'sac_rollout_horizon', 'type': <class 'int'>}, {'abbreviation': 'phf', 'default': 256, 'description': 'Number of neurons per layer of the policy network.', 'doc': True, 'name': 'policy_hidden_features', 'type': <class 'int'>}, {'abbreviation': 'phl', 'default': 4, 'description': 'Depth of policy network.', 'doc': True, 'name': 'policy_hidden_layers', 'type': <class 'int'>}, {'abbreviation': 'pb', 'default': 'res', 'description': 'Backbone of policy network. [mlp, res, ft_transformer]', 'doc': True, 'name': 'policy_backbone', 'type': <class 'str'>}, {'abbreviation': 'pha', 'default': 'leakyrelu', 'description': 'hidden_activation of policy network.', 'doc': True, 'name': 'policy_hidden_activation', 'type': <class 'str'>}, {'abbreviation': 'qhf', 'default': 256, 'name': 'q_hidden_features', 'type': <class 'int'>}, {'abbreviation': 'qhl', 'default': 2, 'name': 'q_hidden_layers', 'type': <class 'int'>}, {'abbreviation': 'nqn', 'default': 4, 'name': 'num_q_net', 'type': <class 'int'>}, {'abbreviation': 'bfs', 'default': 1000000.0, 'description': 'Size of the buffer to store data.', 'doc': True, 'name': 'buffer_size', 'type': <class 'int'>}, {'default': 1.0, 'name': 'w_kl', 'type': <class 'float'>}, {'default': 0.99, 'name': 'gamma', 'type': <class 'float'>}, {'default': 0.2, 'name': 'alpha', 'type': <class 'float'>}, {'default': 0.99, 'name': 'polyak', 'type': <class 'float'>}, {'default': 1, 'name': 'batch_ratio', 'type': <class 'float'>}, {'default': 4e-05, 'description': 'Initial learning rate of the training process.', 'doc': True, 'name': 'g_lr', 'search_mode': 'continuous', 'search_values': [1e-06, 0.001], 'type': <class 'float'>}, {'default': 0, 'description': 'interval step for index removing.', 'name': 'interval', 'type': <class 'int'>}, {'default': 1, 'description': 'deterministic of generator rollout', 'name': 'generate_deter', 'type': <class 'int'>}, {'default': False, 'name': 'filter', 'type': <class 'bool'>}, {'default': 50, 'name': 'candidate_num', 'type': <class 'int'>}, {'default': 50, 'name': 'ensemble_size', 'type': <class 'int'>}, {'default': 10, 'name': 'ensemble_choosing_interval', 'type': <class 'int'>}, {'default': False, 'description': 'Disturbing the network node in policy learning', 'name': 'disturbing_transition_function', 'type': <class 'bool'>}, {'default': 'auto', 'description': 'Disturbing the network node in policy learning', 'name': 'disturbing_nodes', 'type': <class 'list'>}, {'default': 100, 'description': 'Disturbing the network node in policy learning', 'name': 'disturbing_net_num', 'type': <class 'int'>}, {'default': 0.05, 'description': 'Disturbing the network node in policy learning', 'name': 'disturbing_weight', 'type': <class 'float'>}, {'default': True, 'name': 'critic_pretrain', 'type': <class 'bool'>}, {'default': 0, 'description': 'Reward uncertainty weight(MOPO)', 'name': 'reward_uncertainty_weight', 'search_mode': 'continuous', 'type': <class 'float'>}, {'default': 20, 'name': 'penalty_sample_num', 'type': <class 'int'>}, {'default': 'None', 'name': 'penalty_type', 'type': <class 'str'>}, {'default': 'auto', 'name': 'ts_conv_nodes', 'type': <class 'list'>}]¶
- model_creator(nodes)[source]¶
Create model including platform policy and value net.
- Returns:
env model, platform policy, value net
- optimizer_creator(scope)[source]¶
- Returns:
generator optimizer including platform policy optimizer and value net optimizer
- data_creator()[source]¶
Create DataLoaders.
- Args:
- config:
configuration parameters
- Return:
(train_loader, val_loader)
- setup(*args, **kwargs)¶
- before_validate_epoch(*args, **kwargs)¶
- buffer_process(generated_data=None, buffer=None, model_index=None, expert_data=None, buffer_expert=None, expert_index=None, generate_buffer=True, expert_buffer=True)[source]¶
- train_batch(*args, **kwargs)¶