revive.algo.policy package¶

Submodules¶

revive.algo.policy.base module¶

revive.algo.policy.base.catch_error(func)[source]¶: push the training error message to data buffer

class revive.algo.policy.base.PolicyOperator(*args, **kwargs)[source]¶

Bases: object

property env¶

property train_policy¶

property val_policy¶

property policy¶

property other_train_models¶

property other_val_models¶

PARAMETER_DESCRIPTION = []¶

classmethod get_parameters(command=None, **kargs)[source]¶

classmethod get_tune_parameters(config: Dict[str, Any], **kargs)[source]¶: Use ray.tune to wrap the parameters to be searched.

optimizer_creator(models: List[Module], config: Dict[str, Any]) → List[Optimizer][source]¶

Define optimizers for the created models.

Args:

pmodels:: list of all the models
config:: configuration parameters

Return:

a list of optimizers

data_creator()[source]¶

Create DataLoaders.

Args:

config:: configuration parameters

Return:

(train_loader, val_loader)

model_creator(nodes)[source]¶

bc_optimizer_creator(scope)[source]¶

get_ope_dataset()[source]¶: convert the dataset to OPEDataset used in d3pe

venv_test(expert_data: Batch, target_policy, traj_length=None, scope: str = 'trainPolicy_on_valEnv')[source]¶: Use the virtual env model to test the policy model

generate_rollout(expert_data: Batch, target_policy, env: VirtualEnvDev | List[VirtualEnvDev], traj_length: int, maintain_grad_flow: bool = False, deterministic: bool = True, clip: bool = False)[source]¶

Generate trajectories based on current policy.

Args：

expert_data:: sampled data from the dataset.

:target_policy： target_policy :env： env :traj_length： traj_length :maintain_grad_flow： maintain_grad_flow

Return:

batch trajectories

before_train_epoch(*args, **kwargs)[source]¶

after_train_epoch(*args, **kwargs)[source]¶

before_validate_epoch(*args, **kwargs)[source]¶

after_validate_epoch(*args, **kwargs)[source]¶

train_epoch(*args, **kwargs)[source]¶

bc_train_batch(*args, **kwargs)[source]¶

validate(*args, **kwargs)[source]¶

train_batch(expert_data: Batch, batch_info: Dict[str, float], scope: str = 'train')[source]¶

validate_batch(expert_data: Batch, batch_info: Dict[str, float], scope: str = 'trainPolicy_on_valEnv')[source]¶

class revive.algo.policy.base.PolicyAlgorithm(algo: str, workspace: str | None = None)[source]¶

Bases: object

get_train_func(config={})[source]¶

get_trainer(config)[source]¶

get_trainable(config)[source]¶

get_parameters(command=None)[source]¶

get_tune_parameters(config)[source]¶

revive.algo.policy.ppo module¶

class revive.algo.policy.ppo.PPOOperator(*args, **kwargs)[source]¶

Bases: PolicyOperator

A class used to train target policy.

NAME = 'PPO'¶

PARAMETER_DESCRIPTION = [{'abbreviation': 'pbs', 'default': 256, 'description': 'Batch size of training process.', 'doc': True, 'name': 'ppo_batch_size', 'type': <class 'int'>}, {'default': 0, 'description': 'pre-train policy with setting epoch', 'doc': True, 'name': 'policy_bc_epoch', 'type': <class 'int'>}, {'abbreviation': 'bep', 'default': 1000, 'description': 'Number of epcoh for the training process', 'doc': True, 'name': 'ppo_epoch', 'type': <class 'int'>}, {'abbreviation': 'prh', 'default': 100, 'description': 'Rollout length of the policy train.', 'doc': True, 'name': 'ppo_rollout_horizon', 'type': <class 'int'>}, {'abbreviation': 'phf', 'default': 256, 'description': 'Number of neurons per layer of the policy network.', 'doc': True, 'name': 'policy_hidden_features', 'type': <class 'int'>}, {'abbreviation': 'phl', 'default': 4, 'description': 'Depth of policy network.', 'doc': True, 'name': 'policy_hidden_layers', 'type': <class 'int'>}, {'abbreviation': 'pb', 'default': 'res', 'description': 'Backbone of policy network.[mlp, res, ft_transformer]', 'doc': True, 'name': 'policy_backbone', 'type': <class 'str'>}, {'abbreviation': 'vhf', 'default': 256, 'name': 'value_hidden_features', 'type': <class 'int'>}, {'abbreviation': 'vhl', 'default': 4, 'name': 'value_hidden_layers', 'type': <class 'int'>}, {'default': 2, 'name': 'ppo_runs', 'type': <class 'int'>}, {'default': 0.2, 'name': 'epsilon', 'type': <class 'float'>}, {'default': 0.001, 'name': 'w_vl2', 'type': <class 'float'>}, {'default': 0.0, 'name': 'w_ent', 'type': <class 'float'>}, {'default': 1.0, 'name': 'w_kl', 'type': <class 'float'>}, {'default': 0.99, 'name': 'gae_gamma', 'type': <class 'float'>}, {'default': 0.95, 'name': 'gae_lambda', 'type': <class 'float'>}, {'default': 4e-05, 'description': 'Initial learning rate of the training process.', 'doc': True, 'name': 'g_lr', 'search_mode': 'continuous', 'search_values': [1e-06, 0.001], 'type': <class 'float'>}, {'default': 0, 'description': 'Reward uncertainty weight(MOPO)', 'name': 'reward_uncertainty_weight', 'search_mode': 'continuous', 'type': <class 'float'>}, {'default': False, 'name': 'filter', 'type': <class 'bool'>}, {'default': 50, 'name': 'candidate_num', 'type': <class 'int'>}, {'default': 50, 'name': 'ensemble_size', 'type': <class 'int'>}, {'default': 10, 'name': 'ensemble_choosing_interval', 'type': <class 'int'>}, {'default': False, 'description': 'Disturbing the network node in policy learning', 'name': 'disturbing_transition_function', 'type': <class 'bool'>}, {'default': 'auto', 'description': 'Disturbing the network node in policy learning', 'name': 'disturbing_nodes', 'type': <class 'list'>}, {'default': 100, 'description': 'Disturbing the network node in policy learning', 'name': 'disturbing_net_num', 'type': <class 'int'>}, {'default': 0.05, 'description': 'Disturbing the network node in policy learning', 'name': 'disturbing_weight', 'type': <class 'float'>}, {'default': 1, 'description': 'deterministic of generator rollout', 'name': 'generate_deter', 'type': <class 'int'>}]¶

property policy_bc_optimizer¶

model_creator(nodes)[source]¶

Create models including target policy and value net.

Returns:: env models, target policy, value net

optimizer_creator(scope)[source]¶

Returns:: generator optimizers including target policy optimizers and value net optimizers

data_creator()[source]¶

Create DataLoaders.

Args:

config:: configuration parameters

Return:

(train_loader, val_loader)

after_train_epoch(*args, **kwargs)¶

train_batch(*args, **kwargs)¶

ADV(reward, mask, value, gamma, lam, use_gae=True)[source]¶

Compute advantage function for PPO.

Parameters:

reward – rewards of each step
mask – mask is 1 if the trajectory done, else 0
value – value for each state
gamma – discount factor
lam – GAE lamda
use_gae – True or False

Returns:

advantages and new value

ppo_step(target_policy, value_net, generator_optimizer, states, actions, ret, advantages, action_log_probs, expert_data, value_net_states, valid_masks=1)[source]¶

Train target_policy and value_net by PPO algorithm.

Parameters:

target_policy – target policy
value_net – value net
generator_optimizer – the optimizers used to optimize target policy and value net
state – state of target policy
action – action of target policy
ret – GAE value
adv – advantage
action_log_prob – action log probability of target policy
expert_data – batch of expert data

Returns:

v_loss, p_loss, total_loss

revive.algo.policy.sac module¶

class revive.algo.policy.sac.ReplayBuffer(buffer_size)[source]¶

Bases: object

A simple FIFO experience replay buffer for SAC agents.

put(batch_data: Batch)[source]¶

__len__()[source]¶

sample(batch_size)[source]¶

class revive.algo.policy.sac.SACOperator(*args, **kwargs)[source]¶

Bases: PolicyOperator

A class used to train platform policy.

NAME = 'SAC'¶

critic_pre_trained = False¶

PARAMETER_DESCRIPTION = [{'abbreviation': 'pbs', 'default': 1024, 'description': 'Batch size of training process.', 'doc': True, 'name': 'sac_batch_size', 'type': <class 'int'>}, {'default': 0, 'description': 'pre-train policy with setting epoch', 'doc': True, 'name': 'policy_bc_epoch', 'type': <class 'int'>}, {'abbreviation': 'bep', 'default': 1000, 'description': 'Number of epcoh for the training process.', 'doc': True, 'name': 'sac_epoch', 'type': <class 'int'>}, {'abbreviation': 'sspe', 'default': 200, 'description': 'The number of update rounds of sac in each epoch.', 'doc': True, 'name': 'sac_steps_per_epoch', 'type': <class 'int'>}, {'abbreviation': 'srh', 'default': 20, 'doc': True, 'name': 'sac_rollout_horizon', 'type': <class 'int'>}, {'abbreviation': 'phf', 'default': 256, 'description': 'Number of neurons per layer of the policy network.', 'doc': True, 'name': 'policy_hidden_features', 'type': <class 'int'>}, {'abbreviation': 'phl', 'default': 4, 'description': 'Depth of policy network.', 'doc': True, 'name': 'policy_hidden_layers', 'type': <class 'int'>}, {'abbreviation': 'pb', 'default': 'res', 'description': 'Backbone of policy network. [mlp, res, ft_transformer]', 'doc': True, 'name': 'policy_backbone', 'type': <class 'str'>}, {'abbreviation': 'pha', 'default': 'leakyrelu', 'description': 'hidden_activation of policy network.', 'doc': True, 'name': 'policy_hidden_activation', 'type': <class 'str'>}, {'abbreviation': 'qhf', 'default': 256, 'name': 'q_hidden_features', 'type': <class 'int'>}, {'abbreviation': 'qhl', 'default': 2, 'name': 'q_hidden_layers', 'type': <class 'int'>}, {'abbreviation': 'nqn', 'default': 4, 'name': 'num_q_net', 'type': <class 'int'>}, {'abbreviation': 'bfs', 'default': 1000000.0, 'description': 'Size of the buffer to store data.', 'doc': True, 'name': 'buffer_size', 'type': <class 'int'>}, {'default': 1.0, 'name': 'w_kl', 'type': <class 'float'>}, {'default': 0.99, 'name': 'gamma', 'type': <class 'float'>}, {'default': 0.2, 'name': 'alpha', 'type': <class 'float'>}, {'default': 0.99, 'name': 'polyak', 'type': <class 'float'>}, {'default': 1, 'name': 'batch_ratio', 'type': <class 'float'>}, {'default': 4e-05, 'description': 'Initial learning rate of the training process.', 'doc': True, 'name': 'g_lr', 'search_mode': 'continuous', 'search_values': [1e-06, 0.001], 'type': <class 'float'>}, {'default': 0, 'description': 'interval step for index removing.', 'name': 'interval', 'type': <class 'int'>}, {'default': 1, 'description': 'deterministic of generator rollout', 'name': 'generate_deter', 'type': <class 'int'>}, {'default': False, 'name': 'filter', 'type': <class 'bool'>}, {'default': 50, 'name': 'candidate_num', 'type': <class 'int'>}, {'default': 50, 'name': 'ensemble_size', 'type': <class 'int'>}, {'default': 10, 'name': 'ensemble_choosing_interval', 'type': <class 'int'>}, {'default': False, 'description': 'Disturbing the network node in policy learning', 'name': 'disturbing_transition_function', 'type': <class 'bool'>}, {'default': 'auto', 'description': 'Disturbing the network node in policy learning', 'name': 'disturbing_nodes', 'type': <class 'list'>}, {'default': 100, 'description': 'Disturbing the network node in policy learning', 'name': 'disturbing_net_num', 'type': <class 'int'>}, {'default': 0.05, 'description': 'Disturbing the network node in policy learning', 'name': 'disturbing_weight', 'type': <class 'float'>}, {'default': True, 'name': 'critic_pretrain', 'type': <class 'bool'>}, {'default': 0, 'description': 'Reward uncertainty weight(MOPO)', 'name': 'reward_uncertainty_weight', 'search_mode': 'continuous', 'type': <class 'float'>}, {'default': 20, 'name': 'penalty_sample_num', 'type': <class 'int'>}, {'default': 'None', 'name': 'penalty_type', 'type': <class 'str'>}, {'default': 'auto', 'name': 'ts_conv_nodes', 'type': <class 'list'>}]¶

model_creator(nodes)[source]¶

Create model including platform policy and value net.

Returns:: env model, platform policy, value net

optimizer_creator(scope)[source]¶

Returns:: generator optimizer including platform policy optimizer and value net optimizer

data_creator()[source]¶

Create DataLoaders.

Args:

config:: configuration parameters

Return:

(train_loader, val_loader)

setup(*args, **kwargs)¶

before_validate_epoch(*args, **kwargs)¶

done_process(expert_data=None, generated_data=None, generate_done=True, expert_done=True)[source]¶

buffer_process(generated_data=None, buffer=None, model_index=None, expert_data=None, buffer_expert=None, expert_index=None, generate_buffer=True, expert_buffer=True)[source]¶

pretrain_critic(expert_data, scope)[source]¶

bc_train_batch(expert_data, batch_info, scope='train')[source]¶

train_batch(*args, **kwargs)¶

compute_lcb(_gen_data: Batch, scope=None)[source]¶

sac(buffer, buffer_real, target_policy, q_net, target_q_net, gamma=0.99, alpha=0.2, polyak=0.99, actor_optimizer=None, critic_optimizer=None, pre_train_critic=False, scope=None)[source]¶

revive.algo.policy package¶

Submodules¶

revive.algo.policy.base module¶

revive.algo.policy.ppo module¶

revive.algo.policy.sac module¶

Module contents¶