revive.algo.policy package¶

Submodules¶

revive.algo.policy.base module¶

revive.algo.policy.base.catch_error(func)[source]¶: push the training error message to data buffer

class revive.algo.policy.base.PolicyOperator(*args, **kwargs)[source]¶

Bases: object

property env¶

property policy¶

property val_policy¶

property other_models¶

PARAMETER_DESCRIPTION = []¶

classmethod get_parameters(command=None, **kargs)[source]¶

classmethod get_tune_parameters(config: Dict[str, Any], **kargs)[source]¶: Use ray.tune to wrap the parameters to be searched.

model_creator(config: Dict[str, Any], node: FunctionDecisionNode) → List[Module][source]¶

Create all the models. The algorithm needs to define models for the nodes to be learned.

Args:

config: configuration parameters

Return:

a list of models

optimizer_creator(models: List[Module], config: Dict[str, Any]) → List[Optimizer][source]¶

Define optimizers for the created models.

Args:

pmodels: list of all the models
config: configuration parameters

Return:

a list of optimizers

data_creator(config: Dict[str, Any])[source]¶

Create DataLoaders.

Args:

config: configuration parameters

Return:

(train_loader, val_loader)

get_ope_dataset()[source]¶: convert the dataset to OPEDataset used in d3pe

venv_test(expert_data: Batch, target_policy, traj_length=None, scope: str = 'trainPolicy_on_valEnv')[source]¶: Use the virtual env model to test the policy model

generate_rollout(expert_data: Batch, target_policy, env: Union[VirtualEnvDev, List[VirtualEnvDev]], traj_length: int, maintain_grad_flow: bool = False, deterministic: bool = True, clip: bool = False)[source]¶

Generate trajectories based on current policy.

Args：

expert_data: sampled data from the dataset.

:target_policy： target_policy

:env： env

:traj_length： traj_length

:maintain_grad_flow： maintain_grad_flow

Return:

batch trajectories

before_train_epoch(*args, **kwargs)[source]¶

train_epoch(*args, **kwargs)[source]¶

validate(*args, **kwargs)[source]¶

train_batch(expert_data: Batch, batch_info: Dict[str, float], scope: str = 'train')[source]¶

validate_batch(expert_data: Batch, batch_info: Dict[str, float], scope: str = 'trainPolicy_on_valEnv')[source]¶

class revive.algo.policy.base.PolicyAlgorithm(algo: str, workspace: Optional[str] = None)[source]¶

Bases: object

get_train_func(config)[source]¶

get_trainer(config)[source]¶

get_trainable(config)[source]¶

get_parameters(command=None)[source]¶

get_tune_parameters(config)[source]¶

revive.algo.policy.ppo module¶

class revive.algo.policy.ppo.PPOOperator(*args, **kwargs)[source]¶

Bases: PolicyOperator

A class used to train target policy.

PARAMETER_DESCRIPTION = [{'name': 'ppo_batch_size', 'description': 'Batch size of training process.', 'abbreviation': 'pbs', 'type': <class 'int'>, 'default': 256, 'doc': True}, {'name': 'policy_bc_epoch', 'type': <class 'int'>, 'default': 0}, {'name': 'ppo_epoch', 'description': 'Number of epcoh for the training process', 'abbreviation': 'bep', 'type': <class 'int'>, 'default': 200, 'doc': True}, {'name': 'ppo_rollout_horizon', 'description': 'Rollout length of the policy train.', 'abbreviation': 'prh', 'type': <class 'int'>, 'default': 100, 'doc': True}, {'name': 'policy_hidden_features', 'description': 'Number of neurons per layer of the policy network.', 'abbreviation': 'phf', 'type': <class 'int'>, 'default': 256, 'doc': True}, {'name': 'policy_hidden_layers', 'description': 'Depth of policy network.', 'abbreviation': 'phl', 'type': <class 'int'>, 'default': 4, 'doc': True}, {'name': 'policy_backbone', 'description': 'Backbone of policy network.', 'abbreviation': 'pb', 'type': <class 'str'>, 'default': 'mlp', 'doc': True}, {'name': 'value_hidden_features', 'abbreviation': 'vhf', 'type': <class 'int'>, 'default': 256}, {'name': 'value_hidden_layers', 'abbreviation': 'vhl', 'type': <class 'int'>, 'default': 4}, {'name': 'ppo_runs', 'type': <class 'int'>, 'default': 2}, {'name': 'epsilon', 'type': <class 'float'>, 'default': 0.2}, {'name': 'w_vl2', 'type': <class 'float'>, 'default': 0.001}, {'name': 'w_ent', 'type': <class 'float'>, 'default': 0.0}, {'name': 'w_kl', 'type': <class 'float'>, 'default': 1.0}, {'name': 'gae_gamma', 'type': <class 'float'>, 'default': 0.99}, {'name': 'gae_lambda', 'type': <class 'float'>, 'default': 0.95}, {'name': 'g_lr', 'description': 'Initial learning rate of the training process.', 'type': <class 'float'>, 'default': 4e-05, 'search_mode': 'continuous', 'search_values': [1e-06, 0.001], 'doc': True}, {'name': 'reward_uncertainty_weight', 'description': 'Reward uncertainty weight(MOPO)', 'type': <class 'float'>, 'default': 0, 'search_mode': 'continuous'}]¶

model_creator(config, nodes)[source]¶

Create models including target policy and value net.

Returns: env models, target policy, value net

optimizer_creator(models, config)[source]¶

Returns: generator optimizers including target policy optimizers and value net optimizers

data_creator(config)[source]¶

Create DataLoaders.

Args:

config: configuration parameters

Return:

(train_loader, val_loader)

train_batch(*args, **kwargs)¶

bc_train_batch(*args, **kwargs)¶

ADV(reward, mask, value, gamma, lam, use_gae=True)[source]¶

Compute advantage function for PPO.

Parameters

reward – rewards of each step
mask – mask is 1 if the trajectory done, else 0
value – value for each state
gamma – discount factor
lam – GAE lamda
use_gae – True or False

Returns

advantages and new value

ppo_step(target_policy, value_net, generator_optimizer, states, actions, ret, advantages, action_log_probs, expert_data, value_net_states)[source]¶

Train target_policy and value_net by PPO algorithm.

Parameters

target_policy – target policy
value_net – value net
generator_optimizer – the optimizers used to optimize target policy and value net
state – state of target policy
action – action of target policy
ret – GAE value
adv – advantage
action_log_prob – action log probability of target policy
expert_data – batch of expert data

Returns

v_loss, p_loss, total_loss

revive.algo.policy.sac module¶

class revive.algo.policy.sac.ReplayBuffer(buffer_size)[source]¶

Bases: object

A simple FIFO experience replay buffer for SAC agents.

put(batch_data: Batch)[source]¶

__len__()[source]¶

sample(batch_size)[source]¶

class revive.algo.policy.sac.SACOperator(*args, **kwargs)[source]¶

Bases: PolicyOperator

A class used to train platform policy.

PARAMETER_DESCRIPTION = [{'name': 'sac_batch_size', 'description': 'Batch size of training process.', 'abbreviation': 'pbs', 'type': <class 'int'>, 'default': 1024, 'doc': True}, {'name': 'policy_bc_epoch', 'type': <class 'int'>, 'default': 0}, {'name': 'sac_epoch', 'description': 'Number of epcoh for the training process.', 'abbreviation': 'bep', 'type': <class 'int'>, 'default': 200, 'doc': True}, {'name': 'sac_steps_per_epoch', 'description': 'The number of update rounds of sac in each epoch.', 'abbreviation': 'sspe', 'type': <class 'int'>, 'default': 200, 'doc': True}, {'name': 'sac_rollout_horizon', 'abbreviation': 'srh', 'type': <class 'int'>, 'default': 20, 'doc': True}, {'name': 'policy_hidden_features', 'description': 'Number of neurons per layer of the policy network.', 'abbreviation': 'phf', 'type': <class 'int'>, 'default': 256, 'doc': True}, {'name': 'policy_hidden_layers', 'description': 'Depth of policy network.', 'abbreviation': 'phl', 'type': <class 'int'>, 'default': 4, 'doc': True}, {'name': 'policy_backbone', 'description': 'Backbone of policy network.', 'abbreviation': 'pb', 'type': <class 'str'>, 'default': 'mlp', 'doc': True}, {'name': 'policy_hidden_activation', 'description': 'hidden_activation of policy network.', 'abbreviation': 'pha', 'type': <class 'str'>, 'default': 'leakyrelu', 'doc': True}, {'name': 'q_hidden_features', 'abbreviation': 'qhf', 'type': <class 'int'>, 'default': 256}, {'name': 'q_hidden_layers', 'abbreviation': 'qhl', 'type': <class 'int'>, 'default': 2}, {'name': 'num_q_net', 'abbreviation': 'nqn', 'type': <class 'int'>, 'default': 4}, {'name': 'buffer_size', 'description': 'Size of the buffer to store data.', 'abbreviation': 'bfs', 'type': <class 'int'>, 'default': 1000000.0, 'doc': True}, {'name': 'w_kl', 'type': <class 'float'>, 'default': 1.0}, {'name': 'gamma', 'type': <class 'float'>, 'default': 0.99}, {'name': 'alpha', 'type': <class 'float'>, 'default': 0.2}, {'name': 'polyak', 'type': <class 'float'>, 'default': 0.99}, {'name': 'batch_ratio', 'type': <class 'float'>, 'default': 1}, {'name': 'g_lr', 'description': 'Initial learning rate of the training process.', 'type': <class 'float'>, 'default': 4e-05, 'search_mode': 'continuous', 'search_values': [1e-06, 0.001], 'doc': True}, {'name': 'interval', 'description': 'interval step for index removing.', 'type': <class 'int'>, 'default': 0}, {'name': 'generate_deter', 'description': 'deterministic of generator rollout', 'type': <class 'int'>, 'default': 0}, {'name': 'reward_uncertainty_weight', 'description': 'Reward uncertainty weight(MOPO)', 'type': <class 'float'>, 'default': 0, 'search_mode': 'continuous'}]¶

property policy¶

property val_policy¶

model_creator(config, nodes)[source]¶

Create model including platform policy and value net.

Returns: env model, platform policy, value net

optimizer_creator(models, config)[source]¶

Returns: generator optimizer including platform policy optimizer and value net optimizer

data_creator(config)[source]¶

Create DataLoaders.

Args:

config: configuration parameters

Return:

(train_loader, val_loader)

setup(*args, **kwargs)¶

train_batch(*args, **kwargs)¶

bc_train_batch(*args, **kwargs)¶

sac(buffer, buffer_real, target_policy, q_net, target_q_net, gamma=0.99, alpha=0.2, polyak=0.99, actor_optimizer=None, critic_optimizer=None)[source]¶

revive.algo.policy package¶

Submodules¶

revive.algo.policy.base module¶

revive.algo.policy.ppo module¶

revive.algo.policy.sac module¶

Module contents¶