revive.algo.policy package

Submodules

revive.algo.policy.base module

revive.algo.policy.base.catch_error(func)[source]

push the training error message to data buffer

class revive.algo.policy.base.PolicyOperator(*args, **kwargs)[source]

Bases: object

property env
property policy
property val_policy
property other_models
PARAMETER_DESCRIPTION = []
classmethod get_parameters(command=None, **kargs)[source]
classmethod get_tune_parameters(config: Dict[str, Any], **kargs)[source]

Use ray.tune to wrap the parameters to be searched.

model_creator(config: Dict[str, Any], node: FunctionDecisionNode) List[Module][source]

Create all the models. The algorithm needs to define models for the nodes to be learned.

Args:
config

configuration parameters

Return:

a list of models

optimizer_creator(models: List[Module], config: Dict[str, Any]) List[Optimizer][source]

Define optimizers for the created models.

Args:
pmodels

list of all the models

config

configuration parameters

Return:

a list of optimizers

data_creator(config: Dict[str, Any])[source]

Create DataLoaders.

Args:
config

configuration parameters

Return:

(train_loader, val_loader)

get_ope_dataset()[source]

convert the dataset to OPEDataset used in d3pe

venv_test(expert_data: Batch, target_policy, traj_length=None, scope: str = 'trainPolicy_on_valEnv')[source]

Use the virtual env model to test the policy model

generate_rollout(expert_data: Batch, target_policy, env: Union[VirtualEnvDev, List[VirtualEnvDev]], traj_length: int, maintain_grad_flow: bool = False, deterministic: bool = True, clip: bool = False)[source]

Generate trajectories based on current policy.

Args:
expert_data

sampled data from the dataset.

:target_policy: target_policy

:env: env

:traj_length: traj_length

:maintain_grad_flow: maintain_grad_flow

Return:

batch trajectories

before_train_epoch(*args, **kwargs)[source]
train_epoch(*args, **kwargs)[source]
validate(*args, **kwargs)[source]
train_batch(expert_data: Batch, batch_info: Dict[str, float], scope: str = 'train')[source]
validate_batch(expert_data: Batch, batch_info: Dict[str, float], scope: str = 'trainPolicy_on_valEnv')[source]
class revive.algo.policy.base.PolicyAlgorithm(algo: str, workspace: Optional[str] = None)[source]

Bases: object

get_train_func(config)[source]
get_trainer(config)[source]
get_trainable(config)[source]
get_parameters(command=None)[source]
get_tune_parameters(config)[source]

revive.algo.policy.ppo module

class revive.algo.policy.ppo.PPOOperator(*args, **kwargs)[source]

Bases: PolicyOperator

A class used to train target policy.

PARAMETER_DESCRIPTION = [{'name': 'ppo_batch_size', 'description': 'Batch size of training process.', 'abbreviation': 'pbs', 'type': <class 'int'>, 'default': 256, 'doc': True}, {'name': 'policy_bc_epoch', 'type': <class 'int'>, 'default': 0}, {'name': 'ppo_epoch', 'description': 'Number of epcoh for the training process', 'abbreviation': 'bep', 'type': <class 'int'>, 'default': 200, 'doc': True}, {'name': 'ppo_rollout_horizon', 'description': 'Rollout length of the policy train.', 'abbreviation': 'prh', 'type': <class 'int'>, 'default': 100, 'doc': True}, {'name': 'policy_hidden_features', 'description': 'Number of neurons per layer of the policy network.', 'abbreviation': 'phf', 'type': <class 'int'>, 'default': 256, 'doc': True}, {'name': 'policy_hidden_layers', 'description': 'Depth of policy network.', 'abbreviation': 'phl', 'type': <class 'int'>, 'default': 4, 'doc': True}, {'name': 'policy_backbone', 'description': 'Backbone of policy network.', 'abbreviation': 'pb', 'type': <class 'str'>, 'default': 'mlp', 'doc': True}, {'name': 'value_hidden_features', 'abbreviation': 'vhf', 'type': <class 'int'>, 'default': 256}, {'name': 'value_hidden_layers', 'abbreviation': 'vhl', 'type': <class 'int'>, 'default': 4}, {'name': 'ppo_runs', 'type': <class 'int'>, 'default': 2}, {'name': 'epsilon', 'type': <class 'float'>, 'default': 0.2}, {'name': 'w_vl2', 'type': <class 'float'>, 'default': 0.001}, {'name': 'w_ent', 'type': <class 'float'>, 'default': 0.0}, {'name': 'w_kl', 'type': <class 'float'>, 'default': 1.0}, {'name': 'gae_gamma', 'type': <class 'float'>, 'default': 0.99}, {'name': 'gae_lambda', 'type': <class 'float'>, 'default': 0.95}, {'name': 'g_lr', 'description': 'Initial learning rate of the training process.', 'type': <class 'float'>, 'default': 4e-05, 'search_mode': 'continuous', 'search_values': [1e-06, 0.001], 'doc': True}, {'name': 'reward_uncertainty_weight', 'description': 'Reward uncertainty weight(MOPO)', 'type': <class 'float'>, 'default': 0, 'search_mode': 'continuous'}]
model_creator(config, nodes)[source]

Create models including target policy and value net.

Returns

env models, target policy, value net

optimizer_creator(models, config)[source]
Returns

generator optimizers including target policy optimizers and value net optimizers

data_creator(config)[source]

Create DataLoaders.

Args:
config

configuration parameters

Return:

(train_loader, val_loader)

train_batch(*args, **kwargs)
bc_train_batch(*args, **kwargs)
ADV(reward, mask, value, gamma, lam, use_gae=True)[source]

Compute advantage function for PPO.

Parameters
  • reward – rewards of each step

  • mask – mask is 1 if the trajectory done, else 0

  • value – value for each state

  • gamma – discount factor

  • lam – GAE lamda

  • use_gae – True or False

Returns

advantages and new value

ppo_step(target_policy, value_net, generator_optimizer, states, actions, ret, advantages, action_log_probs, expert_data, value_net_states)[source]

Train target_policy and value_net by PPO algorithm.

Parameters
  • target_policy – target policy

  • value_net – value net

  • generator_optimizer – the optimizers used to optimize target policy and value net

  • state – state of target policy

  • action – action of target policy

  • ret – GAE value

  • adv – advantage

  • action_log_prob – action log probability of target policy

  • expert_data – batch of expert data

Returns

v_loss, p_loss, total_loss

revive.algo.policy.sac module

class revive.algo.policy.sac.ReplayBuffer(buffer_size)[source]

Bases: object

A simple FIFO experience replay buffer for SAC agents.

put(batch_data: Batch)[source]
__len__()[source]
sample(batch_size)[source]
class revive.algo.policy.sac.SACOperator(*args, **kwargs)[source]

Bases: PolicyOperator

A class used to train platform policy.

PARAMETER_DESCRIPTION = [{'name': 'sac_batch_size', 'description': 'Batch size of training process.', 'abbreviation': 'pbs', 'type': <class 'int'>, 'default': 1024, 'doc': True}, {'name': 'policy_bc_epoch', 'type': <class 'int'>, 'default': 0}, {'name': 'sac_epoch', 'description': 'Number of epcoh for the training process.', 'abbreviation': 'bep', 'type': <class 'int'>, 'default': 200, 'doc': True}, {'name': 'sac_steps_per_epoch', 'description': 'The number of update rounds of sac in each epoch.', 'abbreviation': 'sspe', 'type': <class 'int'>, 'default': 200, 'doc': True}, {'name': 'sac_rollout_horizon', 'abbreviation': 'srh', 'type': <class 'int'>, 'default': 20, 'doc': True}, {'name': 'policy_hidden_features', 'description': 'Number of neurons per layer of the policy network.', 'abbreviation': 'phf', 'type': <class 'int'>, 'default': 256, 'doc': True}, {'name': 'policy_hidden_layers', 'description': 'Depth of policy network.', 'abbreviation': 'phl', 'type': <class 'int'>, 'default': 4, 'doc': True}, {'name': 'policy_backbone', 'description': 'Backbone of policy network.', 'abbreviation': 'pb', 'type': <class 'str'>, 'default': 'mlp', 'doc': True}, {'name': 'policy_hidden_activation', 'description': 'hidden_activation of policy network.', 'abbreviation': 'pha', 'type': <class 'str'>, 'default': 'leakyrelu', 'doc': True}, {'name': 'q_hidden_features', 'abbreviation': 'qhf', 'type': <class 'int'>, 'default': 256}, {'name': 'q_hidden_layers', 'abbreviation': 'qhl', 'type': <class 'int'>, 'default': 2}, {'name': 'num_q_net', 'abbreviation': 'nqn', 'type': <class 'int'>, 'default': 4}, {'name': 'buffer_size', 'description': 'Size of the buffer to store data.', 'abbreviation': 'bfs', 'type': <class 'int'>, 'default': 1000000.0, 'doc': True}, {'name': 'w_kl', 'type': <class 'float'>, 'default': 1.0}, {'name': 'gamma', 'type': <class 'float'>, 'default': 0.99}, {'name': 'alpha', 'type': <class 'float'>, 'default': 0.2}, {'name': 'polyak', 'type': <class 'float'>, 'default': 0.99}, {'name': 'batch_ratio', 'type': <class 'float'>, 'default': 1}, {'name': 'g_lr', 'description': 'Initial learning rate of the training process.', 'type': <class 'float'>, 'default': 4e-05, 'search_mode': 'continuous', 'search_values': [1e-06, 0.001], 'doc': True}, {'name': 'interval', 'description': 'interval step for index removing.', 'type': <class 'int'>, 'default': 0}, {'name': 'generate_deter', 'description': 'deterministic of generator rollout', 'type': <class 'int'>, 'default': 0}, {'name': 'reward_uncertainty_weight', 'description': 'Reward uncertainty weight(MOPO)', 'type': <class 'float'>, 'default': 0, 'search_mode': 'continuous'}]
property policy
property val_policy
model_creator(config, nodes)[source]

Create model including platform policy and value net.

Returns

env model, platform policy, value net

optimizer_creator(models, config)[source]
Returns

generator optimizer including platform policy optimizer and value net optimizer

data_creator(config)[source]

Create DataLoaders.

Args:
config

configuration parameters

Return:

(train_loader, val_loader)

setup(*args, **kwargs)
train_batch(*args, **kwargs)
bc_train_batch(*args, **kwargs)
sac(buffer, buffer_real, target_policy, q_net, target_q_net, gamma=0.99, alpha=0.2, polyak=0.99, actor_optimizer=None, critic_optimizer=None)[source]

Module contents