Controlling Mujoco-HalfCheetah using Revive_filterΒΆ
We can use the revive_filter algorithm integrated in the REVIVE SDK for environment learning. Simply put, the revive_filter algorithm learns a generalizable dynamics reward model from offline data. This dynamics reward model can be subsequently employed as a transition filter to obtain more reliable environment transitions. When generating environment transitions, the environment model will generate a batch of environment transitions as a candidate set. Then the environment transition reward model will help select the most reliable environment transition from the candidate set as the final transition result. But it should be noted that, depending on different parameter settings, this algorithm will increase the memory and graphics memory overhead to varying degrees, and reduce the training speed. The algorithm performance is also affected by key hyperparameters, which are explained in detail at the end of the article.
Revive_filter environment learning algorithm generates two types of models: an environment model and a discriminator model. The environment model is used to predict the next state, while the discriminator model is used to determine whether the predicted state by the environment model is reasonable. Next, we will demonstrate how to apply this algorithm to the task of controlling Mujoco-HalfCheetah motion using the REVIVE SDK, ultimately improving its learning strategy.
In this algorithm example, the decision flow chart we use will change and involve heterogeneous decision flow functions.
This means that the decision flow chart for environment learning is different from that for strategy learning.
The difference lies in that during the environment learning process, the action
nodes are not learned,
while during strategy learning, the action
nodes need to be learned.
Below, we will show the decision flow charts for training the environment model and the strategy model.
Decision Graph for Environment Training:
metadata:
columns:
...
graph:
delta_x:
- obs
- action
delta_obs:
- obs
- action
next_obs:
- obs
- delta_obs
expert_functions:
next_obs:
'node_function' : 'delta_obs.get_next_obs'
Decision Graph for Policy Training:
metadata:
columns:
...
graph:
action:
- obs
delta_x:
- obs
- action
delta_obs:
- obs
- action
next_obs:
- obs
- delta_obs
expert_functions:
next_obs:
'node_function' : 'delta_obs.get_next_obs'
The implementation of the expert function get_next_obs
is as follows:
import torch
import numpy as np
def get_next_obs(data):
obs = data["obs"]
delta_obs = data["delta_obs"]
if len(obs.shape) == 1:
obs = obs.reshape(1, -1)
delta_obs = delta_obs.reshape(1, -1)
next_obs = obs + delta_obs
if len(data["obs"].shape) == 1:
next_obs = next_obs.reshape(-1)
return next_obs
This expert function is placed in the delta_obs.py
file.
In addition, to enable this feature, we need to make the following configurations in config.json
:
{
...
"venv_algo_config": {
"revive_f": [
{
"name": "bc_epoch",
"type": "int",
"default": 1500
},
{
"name": "revive_epoch",
"description": "Number of epcoh for the training process",
"abbreviation": "mep",
"type": "int",
"default": 1500
},
{
"name": "bc_lr",
"type": "float",
"default": 1e-3
},
{
"name": "bc_steps",
"type": "int",
"default": 10
},
{
"name": "matcher_record_len",
"type": "int",
"default": 50
},
{
"name": "fix_std",
"type": "float",
"default": 0.125
},
...
],
...
},
"venv_algo_config":{
"sac":{
{
"name": "penalty_type",
"type": "str",
"default": "filter"
},
{
"name": "penalty_sample_num",
"type": "int",
"default": 50
},
{
"name": "reward_uncertainty_weight",
"type": "float",
"default": 0.75
},
{
"name": "ensemble_choosing_interval",
"type": "int",
"default": 5
},
{
"name": "ensemble_size",
"type": "int",
"default": 10
},
{
"name": "candidate_num",
"type": "int",
"default": 20
},
{
"name": "filter",
"type": "bool",
"default": false
},
...
}
In this example, we first train the environment model with the revive_f
algorithm,
and then train the strategy model with the sac
algorithm. Some parameters need to be noted:
During environment training:
fix_std
controls the standard deviation of the output in adversarial venv learning.matcher_record_len
controls the number of discriminator models saved.revive_epoch
controls the number of training epochs for adversarial training.bc_epoch
controls the number of training epochs for bahavior learning.bc_lr
controls the learning rate for bahavior learning.
During policy training:
penalty_type
controls the type of reward penalty.penalty_sample_num
controls the number of samples for computing reward penalty.reward_uncertainty_weight
controls the weight of the reward penalty.ensemble_choosing_interval
controls the interval for choosing discriminator models.ensemble_size
controls the number of discriminator models.candidate_num
controls the number of samples for environment model output. This parameter only takes effect whenfilter
is true.filter
controls whether to use discriminator model assisted output. Note that if this parameter istrue
, the training speed will be slower.
After training the strategy, we place the strategy in the environment for online testing. 50 trajectories can achieve an average return of around 8500 points. This is about a 20% improvement over the results of the revive_p algorithm, which achieved a score of 7000 points.
Note
By default, for efficiency, discriminator model assisted output is not used.
For detailed algorithm reference, please refer to the article: Reward-Consistent Dynamics Models are Strongly Generalizable for Offline Reinforcement Learning .