Controlling Mujoco-HalfCheetah using Revive_filter¶

We can use the revive_filter algorithm integrated in the REVIVE SDK for environment learning. Simply put, the revive_filter algorithm learns a generalizable dynamics reward model from offline data. This dynamics reward model can be subsequently employed as a transition filter to obtain more reliable environment transitions. When generating environment transitions, the environment model will generate a batch of environment transitions as a candidate set. Then the environment transition reward model will help select the most reliable environment transition from the candidate set as the final transition result. But it should be noted that, depending on different parameter settings, this algorithm will increase the memory and graphics memory overhead to varying degrees, and reduce the training speed. The algorithm performance is also affected by key hyperparameters, which are explained in detail at the end of the article.

Revive_filter environment learning algorithm generates two types of models: an environment model and a discriminator model. The environment model is used to predict the next state, while the discriminator model is used to determine whether the predicted state by the environment model is reasonable. Next, we will demonstrate how to apply this algorithm to the task of controlling Mujoco-HalfCheetah motion using the REVIVE SDK, ultimately improving its learning strategy.

In this algorithm example, the decision flow chart we use will change and involve heterogeneous decision flow functions. This means that the decision flow chart for environment learning is different from that for strategy learning. The difference lies in that during the environment learning process, the action nodes are not learned, while during strategy learning, the action nodes need to be learned. Below, we will show the decision flow charts for training the environment model and the strategy model.

Decision Graph for Environment Training:

metadata:
  columns:
  ...
  graph:
    delta_x:
    - obs
    - action
    delta_obs:
    - obs
    - action
    next_obs:
    - obs
    - delta_obs

  expert_functions:
    next_obs:
        'node_function' : 'delta_obs.get_next_obs'

Decision Graph for Policy Training:

metadata:
  columns:
  ...
  graph:
    action:
    - obs
    delta_x:
    - obs
    - action
    delta_obs:
    - obs
    - action
    next_obs:
    - obs
    - delta_obs

  expert_functions:
    next_obs:
        'node_function' : 'delta_obs.get_next_obs'

The implementation of the expert function get_next_obs is as follows:

import torch
import numpy as np


def get_next_obs(data):
    obs = data["obs"]
    delta_obs = data["delta_obs"]

    if len(obs.shape) == 1:
        obs = obs.reshape(1, -1)
        delta_obs = delta_obs.reshape(1, -1)

    next_obs = obs + delta_obs

    if len(data["obs"].shape) == 1:
        next_obs = next_obs.reshape(-1)

    return next_obs

This expert function is placed in the delta_obs.py file.

In addition, to enable this feature, we need to make the following configurations in config.json :

{
   ...
   "venv_algo_config": {
       "revive_f": [
          {
              "name": "bc_epoch",
              "type": "int",
              "default": 1500
          },
          {
              "name": "revive_epoch",
              "description": "Number of epcoh for the training process",
              "abbreviation": "mep",
              "type": "int",
              "default": 1500
           },
           {
               "name": "bc_lr",
               "type": "float",
               "default": 1e-3
           },
           {
               "name": "bc_steps",
               "type": "int",
               "default": 10
           },
           {
               "name": "matcher_record_len",
               "type": "int",
               "default": 50
           },
           {
               "name": "fix_std",
               "type": "float",
               "default": 0.125
           },
           ...
       ],
       ...
   },
   "venv_algo_config":{
       "sac":{
           {
               "name": "penalty_type",
               "type": "str",
               "default": "filter"
           },
           {
               "name": "penalty_sample_num",
               "type": "int",
               "default": 50
           },
           {
               "name": "reward_uncertainty_weight",
               "type": "float",
               "default": 0.75
           },
         {
               "name": "ensemble_choosing_interval",
               "type": "int",
               "default": 5
           },
           {
               "name": "ensemble_size",
               "type": "int",
               "default": 10
           },
           {
               "name": "candidate_num",
               "type": "int",
               "default": 20
           },
           {
               "name": "filter",
               "type": "bool",
               "default": false
           },
           ...
}

In this example, we first train the environment model with the revive_f algorithm, and then train the strategy model with the sac algorithm. Some parameters need to be noted:

During environment training:

fix_std controls the standard deviation of the output in adversarial venv learning.
matcher_record_len controls the number of discriminator models saved.
revive_epoch controls the number of training epochs for adversarial training.
bc_epoch controls the number of training epochs for bahavior learning.
bc_lr controls the learning rate for bahavior learning.

During policy training:

penalty_type controls the type of reward penalty.
penalty_sample_num controls the number of samples for computing reward penalty.
reward_uncertainty_weight controls the weight of the reward penalty.
ensemble_choosing_interval controls the interval for choosing discriminator models.
ensemble_size controls the number of discriminator models.
candidate_num controls the number of samples for environment model output. This parameter only takes effect when filter is true.
filter controls whether to use discriminator model assisted output. Note that if this parameter is true , the training speed will be slower.

After training the strategy, we place the strategy in the environment for online testing. 50 trajectories can achieve an average return of around 8500 points. This is about a 20% improvement over the results of the revive_p algorithm, which achieved a score of 7000 points.

Note

By default, for efficiency, discriminator model assisted output is not used.

For detailed algorithm reference, please refer to the article: Reward-Consistent Dynamics Models are Strongly Generalizable for Offline Reinforcement Learning .