Use REVIVE to control industrial machines

example-of-lander_hover

Industrial machines controlling description

Industry Benchmark (IB) is a reinforcement learning benchmark environment designed to simulate the characteristics of various industrial control tasks, such as wind or gas turbines and chemical reactors. It encompasses many common problems in the industrial field in the real world, such as the high-dimensionality of continuous states and action spaces, delayed rewards, complex noise patterns, and high randomness of multiple response targets. We also added two dimensions of the system state to the observation space to calculate the immediate reward for each step, thereby augmenting the original IB environment. As IB itself is a high-dimensional and highly random environment, no noise is added to the action data when sampling on this environment.

Action Space

Continuous(3,)

Observation

Shape (180,)

Action Space

The action space consists of a continuous 3D vector. For more information, please refer to http://polixir.ai/research/neorl.

Observation Space

The state is a 180-dimensional vector. In fact, the observation at each time step is a 6-dimensional vector, and the dataset automatically concatenates the data from the previous 29 frames. Therefore, the dimension of the current observation is \(180=6*30\). For more information, please refer to http://polixir.ai/research/neorl.

Task Objective of Industrial Machine Control

The objective of the industrial machine control task is to maintain various indicators of the machine near their target values. For more information, please refer to http://polixir.ai/research/neorl.

Training control policies using REVIVE SDK

REVIVE SDK is a historical data-driven tool. According to the documentation tutorial, using REVIVE SDK for controlling industrial machines tasks can be divided into the following steps:

  1. Collect historical decision-making data for hovering tasks;

  2. Combine business scenarios and collected historical data to build decision flow and array data . The decision flow mainly describes the interaction logic of business data and is stored in the .yaml file. The array data stores node data defined in the decision flow, which is stored using the .npz or .h5 file.

  3. With the above decision flow and array data, REVIVE SDK can train the virtual environment model. However, to obtain a better control policy, it is necessary to define the reward function based on the task goal. The reward function defines the optimization goal of the strategy and can guide the control policy to make the industrial machine more stable.

  4. After defining the decision flow, training data and reward function, we can use REVIVE SDK to start training the virtual environment model and the policy model.

  5. Finally, test the policy model trained by REVIVE SDK online.

Preparing Data

We used the IB dataset in Neorl and reward functions to build the training task. For more information, please refer to http://polixir.ai/research/neorl.

Define Decision Flow

The complete training process of the IB task involves loading heterogeneous decision flow charts. For details, see Loading Heterogeneous Decision Graphs.

Here is the .yaml file when training the virtual environment:

metadata:
    columns:
    - obs_0:
        dim: obs
        type: continuous
    - obs_1:
        dim: obs
        type: continuous
    ...
    - obs_179:
        dim: obs
        type: continuous

    - obs_0:
        dim: current_next_obs
        type: continuous
    - obs_1:
        dim: current_next_obs
        type: continuous
    ...
    - obs_5:
        dim: current_next_obs
        type: continuous

    - action_0:
        dim: action
        type: continuous
    - action_1:
        dim: action
        type: continuous
    - action_2:
        dim: action
        type: continuous

    graph:
        #action:
        #- obs
        current_next_obs:
        - obs
        - action
        next_obs:
        - obs
        - current_next_obs

    expert_functions:
        next_obs:
        'node_function' : 'expert_function.next_obs'

Here is the .yaml file when training the policy:

metadata:
    columns:
    - obs_0:
        dim: obs
        type: continuous
    - obs_1:
        dim: obs
        type: continuous
    ...
    - obs_179:
        dim: obs
        type: continuous

    - obs_0:
        dim: current_next_obs
        type: continuous
    - obs_1:
        dim: current_next_obs
        type: continuous
    ...
    - obs_5:
        dim: current_next_obs
        type: continuous

    - action_0:
        dim: action
        type: continuous
    - action_1:
        dim: action
        type: continuous
    - action_2:
        dim: action
        type: continuous

    graph:
        action:
        - obs
        current_next_obs:
        - obs
        - action
        next_obs:
        - obs
        - current_next_obs

    expert_functions:
        next_obs:
        'node_function' : 'expert_function.next_obs'

    #nodes:
    #  action:
    #      step_input: True

Building Reward Function

Here we define the reward function for policy nodes in the IB task:

import torch
from typing import Dict


def get_reward(data: Dict[str, torch.Tensor]) -> torch.Tensor:
    obs = data["obs"]
    next_obs = data["next_obs"]

    single_reward = False
    if len(obs.shape) == 1:
        single_reward = True
        obs = obs.reshape(1, -1)
    if len(next_obs.shape) == 1:
        next_obs = next_obs.reshape(1, -1)

    CRF = 3.0
    CRC = 1.0

    fatigue = next_obs[:, 4]
    consumption = next_obs[:, 5]

    cost = CRF * fatigue + CRC * consumption

    reward = -cost

    if single_reward:
        reward = reward[0].item()
    else:
        reward = reward.reshape(-1, 1)

    return reward

Train a Control Policy using REVIVE SDK

REVIVE SDK has provided the required data and code for training. For details, refer to REVIVE SDK source code library. After completing the installation of REVIVE SDK, switch to the examples/task/IB directory and run the following Bash command to start training the virtual environment model and the policy model. During the training process, we can use tensorboard to open the log directory at any time to monitor the training process. When REVIVE SDK completes the virtual environment model training and the policy model training, we can find the saved model (.pkl or .onnx) in the log folder (logs/<run_id>).

python train.py -df data/ib.npz -cf data/ib_env.yaml -rf data/ib_reward.py -rcf data/config.json -vm tune -pm None --run_id revive

python train.py -df data/ib.npz -cf data/ib_policy.yaml -rf data/ib_reward.py -rcf data/config.json -vm None -pm tune --run_id revive

Test the Trained Policy Model in the IB Environment

After training is complete, use the provided jupyter notebook script to test the performance of the trained policy. For more information, see jupyter notebook.