Use REVIVE to control industrial machines 
=======================================================

.. image:: images/ib.png
 :alt: example-of-lander_hover
 :align: center

Industrial machines controlling description
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Industry Benchmark (IB) is a reinforcement learning benchmark environment designed to simulate the characteristics of various industrial control tasks, such as wind or gas turbines and chemical reactors. It encompasses many common problems in the industrial field in the real world, such as the high-dimensionality of continuous states and action spaces, delayed rewards, complex noise patterns, and high randomness of multiple response targets. We also added two dimensions of the system state to the observation space to calculate the immediate reward for each step, thereby augmenting the original IB environment. As IB itself is a high-dimensional and highly random environment, no noise is added to the action data when sampling on this environment.

================= ====================
Action Space      Continuous(3,)
Observation       Shape (180,)
================= ====================


Action Space
--------------------------

The action space consists of a continuous 3D vector. For more information, please refer to `http://polixir.ai/research/neorl <http://polixir.ai/research/neorl>`__.


Observation Space 
--------------------------

The state is a 180-dimensional vector. In fact, the observation at each time step is a 6-dimensional vector, and the dataset automatically concatenates the data from the previous 29 frames. Therefore, the dimension of the current observation is :math:`180=6*30`. For more information, please refer to `http://polixir.ai/research/neorl <http://polixir.ai/research/neorl>`__.


Task Objective of Industrial Machine Control
------------------------------------------------------

The objective of the industrial machine control task is to maintain various indicators of the machine near their target values. For more information, please refer to `http://polixir.ai/research/neorl <http://polixir.ai/research/neorl>`__.


Training control policies using REVIVE SDK
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

REVIVE SDK is a historical data-driven tool. According to the documentation tutorial, using REVIVE SDK for controlling industrial machines tasks can be divided into the following steps:

1. Collect historical decision-making data for hovering tasks;
2. Combine business scenarios and collected historical data to build :doc:`decision flow and array data<../tutorial/data_preparation>` . The decision flow mainly describes the interaction logic of business data and is stored in the ``.yaml`` file. The array data stores node data defined in the decision flow, which is stored using the ``.npz`` or ``.h5`` file.
3. With the above decision flow and array data, REVIVE SDK can train the virtual environment model. However, to obtain a better control policy, it is necessary to define the :doc:`reward function <../tutorial/reward_function>` based on the task goal. The reward function defines the optimization goal of the strategy and can guide the control policy to make the industrial machine more stable.
4. After defining the :doc:`decision flow <../tutorial/data_preparation>`, :doc:`training data<../tutorial/data_preparation>` and :doc:`reward function<../tutorial/reward_function>`, we can use REVIVE SDK to start training the virtual environment model and the policy model.
5. Finally, test the policy model trained by REVIVE SDK online.

Preparing Data
--------------

We used the IB dataset in Neorl and reward functions to build the training task. For more information, please refer to `http://polixir.ai/research/neorl <http://polixir.ai/research/neorl>`__.

Define Decision Flow
--------------------------

The complete training process of the IB task involves loading heterogeneous decision flow charts. For details, see :doc:`Loading Heterogeneous Decision Graphs <../tutorial/heterogeneous_decision_graphs>`.

Here is the ``.yaml`` file when **training the virtual environment**:

.. code:: yaml

    metadata:
        columns:
        - obs_0:
            dim: obs
            type: continuous
        - obs_1:
            dim: obs
            type: continuous
        ...
        - obs_179:
            dim: obs
            type: continuous

        - obs_0:
            dim: current_next_obs
            type: continuous
        - obs_1:
            dim: current_next_obs
            type: continuous
        ...
        - obs_5:
            dim: current_next_obs
            type: continuous

        - action_0:
            dim: action
            type: continuous
        - action_1:
            dim: action
            type: continuous
        - action_2:
            dim: action
            type: continuous

        graph:
            #action:
            #- obs
            current_next_obs:
            - obs
            - action
            next_obs:
            - obs
            - current_next_obs
    
        expert_functions:
            next_obs:
            'node_function' : 'expert_function.next_obs'

Here is the ``.yaml`` file when **training the policy**:

.. code:: yaml

    metadata:
        columns:
        - obs_0:
            dim: obs
            type: continuous
        - obs_1:
            dim: obs
            type: continuous
        ...
        - obs_179:
            dim: obs
            type: continuous

        - obs_0:
            dim: current_next_obs
            type: continuous
        - obs_1:
            dim: current_next_obs
            type: continuous
        ...
        - obs_5:
            dim: current_next_obs
            type: continuous

        - action_0:
            dim: action
            type: continuous
        - action_1:
            dim: action
            type: continuous
        - action_2:
            dim: action
            type: continuous

        graph:
            action:
            - obs
            current_next_obs:
            - obs
            - action
            next_obs:
            - obs
            - current_next_obs
  
        expert_functions:
            next_obs:
            'node_function' : 'expert_function.next_obs'

        #nodes:
        #  action:
        #      step_input: True


Building Reward Function
--------------------------

Here we define the reward function for policy nodes in the IB task:

.. code:: python

    import torch
    from typing import Dict


    def get_reward(data: Dict[str, torch.Tensor]) -> torch.Tensor:
        obs = data["obs"]
        next_obs = data["next_obs"]

        single_reward = False
        if len(obs.shape) == 1:
            single_reward = True
            obs = obs.reshape(1, -1)
        if len(next_obs.shape) == 1:
            next_obs = next_obs.reshape(1, -1)

        CRF = 3.0
        CRC = 1.0

        fatigue = next_obs[:, 4]
        consumption = next_obs[:, 5]

        cost = CRF * fatigue + CRC * consumption

        reward = -cost

        if single_reward:
            reward = reward[0].item()
        else:
            reward = reward.reshape(-1, 1)

        return reward


Train a Control Policy using REVIVE SDK
---------------------------------------

REVIVE SDK has provided the required data and code for training. For details, refer to `REVIVE SDK source code library <https://github.com/polixir/revive/tree/master/examples/task/IB>`__. After completing the installation of REVIVE SDK, switch to the ``examples/task/IB`` directory and run the following Bash command to start training the virtual environment model and the policy model. During the training process, we can use tensorboard to open the log directory at any time to monitor the training process. When REVIVE SDK completes the virtual environment model training and the policy model training, we can find the saved model (``.pkl`` or ``.onnx``) in the log folder (``logs/<run_id>``).

.. code:: bash

 python train.py -df data/ib.npz -cf data/ib_env.yaml -rf data/ib_reward.py -rcf data/config.json -vm tune -pm None --run_id revive

 python train.py -df data/ib.npz -cf data/ib_policy.yaml -rf data/ib_reward.py -rcf data/config.json -vm None -pm tune --run_id revive


Test the Trained Policy Model in the IB Environment
---------------------------------------------------

After training is complete, use the provided jupyter notebook script to test the performance of the trained policy. For more information, see `jupyter notebook <https://github.com/polixir/revive/tree/master/examples/task/IB/TestPolicy.ipynb>`__.