Example of applying REVIVE to Lander Hover
================================================

.. image:: images/lander_hover.gif
 :alt: example-of-lander_hover
 :align: center


Description
~~~~~~~~~~~~~
The Lander Hover environment is the transformation of the Lunar Lander environment in Gym. 
The goal of the mission is changed from landing to hovering over the lunar surface. 
This environment is part of the `Box2D environments <https://www.gymlibrary.dev/enviroments/box2d/>`__. One could refer 
to that page for general information at first.

================= ====================
Action Space      Discrete(4)
Observation       Shape (4,)
Observation       High [1.5 1.5 5. 5. ]
Observation       Low [-1.5 -1.5 -5. -5. ]
================= ====================


This environment is a classic rocket trajectory optimization problem.
According to Pontryagin’s maximum principle, it is optimal to fire the
engine at full throttle or turn it off. According to it, actions are 
discrete: engine on or off.

The action space of this environment is discrete. The hovering target
point is always at the coordinate, where :math:`x`-axis and 
:math:`y`-axis is with (0,1). The coordinates are also the first two
dimensions of the state vector. By the way, the fuel of the lander is infinite.

Action Space
--------------------------

There are four discrete actions available: doing nothing, firing the left 
orientation engine, firing the main engine, and firing the right orientation engine. 

Observation Space 
--------------------------

The state is a 4-dimensional vector: the coordinates
of the lander in :math:`x`-axis & :math:`y`-axis, 
its linear velocities in :math:`x`-axis & :math:`y`-axis. 

Rewards
--------------------------
The hovering target point is at (0,1). If the deviation of both the x
and y coordinates of the lander is less than 0.2. then the lander
hovering is considered successful. Get a bonus of 10 scores, otherwise get a 
-0.3 penalty.

.. code:: python

 if abs(state[0]) <= 0.2 and abs(state[1]-1) <= 0.2:
    reward = 10
 else:
    reward = -0.3

Starting State
--------------------------

The lander starts at the top center of the viewport with a random
initial force applied to its center of mass. A random initial force will
give the lander an initial velocity in both the :math:`x` and :math:`y` directions.

Episode Termination 
--------------------------
The episode finishes if:

1. The lander crashes (the lander's body gets in contact with the moon's surface);
2. The lander gets outside of the viewport (with coordinate :math:`x`-axis greater than 1);
3. Time step to reach 400 steps;
4. The lander is not awake. From the `Box2D <https://box2d.org/documentation/md__d_1__git_hub_box2d_docs_dynamics.html#autotoc_md61>`__,
   a body which is not awake is a body not moving and not 
   colliding with any other body:
   When Box2D determines that a body (or group of bodies) has/have come to rest,
   the body enters a sleep state which has very little CPU overhead. If a
   body is awake and collides with a sleeping body, then the sleeping body
   wakes up. Bodies will also wake up if a joint or contact attached to
   them is destroyed.

Train Control Policy Using REVIVE SDK
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


The REVIVE SDK is a historical data-driven tool. According to the description in the documentation, the use of the REVIVE SDK for hovering tasks can be divided into the following steps:

 1. Collect historical decision-making data for hovering tasks;
 2. Combine business scenarios and collected historical data to build decision flow graph and array data, which mainly 
    describe the interaction logic of business data. The decision flowchart is stored using ``.yaml`` files, and the 
    array data storing node data defined in the decision flowchart is stored using ``.npz`` or ``.h5`` files.
 3. With the above decision flow graph and array data, the REVIVE SDK can already conduct virtual environment model training. 
    However, in order to obtain a better control strategy, it is necessary to define a reward function according to the task goal. 
    The reward function defines the optimization goal of the strategy and can guide the control strategy to hover the lander better 
    at the target position.
 4. After defining the decision flow graph, training data, and reward function, we can use the REVIVE SDK to start virtual 
    environment model training and policy model training.
 5. Finally, the policy model trained by the REVIVE SDK needs to be tested online.

Collect historical data
----------------------------------------------------

We use a coordinate-based control strategy to simulate historical decision-making processes and collect data. 
Specifically, the coordinate-based control strategy provides corresponding actions based on how far the current 
lander's position deviates from the target position, in order to control the lander to get as close as possible to 
the target position. The control effect is shown in the figure below:

.. image:: images/rule_result.gif
 :alt: example-of-rule_result
 :align: center


Define decision flow and prepare data
----------------------------------------------------

After completing the collection of historical data, we need to construct a decision flow graph and array data based on the business
scenario. The decision flowchart accurately defines the causal relationship between the data.In the hover landing task, we can observe 
the relevant state information of the lander. The state is a four-dimensional vector, including the lander's :math:`X` axis coordinates, 
the lander's :math:`Y` axis coordinates, the lander's :math:`X` axis velocity, and the lander's :math:`Y` axis velocity. The action space 
consists of four independent actions: do nothing, start left engine, start main engine, and start right engine.

Based on our understanding of the business, we have constructed the following decision flow graph, which includes three nodes. 
The ``obs`` node represents the state of the lander, the ``action`` node represents the lander's action, and the ``next_obs`` node 
represents the state of the lander at the next moment. The edge from the ``obs`` node to the ``action`` node indicates that the 
lander's action should be determined solely by its state, while the edges from the ``obs`` node and ``action`` node to the ``next_obs`` node 
indicate that the state of the lander at the next moment is jointly affected by its current state and control action.

.. image:: images/decision_graph.png
 :alt: decision_graph of hover lander
 :align: center


The corresponding ``.yaml`` file for the above decision flowchart is as follows:

.. code:: yaml

 metadata:
    columns:
      - obs_0:
          dim: obs
          type: continuous
      - obs_1:
          dim: obs
          type: continuous
      - obs_2:
          dim: obs
          type: continuous
      - obs_3:
          dim: obs
          type: continuous
      - action:
          dim: action
          type: category
          values: [0,1,2,3]

    graph:
      action:
      - obs
      next_obs:
      - obs
      - action

In the process of preparing the decision flowchart, we also convert the raw data into a ``.npz`` file for storage.

For more detailed descriptions of decision flowcharts and data preparation, please refer to the documentation in 
the :doc:`data preparation <../tutorial/data_preparation>` section of the tutorial.

Write the Reward Function
----------------------------

The design of the reward function is crucial for learning strategies. A good reward function should be able to guide the policy 
towards the expected direction of learning. The REVIVE SDK supports defining reward functions in the form of Python source files.

The target point for the hover task is located at (0,1). If the deviation of the lander's :math:`x` and :math:`y` coordinates is 
less than 0.2, the landing hover is considered successful and receives a reward of 10 points; otherwise, it incurs a penalty 
of 0.3 points. The following code shows how to convert the task goal provided by the lander hovering environment into a reward 
function required by the REVIVE SDK.

.. code:: python

 import torch
 from typing import Dict

 def get_reward(data : Dict[str, torch.Tensor]):

    return torch.where((torch.abs(data["next_obs"][...,0:1]) < 0.2) \& 
                       (torch.abs(data["next_obs"][...,1:2] - 1) < 0.2), 10,-0.3)

For more detailed descriptions of defining reward functions, please refer to the documentation 
in the  :doc:`reward function <../tutorial/reward_function>`  section of the tutorial.


Use REVIVE Train a Control Policy
------------------------------------

The data and code for the :doc:`Pendulum <Use_revive_to_play_pendulum_game>` 
and :doc:`refrigerator <Use_revive_to_control_your_refrigerator>` example have been provided. The
code can be found and run at `SDK source code
repositories <https://github.com/polixir/revive/tree/master/examples/task/LanderHover>`__.

After completing the installation of the REVIVE SDK, you can switch to the directory ``examples/task/LanderHover`` and run 
the following Bash command to start training the virtual environment model and policy model. During the training process, 
we can use tensorboard to open the log directory to monitor the progress of the training. Once the REVIVE SDK completes the 
training of the virtual environment model and policy model, we can find the saved models (``.pkl`` or ``.onnx``) in the log 
folder ( ``logs/<run_id>`` ).

.. code:: bash

 python train.py -df data/LanderHover.npz -cf data/LanderHover.yaml -rf data/LanderHover.py -rcf data/config.json -vm once -pm once --run_id revive --revive_epoch 1000 --ppo_epoch 500

Testing The Trained Policy Model on the Environment
------------------------------------------------------------------------

Once the training is complete, a `jupyter
notebook <https://github.com/polixir/revive/tree/master/examples/task/LanderHover/VisualizationResults.ipynb>`__
is provided with an example to test the effect of the completed
training strategy.

.. image:: images/jupyter_plt.png
 :alt: example-of-jupyter_plt
 :align: center

Showing the Control Effect of Different Policies
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The following shows the effect of control using different policies.

**Policy 1: Control by random selection of action**

.. image:: images/random_result.gif
 :alt: example-of-random_result
 :align: center

**Policy 2: Control by specific rules according to coordinate information**

.. image:: images/rule_result.gif
 :alt: example-of-rule_result
 :align: center

**Policy 3: Control by policy trained online using DQN**

.. image:: images/result_dqn.gif
 :alt: example-of-result_dqn
 :align: center

**Policy 4: Control by policy trained offline using REVIVE SDK**

.. image:: images/revive_result.gif
 :alt: example-of-revive_result
 :align: center