Heterogeneous Decision Flow Loading¶
After completing training in a virtual environment, we can begin policy training. However, in some cases, we may want certain nodes’ inputs and outputs to be slightly different from those in the virtual environment during policy training. In other words, the decision flow chart used during virtual environment training does not match the decision flow chart (.yaml
file) used during policy training, and this is where the functionality of the heterogeneous decision flow chart comes in.
At this point, the REVIVE SDK automatically performs node reuse by reusing unchanged nodes in the environment model for new nodes, while initializing nodes that have changed according to the new .yaml
file.
For example, suppose we are training for a mechanical control task and the original policy is a PID control strategy. In the training environment, we build an expert function for PID control as the control node in the decision flow chart. During policy training, we want to obtain a better control strategy. To do this, we need to initialize the control node as a learnable neural network node so that it can gradually learn and optimize the control strategy during training. This is where the heterogeneous decision flow chart comes in handy, by converting the control node from the virtual environment into a learnable node, enabling the use of more flexible and optimized strategies during policy training. An example of implementation is shown below:
Here is the .yaml
file used during training in the environment, where it can be seen that the action node is bound to the PID expert function.
metadata:
graph:
action:
- observation
next_observation:
- action
- observation
columns:
...
expert_functions:
action:
'node_function' : 'dynamics.pid_policy'
Here is the .yaml
file used during policy training, where it can be seen that the action node no longer has the bound expert function and is initialized as a learnable neural network node.
metadata:
graph:
action:
- observation
next_observation:
- action
- observation
columns:
...
Similarly, the opposite applies: if a node is non-fixed in the training environment, we can use neural network nodes for learning. Then, during the training strategy, we can fix it under a predetermined pattern and bind the corresponding expert function to the node. This ensures more stable and reliable performance of the node during the actual application of the strategy. This is also the advantage of heterogeneous decision flow graphs, where different node types can be selected according to specific training and application requirements.
In the case of Use revive to control industrial machines due to the mixed data of multiple control strategies collected in the historical data, it is very difficult to achieve perfect learning for the policy control node. To address this situation, we define the control node as an external variable during virtual environment training and only initialize the environment transition node as a learnable node. During the policy training, we will re-initialize the control node as a learnable neural network node to optimize the training strategy and achieve better performance. This approach can effectively improve the accuracy and robustness of the model in practical applications, to deal with incomplete or inconsistent training data.