Define Reward Function

The reward function is a crucial part of reinforcement learning. It defines the reward that the policy receives for taking an action in the environment, which can be positive, negative, or zero.

The goal of reinforcement learning is to maximize cumulative reward, and thus, the reward function defines the optimization objective of the policy.

In reinforcement learning, the agent program searches for the optimal policy by interacting with the environment repeatedly. The policy selects an action based on the current state and receives a reward and the next state accordingly. This process continues until a termination state is reached. If the policy receives a high reward in a series of actions and state transitions, it has found a better strategy. Therefore, the design of the reward function is crucial for optimizing the reinforcement learning policy. A good reward function should guide the policy to learn in the expected direction.

The REVIVE SDK supports defining reward functions as Python source files. The reward function defines the optimization objective of the policy. The input of the reward function is the data of the single-step decision flow, and the output of the reward function is the reward value obtained by the current step of the policy. Here are some examples of reward functions:

Note

The name of the reward function must be get_reward.

Suppose we want to teach an agent to play a jumping game. The agent should learn how to jump over obstacles based on the information on the screen. The reward function can be designed to give a certain positive reward every time the agent jumps over an obstacle and a certain negative reward every time it falls into an obstacle. The corresponding reward function example is as follows:

import torch
from typing import Dict

def get_reward(data: Dict[str, torch.Tensor]) -> torch.Tensor:
    """Reward function"""

    # Get the status of whether the obstacle is passed
    is_passed = torch.gt(data['obs'][..., :1], 0.5)

    # Give 100 rewards for passing the obstacle, and give -10 rewards (penalties) for not passing it
    passed_reward = 100
    unpassed_reward = -10

    # Calculate the reward for each time step based on the condition
    reward = torch.where(is_passed, passed_reward, unpassed_reward)

    return reward

If we are faced with a robot control task, and the policy’s goal is to move the robot from one location to another, the reward function will be calculated based on the reduction value of the distance between the robot and the target location. The corresponding reward function example is as follows:

import torch
from typing import Dict

def get_reward(data: Dict[str, torch.Tensor], target_pos: torch.Tensor) -> torch.Tensor:
    """Robot control task reward function"""

    # Get the position of the robot before the action is taken
    pos = data['obs'][..., :2]
    # Get the position of the robot after the action is taken
    next_pos = data['next_obs'][..., :2]
    # Get the target position
    target_pos = data['target_pos'][..., :2]

    # Calculate the reduction value of the distance between the robot and the target position as the reward
    dist1 = torch.norm(pos - target_pos)
    dist2 = torch.norm(next_pos - target_pos)
    reward = (dist1 - dist2) / dist1

    return reward

Note that when using a reward function to process data, multiple data points are usually organized into batches for processing all at once. This approach can improve the efficiency of the code. Therefore, when writing a reward function, it is important to ensure that the function can handle multidimensional data that corresponds to the input tensor shape. Additionally, when calculating the reward, we typically focus on the feature dimension of the last axis. To facilitate processing, the computation dimension of the reward function is usually set to the last axis. Therefore, when using the data, the last axis feature can be obtained using slicing ( [..., n:m] ), and the features can be computed. For example, to take the first two features of the last axis from the obs data, the following code can be used:

obs_features = data['obs'][..., :2]  # take the first and second column of the last axis

The corresponding returned reward should be a Pytorch Tensor with the same batch dimension as the input data, and the feature dimension of the last axis should be 1.

Note

The defined reward function is only needed and used when training a policy model. It is not necessary to define a reward function when training a virtual environment model. Therefore, the REVIVE SDK supports training multiple policy models with different reward functions during training.