we designed a reward function that isbased on a game-balancing constant and introduce itinto the Proximal-Policy-Opmitization (PPO) (Schul-man et al., 2017) algorithm, a reinforcement learn-ing method that directly optimizes the policy usinggradient-based learning.
*핵심 reward function + PPO