We define two different reward structures: ternary1(win) /0(tie) /−1(loss) received at the endof a game (with all-zero rewards during the game), and Blizzard score. The ternary win/tie/lossscore is the real reward that we care about. The Blizzard score is the score seen by players on thevictory screen at the end of the game. While players can only see this score at the end of the game, weprovide access to the running Blizzard score at every step during the game so that the change in scorecan be used as a reward for reinforcement learning. It is computed as the sum of current resourcesand upgrades researched, as well as units and buildings currently alive and being built. This meansthat the player’s cumulative reward increases with more mined resources, decreases when losingunits/buildings, and all other actions (training units, building buildings, and researching) do notaffect it. The Blizzard score is not zero-sum since it is player-centric, it is far less sparse than theternary reward signal, and it correlates to some extent with winning or losing.
dynalist 정리