时序差分学习

时序差分学习（英语：Temporal difference learning，TD learning）是一类无模型强化学习方法的统称，这种方法强调通过从当前价值函数的估值中自举的方式进行学习。这一方法需要像蒙特卡罗方法那样对环境进行取样，并根据当前估值对价值函数进行更新，宛如动态规划算法。^[1]

和蒙特卡罗法所不同的是，时序差分学习可以在最终结果出来前对其参数进行不断地调整，使其预测更为准确，而蒙特卡罗法只能在最终结果产生后进行调整。^[2]这是一种自举式的算法，具体的例子如下：

假设你需要预测星期六的天气，并且手头上正好有相关的模型。按照一般的方法，你只有到星期六才能根据结果对你的模型进行调整。然而，当到了星期五时，你应该对星期六的天气有很好的判断。因此在星期六到来之前，你就能够调整你的模型以预测星期六的天气。^[2]

时序差分学习与动物学领域中的动物认知存在一定的关联。^[3]^[4]^[5]^[6]^[7]

数学模型

$TD(0)$ 表格法是最简单的时序差分学习法之一，为随即近似法的一个特例。这种方法用于估计在策略 $\pi$ 之下有限状态马尔可夫决策过程的状态价值函数。现用 $V^{\pi }$ 表示马尔可夫决策过程的状态价值函数，其中涉及到状态 $(s_{t})_{t\in \mathbb {N} }$ 、奖励 $(r_{t})_{t\in \mathbb {N} }$ 、学习折扣率 $\gamma$ 以及策略 $\pi$ ^[8]：

V^{\pi }(s)=E_{a\sim \pi }\left\{\sum _{t=0}^{\infty }\gamma ^{t}r_{t}(a_{t}){\Bigg |}s_{0}=s\right\}.

为了方便起见，我们将上述表达式中表示动作的符号去掉，所得 $V^{\pi }$ 满足哈密顿-雅可比-贝尔曼方程：

V^{\pi }(s)=E_{\pi }\{r_{0}+\gamma V^{\pi }(s_{1})|s_{0}=s\},

因此 $r_{0}+\gamma V^{\pi }(s_{1})$ 乃是 $V^{\pi }(s)$ 的无偏估计，基于这一观察结果可以设计用于估计 $V^{\pi }$ 的算法。在这一算法中，首先用任意值对表格 $V(s)$ 进行初始化，使马尔可夫决策过程中的每个状态都有一个对应值，并选择一个正的学习率 $\alpha$ 。我们接下来要做的便是反复对策略 $\pi$ 进行评估，并根据所获得的奖励 $r$ 按照如下方式对旧状态下的价值函数进行更新^[9]：

V(s)\leftarrow V(s)+\alpha (\overbrace {r+\gamma V(s')} ^{\text{The TD target}}-V(s))

其中 $s$ 和 $s'$ 分别表示新旧状态，而 $r+\gamma V(s')$ 便是所谓的TD目标（TD target）。

TD-λ算法

TD-λ算法是理查德·S·萨顿基于亚瑟·李·塞谬尔的时序差分学习早期研究成果而创立的算法，这一算法最著名的应用是杰拉尔德·特索罗开发的TD-Gammon程序。该程序可以用于学习双陆棋对弈，甚至能够到达人类专家水准。^[10]这一算法中的 $\lambda$ 值为迹线衰减参数，介于0和1之间。当 $\lambda$ 越大时，很久之后的奖励将越被重视。当 $\lambda =1$ 时，将会变成与蒙特卡罗强化学习算法并行的学习算法。^[11]

在神经科学领域

时序差分学习算法在神经科学领域亦得到了重视。研究人员发现腹侧被盖区与黑质中多巴胺神经元的放电率和时序差分学习算法中的误差函数具有相似之处^[3]^[4]^[5]^[6]^[7]，该函数将会回传任何给定状态或时间步长的估计奖励与实际收到奖励之间的差异。当误差函数越大时，这意味着预期奖励与实际奖励之间的差异也就越大。

多巴胺细胞的行为也和时序差分学习存在相似之处。在一次实验中，研究人员训练一只猴子将刺激与果汁奖励联系起来，并对多巴胺细胞的表现进行了测量。^[12]一开始猴子接受果汁时，其多巴胺细胞的放电率会增加，这一结果表明预期奖励和实际奖励存在差异。不过随着训练次数的增加，预期奖励也会发生变化，导致其巴胺细胞的放电率不再显著增加。而当没有获得预期奖励时，其多巴胺细胞的放电率会降低。由此可以看出，这一特征与时序差分学习中的误差函数有着相似之处。

目前很多关于神经功能的研究都是建立在时序差分学习的基础之上的^[13]^[14]，这一方法还被用于对精神分裂症的治疗及研究多巴胺的药理学作用。^[15]

参考文献

^ Sutton & Barto (2018)，第133页.
^ ^2.0 ^2.1 Sutton, Richard S. Learning to predict by the methods of temporal differences. Machine Learning. 1988-08-01, 3 (1): 9–44 [2023-04-04]. ISSN 1573-0565. doi:10.1007/BF00115009. （原始内容存档于2023-03-31）（英语）.
^ ^3.0 ^3.1 Schultz, W, Dayan, P & Montague, PR. A neural substrate of prediction and reward. Science. 1997, 275 (5306): 1593–1599. CiteSeerX 10.1.1.133.6176 . PMID 9054347. S2CID 220093382. doi:10.1126/science.275.5306.1593.
^ ^4.0 ^4.1 Montague, P. R.; Dayan, P.; Sejnowski, T. J. A framework for mesencephalic dopamine systems based on predictive Hebbian learning (PDF). The Journal of Neuroscience. 1996-03-01, 16 (5): 1936–1947 [2023-04-04]. ISSN 0270-6474. PMC 6578666 . PMID 8774460. doi:10.1523/JNEUROSCI.16-05-01936.1996. （原始内容存档 (PDF)于2018-07-21）.
^ ^5.0 ^5.1 Montague, P.R.; Dayan, P.; Nowlan, S.J.; Pouget, A.; Sejnowski, T.J. Using aperiodic reinforcement for directed self-organization (PDF). Advances in Neural Information Processing Systems. 1993, 5: 969–976 [2023-04-04]. （原始内容存档 (PDF)于2006-03-12）.
^ ^6.0 ^6.1 Montague, P. R.; Sejnowski, T. J. The predictive brain: temporal coincidence and temporal order in synaptic learning mechanisms. Learning & Memory. 1994, 1 (1): 1–33. ISSN 1072-0502. PMID 10467583. S2CID 44560099. doi:10.1101/lm.1.1.1 .
^ ^7.0 ^7.1 Sejnowski, T.J.; Dayan, P.; Montague, P.R. Predictive hebbian learning. Proceedings of Eighth ACM Conference on Computational Learning Theory. 1995: 15–18. ISBN 0897917235. S2CID 1709691. doi:10.1145/225298.225300 .
^ Sutton & Barto (2018)，第134页.
^ Sutton & Barto (2018)，第135页.
^ Tesauro, Gerald. Temporal difference learning and TD-Gammon. Communications of the ACM. 1995-03-01, 38 (3): 58–68 [2023-04-06]. ISSN 0001-0782. doi:10.1145/203330.203343. （原始内容存档于2023-04-06）.
^ Sutton & Barto (2018)，第175页.
^ Schultz, W. Predictive reward signal of dopamine neurons. Journal of Neurophysiology. 1998, 80 (1): 1–27. CiteSeerX 10.1.1.408.5994 . PMID 9658025. S2CID 52857162. doi:10.1152/jn.1998.80.1.1.
^ Dayan, P. Motivated reinforcement learning (PDF). Advances in Neural Information Processing Systems (MIT Press). 2001, 14: 11–18 [2023-04-11]. （原始内容 (PDF)存档于2012-05-25）.
^ Tobia, M. J., etc. Altered behavioral and neural responsiveness to counterfactual gains in the elderly. Cognitive, Affective, & Behavioral Neuroscience. 2016, 16 (3): 457–472. PMID 26864879. S2CID 11299945. doi:10.3758/s13415-016-0406-7 .
^ Smith, A., Li, M., Becker, S. and Kapur, S. Dopamine, prediction error, and associative learning: a model-based account. Network: Computation in Neural Systems. 2006, 17 (1): 61–84. PMID 16613795. S2CID 991839. doi:10.1080/09548980500361624.

参考著作

Sutton, Richard S.; Barto, Andrew G. Reinforcement Learning: An Introduction 2nd. Cambridge, MA: MIT Press. 2018 [2023-04-04]. （原始内容存档于2023-04-26）.

延伸阅读

Meyn, S. P. Control Techniques for Complex Networks. Cambridge University Press. 2007. ISBN 978-0521884419. See final chapter and appendix.
Sutton, R. S.; Barto, A. G. Time Derivative Models of Pavlovian Reinforcement (PDF). Learning and Computational Neuroscience: Foundations of Adaptive Networks. 1990: 497–537 [2023-04-06]. （原始内容存档 (PDF)于2017-03-30）.

外部链接

Connect Four TDGravity Applet （页面存档备份，存于互联网档案馆） (+ mobile phone version) – self-learned using TD-Leaf method (combination of TD-Lambda with shallow tree search)
Self Learning Meta-Tic-Tac-Toe （页面存档备份，存于互联网档案馆） Example web app showing how temporal difference learning can be used to learn state evaluation constants for a minimax AI playing a simple board game.
Reinforcement Learning Problem, document explaining how temporal difference learning can be used to speed up Q-learning
TD-Simulator （页面存档备份，存于互联网档案馆） Temporal difference simulator for classical conditioning

[FOOTNOTESuttonBarto2018133-1] Sutton & Barto (2018)，第133页.

[RSutton-1988-2] 2.0 ^2.1 Sutton, Richard S. Learning to predict by the methods of temporal differences. Machine Learning. 1988-08-01, 3 (1): 9–44 [2023-04-04]. ISSN 1573-0565. doi:10.1007/BF00115009. （原始内容存档于2023-03-31）（英语）.

[WSchultz-1997-3] 3.0 ^3.1 Schultz, W, Dayan, P & Montague, PR. A neural substrate of prediction and reward. Science. 1997, 275 (5306): 1593–1599. CiteSeerX 10.1.1.133.6176 . PMID 9054347. S2CID 220093382. doi:10.1126/science.275.5306.1593.

[:0-4] 4.0 ^4.1 Montague, P. R.; Dayan, P.; Sejnowski, T. J. A framework for mesencephalic dopamine systems based on predictive Hebbian learning (PDF). The Journal of Neuroscience. 1996-03-01, 16 (5): 1936–1947 [2023-04-04]. ISSN 0270-6474. PMC 6578666 . PMID 8774460. doi:10.1523/JNEUROSCI.16-05-01936.1996. （原始内容存档 (PDF)于2018-07-21）.

[:1-5] 5.0 ^5.1 Montague, P.R.; Dayan, P.; Nowlan, S.J.; Pouget, A.; Sejnowski, T.J. Using aperiodic reinforcement for directed self-organization (PDF). Advances in Neural Information Processing Systems. 1993, 5: 969–976 [2023-04-04]. （原始内容存档 (PDF)于2006-03-12）.

[:2-6] 6.0 ^6.1 Montague, P. R.; Sejnowski, T. J. The predictive brain: temporal coincidence and temporal order in synaptic learning mechanisms. Learning & Memory. 1994, 1 (1): 1–33. ISSN 1072-0502. PMID 10467583. S2CID 44560099. doi:10.1101/lm.1.1.1 .

[:3-7] 7.0 ^7.1 Sejnowski, T.J.; Dayan, P.; Montague, P.R. Predictive hebbian learning. Proceedings of Eighth ACM Conference on Computational Learning Theory. 1995: 15–18. ISBN 0897917235. S2CID 1709691. doi:10.1145/225298.225300 .

[FOOTNOTESuttonBarto2018134-8] Sutton & Barto (2018)，第134页.

[FOOTNOTESuttonBarto2018135-9] Sutton & Barto (2018)，第135页.

[10] Tesauro, Gerald. Temporal difference learning and TD-Gammon. Communications of the ACM. 1995-03-01, 38 (3): 58–68 [2023-04-06]. ISSN 0001-0782. doi:10.1145/203330.203343. （原始内容存档于2023-04-06）.

[FOOTNOTESuttonBarto2018175-11] Sutton & Barto (2018)，第175页.

[WSchultz-1998-12] Schultz, W. Predictive reward signal of dopamine neurons. Journal of Neurophysiology. 1998, 80 (1): 1–27. CiteSeerX 10.1.1.408.5994 . PMID 9658025. S2CID 52857162. doi:10.1152/jn.1998.80.1.1.

[PDayan-2001-13] Dayan, P. Motivated reinforcement learning (PDF). Advances in Neural Information Processing Systems (MIT Press). 2001, 14: 11–18 [2023-04-11]. （原始内容 (PDF)存档于2012-05-25）.

[14] Tobia, M. J., etc. Altered behavioral and neural responsiveness to counterfactual gains in the elderly. Cognitive, Affective, & Behavioral Neuroscience. 2016, 16 (3): 457–472. PMID 26864879. S2CID 11299945. doi:10.3758/s13415-016-0406-7 .

[ASmith-2006-15] Smith, A., Li, M., Becker, S. and Kapur, S. Dopamine, prediction error, and associative learning: a model-based account. Network: Computation in Neural Systems. 2006, 17 (1): 61–84. PMID 16613795. S2CID 991839. doi:10.1080/09548980500361624.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]