Recurrent Networks: With connections allowed both ways between units, and even from a unit to itself.

7.1 Boltzmann Machines(玻尔兹曼机)

Stochastic networks with symmetric connections($w_{ij}=w_{ji}$).

  • Original form: slow for averaging over stochastic variables.
  • Mean field version: speeded up and more applications.

Stochastic Units(随机单元)

Visible & hidden units.

Hidden units: No connection to outside world.

$$ \begin{aligned} P(S_{i} = +1) &= g(h_{i}) = \frac{1}{1+\exp{(-2\beta h_{i})}}\\ P(S_{i} = -1) &= 1 - g(h_{i}) \end{aligned} $$

$\begin{aligned}h_{i} = \sum_{j}w_{ij}S_{j},\quad\beta = 1/T\end{aligned}$.

Energy function:

$$ H(\{S_{i}\}) = -\frac{1}{2}\sum_{ij}w_{ij}S_{i}S_{j},\quad S_{i} = \text{sgn}{(h_{i})} $$

Boltzmann-Gibbs distribution:

$$ P(\{S_{i}\}) = \frac{e^{-\beta H(\{S_{i}\})}}{Z} $$

Partition function: $$Z = \sum_{\{S_{i}\}} e^{-\beta H(\{S_{i}\})}$$

average $\langle X\rangle$:

$$ \langle X\rangle = \sum_{\{S_{i}\}}P(\{S_{i}\}) X(\{S_{i}\}) $$

Pattern completion: At low $T$, only a few possible states have high and equal prob.

$w_{ij}\propto \xi_{i}\xi_{j}$: no information about links with hidden units.


$\alpha$: state of visible units;

$\beta$: state of hidden units.

$N$ visible units, $K$ hidden units $\Rightarrow 2^{N+K}$ possibilities.

Probability $P_{\alpha}$ of chosen state $\alpha$:

$$ \begin{aligned} P_{\alpha} &= \sum_{\beta}P_{\alpha\beta} = \sum_{\beta}\frac{1}{Z}e^{-\beta H_{\alpha\beta}}\\ Z &= \sum_{\alpha\beta}e^{-\beta H_{\alpha\beta}}\\ H_{\alpha\beta} &= -\frac{1}{2}\sum_{ij}w_{ij}S_{i}^{\alpha\beta}S_{j}^{\alpha\beta} \end{aligned} $$

Desired probability: $R_{\alpha}$

Cost function: relative entropy(to measure the difference):

$$ E = \sum_{\alpha}R_{\alpha}\log{\frac{R_{\alpha}}{P_{\alpha}}} = E_{0} {\color{green}{- \sum_{\alpha}R_{\alpha}\log{P_{\alpha}}}}\geq 0 $$

$E = 0$ only if $P_{\alpha} = R_{\alpha}$, $\forall \alpha$.

Gradient descent:

$$ \begin{aligned} {\color{blue}{\Delta w_{ij}}} &= -\eta \frac{\partial E}{\partial w_{ij}} = \eta\sum_{\alpha}\frac{R_{\alpha}}{P_{\alpha}}{\color{red}{\frac{\partial P_{\alpha}}{\partial w_{ij}}}}\\ {\color{red}{\frac{\partial P_{\alpha}}{\partial w_{ij}}}} &= \sum_{\beta}\frac{1}{Z}e^{-\beta H_{\alpha\beta}}S_{i}^{\alpha\beta}S_{j}^{\alpha\beta} - \frac{1}{Z^{2}}\left(\sum_{\beta}e^{-\beta H_{\alpha\beta}}\right)\sum_{\lambda\mu}e^{-\beta H_{\lambda\mu}}S_{i}^{\lambda\mu}S_{j}^{\lambda\mu}\\ &= \beta\left[ \sum_{\beta}S_{i}^{\alpha\beta}S_{j}^{\alpha\beta}P_{\alpha\beta} - P_{\alpha}\langle S_{i}S_{j}\rangle \right]\\ {\color{blue}{\Delta w_{ij}}} &= \eta\beta\left[ \sum_{\alpha}\frac{R_{\alpha}}{P_{\alpha}}\sum_{\beta}S_{i}^{\alpha\beta}S_{j}^{\alpha\beta}P_{\alpha\beta} - \sum_{\alpha}R_{\alpha}\langle S_{i}S_{j}\rangle \right]\\ &= \eta\beta\left[ \sum_{\alpha\beta}R_{\alpha}P_{\beta|\alpha}S_{i}^{\alpha\beta}S_{j}^{\alpha\beta} - \langle S_{i}S_{j}\rangle \right]\\ &= \eta\beta\left[ \overset{\text{Hebb term}}{\overline{\langle S_{i}S_{j}\rangle}_{\text{clamped}}} - \overset{\text{unlearning term}}{\langle S_{i}S_{j}\rangle_{\text{free}}} \right],\quad\text{Boltzmann learning rule} \end{aligned} $$

conditional probability: $P_{\beta|\alpha} = P_{\alpha\beta}/P_{\alpha}$.

$\overline{\langle S_{i}S_{j}\rangle}_{\text{clamped}}$: 在给定 $\alpha$ 概率为 $R_{\alpha}$ 的条件下, $S_{i}S_{j}$ 的平均值

Bring the system to equilibrium before taking an average.

Monte Carlo simulation:

Two kinds of methods:

  • Select a unit at random, update its state according to $P(S_{i} = \pm 1)$.
  • Flip a unit with prob $$P(S_{i}\rightarrow -S_{i}) = \frac{1}{1+\exp{(\beta\Delta H_{i})}}.$$ Slow at low $T$ (tends to get trapped in local minima)

Faster solution: simulated annealing procedure(渐进模拟)

  1. Adjust weights for convergence;
  2. Calculate $\langle S_{i}S_{j}\rangle$ in clamped and freestates;
  3. Annealing schedule $T(t)$;
  4. Sample and update units.

Fast simulated annealing(Cauchy machine): multiple flips.

Slow but effective.

Variations on Boltzmann Machines(玻尔兹曼机的变种)

  1. Weight decay terms $w_{ij}^{\text{new}} = (1-\varepsilon)w_{ij}^{\text{old}}$: automatically become symmetric;
  2. Incremental rule(增量规则)

If $S_{i}$ & $S_{j}$ on together, step size $\varepsilon$, clamped: $w_{ij}\uparrow$; free: $w_{ij}\downarrow$

Avoid oscillations in valley when relying on local gradient

  • $\exist \alpha$, $R_{\alpha}=0$. increase it to $\epsilon>0$ to avoid infinite weights.

Input($\gamma$), output($\alpha$), hidden($\beta$). Learn $\gamma\rightarrow \alpha$ (即目标是 令 $P_{\alpha|\gamma}=R_{\alpha|\gamma}$)

Error measure(entropic cost function):

$$ E = \sum_{\gamma}p_{\gamma}\sum_{\alpha}R_{\alpha|\gamma}\log{\frac{R_{\alpha|\gamma}}{P_{\alpha|\gamma}}} $$

Learning rule:

$$ \Delta w_{ij} = \eta\beta\left[ \overline{\langle S_{i}S_{j}\rangle}_{\text{I,O clamped}} - \overline{\langle S_{i}S_{j}\rangle}_{\text{I clamped}} \right] $$

  1. Harmonium: two-layer Boltzmann machine. Connections between layers only.

  2. Hopfield: without hidden units. Average $\xi_{i}^{\mu}\xi_{j}^{\mu}$ over patterns $\mu$. Start from random cfg and relax at $T=0$. Faster than incremental one.

Statistical Mechanics Reformulation(统计力学重构)

$P_{\alpha}$: Prob of visible units in state $\alpha$

$\begin{aligned}Z_{\text{clamped}}^{\alpha} = \sum_{\beta}e^{-\beta H_{\alpha\beta}}\end{aligned}$: Partition function (visible units clamped in state $\alpha$)

$$ Z = e^{-\beta F} $$

$$ P_{\alpha} = \frac{Z_{\text{clamped}}^{\alpha}}{Z} = \frac{e^{-\beta F_{\text{clamped}}^{\alpha}}}{e^{-\beta F}} = e^{-\beta (F_{\text{clamped}}^{\alpha} - F)} $$

Cost function:

$$ \begin{aligned} E = E_{0} - \sum_{\alpha}R_{\alpha}\log{P_{\alpha}} &= E_{0} + \beta\sum_{\alpha}R_{\alpha}(F_{\text{clamped}}^{\alpha} - F)\\ &= E_{0} + \beta\left[ \overline{F_{\text{clamped}}^{\alpha}} - F \right] \end{aligned} $$

$$ \begin{aligned} \langle S_{i}S_{j}\rangle &= -\frac{\partial F}{\partial w_{ij}} \\ &= T\frac{\partial \log{Z}}{\partial w_{ij}} = \frac{T}{Z}\frac{\partial Z}{\partial w_{ij}} = \frac{1}{Z}\sum_{\alpha\beta} S_{i}^{\alpha\beta}S_{j}^{\alpha\beta}e^{-\beta H_{\alpha\beta}} \end{aligned} $$

Deterministic Boltzmann Machines(确定性玻尔兹曼机)

Mean field annealing: $\begin{aligned}m_{i}\equiv\langle S_{i}\rangle = \tanh{\left(\beta\sum_{j}w_{ij}m_{j}\right)},\quad \langle S_{i}S_{j}\rangle\approx m_{i}m_{j}\end{aligned}$

Iteration

$$ m_{i}^{\text{new}} = \tanh{\left(\beta\sum_{j}w_{ij}m_{j}^{\text{old}}\right)} $$

Entropy:

$$ S = - \sum_{i}\left( p_{i}^{+}\log{p_{i}^{+}} + p_{i}^{-}\log{p_{i}^{-}} \right), \quad p_{i}^{\pm} = P(S_{i} = \pm 1) $$

$p_{i}^{+} + p_{i}^{-} = 1$, $m_{i} = p_{i}^{+} - p_{i}^{-}$

$$ p_{i}^{\pm} = \frac{1\pm m_{i}}{2} $$

Mean field free energy:

$$ \begin{aligned} F_{\text{MF}} &= H - TS \\ &= -\frac{1}{2}\sum_{ij}w_{ij}m_{i}m_{j} + T\sum_{i}\left( \frac{1+m_{i}}{2}\log{\frac{1+m_{i}}{2}} + \frac{1-m_{i}}{2}\log{\frac{1-m_{i}}{2}} \right) \end{aligned} $$

$$ E = E_{0} + \beta\left[ \overline{F_{\text{clamped}}^{\alpha}} - F \right]\overset{\text{MF}}{\longrightarrow}E_{\text{MF}} = E_{0} + \beta\left[ \overline{F_{\text{MF}}^{\alpha}} - F_{\text{MF}} \right] $$

Gradient descent for $w_{ij}$:

$$ \Delta w_{ij} = -\eta\frac{\partial E_{\text{MF}}}{\partial w_{ij}} = \eta\beta \bigg[\overline{m_{i}^{\alpha}m_{i}^{\alpha}} - m_{i}m_{j}\bigg] $$

原型: $$ \Delta w_{ij} = \eta\beta \bigg[\overline{\langle S_{i}S_{j}\rangle}_{\text{clamped}} - \langle S_{i}S_{j}\rangle_{\text{free}}\bigg] $$

7.2 Recurrent Back-Propagation(递归反向传播)

$V_{i}$: $N$ continuous-valued units

$w_{ij}$: connections

$g(h)$: activation function

$\zeta_{i}^{\mu}$: desired training values.

Dynamic evolution rules:

$$ \tau\frac{\mathrm{d}V_{i}}{\mathrm{d}t} = -V_{i} + g\left( \sum_{j}w_{ij}V_{j} + \xi_{i} \right) $$

fixed point equation $\begin{aligned}\left(\frac{\mathrm{d}V_{i}}{\mathrm{d}t} = 0\right)\end{aligned}$:

$$ V_{i} = g(h_{i}) = g\left( \sum_{j}w_{ij}V_{j} + \xi_{i} \right) $$

Assumption: At least 1 fixed point exists and is a stable attractor.

Error measure(Cost function):

$$ E = \frac{1}{2}\sum_{k}E_{k}^{2},\quad E_{k} = \begin{cases} \zeta_{k} - V_{k} & \text{if }k\text{ is output unit}\\ 0 & \text{otherwise} \end{cases} $$

Gradient descent:

$$ \Delta w_{pq} = -\eta\frac{\partial E}{\partial w_{pq}} = \eta\sum_{k}E_{k}\frac{\partial V_{k}}{\partial w_{pq}} $$

Differentiating fixed point equation:

$$ \begin{aligned} \frac{\partial V_{i}}{\partial w_{pq}} &= g^{\prime}(h_{i}){\color{red}{\frac{\partial h_{i}}{\partial w_{pq}}}}\\ {\color{red}{\frac{\partial h_{i}}{\partial w_{pq}}}} &= \frac{\partial }{\partial w_{pq}}\left(\sum_{j}{\color{blue}{w_{ij}V_{j}}} + \xi_{i}\right)\\ \frac{\partial ({\color{blue}{w_{ij}V_{j}}})}{\partial w_{pq}} &= {\color{green}{\frac{\partial w_{ij}}{\partial w_{pq}}}}V_{j} + w_{ij}\frac{\partial V_{j}}{\partial w_{pq}} = {\color{green}{\delta_{ip}\delta_{jq}}}V_{j} + w_{ij}\frac{\partial V_{j}}{\partial w_{pq}}\\ \Rightarrow {\color{red}{\frac{\partial h_{i}}{\partial w_{pq}}}} &= \sum_{{\color{green}{j}}}\left( \delta_{ip}\delta_{{\color{green}{j}}q}V_{{\color{green}{j}}} + w_{ij}\frac{\partial V_{j}}{\partial w_{pq}} \right) = \delta_{ip}V_{q} + \sum_{j}w_{ij}\frac{\partial V_{j}}{\partial w_{pq}}\\ \Rightarrow \frac{\partial V_{i}}{\partial w_{pq}} &= g^{\prime}(h_{i})\left( \delta_{ip}V_{q} + \sum_{j}w_{ij}\frac{\partial V_{j}}{\partial w_{pq}} \right)\\ &= \delta_{ip}g^{\prime}(h_{i})V_{q} + \sum_{j}g^{\prime}(h_{i})w_{ij}\frac{\partial V_{j}}{\partial w_{pq}} \end{aligned} $$

将单项写作 delta-求和 形式 $\begin{aligned}\frac{\partial V_{i}}{\partial w_{pq}} = \sum_{j}\delta_{ij}\frac{\partial V_{j}}{\partial w_{pq}}\end{aligned}$, 再将等式右边第二项移到左边

$$ \sum_{j}{\color{red}{\bigg[\delta_{ij}-g^{\prime}(h_{i})w_{ij}\bigg]}}\frac{\partial V_{j}}{\partial w_{pq}} = \delta_{ip}g^{\prime}(h_{i})V_{q}\\ \sum_{j}{\color{red}{\mathbf{L}_{ij}}}\frac{\partial V_{j}}{\partial w_{pq}} = \delta_{ip}g^{\prime}(h_{i})V_{q} $$

实际可以理解为 $\mathbf{L}_{i\times j}\times \frac{\partial \mathbf{V}}{\partial w_{pq}}_{j\times 1}$ 的矩阵乘法. 左右同时左乘 $\mathbf{L}^{-1}$:

$$ \frac{\partial V_{k}}{\partial w_{pq}} = (\mathbf{L}^{-1})_{kp}g^{\prime}(h_{p})V_{q} $$

delta-rule:

$$ \begin{aligned} \Delta w_{pq} &= \eta{\color{red}{\sum_{k} E_{k}(\mathbf{L}^{-1})_{kp}g^{\prime}(h_{p})}}V_{q}\\ &= \eta{\color{red}{\delta_{p}}}V_{q} \end{aligned} $$

Recall: delta rule 是源自梯度下降的概念, 但是可以被推广为更普适的形式.

Cost function:

$$ E[\vec{w}] = \frac{1}{2}\sum_{i\mu}(\zeta_{i}^{\mu} - O_{i}^{\mu})^{2} = \frac{1}{2}\sum_{i\mu}\left(\zeta_{i}^{\mu}-\sum_{k}w_{ik}\xi_{k}^{\mu}\right)^{2} $$

$$ \Delta w_{ik} = -\eta\frac{\partial E}{\partial w_{ik}} = \eta\sum_{\mu}(\zeta_{i}^{\mu} - O_{i}^{\mu})\xi_{k}^{\mu} $$

对于某固定 $\mu$:

$$ \Delta w_{ik} = \eta(\zeta_{i}^{\mu}-O_{i}^{\mu})\xi_{k}^{\mu} = {\color{red}{\eta\delta_{i}\xi_{k}^{\mu}}} $$

需要求逆而耗费时间.


改进:

$$ \delta_{p} = {\color{red}{\sum_{k} E_{k}(\mathbf{L}^{-1})_{kp}}}g^{\prime}(h_{p}) = g^{\prime}(h_{p}){\color{red}{Y_{p}}} $$

$$ \begin{aligned} \sum_{p}\mathbf{L}_{pi}Y_{p} &= \sum_{p}\mathbf{L}_{pi}\left[ \sum_{k}E_{k}(\mathbf{L}^{-1})_{kp} \right] \\ &= \sum_{k}E_{k}\left[ \sum_{p}\mathbf{L}_{pi}(\mathbf{L}^{-1})_{kp} \right]\\ &= \sum_{k}E_{k}\delta_{ik} = E_{i} \end{aligned} $$

Recall: $$ \mathbf{L}_{ij} = \delta_{ij} - g^{\prime}(h_{i})w_{ij} $$

那么

$$ \begin{aligned} \sum_{p}\mathbf{L}_{pi}Y_{p} &= \sum_{p}\bigg[ \delta_{pi}-g^{\prime}(h_{p})w_{pi} \bigg]Y_{p} \\ &= \sum_{p}\delta_{pi}Y_{p} - \sum_{p}g^{\prime}(h_{p})w_{pi}Y_{p} \\ &= \boxed{Y_{i} - \sum_{p}g^{\prime}(h_{p})w_{pi}Y_{p} = E_{i}} \end{aligned} $$

Error-propagation network:

$$ \tau\frac{\mathrm{d}Y_{i}}{\mathrm{d}t} = - Y_{i} + \sum_{p}g^{\prime}(h_{p})w_{pi}Y_{p} + E_{i} $$

$w_{ij}\rightarrow g^{\prime}(h_{i})w_{ij}$

$g(x)\rightarrow x$

network transposition(网络转置): 通过转置避免显式求逆

Procedure:

  1. $\begin{aligned}\tau\frac{\mathrm{d}V_{i}}{\mathrm{d}t} = -V_{i} + g\left(\sum_{j}w_{ij}V_{j} + \xi_{i}\right)\end{aligned}$ 以找到 $V_{i}$;

  2. $\begin{aligned}E_{k} = \begin{cases}\zeta_{k}-V_{k} & k\text{ is output}\\0&\text{ others}\end{cases}\end{aligned}$ 以确定 $E_{i}$;

  3. $\begin{aligned}\tau\frac{\mathrm{d}Y_{i}}{\mathrm{d}t} = - Y_{i} + \sum_{p}g^{\prime}(h_{p})w_{pi}Y_{p} + E_{i}\end{aligned}$ 以找到 $Y_{i}$;

  4. $\Delta w_{pq}=\eta\delta_{p}V_{q}$, $\delta_{p} = g^{\prime}(h_{p})Y_{p}$ 更新权重.

without requiring any non-local operations(matrix inversion)

original network: supply error signals $E_{i}$;

error-propagation network: adjust weights in the original network.


$\begin{aligned}Y_{i} - \sum_{p}g^{\prime}(h_{p})w_{pi}Y_{p} = E_{i}\end{aligned}$ is a stable attractor if $\begin{aligned}V_{i} = g\left(\sum_{j}w_{ij}V_{j} + \xi_{i}\right)\end{aligned}$ is a stable attractor.

设 $V_{i}^{*}$ 和 $Y_{i}^{*}$ 各为两个动力学方程的不动点. 研究其邻域 $V_{i} = V_{i}^{*} + \epsilon_{i}$ 和 $Y_{i} = Y_{i}^{*} + \eta_{i}$. 代入微分方程且保留一阶项:

$$ \begin{aligned} \tau\frac{\mathrm{d}\epsilon_{i}}{\mathrm{d}t} &= -\epsilon_{i} + g^{\prime}(h_{i})\sum_{j}w_{ij}\epsilon_{j} = -\sum_{j}\mathbf{L}_{ij}\epsilon_{j}\\ \tau\frac{\mathrm{d}\eta_{i}}{\mathrm{d}t} &= -\eta_{i} + \sum_{p}g^{\prime}(h_{p})w_{pi}\eta_{p} = -\sum_{p}\mathbf{L}_{ip}^{T}\eta_{p} \end{aligned} $$

$\mathbf{L}$ 和 $\mathbf{L}^{T}$ 特征值相同, 所以局部稳定性一致.

全连接网络($\mathbf{L}_{N\times N}$)求逆时间复杂度 为 $\mathcal{O}(N^3)$. 而网络转置法的时间复杂度为 $\mathcal{O}(N^2)$.

7.3 Learning Time Sequences(时间序列学习)

  • Sequence Recognition(序列识别): specific input sequence $\to$ particular output pattern

  • Sequence Reproduction(序列再现): generate the rest of a sequence given part of it.

  • Temporal Association(时间关联): particular output sequence in response to a specific input sequence.

Tapped Delay Lines(抽头延迟线)

time-delay neural networks:

从信号 $x[t]$ 中采集的 $x[\tau], x[\tau-\Delta], \cdots, x[\tau-(m-1)\Delta]$ 同时作为网络输入.

  • not for arbitrary-length sequences, 不灵活;
  • large training examples, slow computation, 耗算力;
  • precise clock, 同步要求高.

Tank & Hopfield:

设 raw signal $\vec{x}(t)$.

  • Usual delay line: $\vec{x}(t), \vec{x}(t-\tau_{1}),\vec{x}(t-\tau_{2}),\cdots, \vec{x}(t-\tau_{m})$;

  • Tank & Hopfield: $\begin{aligned} y(t;\tau_{i}) = \int_{-\infty}^{t}G(t-t^{\prime};\tau_{i})\vec{x}(t^{\prime})\mathrm{d}t^{\prime}\end{aligned}$

$$ G(t;\tau_{i}) = \left(\frac{t}{\tau_{i}}\right)^{\alpha}e^{\alpha(1-t/\tau_{i})} $$

$t=\tau_{i}$ 取极值 1. $\alpha$ 越大 $G$ 越尖锐.

no need for precise synchronization by a central clock.

类似于卷积+滑动窗口(window) 的思路.

Context Units(上下文/记忆单元)

Partially recurrent networks(sequential networks):

  • mainly feed-forward;
  • chosen set of feed-back connections

context units $C_{i}$: receive clocked feedback signals.

$t$ 时刻, context unit 接收 $t-1$ 的网络信号

单箭头: 第 $i$ 单元只连接下一层的第 $i$ 单元

阴影箭头: 每一个神经元都和上一层的所有神经元连接

(a) context units hold a copy of the activations of the hidden units from the previous time step.

(b) updating rule:

$$ C_{i}(t+1) = \alpha C_{i}(t) + O_{i}(t) $$

$O_{i}$: output units;

$\alpha < 1$: strength of the self-connections.

If $O_{i}$ fixed: $C_{i}\to O_{i}/(1-\alpha)$

因此, 也被称作 decay/integrating/capacitive units.

Iterating:

$$ \begin{aligned} C_{i}(t+1) &= O_{i}(t) + \alpha O_{i}(t-1) + \alpha^{2}O_{i}(t-2)+\cdots\\ &= \sum_{t^{\prime}=0}^{t}\alpha^{t-t^{\prime}}O_{i}(t^{\prime})\\ &\Rightarrow \int_{0}^{t}e^{-\gamma (t-t^{\prime})}O_{i}(t^{\prime})\mathrm{d}t^{\prime},\quad \gamma = |\log{\alpha}| \end{aligned} $$

这表明 $\alpha$ 应当和输入序列的时间尺度相匹配. 可以使用多个 context 单元组, 每组有不同的 $\alpha$ 值.

(c) Feedback: context units themselves only.

(d)

  • Input 和 context 之间的连接为全连接而非逐一连接.
  • Self-connections can be trained.

Back-propagation: No modifiable recurrent connections.

Mozer:

$$ C_{i}(t+1) = \alpha_{i}C_{i}(t) + g\left( \sum_{j}w_{ij}\xi_{j} \right) $$

$\xi_{j}$: input pattern

Williams & Zipser:

$$ C_{i}(t+1) = g\left[ \alpha_{i}C_{i}(t) + \sum_{j}w_{ij}\xi_{j} \right] $$

Back-Propagation Through Time(时间反向传播)

fully recurrent networks: Any unit $V_{i}$ may be connected to any other.

Update rule:

$$ V_{i}(t+1) = g[h_{i}(t)] = g\left[ \sum_{j}w_{ij}V_{j}(t) + \xi_{i}(t) \right] $$

$\xi_{i}(t)$: input at node $i$, at time $t$. 不存在时为 $0$.

Produce $\zeta_{i}(t)$ in response to input sequence $\xi_{i}(t)$

Trick: 任意 recurrent network 可展开为 等效 feed-forward network.

2-unit, $T=4$. $w_{ij}$ independent of $t$.

No clock needed;

Need large computer resources:

  • Storage;
  • Computer time for simulation;
  • Number of training examples.

Once trained (in unfolded form), it works for temporal association task.

Real-Time Recurrent Learning(实时递归学习)

Learning rule without duplicating the units.

Deal with sequence of arbitrary length.


Dynamics:

$$ V_{i}(t) = g[h_{i}(t-1)] = g\left[ \sum_{j}w_{ij}V_{j}(t-1) + \xi_{i}(t-1) \right] $$

$\zeta_{k}(t)$: target output

Error measure:

$$ E_{k}(t) = \begin{cases} \zeta_{k}(t) - V_{k}(t) & \text{if }\zeta_{k}(t)\text{ is defined at time }t\\ 0 & \text{otherwise} \end{cases} $$

Cost function:

$$ E = \sum_{t=0}^{T} E(t) $$

$$ E(t) = \frac{1}{2}\sum_{k}[E_{k}(t)]^{2},\quad t = 0,1,2,\cdots,T $$

Gradient descent:

$$ \Delta w_{pq}(t) = -\eta\frac{\partial E(t)}{\partial w_{pq}} = \eta\sum_{k}E_{k}(t)\frac{\partial V_{k}(t)}{\partial w_{pq}} $$

differentiating dynamics:

$$ \frac{\partial V_{i}(t)}{\partial w_{pq}} = g^{\prime}[h_{i}(t-1)]\left[ \delta_{ip}V_{q}(t-1) + \sum_{j}w_{ij}\frac{\partial V_{j}(t-1)}{\partial w_{pq}} \right] $$

Recall: $$ \frac{\partial V_{i}}{\partial w_{pq}} = g^{\prime}(h_{i})\left( \delta_{ip}V_{q} + \sum_{j}w_{ij}\frac{\partial V_{j}}{\partial w_{pq}} \right) $$

If only a stable attractor interested, let $\partial_{w_{pq}}V_{i}(t)=\partial_{w_{pq}}V_{i}(t-1)$, or $\partial_{w_{pq}}V_{i}(0)=0$

$N$ units, $N^{3}$ derivatives($N$ neurons $\times N^{2}$ weights), $\mathcal{O}(N^{4})$ time complexity($N^{3}$ derivatives $\times N$ nodes).

Real-time recurrent learning: Weight update in real time

Avoid array storage.

teacher forcing: always replace $V_{k}(t)$ by $\zeta_{k}(t)$ after $E_{k}(t)$ & derivatives computed.

加速训练收敛, 避免早期预测错误的积累.

Attractor under teacher forcing may be removed when the network runs freely, even turn into a repellor(排斥子)

flip-flop(触发器): if “A $\to$ (arbitrary interval) $\to$ B”: output a signal.

Tapped delay line could never do this.

Time-Dependent Recurrent Back-Propagation(含时递归反向传播)

dynamics:

$$ \tau_{i}\frac{\mathrm{d}V_{i}}{\mathrm{d}t} = -V_{i} + g\left( \sum_{j}w_{ij}V_{j} \right) + \xi_{i}(t)\tag{*} $$

Recall: Recurrent Back-Propagation $$ \tau\frac{\mathrm{d}V_{i}}{\mathrm{d}t} = -V_{i} + g\left(\sum_{j}w_{ij}V_{j} + \xi_{i}\right) $$

Error function:

$$ E = \frac{1}{2}\int_{0}^{T}\sum_{k\in O}[V_{k}(t)-\zeta_{k}(t)]^{2}\mathrm{d}t $$

$O$: output units

$\zeta_{k}(t)$: target output

Gradient descent:

  1. functional derivative(泛函导数):

$$ E_{k}(t) = \frac{\delta E}{\delta V_{k}(t)} = [V_{k}(t) - \zeta_{k}(t)] $$

  1. gradient:

$$ \frac{\partial E}{\partial w_{ij}} = \frac{1}{\tau_{i}}\int_{0}^{T}Y_{i}g^{\prime}(h_{i})V_{j}\mathrm{d}t $$

可以通过 $\begin{aligned}\frac{\partial E}{\partial \tau_{i}}\end{aligned}$ 从而通过调整 $\tau_{i}$ 进行优化.

$\begin{aligned}h_{i}(t) = \sum_{j}w_{ij}V_{j}(t)\end{aligned}$

$Y_{i}(t)$: solution of

$$ \frac{\mathrm{d}Y_{i}}{\mathrm{d}t} = \frac{1}{\tau_{i}}Y_{i} - \sum_{j}\frac{1}{\tau_{j}}w_{ji}g^{\prime}(h_{j})Y_{j} - E_{i}(t),\quad Y_{i}(T) = 0, \forall i\tag{**} $$

recall:

$$ \mathbf{L}_{ij} = \delta_{ij} - g^{\prime}(h_{i})w_{ij}\\ Y_{p} = \sum_{k}E_{k}(\mathbf{L}^{-1})_{kp} $$ error propagation network dynamics: $$ \tau\frac{\mathrm{d}Y_{i}}{\mathrm{d}t} = -Y_{i} + \sum_{p}g^{\prime}(h_{p})w_{pi}Y_{p} + E_{i} $$

$\int_{0}^{T}(*)\Rightarrow V_{i}(t)$;

$\int_{T}^{0}(**)\Rightarrow Y_{i}(t)$.

so that we get $\begin{aligned}\Delta w_{ij} = -\eta\frac{\partial E}{\partial w_{ij}}\end{aligned}$


large time & storage requirements: $N$ fully recurrent units & $K$ time steps ($0\to T$)

  • time per forward-backward pass $\sim N^{2}K$ (Real-Time Recurrent Learning: $\sim N^{4}K$)
  • memory $\sim aNK+bN^{2}$ (RTRL: $\sim N^{3}$)

7.4 Reinforcement Learning(强化学习)

No detailed target values. Only “right” or “wrong”. Learn with a critic, not a teacher.

Introduce randomness to explore a correct value(by using stochastic units).

Reinforcement learning problems:

  • I. Same signal for a given i-o pair. (IO mapping) No reference to previous outputs.

  • II. Stochastic environment. Given same io-pair, fixed probability distribution of reinforcement. Independent of history.

Two-armed bandit problem(双臂赌博): 两个未知概率的老虎机. 找到平均收益最高的一个. Method: stochastic learning automata(随机学习自动机)

  • III. Reinforcement & input patterns depend on the output history arbitrarily. Game theory(博弈论): “environment” = other players.

[Example] Chess. Win/lose after a long sequence of moves. Credit assignment problem: how to judge each move’s contribution to victory/defeat?

Associative Reward-Penalty(联想奖惩) $A_{\text{RP}}$

$S_{i}=\pm 1$: output units

Dynamical rule:

$$ P(S_{i} = \pm 1) = g(\pm h_{i}) = f_{\beta}(\pm h_{i}) = \frac{1}{1+\exp{(\mp 2\beta h_{i})}} $$

$$ h_{i} = \sum_{j}w_{ij}V_{j} $$

$V_{j}$: activations of hidden/input units.

Reinforcement signal $r=\pm 1$ (reward/penalty).

Target patterns:

$$ \zeta_{i}^{\mu} = \begin{cases} +S_{i}^{\mu} & \text{if }r^{\mu} = +1\\ -S_{i}^{\mu} & \text{if }r^{\mu} = -1 \end{cases} $$

Recall: Stochastic Units

$$ \begin{aligned} P(S_{i}=\pm 1) &= f_{\beta}(\pm h_{i}) = \frac{1}{1+\exp{(\mp 2\beta h_{i})}},\quad h_{i}^{\mu} = \sum_{k}w_{ik}\xi_{k}^{\mu}\\ \langle S_{i}^{\mu}\rangle &= (+1)g(h_{i}^{\mu}) + (-1)[1-g(h_{i}^{\mu})] = \tanh{\left(\beta h_{i}^{\mu}\right)}\\ \Delta w_{ik} &= \eta\delta_{i}^{\mu}\xi_{k}^{\mu}, \quad \delta_{i}^{\mu} = \zeta_{i}^{\mu} - \langle S_{i}^{\mu}\rangle \end{aligned} $$

Weight update rule

$$ \Delta w_{ij} = \eta(r^{\mu})\delta_{i}^{\mu}V_{j}^{\mu} $$

$\times g^{\prime}(h_{i}^{\mu})$ to minimize $\begin{aligned}\sum_{i}(\delta_{i}^{\mu})^{2}\end{aligned}$ (not for entropic cost function).

$\eta(+1)$ is 10-100 times larger than $\eta(-1)$

So the learning rule

$$ \Delta w_{ij} = \begin{cases} \eta^{+}[+S_{i}^{\mu}-\langle S_{i}^{\mu}\rangle]V_{j}^{\mu} & \text{if }r^{\mu} = +1\\ \eta^{-}[-S_{i}^{\mu}-\langle S_{i}^{\mu}\rangle]V_{j}^{\mu} & \text{if }r^{\mu} = -1 \end{cases} $$

$\eta^{\pm 1} = \eta(\pm 1)$.

Slow compared to supervised learning. Speed depends on the size of the output space and correct fraction of it.

[Example] Only 1 correct answer for each input, the search can be very long.

Variations on $A_{\text{RP}}$:

  • 0/1 units. $-S_{i}^{\mu}\to 1-S_{i}^{\mu}$, $\langle S_{i}^{\mu}\rangle\to p_{i}^{\mu}$.
  • continuous-valued reward $r\in [0,1]$ for $[\text{terrible}, \text{excellent}]$. simplify the formula:

$$ \Delta w_{ij} = \eta(r^{\mu})\{ r^{\mu}[S_{i}^{\mu} - \langle S_{i}^{\mu}\rangle] + (1-r^{\mu})[-S_{i}^{\mu} - \langle S_{i}^{\mu}\rangle] \}V_{j}^{\mu} $$

  • for penalty case, use $\langle S_{i}^{\mu}\rangle - S_{i}^{\mu}$ instead of $-S_{i}^{\mu} - \langle S_{i}^{\mu}\rangle$:

$$ \Delta w_{ij} = \eta r^{\mu}[S_{i}^{\mu}-\langle S_{i}^{\mu}\rangle]V_{j}^{\mu} $$

ignore dependence of $\eta$ on $r$.

$r=-1$. 标准形式: 和目前所做相反; 此形式: 不要做目前所做.

standard form works better for larger weight changes.

this variation is more amenable to theoretical analysis.

  • speed $\uparrow$: present $\mu$ several times before moving to the next pattern. batch mode: accumulating all $\Delta w$

Theory of Associative Reward-Penalty(联想奖惩理论)

Convergence: proved only in special cases.

Usual approach: Cost function $\to \epsilon$

Not so satisfactory for the stochastic process.

Cost function: $-\langle r\rangle$

$E\downarrow \Rightarrow \langle r\rangle\uparrow$


$V_{j}\to \xi_{j}$. Deterministic environment(class I). So $r = r(\mathbf{S},\mathbf{\xi})$

$\mathbf{S}$: output pattern; $\mathbf{\xi}$: input pattern.

$$ \langle r^{\mu}\rangle = \sum_{\mathbf{S}}P(\mathbf{S}|\mathbf{w},\xi^{\mu})r(\mathbf{S},\xi^{\mu}) $$

$P(\mathbf{S}|\mathbf{w},\mathbf{\xi}^{\mu})$: prob of $\mathbf{S}$ if $\mathbf{w}$ and $\mathbf{\xi}^{\mu}$.

$$ P(\mathbf{S}|\mathbf{w},\mathbf{\xi}^{\mu}) = \prod_{k}\begin{cases} g(h_{k}^{\mu}) & \text{if }S_{k} = +1\\ 1-g(h_{k}^{\mu}) & \text{if }S_{k} = -1 \end{cases},\quad h_{k}^{\mu} = \sum_{j}w_{kj}\xi_{j}^{\mu} $$

Gradient:

  • differentiating $P$:

$$ \frac{\partial P(\mathbf{S}|\mathbf{w},\mathbf{\xi}^{\mu})}{\partial w_{ij}} = \left(\prod_{k\neq i}\begin{cases} g(h_{k}^{\mu}) & \text{if }S_{k} = +1\\ 1-g(h_{k}^{\mu}) & \text{if }S_{k} = -1 \end{cases}\right)\begin{cases} g^{\prime}(h_{i}^{\mu})V_{j}^{\mu} & \text{if }S_{i} = +1\\ -g^{\prime}(h_{i}^{\mu})V_{j}^{\mu} & \text{if }S_{i} = -1 \end{cases}\\ = P(\mathbf{S}|\mathbf{w},\mathbf{\xi}^{\mu})\left(\begin{cases} g^{\prime}(h_{i}^{\mu})/g(h_{i}^{\mu}) & \text{if }S_{i} = +1\\ -g^{\prime}(h_{i}^{\mu})/[1-g(h_{i}^{\mu})] & \text{if }S_{i} = -1 \end{cases}\right)V_{j}^{\mu} $$

$g(h) = \frac{1}{1+e^{-2\beta h}}$, $g^{\prime}(h) = 2\beta g(h)[1-g(h)]$

$\langle S\rangle = \tanh{(\beta h)} = \frac{e^{\beta h} - e^{-\beta h}}{e^{\beta h} + e^{-\beta h}}$

$\Rightarrow g(h) = \frac{1}{2}(\langle S\rangle + 1)$

Simplify:

$$ \begin{cases} g^{\prime}(h_{i}^{\mu})/g(h_{i}^{\mu}) & \text{if }S_{i} = +1\\ -g^{\prime}(h_{i}^{\mu})/[1-g(h_{i}^{\mu})] & \text{if }S_{i} = -1 \end{cases} = \beta[S_{i}^{\mu} - \langle S_{i}^{\mu}\rangle] $$

  • differentiating $\langle r^{\mu}\rangle$:

$$ \begin{aligned} \frac{\partial\langle r\rangle^{\mu}}{\partial w_{ij}} &= \beta\sum_{\mathbf{S}}P(\mathbf{S}|\mathbf{w},\mathbf{\xi}^{\mu})r(\mathbf{S},\mathbf{\xi}^{\mu})[S_{i}^{\mu}-\langle S_{i}^{\mu}\rangle]V_{j}^{\mu}\\ &= \beta\bigg\langle r(\mathbf{S},\mathbf{\xi}^{\mu})[S_{i}^{\mu}-\langle S_{i}^{\mu}\rangle]\bigg\rangle V_{j}^{\mu} \end{aligned} $$

$\bigg\langle\cdots\bigg\rangle$: over all output patterns $\mathbf{S}$.

when $\partial_{\mathbf{S}}r(\mathbf{S},\mathbf{\xi}^{\mu})=0$, $\partial_{w_{ij}}\langle r\rangle^{\mu} = 0$, which reaches a correct solution and almost always get $r^{\mu} = +1$.

Update rule:

$$ \begin{aligned} \langle \Delta w_{ij}\rangle^{\mu} = \frac{\eta}{\beta}\frac{\partial \langle r\rangle^{\mu}}{\partial w_{ij}}\\ \overset{\text{average over all }\mu}{\Rightarrow} \langle \Delta w_{ij}\rangle = \frac{\eta}{\beta}\frac{\partial \langle r\rangle}{\partial w_{ij}} \end{aligned} $$

仅 reinforcement signal 最大值时权重停止更新.

$A_{\text{RP}}\overset{\eta^{-}\to 0}{\rightarrow} A_{\text{RI}}$ (associative reward-inaction rule)

Models & Critics

Environment: an auxiliary network(辅助网络). Provide a target for each output (instead of global $r$).

model. evaluation unit(评估单元): continuous-valued output unit $R$. $R\approx r$: good model.

先训练 model, 再训练主网络, 使得 $R$ 最大.

通过 back-propagation 实现: 通过 $\delta = -R$ 传回主网络各单元独立的误差信号. 此时可使用常规的监督学习.

最小化 $(R-r)^{2}$ 的同时最大化 $r$. 最佳策略是优先训练 model network.

II:

  • average fluctuations from environment.
  • chase meaningless $r$ fluctuation
  • easy to maximize $R$.

III:

  • should exstimate a weighted average of the future reinforcement, for cumulative, not instantaneous value. (sacrifice shot-term reward for long-term gain)

critic: raw $r\overset{\text{critic}}{\to}$ processed signal $\rho$

reinforcement comparison: critic generate a prediction $R$ of $r$, could use $\rho = r-R$. ($\rho>0$: reward; $\rho<0$: penalty)