Image-to-Markup Generation with Coarse-to-Fine Attention论文阅读

动机

公式具有结构信息，与一般的OCR识别不同。但是与Image caption问题类似。因此本文作者借鉴了论文[1]的内容
同时增加了一个行encoder

论文方法

encoder部分
使用VGG网络得到特征向量$\hat V$,得到的特征向量$\hat V$经过每一行encoder。也就是将卷积神经网络得到的特征向量的每一行加入到一个双向的LSTM中。$V_{hw} = RNN(V_{h,w-1}, \hat V_{hw})$最后得到与输入有相同尺度的row encoder向量 $V$。

decoder部分
$p(y_{t+1}|y_{1},…,y_{t},V)=softmax(W^{out}o_{t})$
$o_{t}=tanh(W^{c}[h_{t};c_{t}])$
$h_{t}=RNN(h_{t-1},[y_{t-1};o_{t-1}])$

$c_{t}= \sum_{h,w}p(z_{t}=(h,w))V_{hw}$
$p(z_{t}) = softmax(a(h_{t},{V_{hw}}))$
$a_{t,h,w}=\beta^{T}tanh(W_{1}h_{t}+W_{2}V_{hw})$

使用lstm ,同时使用attention

实验结果

\[1\]Xu K, Ba J, Kiros R, et al. Show, attend and tell: Neural image caption generation with visual attention[C]//International conference on machine learning. 2015: 2048-2057.