Understanding LSTM Networks

LSTM 网络

长期记忆网络–通常只称为“LSTMs”–是一种特殊的RNN网络,能够学习长期依赖。由Hochreiter提出(1997),并进行了细化和许多人的后续工作中推广。他们的工作相当的好,在一个大的各种各样的问题,现在被广泛使用的。

LSTMs明确旨在避免长期依赖的问题。长期记忆信息实际上是他们的默认行为,而不是他们努力学习的东西.!

所有的递归神经网络都有神经网络的重复模块链的形式.。在标准的RNNs,这个重复的模块将有一个非常简单的结构,如一个单一的tanh层。

重复的模块在一个标准的网络包含一个单独的层

LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.

LSTMs也有这种链状结构,但重复模块具有不同的结构。对比RNN仅有一个单一的神经网络层,LSTMs有四个,并以一种非常特殊的方式交互。

A LSTM neural network.

The repeating module in an LSTM contains four interacting layers.重复的模块中包含四个相互作用的层次对应。

Don’t worry about the details of what’s going on. We’ll walk through the LSTM diagram step by step later. For now, let’s just try to get comfortable with the notation we’ll be using.

不要担心正在发生的事情的细节。我们后面会一步一步讲解图结构。现在,让我们努力熟悉一下我们将使用的符号标记。

In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.

在上面的图表中,每一行携带一个完整的向量,从一个节点的输出到其他节点的输入。粉红色圆圈代表点的操作,如向量加法,而黄色框是学习神经网络层。线合并表示连接,而线分叉表示其内容被复制和拷贝到不同的位置。

LSTMs的核心概念

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

LSTMs的关键是单元状态,水平线贯穿图的顶部。

The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.

单元状态有点像传送带。它在整个链上向下直线运行,只有一些小的线性相互作用。信息只是沿着链流动是非常容易保持不变的。

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

LSTM有删除或添加信息到单元状态的能力,需要通过称为“门”的结构仔细调节。

“门”是一种让信息有选择通过的方法。他们组成了一个Sigmoid神经网络层和逐点乘法运算。

The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!”

Sigmoid层输出零和1之间的数字,描述每个组件多少量应通过。零值意味着“什么都不让通过”,而一值的意思是“让一切通过!“

An LSTM has three of these gates, to protect and control the cell state.

一个LSTM有三个这样的门,保护和控制单元状态。

LSTM 逐步讲解

LSTM的第一步是确定哪些信息从单元状态中扔掉。这个决策 由称为 “forget gate layer”的Sigmoid层完成。它观察H_t-1和x_t,对单元状态C_t-1的每一个值输出一个0~1的数字,0代表完全排除,1代表完全保留。

让我们回到我们的语言模型的例子,试图预测下一个词的基础上所有以前的。在这样的问题,单元状态可能包括本主题的性别,因此,可以使用正确的代词。当我们看到一门新主题时,我们要忘记旧主题的性别。

下一步是决定在单元状态中储存什么新信息。这有两个部分。首先,一个叫做“输入门层”的Sigmoid层决定我们将更新哪个值.。接下来,双曲正切层创建一个新的候选值向量,C ~ t,可以添加到状态。在接下来的步骤中,我们将结合这两个创建一个更新的状态。

在我们的语言模型的例子中,我们希望把新的对象的性别添加到单元状态,以取代我们遗忘的旧的.。

It’s now time to update the old cell state, Ct1

, into the new cell state Ct

. The previous steps already decided what to do, we just need to actually do it.

We multiply the old state by ft

, forgetting the things we decided to forget earlier. Then we add itC~t

. This is the new candidate values, scaled by how much we decided to update each state value.

In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps.

Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh

(to push the values to be between 1 and 1

) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.

Variants on Long Short Term Memory

What I’ve described so far is a pretty normal LSTM. But not all LSTMs are the same as the above. In fact, it seems like almost every paper involving LSTMs uses a slightly different version. The differences are minor, but it’s worth mentioning some of them.

One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This means that we let the gate layers look at the cell state.

The above diagram adds peepholes to all the gates, but many papers will give some peepholes and not others.

Another variation is to use coupled forget and input gates. Instead of separately deciding what to forget and what we should add new information to, we make those decisions together. We only forget when we’re going to input something in its place. We only input new values to the state when we forget something older.

A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.

A gated recurrent unit neural network.

These are only a few of the most notable LSTM variants. There are lots of others, like Depth Gated RNNs by Yao, et al. (2015). There’s also some completely different approach to tackling long-term dependencies, like Clockwork RNNs by Koutnik, et al. (2014).

Which of these variants is best? Do the differences matter? Greff, et al. (2015) do a nice comparison of popular variants, finding that they’re all about the same. Jozefowicz, et al. (2015) tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks.

Conclusion

Earlier, I mentioned the remarkable results people are achieving with RNNs. Essentially all of these are achieved using LSTMs. They really work a lot better for most tasks!

Written down as a set of equations, LSTMs look pretty intimidating. Hopefully, walking through them step by step in this essay has made them a bit more approachable.

LSTMs were a big step in what we can accomplish with RNNs. It’s natural to wonder: is there another big step? A common opinion among researchers is: “Yes! There is a next step and it’s attention!” The idea is to let every step of an RNN pick information to look at from some larger collection of information. For example, if you are using an RNN to create a caption describing an image, it might pick a part of the image to look at for every word it outputs. In fact, Xu, et al. (2015) do exactly this – it might be a fun starting point if you want to explore attention! There’s been a number of really exciting results using attention, and it seems like a lot more are around the corner…

Attention isn’t the only exciting thread in RNN research. For example, Grid LSTMs by Kalchbrenner, et al. (2015) seem extremely promising. Work using RNNs in generative models – such as Gregor, et al. (2015), Chung, et al. (2015), or Bayer & Osendorfer (2015) – also seems very interesting. The last few years have been an exciting time for recurrent neural networks, and the coming ones promise to only be more so!

 

留下评论