RECURRENT NEURAL NETWORKS TUTORIAL, PART 2 – IMPLEMENTING A RNN WITH PYTHON, NUMPY AND THEANO

本文将用Python实现完整的RNN,并且用Theano来优化。

语言模型

我们的目标是使用RNN建立一个语言模型。以下我们举例说明什么是语言模型。例如,你说了一句包括$$m$$个词语的句子,语言模型可以为这句话出现的概率打分:

$$P(w_1,\cdots,w_m) = \prod_{i=1}^m P(w_i \mid w_1,\cdots,w_{i-1})$$

每一个词语的概率都取决于它之前的所有的词的概率。

这样的模型有什么用处呢?

  • 可以用于机器翻译或者语音识别中的正确句子打分
  • 以概率生成新的句子

注意到在上面的公式内,我们使用了所有的之前的词的概率,实际上这在计算和存储时的耗费都是巨大的,通常而言只会取2~4个词左右。

预处理并训练数据

1.标记化

原始的文本需要被标记化,例如需要把文本标记为句子,句子标记为词语,并且还需要处理标点符号。我们将使用NLTK的word_tokenize\sent_tokenize方法。

2.移除低频词

移除低频词不管是对于训练和预测都是有帮助的。这里我们设置一个上限vocabulary_size为8000,出现次数少于它的词都会被替换为UNKNOWN_TOKEN输入,而当输出是UNKNOWN_TOKEN时,它将被随机替换为一个不在词表内的词,亦或者持续预测直到不出现UNKNOWN_TOKEN为止。

3.放置句子开始和结束标记

为了解句子的开始和结束,我们把SENTENCE_START放置在句子开头,并且把SENTENCE_END放置在句子结尾。

4.建立训练数据的矩阵

RNN的输入和输出都是向量而不是字符串,我们需要把词与向量一一对应,通过index_to_wordword_to_index。比如一个训练的例子$$x$$为[0, 179, 341, 416](注意到其中每个元素都是长度为vocabulary_size的one-hot向量,所以$$x$$实际上是一个矩阵),那么其label-$$y$$为[179, 341, 416, 1],注意到我们的目标是预测下一个词,所以$$y$$就是$$x$$移动一位,并添加上最后的一个元素(预测词)的结果,其中SENTENCE_STARTSENTENCE_END分别为0和1.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
vocabulary_size = 8000
unknown_token = "UNKNOWN_TOKEN"
sentence_start_token = "SENTENCE_START"
sentence_end_token = "SENTENCE_END"

# Read the data and append SENTENCE_START and SENTENCE_END tokens
print "Reading CSV file..."
with open('data/reddit-comments-2015-08.csv', 'rb') as f:
reader = csv.reader(f, skipinitialspace=True)
reader.next()
# Split full comments into sentences
sentences = itertools.chain(*[nltk.sent_tokenize(x[0].decode('utf-8').lower()) for x in reader])
# Append SENTENCE_START and SENTENCE_END
sentences = ["%s %s %s" % (sentence_start_token, x, sentence_end_token) for x in sentences]
print "Parsed %d sentences." % (len(sentences))

# Tokenize the sentences into words
tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]

# Count the word frequencies
word_freq = nltk.FreqDist(itertools.chain(*tokenized_sentences))
print "Found %d unique words tokens." % len(word`_freq.items())

# Get the most common words and build index_to_word and word_to_index vectors
vocab = word_freq.most_common(vocabulary_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])

print "Using vocabulary size %d." % vocabulary_size
print "The least frequent word in our vocabulary is '%s' and appeared %d times." % (vocab[-1][0], vocab[-1][1])

# Replace all words not in our vocabulary with the unknown token
for i, sent in enumerate(tokenized_sentences):
tokenized_sentences[i] = [w if w in word_to_index else unknown_token for w in sent]

print "\nExample sentence: '%s'" % sentences[0]
print "\nExample sentence after Pre-processing: '%s'" % tokenized_sentences[0]

# Create the training data
X_train = np.asarray([[word_to_index[w] for w in sent[:-1]] for sent in tokenized_sentences])
y_train = np.asarray([[word_to_index[w] for w in sent[1:]] for sent in tokenized_sentences])

以下是一个训练样本:

1
2
3
4
5
6
7
x:
SENTENCE_START what are n't you understanding about this ? !
[0, 51, 27, 16, 10, 856, 53, 25, 34, 69]

y:
what are n't you understanding about this ? ! SENTENCE_END
[51, 27, 16, 10, 856, 53, 25, 34, 69, 1]

接下来我们开始建立RNN。

##建立RNN

总结一下我们的RNN模型的形式。初始输入$$x$$是一个代表一条句子的矩阵,每一时刻的输入$$x_t$$是一个代表一个词语的向量,每一时刻的输出$$o_t$$也是一个向量,其中每个元素代表词表内每一个词被预测的概率。

RNN中的等式:
$$
s_t = tanh(Ux_t + Ws_{t-1})
o_t = softmax(Vs_t)
$$

介绍一下各个变量的维度:假设我们的词表大小$$C$$为8000,隐藏层大小$$H$$为100,可以把隐藏层大小理解为网络内存的大小,内存越大,网络能记忆的信息也越多,但是也要耗费更多的代价来计算。综上,我们的模型参数维度为:

$$
x_t \in \Bbb{R}^{8000} \ o_t \in \Bbb{R}^{8000} \ s_t \in \Bbb{R}^{100} \ U_t \in \Bbb{R}^{100 \times 8000} \ V_t \in \Bbb{R}^{8000 \times 100} \ W_t \in \Bbb{R}^{100 \times 100}
$$

其中$$U,V,W$$是网络的参数,根据上面的等式,我们需要学习$$2HC+H^2$$个参数,由于$$x_t$$是稀疏的one-hot向量,所以其与$$U$$的乘积一步即可算出,$$W$$和$$S_t$$的维度都比较小,所以最繁琐的计算就是$$VS_t$$了。

###初始化

通过Numpy实现第一个版本,对$$U,V,W$$的初始化比较tricky,通常是把它们的初始值置于$$[-\frac{1}{\sqrt n},\frac{1}{\sqrt n}]$$较好。

1
2
3
4
5
6
7
8
9
10
11
12
class RNNNumpy:

def __init__(self, word_dim, hidden_dim=100, bptt_truncate=4):
# Assign instance variables
self.word_dim = word_dim
self.hidden_dim = hidden_dim
self.bptt_truncate = bptt_truncate
# Randomly initialize the network parameters
self.U = np.random.uniform(-np.sqrt(1./word_dim), np.sqrt(1./word_dim), (hidden_dim, word_dim))
self.V = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (word_dim, hidden_dim))
self.W = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (hidden_dim, hidden_dim))

其中word_dim是词表大小,hidden_dim是隐藏层大小。

###前向计算

以下是前向计算(预测词语的概率)的过程:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def forward_propagation(self, x):
# The total number of time steps
T = len(x)
# During forward propagation we save all hidden states in s because need them later.
# We add one additional element for the initial hidden, which we set to 0
s = np.zeros((T + 1, self.hidden_dim))
s[-1] = np.zeros(self.hidden_dim)
# The outputs at each time step. Again, we save them for later.
o = np.zeros((T, self.word_dim))
# For each time step...
for t in np.arange(T):
# Note that we are indxing U by x[t]. This is the same as multiplying U with a one-hot vector.
s[t] = np.tanh(self.U[:,x[t]] + self.W.dot(s[t-1]))
o[t] = softmax(self.V.dot(s[t]))
return [o, s]

RNNNumpy.forward_propagation = forward_propagation

我们同时返回输出及隐藏层状态,隐藏层状态之后会用于梯度计算。$$o_t$$是词表内每个词的概率,但是有时我们需要直接预测出概率最高的词:

1
2
3
4
5
6
def predict(self, x):
# Perform forward propagation and return index of the highest score
o, s = self.forward_propagation(x)
return np.argmax(o, axis=1)

RNNNumpy.predict = predict

尝试输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
np.random.seed(10)
model = RNNNumpy(vocabulary_size)
o, s = model.forward_propagation(X_train[10])
print o.shape
print o
(45, 8000)
[[ 0.00012408 0.0001244 0.00012603 ..., 0.00012515 0.00012488
0.00012508]
[ 0.00012536 0.00012582 0.00012436 ..., 0.00012482 0.00012456
0.00012451]
[ 0.00012387 0.0001252 0.00012474 ..., 0.00012559 0.00012588
0.00012551]
...,
[ 0.00012414 0.00012455 0.0001252 ..., 0.00012487 0.00012494
0.0001263 ]
[ 0.0001252 0.00012393 0.00012509 ..., 0.00012407 0.00012578
0.00012502]
[ 0.00012472 0.0001253 0.00012487 ..., 0.00012463 0.00012536
0.00012665]]

对上面句子(包括45个单词)中的每个词,模型都预测了8000个概率值。因为模型参数这时候是随机初始值,所以预测也是随机的。接下来我们给出预测的词的位置:

1
2
3
4
5
6
7
predictions = model.predict(X_train[10])
print predictions.shape
print predictions
(45,)
[1284 5221 7653 7430 1013 3562 7366 4860 2212 6601 7299 4556 2481 238 2539
21 6548 261 1780 2005 1810 5376 4146 477 7051 4832 4991 897 3485 21
7291 2007 6006 760 4864 2182 6569 2800 2752 6821 4437 7021 7875 6912 3575]

接下来我们计算损失。

###计算损失

我们使用交叉熵函数作为损失函数。若我们有$$N$$个训练样本(text中的词语数),$$C$$个类别(词表大小),预测是$$o$$,label是$$y$$,那么损失计算为:

$$
L(y,o) = - \frac{1}{N} \sum_{n \in N} y_n \log o_n
$$

损失函数计算的是我们的预测oo与正确的词yy的差距的大小。通过以下计算损失:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def calculate_total_loss(self, x, y):
L = 0
# For each sentence...
for i in np.arange(len(y)):
o, s = self.forward_propagation(x[i])
# We only care about our prediction of the "correct" words
correct_word_predictions = o[np.arange(len(y[i])), y[i]]
# Add to the loss based on how off we were
L += -1 * np.sum(np.log(correct_word_predictions))
return L

def calculate_loss(self, x, y):
# Divide the total loss by the number of training examples
N = np.sum((len(y_i) for y_i in y))
return self.calculate_total_loss(x,y)/N

RNNNumpy.calculate_total_loss = calculate_total_loss
RNNNumpy.calculate_loss = calculate_loss

让我们稍微检验一下,如果我们词表大小为$$C$$,那么开始时每个词被随机预测的概率为$$\frac{1}{C}$$,那么损失为$$L = - \frac{1}{C} N \log \ \frac{1}{C} = \log C$$:

1
2
3
 Limit to 1000 examples to save time
print "Expected Loss for random predictions: %f" % np.log(vocabulary_size)
print "Actual loss: %f" % model.calculate_loss(X_train[:1000], y_train[:1000])

很接近。这里我们应当记住,如果data很大,那么计算损失会变得非常耗费时间。

###通过SGD和BPTT(BACKPROPAGATION THROUGH TIME )训练RNN

给定训练样本$$(x,y)$$我们需要计算梯度$$\frac{\partial L}{\partial {U}},\frac{\partial L}{\partial {V}},\frac{\partial L}{\partial {W}}$$。通过以下代码实现BPTT:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def bptt(self, x, y):
T = len(y)
# Perform forward propagation
o, s = self.forward_propagation(x)
# We accumulate the gradients in these variables
dLdU = np.zeros(self.U.shape)
dLdV = np.zeros(self.V.shape)
dLdW = np.zeros(self.W.shape)
delta_o = o
delta_o[np.arange(len(y)), y] -= 1.
# For each output backwards...
for t in np.arange(T)[::-1]:
dLdV += np.outer(delta_o[t], s[t].T)
# Initial delta calculation
delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2))
# Backpropagation through time (for at most self.bptt_truncate steps)
for bptt_step in np.arange(max(0, t-self.bptt_truncate), t+1)[::-1]:
# print "Backpropagation step t=%d bptt step=%d " % (t, bptt_step)
dLdW += np.outer(delta_t, s[bptt_step-1])
dLdU[:,x[bptt_step]] += delta_t
# Update delta for next step
delta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2)
return [dLdU, dLdV, dLdW]

RNNNumpy.bptt = bptt

接下来我们测试梯度。

###测试梯度

实现BP算法过程中,测试梯度是一个良好的习惯。通过以下公式:

$$
\frac{\partial L}{\partial{\theta}} \approx \lim_{h \rightarrow0} \frac{J(\theta + h)-J(\theta - h)}{2h}
$$

使用以上的公式来测试梯度,同样,由于需要计算所以的参数,梯度测试也是很耗时间的,在部分数据上测试是比较好的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
def gradient_check(self, x, y, h=0.001, error_threshold=0.01):
# Calculate the gradients using backpropagation. We want to checker if these are correct.
bptt_gradients = self.bptt(x, y)
# List of all parameters we want to check.
model_parameters = ['U', 'V', 'W']
# Gradient check for each parameter
for pidx, pname in enumerate(model_parameters):
# Get the actual parameter value from the mode, e.g. model.W
parameter = operator.attrgetter(pname)(self)
print "Performing gradient check for parameter %s with size %d." % (pname, np.prod(parameter.shape))
# Iterate over each element of the parameter matrix, e.g. (0,0), (0,1), ...
it = np.nditer(parameter, flags=['multi_index'], op_flags=['readwrite'])
while not it.finished:
ix = it.multi_index
# Save the original value so we can reset it later
original_value = parameter[ix]
# Estimate the gradient using (f(x+h) - f(x-h))/(2*h)
parameter[ix] = original_value + h
gradplus = self.calculate_total_loss([x],[y])
parameter[ix] = original_value - h
gradminus = self.calculate_total_loss([x],[y])
estimated_gradient = (gradplus - gradminus)/(2*h)
# Reset parameter to original value
parameter[ix] = original_value
# The gradient for this parameter calculated using backpropagation
backprop_gradient = bptt_gradients[pidx][ix]
# calculate The relative error: (|x - y|/(|x| + |y|))
relative_error = np.abs(backprop_gradient - estimated_gradient)/(np.abs(backprop_gradient) + np.abs(estimated_gradient))
# If the error is to large fail the gradient check
if relative_error > error_threshold:
print "Gradient Check ERROR: parameter=%s ix=%s" % (pname, ix)
print "+h Loss: %f" % gradplus
print "-h Loss: %f" % gradminus
print "Estimated_gradient: %f" % estimated_gradient
print "Backpropagation gradient: %f" % backprop_gradient
print "Relative Error: %f" % relative_error
return
it.iternext()
print "Gradient check for parameter %s passed." % (pname)

RNNNumpy.gradient_check = gradient_check

# To avoid performing millions of expensive calculations we use a smaller vocabulary size for checking.
grad_check_vocab_size = 100
np.random.seed(10)
model = RNNNumpy(grad_check_vocab_size, 10, bptt_truncate=1000)
model.gradient_check([0,1,2,3], [1,2,3,4])

接下来我们实现SGD。

###实现SGD

通过两步实现:

  • sdg_step计算计算梯度对每个batch更新
  • 外部循环迭代整个训练数据并且调整学习率
1
2
3
4
5
6
7
8
9
def numpy_sdg_step(self, x, y, learning_rate):
# Calculate the gradients
dLdU, dLdV, dLdW = self.bptt(x, y)
# Change parameters according to gradients and learning rate
self.U -= learning_rate * dLdU
self.V -= learning_rate * dLdV
self.W -= learning_rate * dLdW

RNNNumpy.sgd_step = numpy_sdg_step
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Outer SGD Loop
# - model: The RNN model instance
# - X_train: The training data set
# - y_train: The training data labels
# - learning_rate: Initial learning rate for SGD
# - nepoch: Number of times to iterate through the complete dataset
# - evaluate_loss_after: Evaluate the loss after this many epochs
def train_with_sgd(model, X_train, y_train, learning_rate=0.005, nepoch=100, evaluate_loss_after=5):
# We keep track of the losses so we can plot them later
losses = []
num_examples_seen = 0
for epoch in range(nepoch):
# Optionally evaluate the loss
if (epoch % evaluate_loss_after == 0):
loss = model.calculate_loss(X_train, y_train)
losses.append((num_examples_seen, loss))
time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
print "%s: Loss after num_examples_seen=%d epoch=%d: %f" % (time, num_examples_seen, epoch, loss)
# Adjust the learning rate if loss increases
if (len(losses) > 1 and losses[-1][1] > losses[-2][1]):
learning_rate = learning_rate * 0.5
print "Setting learning rate to %f" % learning_rate
sys.stdout.flush()
# For each training example...
for i in range(len(y_train)):
# One SGD step
model.sgd_step(X_train[i], y_train[i], learning_rate)
num_examples_seen += 1

完成。我们来测试一下训练耗费多长的时间:

1
2
3
np.random.seed(10)
model = RNNNumpy(vocabulary_size)
%timeit model.sgd_step(X_train[10], y_train[10], 0.005)

可以看到在我的机器上需要SGD的每一步需要180ms,这代表整个训练将耗费数天甚至更多。我们可以通过许多的方法来加速这一过程,如改善代码和调整模型。这里我们希望使用GPU来加速。在这之前,我们先测试一下SGD的效果:

1
2
3
4
np.random.seed(10)
# Train on a small subset of the data to see what happens
model = RNNNumpy(vocabulary_size)
losses = train_with_sgd(model, X_train[:100], y_train[:100], nepoch=10, evaluate_loss_after=1)
1
2
3
4
5
6
7
8
9
10
2016-06-13 16:59:46: Loss after num_examples_seen=0 epoch=0: 8.987425
2016-06-13 16:59:56: Loss after num_examples_seen=100 epoch=1: 8.976270
2016-06-13 17:00:06: Loss after num_examples_seen=200 epoch=2: 8.960212
2016-06-13 17:00:16: Loss after num_examples_seen=300 epoch=3: 8.930430
2016-06-13 17:00:26: Loss after num_examples_seen=400 epoch=4: 8.862264
2016-06-13 17:00:37: Loss after num_examples_seen=500 epoch=5: 6.913570
2016-06-13 17:00:46: Loss after num_examples_seen=600 epoch=6: 6.302493
2016-06-13 17:00:56: Loss after num_examples_seen=700 epoch=7: 6.014995
2016-06-13 17:01:06: Loss after num_examples_seen=800 epoch=8: 5.833877
2016-06-13 17:01:16: Loss after num_examples_seen=900 epoch=9: 5.710718

看起来,SGD起到了效果。

##通过Theano和GPU训练

1
2
3
enp.random.seed(10)
model = RNNTheano(vocabulary_size)
%timeit model.sgd_step(X_train[10], y_train[10], 0.005)

这次一次SGD步骤耗费为73.7ms。

这里我们直接使用训练好的的参数:

1
2
3
4
5
6
from utils import load_model_parameters_theano, save_model_parameters_theano

model = RNNTheano(vocabulary_size, hidden_dim=50)
# losses = train_with_sgd(model, X_train, y_train, nepoch=50)
# save_model_parameters_theano('./data/trained-model-theano.npz', model)
load_model_parameters_theano('./data/trained-model-theano.npz', model)

###生成语句

使用如下生成语句:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def generate_sentence(model):
# We start the sentence with the start token
new_sentence = [word_to_index[sentence_start_token]]
# Repeat until we get an end token
while not new_sentence[-1] == word_to_index[sentence_end_token]:
next_word_probs = model.forward_propagation(new_sentence)
sampled_word = word_to_index[unknown_token]
# We don't want to sample unknown words
while sampled_word == word_to_index[unknown_token]:
samples = np.random.multinomial(1, next_word_probs[-1])
sampled_word = np.argmax(samples)
new_sentence.append(sampled_word)
sentence_str = [index_to_word[x] for x in new_sentence[1:-1]]
return sentence_str

num_sentences = 10
senten_min_length = 7

for i in range(num_sentences):
sent = []
# We want long sentences, not sentences with one or two words
while len(sent) < senten_min_length:
sent = generate_sentence(model)
print " ".join(sent)

得到结果为:

  • no to blame their stuff go at all .
  • consider via under gear but equal every game .
  • no similar work on the ui birth a ce nightmare .
  • the challenging what is absolutely hard .
  • me do you research getting +2 .
  • ugh is much good , no .
  • me so many different lines hair .
  • probably not very a bot or gain .
  • correct this is affected so why ?
  • register but a grown gun environment .

看起来还不错!不过还是有一些缺陷,这些都是由于vanilla RNN不能解决长期依赖问题导致的。下一篇我们将讨论BPTT并且关注梯度消失/爆炸问题。