char-rnn-chinese

本文主要根据Multi-layer Recurrent Neural Networks (LSTM, GRU, RNN) for character-level language models in Torch的内容来进行试验。

#准备工作

根据原文“This code is written in Lua and requires Torch. Additionally, you need to install the nngraph and optim packages using LuaRocks”,安装以下依赖。

##安装Torch

使用如下的命令安装Torch

1
2
3
4
cd ~/
curl -s https://raw.githubusercontent.com/torch/ezinstall/master/install-deps | bash
git clone https://github.com/torch/distro.git ~/torch --recursive
cd ~/torch; ./install.sh

再用如下命令更新:
source ~/.bashrc

出现如下画面,代表Torch已经装好!

![](/images/2016/05/Screenshot from 2016-05-26 19-51-17.png)

##安装lua
sudo apt-get install lua5.2

##安装其他依赖

使用LuaRocks来安装nngraphoptim

1
2
luarocks install nngraph
luarocks install optim

首先安装LuaRocks
安装时在config部分遇到问题,参考安装Luarockslinux下lua开发环境安装
这时可能遇到安装了lua但是却提示无法找到lua.h可能是因为还需要安装liblua5.1-0-dev的缘故。
使用apt-get安装luarocks后在安装nngraph时报错,需要解决

==其实使用torch内自带的luarocks安装即可==:

1
sudo ~/torch/install/bin/luarocks install

因为本机只有英特尔核显,所以只打算用CPU计算,就不安装CUDA了。

#开始实验

karpathy的example实验-cpu版本

###training过程

使用th train.lua --help查看一下各参数的作用:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Options
-data_dir data directory. Should contain the file input.txt with input data [data/tinyshakespeare] 训练语料
-min_freq min frequent of character [0]
-rnn_size size of LSTM internal state [128]
-num_layers number of layers in the LSTM [2]
-model for now only lstm is supported. keep fixed [lstm]
-learning_rate learning rate [0.002]
-learning_rate_decay learning rate decay [0.97]
-learning_rate_decay_after in number of epochs, when to start decaying the learning rate [10]
-decay_rate decay rate for rmsprop [0.95]
-dropout dropout for regularization, used after each RNN hidden layer. 0 = no dropout [0]
-seq_length number of timesteps to unroll for [50]
-batch_size number of sequences to train on in parallel [50]
-max_epochs number of full passes through the training data [50]
-grad_clip clip gradients at this value [5]
-train_frac fraction of data that goes into train set [0.95]
-val_frac fraction of data that goes into validation set [0.05]
-init_from initialize network parameters from checkpoint at this path []
-seed torch manual random number generator seed [123]
-print_every how many steps/minibatches between printing out the loss [1]
-eval_val_every every how many iterations should we evaluate on validation data? [2000]
-checkpoint_dir output directory where checkpoints get written [cv]
-savefile filename to autosave the checkpont to. Will be inside checkpoint_dir/ [lstm]
-accurate_gpu_timing set this flag to 1 to get precise timings when using GPU. Might make code bit slower but reports accurate timings. [0]
-gpuid which gpu to use. -1 = use CPU [0]
-opencl use OpenCL (instead of CUDA) [0]
-use_ss whether use scheduled sampling during training [1]
-start_ss start amount of truth data to be given to the model when using ss [1]
-decay_ss ss amount decay rate of each epoch [0.005]
-min_ss minimum amount of truth data to be given to the model when using ss [0.9]

按照Github上的说明进行实验,使用原文件夹里的语料,

1
2
th train.lua -data_dir data/tinyshakespeare/shakespeare_input.txt -gpuid -1

报错:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
 th train.lua -data_dir data/tinyshakespeare/shakespeare_input.txt -gpuid -1
vocab.t7 and data.t7 do not exist. Running preprocessing...
one-time setup: preprocessing input text file data/tinyshakespeare/shakespeare_input.txt/input.txt...
loading text file...
/home/frank/torch/install/bin/luajit: cannot open <data/tinyshakespeare/shakespeare_input.txt/input.txt> in mode r at /home/frank/torch/pkg/torch/lib/TH/THDiskFile.c:649
stack traceback:
[C]: at 0x7f9c42473540
[C]: in function 'DiskFile'
./util/CharSplitLMMinibatchLoader.lua:201: in function 'text_to_tensor'
./util/CharSplitLMMinibatchLoader.lua:38: in function 'create'
train.lua:118: in main chunk
[C]: in function 'dofile'
...rank/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00405d70

这里出现了问题,因为本文是中国作者按照原karpathy的char-rnn 改写的,我认为或许使用karpathy作者的原版本教程可能会更加方便一些。于是使用As a sanity check,运行:

1
th train.lua -gpuid -1

这指的是使用CPU并不指定任何参数来训练example。

15:42开始训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
 th train.lua -gpuid -1
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 423, val: 23, test: 0
vocab size: 65
creating an LSTM with 2 layers
setting forget gate biases to 1 in LSTM layer 1
setting forget gate biases to 1 in LSTM layer 2
number of parameters in the model: 240321
cloning rnn
cloning criterion
1/21150 (epoch 0.002), train_loss = 4.19803724, grad/param norm = 5.1721e-01, time/batch = 2.3129s
2/21150 (epoch 0.005), train_loss = 3.93712133, grad/param norm = 1.4679e+00, time/batch = 2.3114s
3/21150 (epoch 0.007), train_loss = 3.43764434, grad/param norm = 9.5800e-01, time/batch = 2.3022s
4/21150 (epoch 0.009), train_loss = 3.41313742, grad/param norm = 7.5143e-01, time/batch = 2.5311s
5/21150 (epoch 0.012), train_loss = 3.33707270, grad/param norm = 6.9269e-01, time/batch = 2.4913s

到第300次迭代后,time/batch稳定在2.3s左右,也就是说,使用GPU训练这个1Mb的example,需要约14小时!
次日08:24训练完毕

1
2
3
4
5
6
21148/21150 (epoch 49.995), train_loss = 1.53254314, grad/param norm = 5.9157e-02, time/batch = 2.8658s
21149/21150 (epoch 49.998), train_loss = 1.50882624, grad/param norm = 5.7123e-02, time/batch = 2.8737s
decayed learning rate by a factor 0.97 to 0.00057368183755432
evaluating loss over split index 2
saving checkpoint to cv/lm_lstm_epoch50.00_1.3568.t7
21150/21150 (epoch 50.000), train_loss = 1.46142484, grad/param norm = 5.9032e-02, time/batch = 2.8834s

###Sample过程
查看help

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
th sample.lua --help
Usage: /home/frank/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th [options] <model>

Sample from a character-level language model

Options
<model> model checkpoint to use for sampling
-seed random number generator's seed [123]
-sample 0 to use max at each timestep, 1 to sample at each timestep [1]
-primetext used as a prompt to "seed" the state of the LSTM using a given sequence, before we sample. []
-length max number of characters to sample [2000] 采样字符大小,最大2000
-temperature temperature of sampling [1]
-gpuid which gpu to use. -1 = use CPU [0] 和训练时设置应该保持一致
-verbose set to 0 to ONLY print the sampled text, no diagnostics [1]
-stop stop sampling when detected [




]

先试运行一下
th sample.lua cv/lm_lstm_epoch50.00_1.3568.t7 -gpuid -1
生成了如下语句:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
giff,
Some sweet amends, aasher, had therein, not he had wot on
man's friends, for her own blow: for my men's
ackingly knight, it cannot hear upon yield.

GLOUCESTER:
How now again?
i
March, that's my arm determitness,
The temper of the vrowal; ere from the grove
Tut wilh 'goom'd Carulea.
'Mwailich, tne shy had lost in When Ied the way
The bower to a late his body, grim on you:
His opicious shames a booy, infairs,'
From her, I tell you, ay, as we mean him
Dear 'tis a giving o' thur back's empass'd,
That nrost, I'll havk him cume thee for 't.

LAONTES:
'Twi'l love thy sronowing.

VALYRAN:
Beord mocdoch him for thy follight
snn
hours,
But thank yours lodkes, my good journeding,
His jealousisposour thee are both abomish
That noom that's easembelland. Camest, sir, more
kia; one, in this highty be the un
Since of a gournor on thy friendshall swow
Some painon; and I, and lord, the at the kins
Wise rit hable surliments. Shd, believh gone.

voisted tleace:
Tock him what all you di turn up to celent
To my sistinge. Frranch, good night, your child, so fatus;
Aor he shall be my trueking:
Come on my quarrel of the way:
Methinks the letters; for this ctome-steers
Tad mousd my smodered pouncy to
haw up another sense tlays underttry
Tut bonscuration fair all purpose,
then be vesegt me: do not, yet rustle cannot,
But for thy mustered a dust, let me
Tncerfact me tresmer of his father:
therefore by hanging,
ANd
Ays, my lord: you do here in coumisant.

LORD:
How lond the brown!
So majp me; bonch, smmily lovely blotters,
When Ie my hoeaty threat and virlume these things,
Make fasting garlands dfar the sack'd my servictught
Not knows the crowns: one air, Aumerle,
Ere wear not so nour Bidagle? What Aphark is fury
Tld meens them, faireyou consides to no more
Ihis wantond frown and pollitueser'd city.
Can should put him more recounders to impudesnt poison on
thet hour from hunt to Rame, supp to bere
Flowerd and his friend is une dewn ao pirt,
You know by join'd guilty, whathout we e.

ANd
Ays, my lord: you do here in coumisant.

LORD:
How lond the brown!
So majp me; bonch, smmily lovely blotters,
When Ie my hoeaty threat and virlume these things,
Make fasting garlands dfar the sack'd my servictught
Not knows the crowns: one air, Aumerle,
Ere wear not so nour Bidagle? What Aphark is fury
Tld meens them, faire

karpathy的example实验-gpu版本

使用和cpu版本相同的指令,只是th train.lua -gpuid 0
得到的

1
th sample.lua cv/lm_lstm_epoch50.00_1.3622.t7 -gpuid 0

sample为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
--------------------------
ge
I prithee; nor in the day of all report?
Nou shall be you that lances stoson quarrel:
We mits most stone, 'aim, upeed his forticent
Ahen I was, yet for her, but of lovels Clarencel
And past her got and father there, one weep
In mine wroth nightlys, smileed. Cantleye An York,
A druls marriage, that though Service hated me
Their duen of iflock, we are here in berieved
and more than tanise them with suterept
He rt in some pickle.

SePSERTER:
Or I have safe all thou depustere andthe before him.

FRIAR LAURENCE:
What art my government, cosbude, and hence; if Onfrawn? provest tor my duty?

CATESBY:

KARIANA:
My love noe is are with t herman and his,
It should she well deeauring our consent:
They hang me bointed on the king, let so two
Nature by my sighsing pleasing 'jabe
That leaven and grue, at her Richard's blood.
More ends it likipenortnive, nor each of him
ic.

SLY:
How dachors, Richmend, henr dack but like it?
Be long, anon since your kingdom and us,
And we that aver el
aunter, my eee to toucurt tomends.
It this her great fawn's birds,, sir! you'er head.

PAULINA:
Upternalt cost of his hands for their tricks my father,
Who ts it most seunt to live te and she were all.
-kill
O to thy son os shall not on your childrs,
one next, for she did formly consixent
Above, my life, and wew me worthy deeming tvenge!
My mustere be exploience, aot come n leave where ahe knees in.
dear, thus wild up tilt on the county, hath be one.
See this sword of thee with the deepito man,
For sunier ene first sears. Where's turn on to be.
Unctious blunlest terrocate doves
Trades Marcius aines of hlends
My's learth an old--ay.

LEONTES:
Marcius?

PRONVO:
You would no gue.

VOLUMNIA:
Oovine s fetch to tight, thou must but loods.

HASTINGS:
And was with her, nor yonder to be sworn,
What are you allady that I should have purpose.
What men revenge is a well patient
And who seth sxoleng to knowled ed to myself;
And married me in the joy:
So reve I made to find me speak,
how he been tou.

PETCA:

##《水浒传》语料实验

###cpu版本

把下载好的《水浒传》改名为input.txt
使用

1
2
th train.lua -data_dir data/mydata/ -gpuid -1

训练,可以看到很明显,速度很慢

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
th train.lua -data_dir data/mydata/ -gpuid -1
vocab.t7 and data.t7 do not exist. Running preprocessing...
one-time setup: preprocessing input text file data/mydata/input.txt...
loading text file...
creating vocabulary mapping...
putting data into tensor, it takes a lot of time...
saving data/mydata/vocab.t7
saving data/mydata/data.t7
loading data files...
cutting off end of data so that the batches/sequences divide evenly
reshaping tensor...
data load done. Number of data batches in train: 345, val: 19, test: 0
vocab size: 4129
creating an LSTM with 2 layers
setting forget gate biases to 1 in LSTM layer 1
setting forget gate biases to 1 in LSTM layer 2
number of parameters in the model: 2845345
cloning rnn
cloning criterion
1/17250 (epoch 0.003), train_loss = 8.32795887, grad/param norm = 9.6310e-02, time/batch = 28.8711s
2/17250 (epoch 0.006), train_loss = 8.06433859, grad/param norm = 4.2826e-01, time/batch = 25.2184s
3/17250 (epoch 0.009), train_loss = 7.28941094, grad/param norm = 3.9537e-01, time/batch = 25.2195s
4/17250 (epoch 0.012), train_loss = 6.85331761, grad/param norm = 4.0576e-01, time/batch = 25.2011s
5/17250 (epoch 0.014), train_loss = 6.69327439, grad/param norm = 3.8309e-01, time/batch = 24.9642s
6/17250 (epoch 0.017), train_loss = 6.50776019, grad/param norm = 3.1042e-01, time/batch = 24.9203s

预计需要120小时训练时间!但是,这都是未经过处理的语料,后续使用处理过的余料(如去掉低频词语等)再来训练应该会更快。因为时间太长,所以这个实验被放弃了。
###GPU版本的未处理语料实验
首先对未处理语料做训练:

1
th train.lua -data_dir data/mydata/ -gpuid 0

begin at 10:20
可以看到time/batch稳定在0.08s左右,也就是半小时就可以训练完成!GPU比cpu在科学计算上面实在是太强大了。
训练完毕,使用:
th sample.lua cv/lm_lstm_epoch50.00_3.9309.t7

者,却得出来的物细物,没面少管生渔人。后来的知府时是乱色早晚,拾了几个。李逵惊得忙忙轻梳药穿在大牢里,摆在延安家处,推慰九节的。当下径到居中饮酒,牌门头,戴宗又焦躁。只见屏风背后转出一个小风大来,暗暗听得道:“反细放俺!兄弟拿着,趁这为害天明地清,我休要推道别事的都要做伴当拆投到会耳,便有进漏?”时迁舞起树下探人,的了夹搭,都拽了拽开,胸皮虽是好了六分惊得,是他麻。吐女棒放火了,走不向前,及宋江那道个留守他做个辩察的,先自去州里请明地烧了钱用,但有过京回家,听得状子好!这高老袋内却是出张招安,又都是他的欺负民,如何是计信?必须要和郎相会。且。”趁早起楼去了。两个连夜时候,仓治时,年二十八执迷者多要都要去行叶,早是亲家洒家,径挨到府前来。灯烛纸敌官,方才脱漏。亦被乱窝中有人等好人知蔡京道:“那个人也是他甚么?因此教大军打劫俺那干干金珠的。”母子那妇人来到大王尚安着,相交酒追惹三清酒搦战。不过半夜之事,早饭相烦,心里出对知府说:“官家初时在时,欲要市上时,兀自和我觅家劫犯了他,如何不赶我里来?我大小心定得,不分便了。”叉军转回,已做些小头,要打抵敌他!且把身,带了七个人时,都抢出家,但见:  
壮中醪浑纷领,腰细轻露。阔尺三层挺刀,鱼厚夸敌庆孙。高人有八句诗单道身心强似莽?;小付敲柴的,真多呼圣殿之主贞锋欣乐词。头邬闻丹腊良夫,耍达缘矮岁龙;四虎间,寒暮难以偷黄;牢记仁诸作像,显宝一根红佛。微。善得山迎能指挥,直救清天马星。  
话说上阵法, 师皇帝展得有?宫殿,雨翠云也上田地里。幸观非乃帝重了,宋江心如有誓,同宋受迷。却诗名唤,只传说开门,因此是贼人心腹事务,到宋江纠合生灵害,在忠渐存母亲来宋江以心却才,惊得义既灵垂德,对公孙胜为然无智真道好法,正为:须游六十为聚义,好像原林密寨郎山神保。  
当日宋庄客帐前,与晁保、公二位头领,众头领发起作法。  
石裂更兼地分都拨人汉,且不杀得蒙恩干人结义,下山只是锦袋百把,们有父亲孔宾,同商量。宋江又道:“自是好生,莫非也只是是有哥哥下了。”吴用笑道:“兄弟,不到山寨,吴用命作商量,将军不与他长犬马,力休曾平他:一话难以安身,宋江一力不东昌下几日,谁想大哥哥教小晁盖哥哥会合当的事。我们人投随天军来,又有伤损;若不连环甲关,着李横其不似火体,车藏御上尽挂玉水;军卒许多,无无不难之际?他但**,可等兵,可以斩遣。”众军健都管入庄,要把鲜血迸成,赶起来,背后解水边,唤车军跨城疮只等,原来正是之福,后往,来不见三个使汉。因弟清风船救应。路,至是路买酒,又拨五七百名、罩、白、孔亮,正将费珍、薛霸,尽是钱十二十军,其余的人在彼,欲得众兵险道地广花荣抵敌人住。这一队节度使士都军,被两个军猛,呐声喊,都抢过城里,并无腰迎敌,被贼兵赶上,时,却被花荣战箭射来。童太人、杨志正是南安江韩杨龙、穆弘、李逵、索超正定敌。孙军纵马。琼清马挺着枪,入来,尽被史进和贼人杀死贼兵,擒做霹雳。邬梨因成让风,连鼓上马,将股斧,却飞入阵,大小张清见了见乱军阵前卒法败坑回阵,宋江旭前只是:  
主人问姓,五应风万。侯海道正:“恶平可逃奔:。时们村野阴血,呼往天兵消波。正被杨志聚领渡江,望宋江攻还山庵;拔寨教活林冲、公万一通,并添下山南二王庆名事虚权,再被小人在戴上探山泊路,几路去报,不敢准备。不知这个人说起是百庄小黑凌州,已曾见了,对别无缘。”吴用道道:“便队军马解到此时,必是殡隘为百谷岭。原师悬流水军头二头领,结识江湖上好汉姓石,名给鬼,便乃五家庄二多情。我去这里地路,望会便行。”廊进雷车把人来不止,李应拈着诏书,自此付话。  
且说山客渡过了三只路,教穆弘扮做伴当,扮做阎婆者,带拢是臭镇一个没赌什门,分顺了同行,自去寻闭了的。原自去被人运烧将下去了。宋江等远远,一路进兵。十来县不在大路途来,又怕了到得闲意的张社长,听得监押一声,货钱便是。任原陈达在中,不知处打那华州,特使他来掳去太安军肆,只待下山。戴宗告随张小人,蔡九知府不得,连夜回话,同张招讨干办、众部吉。于路,忽报探知样悔,景珍全过,领回商议,“军师赵枢密喜绯金带,身上悬面草板,护道国师,服,神色不通。是奇诈将丘留的人,准备起船走径来借粮,业不同何遇一深困马灵夫,便因密的月色渐砂来完,斋。小温皇威,被宣刚引军来,武松彼朵并顾大嫂,赴了逃去,自逃命探了。被那几人娅?在古靖军吟涂炭,,态纪士,接应喊道,漏转身来,复有神诗,燕世曰:“寡人仰云监斩辽王康公外交法,何”奈阵圣怒须性重。铁挨填丹靴,万边狄行鉴。见。田户观看草畔,红日影豪困催急绩。宋玉游战,听听了大喜。话说宿太师诏奏道:“宿元帅差有敕入请罗真人,密封官军等八员高名,封当同达宋先锋。”日收选润之主,奏为圣旨,特着州殿府探知。太尉宿太尉回到内,启转马,众军方可亦成开大事,放起出来,更兼小一个唤做

##使用增强RNN网络训练

如下,使用512个隐藏节点的3层RNN网络训练模型

th train.lua -data_dir data/shuihuzhuan/ -gpuid 0 -rnn_size 512 -num_layers 3
th sample.lua cv/lm_lstm_epoch50.00_5.4830.t7 -sample 0 -temperature 0.8 -verbose 0 -length 500

弟兄两个,也得个信名之人。”那个也是个道理。童家四更,被张顺斩的粉碎,以下人人家拿去了。一面叫酒保打两桶酒来。小二哥叫道:“师父,你不是我来也!”那小牢子道:“我也不曾你,你便叫我上来。”   石秀道:“你且说他三宫百里吃酒了来,你便抢入去,你便先来看,却被这畜生说不得了。”那妇人道:“你真人要打这里话?你却不认你,你便叫我儿来寻。”李逵道:“你敢作吃的,便揪我做脚!”赤条条地寻谁,只得骂道:“爹娘,你且休了,我自不信,砍我头便打那老娘。”那妇人道:“也好。”便把袖儿丢下去了。那妇人也把刀带在一边,却似小窗????胡乱道:“好拳脚!”急叫开了店看。”王庆听了,连声叫道:“阿也!你不要吃!”把手一指,提倒上岸来,把朴刀倚在被里。就把篙子门内,倒做五六斤了,将把木鱼来摆下桶桶。少时,张顺吃了一回。两个回到店前,再出来赏赐,解了戒刀,包了水出去,到四更,把船渡入去,便叫艄公下楼,买了些鱼吃,把些酒肉吃了,酒保做些桶汤、盘酒、些肉。下来穿瓶与酒。一瓶儿酒肉,买些肉吃,只见店主人把包裹插下,那妇人也吃得饱了,口里说道:“娘子,老身等这几个泼,不要吃酒钱。”店小二道:“好酒好肉要打,我吃便饱

很明显,此时sample的样本语句更加通顺,错误很少,从品味小说的角度来讲,增强了的RNN训练得到的模型更加完美了。 可以看到,**增加了节点数和隐藏层的RNN具有更强的学习能力。
**

对比训练过程模型表现力

与此同时我们可以对比一下,训练开始阶段与训练结束时的模型表现力的差异:

th sample.lua cv/06-01-shz_sp/lm_lstm_epoch5.80_4.0451.t7 -sample 0 -temperature 0.8 -verbose 0 -length 500

训练刚刚进行到5.8(为50时完成)得到的是:

只见一个人从来,一个人,都来做一个。那人道:“你这厮们,我自去寻你。”那人道:“你不要瞒我,你便不要你。”那人道:“你不要瞒我,你便不要你。”那人道:“你不要瞒我,你便不要你。”那人道:“你不要瞒我,你便不要你。”那人道:“你不要瞒我,你便不要你。”那人道:“你不要瞒我,你便不要你。”那人道:“你不要瞒我,你便不要你。”那人道:“你不要瞒我,你便不要你。”那人道:“你不要吃,我自去寻你。”那婆娘笑道:“你不要吃,我自去寻你。”那婆娘笑道:“你不要吃,我自去寻你。”那婆娘笑道:“你不要吃,我自去寻你。”那婆娘笑道:“你不要吃,我自去寻你。”那婆娘笑道:“你不要吃,我自去寻你。”那婆娘笑道:“你不要吃,我自去寻你。”那妇人道:“你不要吃,我自去寻你。”那妇人道:“你不要吃,我自去寻你。”那妇人道:“你不要吃,我自去寻你。”那妇人道:“你不要吃,我自去寻你。”那妇人道:“你不要吃,我自去寻你。”那妇人道:“你不要吃,我自去寻你。”那妇人道:“你不要吃,我自去寻你。”那妇人道:“你不要吃,我自去寻你。”那妇人道:“你不要吃,我自去寻你。”那妇人道:“你不要吃,我自去寻你。”那妇人道:“你不

此时语料具有较多的重复,模型还没有良好的收敛。

继续观察,当进行到四分之一左右时:

那汉子听了,便问道:“你这厮不是歹人,如何不来?我们不曾有这般的,如何不来?你的那里去了?”那汉道:“我们不曾说谎。”那汉道:“我不曾说谎。”那汉道:“我不曾说谎。”那汉道:“我不曾说谎。”那汉道:“我不曾说谎。”那汉道:“既是恁地,我们自去。”那汉道:“既是恁地,我们自去。”那汉道:“既然恁地,我们自去买碗酒吃。”那汉道:“既是恁地,我们自去。”那汉道:“既然恁地,我们自去买碗酒吃。”那汉道:“既是恁地,我们自去。”那挑酒的汉子道:“我们自有计较,我们自有些钱,与你些银两,却去商议。”董超道:“我们不曾有这般的事,如何不来?你们自去取路,我和你如何不去?若还了他时,便是要去的。”李逵道:“你们不曾说的,你便是个老儿,如何不来?我们自去取他,你便不曾与他厮见。”那妇人道:“你的,你不省得。”那妇人道:“你的,不要胡说!我们自有钱帛,便要去便了。”那妇人道:“既是恁地,我们自去买些银子与我。”那妇人道:“我们自有钱与你,你便不曾去了。”那妇人道:“既是恁地,我们便去。”那妇人道:“既是恁地,我们便去。”那妇人道:“你的,不要胡说!我如何肯去这里?”那妇人道:“便是这般使棒,不曾得得他。

再看看训练继续进行到一半左右时候的表现力:

张都监为副将急体己人,不敢不依,只得随行众军,掌行送行。只留下降,尽皆欢喜,以此忠义。在府行军中,有使枪棒卖药的,将王庆领军到来,并不必说。当下宋江传令,教中军计策,与卢俊义等商议:“今日折了两阵,俺们自去了。”宋江道:“军师之言甚善!”当下即日便传将令,教军士点营,斩动军马。将及初五更战后,攻打常州,催趱军兵,一齐进发。 寨中,只听得高声叫道:“萧让等救兵!”宋江看那军将,尽数放起,对幽州一个大汉,乃古头上大叫道:“水洼草寇,怎敢轻慢!”只见里面关胜、呼延灼、关胜等探有一人,只见唐斌从人骑马,直到宋江寨前,喝请宋先锋。宋江听了大喜,传令令军士且去寨中坐地,备说宋先锋军马,攻打北京。吴用道:“且教两路军马,攻打北门。”宋江便令吴用、朱武商议:“今日可去,只是国师吴用,坐一件事,我等随顺到此,可用两处夹攻,那厮必然有人来。”宋江道:“军师言之极当。”便唤军士计策。”宋江道:“军师言之极当。”吴用道:“小生直作妙计,即且闻见。”宋江道:“先生之言,是不得这般忧疑。”宋江道:“既然如此,与你四位豪杰,不堪员外大王。”宋江道:“贤弟,你休要疑心,我便去请来。”吴用道:“不须你两个与我箭,只

此时低级的重复没有了,但是可以看到,“宋江道”反复出现,而且说的内容类似,可以得到模型已在进一步完善之中。

这是模型接近训练完成的时候的sample:

张青、孙二娘、顾大嫂、孙二娘,并四个好汉,引着一千余人,吹造大小船只,都投水路。不多时,只见松树背后转出一个小小人来,簇拥着两个人,各提着朴刀,背后有人,叫一声:“捉下!”   那汉子把船只一招,扶着一干人,把那碗饭打伤,打的粉碎,把头头割在一边,口里放火。那人见了是惊得呆了,又不来吃了一惊,扑地只顾走。却待再走,再去脱人避凉。李逵却亦不肯拦他,只得走了。可怜救他两个性命,那里敢?别人。前日被捉死了性命,杀了人,逃走在江州,被害人陷害,方得正中了。今日幸得相见,如何使得?便得是个知县过来的,也喜得及。他便是本人的人,须是高太尉的人,却不知是那里人。”任原道:“这个便是我的儿么?”王婆道:“便是前日那官司亲亲叔孝,为何到此?”那婆子答道:“老身只道不妨,只怕小人自有措置。老身看了,便忘了回去。”老都管道:“这个容易。老身先把银酒去了。”老儿道:“你们自不要吃酒。”那婆子也笑起来道:“这个便是我的老小人家。”那婆子道:“便是老身也不怕你,休要胡主干娘,只怕你疑心。”那妇人道:“不干了。你的女儿,老娘儿只做买些衣服来送与你。”王婆道:“娘子,你要知这几个字?”那婆子道:“有甚么哭处?”

模型进一步完善,接近完成训练。

##《全唐诗》实验

下载下来的全唐诗.txtgbk编码的,首先需要转换为utf-8编码:

1
2
3
4
5
6
7
8
9
10
11
12
# -*- coding: UTF-8 -*-
def main():
s = open("全唐诗.txt")
r = s.read()
r_uni = r.decode('gb2312','ignore')
r_en = r_uni.encode('utf-8','ignore')
fp=open("全唐诗转换.txt","w")
fp.write(r_en)


if __name__ == '__main__' :
main()

这样就得到了utf-8编码的文件。

开始想直接把这个文件丢给rnn训练,后来想一想,“全唐诗”包括了众多诗人的杰作,也许训练出来的模型sample不到什么特点鲜明的诗句,于是我首先想对诗仙李白的诗做一个实验。把得到的全唐诗文件中李白的诗切分出来作为一个文件。
开始训练:
th train.lua -data_dir data/tangshi/ -gpuid 0
得到模型后sample:

th sample.lua cv/lm_lstm_epoch50.00_5.3967.t7

耶翳妃在。流明湖草,岂舞高散纷。小剑宫底寒,石思怀士。
归来相烟叹,又余未老此。此在见携节,气帝皇彩川。
尝君凌天鸟,羞从臂山中。何旋俱所偃,造愧翻遗耻。
悠挥李明云,坐月得相迟。君作成景鸡,暮日清膺发。
胡步垂洞松,三年延未莱。贤迢坐橐山,嗤楼谋疏川。
诗君九安菲,别去写日流。五时谢及洒,相魄三有玄。

卷177_21 【长道送秀士寄之十南国古松游明,》妓此蛾书此下见逃之始崔至六以吟酒宁】李白

仙阳壶我王,谣荣日茫过。幸生天行寒,半持见九忧。
早毂此楼宰歌,鹗税金归名。宾托昼宇闻,高汶堕臣泉。
峻悟湍素都,凤生鸟远才。虎傥亦成一,蹭据锁炳垣。
回谷闻叹波,摧人翼昆衣。丑德皆复贵,何袂自见宝。
黄然奉傍及,戾酒悲溪情。何悟下罗灭,壁令还济然。

卷177_16 【送沙门饯元佛之嵩使归少丞寺辅晔年亭山云】李白

行道一狂日,陆杯欲我园。相寄属相者,不歌流成书。
幽景神兰马,今云乃相知。相能拂此门,娟笑无生魂。
肠后扫天所,罗瑟心世桥。

卷169_11 【金松二炎师】李白

丽劝发何凤,含东药宛锦。且忍咏尺鳌,梦杯池月

罗鸟思归人,兼我无风歇。太色东草树,壮早游罗息。
别来青神极,谒长不陵寒。君忧东海鹤,吹欲得彩好。
梦君来太情,秦子忆延薪。闻干凌楼息,松水接廉才。
且识辞犹之,众断罗里中。

水羊远重,引飒高新策。目布愁霜亲,,随火赤云道。
解毂四上牛,以且汶云年。宁迎清巴寒,种欺清名风。
海坐去不意,思丹酩期然。闻钓曾帆景,一弄暗长雪。
且亦留我辉,杀讼韵中楼。

卷174_7 【赠崔司州十三青寺黔洞姑圣毛闲华忘兼塔宅石】李白

窜笑敬鹤见梦走去留僧。地筑从尺人,泪树平众烂。
窈箸有吉氲,久聪欲洲真。朝春离相兽,飞此何垣发。
绿日偶可言,我言经精存。君卒限汉水,绿月相成失。
恋虏一登事,但乘涂应星。

卷177_18 【咏夜别(人作帝阳之为昆者)】李白

笑乃度将明,蛾洞蕊山发。花坐新合露镜日边归溪。
长面云蓬立,自颜谢鸳分。灭用欲得碑,不云当天舟。
一来但不霄,乘我涉路来。怀子广陵远,夕人不元阙。
笑天亦望好,驱令适谁手。思此有门上,举与满丝君。
醉年归敬山,松藏谢未笑。却思笑古兴,凭凌泪高尺。

钱文1淮作3
 
卷一两十七三北欢宁三元四慈平难名】李白

水乘黄楼镜君客绿1起,山门天阙已忍春。不喜出庭作,乎
晖酒苔。麒惊美。绮门,我见秦心遂天雪,只行天花流。
相随夔山肉烟喧,武斗吹齐而公还。以开玉去戟如雨,
欲席上醉鹊里洪。世闻燕弦对士去,曳不笑丘瞳中嗟。
国昔逢眼青山鸟,扫水长丘双泉空。

卷165_8 【玉崔将言刺判池,一还精孰日君)】李白

军莫荆壮悲,揽迈宋钟客。高当愤酒酒,渴臣达庭边。
沙箸佳花诏,映河讵窗中。为交上梧空,空在猛杯闻。
惜缅何烦柯,屈笑献神然。万情上者溪,相以陶毫逸。

更闻无袂,小必笑伤穷。

卷169_12 【高陵黄入蜀,寄纪侍御二首】李白

爱来游山子,北照有华宅。岩歧难碑李,独承崔滹鼙。
巨君亦不处,常阖之此酒。思君穷莫术,却成愁庄隈。征六此与玄,,羞逐但应君。

卷189_26 【登溪马归歌,归石怀道怀山】】。
长腾明花屏爱心春,寄心入良素断。

卷188_7 【送侍御从尔史崔崖赴天】白山年粲赤浑,书别诗游泛风】李白

祖国一官食,二藏系冠鹇。无砚号盆水,浮发王青吟。
登水惜不母,殷古启与游。萧觉留见去,待泪复相思。
绿风吹中信,江山知洞宫。宾阔未孤里,空可向江峤。
西舟青溟云,一浪摇苗彩。流镜沙青辉,夕落得寒洲。
妾头海弦弓,而多何皇笑。欢辰虽欢心,此鸣暗清飞。
广松春人草,为啸李田。。思君鸾可得,萧论以天风。
今家高去路,酌藏无云功。为时壮罢邑,陈不见长名。
闻啸不可在,玉服汉寒情。愿昆隐山水,更将谢敬离。

大说词隐开,夜子庐鹤行。他烛不知尽,逸君徒风草。
把柯南幽山,超血彩森兵。吊窈赤微鸟,娇若相踪。
笑乏惜香门,夜酌闻岩欢。
我此一可驰,三刀瓦风樽。东秋清柱晚,意歌送高颜。
别留一失在,推时悲紫踪。群远亭我顶,机赠沾酒旋。
明镜若相巨,明窗忘语平。桃水凌瑶心,茫讼落清眉。
横产无商娥,万流应长生。

卷171_2 【酬纪寿阳送官】李白

白浦欲溪山,去入北酒人。云水薄阶弄,乃啾迹风声。
诗命侍飞儿,百结清霞杯。何言思君去,但然谢无歇。

卷176_21 【荆闺崔嵩人宰】李白

常鹉别帝家,绮书逐惟安。西世东山玉,雕是金东辉。
绿登紫鹿色,张看弄月萝。思手穷津水,弄是俨归人。
相恐新鹉道,渌声赠恨公。

这里出来的效果就很惊人了,我们从小就在课本上学习了诗仙李白的许多佳作,可以说大家对于一个诗人的诗的韵味是怎样是很有体会的,在这些字里行间仔细品味,我们完全可以体会到李太白的豪放与洒脱。