encoding - How to use Stanford LexParser for Chinese text? -


i can't seem correct input encoding stanford nlp's lexparser.

how use stanford lexparser chinese text?

i've done following download tool:

$ wget http://nlp.stanford.edu/software/stanford-parser-full-2015-04-20.zip $ unzip stanford-parser-full-2015-04-20.zip  $ cd stanford-parser-full-2015-04-20/ 

and input text in utf-8:

$ echo "应有尽有 的 丰富 选择 定 将 为 您 的 旅程 增添 无数 的 赏心 乐事 。" > input.txt  $ echo "应有尽有#vv 的#dec 丰富#jj 选择#nn 定#vv 将#ad 为#p 您#pn 的#deg 旅程#nn 增添#vv 无数#cd 的#deg 赏心#nn 乐事#nn  。#punct" > pos-input.txt 

according readme.txt, parser trained on:

chinese there chinese grammars trained on mainland material xinhua , more mixed material ldc chinese treebank. default input encoding gb18030.

so i've tried utf-8 file first:

$ bash lexparser-lang.sh chinese 80 edu/stanford/nlp/models/lexparser/chinesepcfg.ser.gz parsed input.txt loading parser serialized file edu/stanford/nlp/models/lexparser/chinesepcfg.ser.gz ...  done [1.0 sec]. parsing file: input.txt parsing [sent. 1 len. 16]: 应有尽有 的1�7 丰富 选择 宄1�7 射1�7 丄1�7 悄1�7 的1�7 旅程 增添 无数 的1�7 赏心 乐事 〄1�7 parsed file: input.txt [1 sentences]. parsed 16 words in 1 sentences (21.00 wds/sec; 1.31 sents/sec). 

it didn't seem work. parser produced file, input.txt.parsed.80.stp

[out]:

$ cat input.txt.parsed.80.stp  (frag (nr 应有尽有) (nr 的1�7) (nt 丰富) (nt 选择) (nn 宄1�7) (nn 射1�7) (nn 丄1�7) (nn 悄1�7) (nr 的1�7) (nt 旅程) (nt 增添) (nn 无数) (nn 的1�7) (nr 赏心) (nr 乐事) (vv 〄1�7)) 

then i'ved tried encode sentence gb18030:

$ bash lexparser-lang.sh chinese 80 edu/stanford/nlp/models/lexparser/chinesepcfg.ser.gz parsed input-gb18030.txt loading parser serialized file edu/stanford/nlp/models/lexparser/chinesepcfg.ser.gz ...  done [1.0 sec]. parsing file: input-gb18030.txt parsing [sent. 1 len. 16]: Ӧ�о��� �� �ḻ ѡ�� �� �� Ϊ �� �� �ó� ���� ���� �� ���� ���� �� parsed file: input-gb18030.txt [1 sentences]. parsed 16 words in 1 sentences (19.90 wds/sec; 1.24 sents/sec). alvas@ubi:~/stanford-parser-full-2015-04-20$ cat input-gb18030.txt.parsed.80.stp  (ip   (np     (cp       (ip         (vp (vv Ӧ�о���)))       (dec ��))     (adjp (jj �ḻ))     (np (nn ѡ��)))   (vp (vv ��)     (vp       (advp (ad ��))       (pp (p Ϊ)         (np           (dnp             (np (pn ��))             (deg ��))           (np (nn �ó�))))       (vp (vv ����)         (np           (dnp             (adjp (jj ����))             (deg ��))           (np (nn ����) (nn ����))))))   (pu ��)) 

it seems it's working how convert file utf8?

i've tried didn't work:

$ cat input-gb18030.txt.parsed.80.stp | python -c "print raw_input().decode('gb18030').encode('utf8')" (ip 

here's concluding question:

  • how convert between gb18030 utf8 , utf8 gb18030?
  • how use stanford lexparser chinese utf8 text?

i followed steps , shows can use encoding convertors achieve goal.

i use iconv in testing.

iconv -f gb18030 -t utf-8 input2.txt.parsed.80.stp -o output 

here output:

dmk@dmk-debian /t/stanford-parser-full-2015-04-20 ❯❯❯ cat input2.txt.parsed.80.stp (ip   (np     (cp       (ip         (vp (vv Ӧ�о���)))       (dec ��))     (adjp (jj �ḻ))     (np (nn ѡ��)))   (vp (vv ��)     (vp       (advp (ad ��))       (pp (p Ϊ)         (np           (dnp             (np (pn ��))             (deg ��))           (np (nn �ó�))))       (vp (vv ����)         (np           (dnp             (adjp (jj ����))             (deg ��))           (np (nn ����) (nn ����))))))   (pu ��))  dmk@dmk-debian /t/stanford-parser-full-2015-04-20 ❯❯❯ iconv -f gb18030 -t utf-8 input2.txt.parsed.80.stp -o output dmk@dmk-debian /t/stanford-parser-full-2015-04-20 ❯❯❯ cat output (ip   (np     (cp       (ip         (vp (vv 应有尽有)))       (dec 的))     (adjp (jj 丰富))     (np (nn 选择)))   (vp (vv 定)     (vp       (advp (ad 将))       (pp (p 为)         (np           (dnp             (np (pn 您))             (deg 的))           (np (nn 旅程))))       (vp (vv 增添)         (np           (dnp             (adjp (jj 无数))             (deg 的))           (np (nn 赏心) (nn 乐事))))))   (pu 。)) 

Comments

Popular posts from this blog

javascript - Google App Script ContentService downloadAsFile not working -

javascript - Function overwritting -

php - Find a regex to take part of Email -