java - Chinese sentence segmenter with Stanford coreNLP -


i'm using stanford corenlp system following command:

java -cp stanford-corenlp-3.5.2.jar:stanford-chinese-corenlp-2015-04-20-models.jar -xmx3g edu.stanford.nlp.pipeline.stanfordcorenlp -props stanfordcorenlp-chinese.properties -annotators segment,ssplit -file input.txt 

and working great on small chinese texts. however, need train mt system requires me segment input. need use -annotators segment, parameters system outputs empty file. run tool using ssplit annotator don't want because input parallel corpora contains 1 sentence line already, , ssplit not split sentences , create problems in parallel data.

is there way tell system segmentation only, or tell input contains sentence line exactly?

using stanford segmenter instead:

$ wget http://nlp.stanford.edu/software/stanford-segmenter-2015-04-20.zip $ unzip stanford-segmenter-2015-04-20.zip $ echo "应有尽有的丰富选择定将为您的旅程增添无数的赏心乐事" > input.txt $ bash stanford-segmenter-2015-04-20/segment.sh ctb input.txt utf-8 0 > output.txt $ cat output.txt 应有尽有 的 丰富 选择 定 将 为 您 的 旅程 增添 无数 的 赏心 乐事 

other stanford segmenter, there many other segmenter might more suitable, see is there open-source or freely available chinese segmentation algorithm available?


to continue using stanford nlp tools pos tagging:

$ wget http://nlp.stanford.edu/software/stanford-postagger-full-2015-04-20.zip $ unzip stanford-postagger-full-2015-04-20.zip $ cd stanford-postagger-full-2015-01-30/ $ echo "应有尽有 的 丰富 选择 定 将 为 您 的 旅程 增添 无数 的 赏心 乐事" > input.txt $ bash stanford-postagger.sh models/chinese-distsim.tagger input.txt > output.txt $ cat output.txt  应有尽有#vv 的#dec 丰富#jj 选择#nn 定#vv 将#ad 为#p 您#pn 的#deg 旅程#nn 增添#vv 无数#cd 的#deg 赏心#nn 乐事#nn 

Comments

Popular posts from this blog

c# - Validate object ID from GET to POST -

node.js - Custom Model Validator SailsJS -

php - Find a regex to take part of Email -