Difference between R tm package stemDocument function behavior and original Porter stemming algorithm -
using r's stemdocument
function tm
package (see session info below) get:
library(tm) stemdocument("cmos") [1] "cmos"
however when using this implementation in java , when using this "online porter stemmer" result of stemming "cmos" be: "cmo".
also in original article step 1a rule says:
step 1a sses -> ss caresses -> caress ies -> ponies -> poni ties -> ti ss -> ss caress -> caress s -> cats -> cat
meaning string "cmos" ending "s" should stemmed "cmo", deleting "s".
so why r's stemdocument
function behavior different?
> sessioninfo() r version 3.1.2 (2014-10-31) platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] lc_collate=english_united states.1252 lc_ctype=english_united states.1252 lc_monetary=english_united states.1252 [4] lc_numeric=c lc_time=english_united states.1252 attached base packages: [1] stats graphics grdevices utils datasets methods base other attached packages: [1] tm_0.6 nlp_0.1-5 loaded via namespace (and not attached): [1] parallel_3.1.2 slam_0.1-32 snowballc_0.5.1 tools_3.1.2
Comments
Post a Comment