Difference between R tm package stemDocument function behavior and original Porter stemming algorithm -


using r's stemdocument function tm package (see session info below) get:

library(tm) stemdocument("cmos") [1] "cmos" 

however when using this implementation in java , when using this "online porter stemmer" result of stemming "cmos" be: "cmo".

also in original article step 1a rule says:

step 1a  sses -> ss                         caresses  ->  caress ies  ->                          ponies    ->  poni                                    ties      ->  ti ss   -> ss                         caress    ->  caress s    ->                            cats      ->  cat 

meaning string "cmos" ending "s" should stemmed "cmo", deleting "s".

so why r's stemdocument function behavior different?

> sessioninfo() r version 3.1.2 (2014-10-31) platform: x86_64-w64-mingw32/x64 (64-bit)  locale: [1] lc_collate=english_united states.1252  lc_ctype=english_united states.1252    lc_monetary=english_united states.1252 [4] lc_numeric=c                           lc_time=english_united states.1252      attached base packages: [1] stats     graphics  grdevices utils     datasets  methods   base       other attached packages: [1] tm_0.6    nlp_0.1-5  loaded via namespace (and not attached): [1] parallel_3.1.2  slam_0.1-32     snowballc_0.5.1 tools_3.1.2 


Comments

Popular posts from this blog

c# - Validate object ID from GET to POST -

node.js - Custom Model Validator SailsJS -

php - Find a regex to take part of Email -