cluster analysis - Mahout clustering: How to retrieve the name of a named vector -


i want cluster multiple documents using mahout. clustering works fine have no idea how find out documents located in each cluster.

i read can use option --namedvector when creating sparse-files take id , how can retrieve id after clustering completed?


right doing following steps:

i have directory file each document. files in following format id of document filename:

filename: documentid.txt  [title]  [content] 

i create sparse directory namedvectors using:

./mahout seqdirectory -i tmp/es-out -o tmp/es-out-seqdir -c utf-8 -chunk 64 -xm sequential ./mahout seq2sparse -i tmp/es-out-seqdir -o tmp/es-out-sparse --maxdfpercent 85 --namedvector 

then can cluster results , create dump:

./mahout kmeans -i tmp/es-out-sparse/tfidf-vectors -c tmp/es-kmeans-clusters -o tmp/es-kmeans -dm org.apache.mahout.common.distance.euclideandistancemeasure -x 10 -k 20 -ow --clustering ./mahout clusterdump -i tmp/es-kmeans/clusters-10-final -o tmp/clusterdump -d tmp/es-out-sparse/dictionary.file-0 -dt sequencefile -b 100 -n 20 --evaluate -dm org.apache.mahout.common.distance.euclideandistancemeasure -sp 0 --pointsdir tmp/es-kmeans/clusteredpoints 

the dump looks this:

:vl-190{n=1 c=[1:3.407, 110:6.193, 2007:3.736, about:1.762, according:2.948, account:3.507, acting:6.   top terms:      epa                                     =>  13.471728324890137     mountaintop                             =>  11.364262580871582     mine                                    =>  10.942587852478027    weight : [props - optional]:  point:  [...] 

k-means in mahout toy.

you can use howtos , tutorials, real use slow, limited, roo hard use. (also, k-means results not half people think... of time dogfood.)

benchmark other tools, , you'll surprised big time.


Comments

Popular posts from this blog

c# - Validate object ID from GET to POST -

node.js - Custom Model Validator SailsJS -

php - Find a regex to take part of Email -