python - Save and reuse TfidfVectorizer in scikit learn -


i using tfidfvectorizer in scikit learn create matrix text data. need save object reusing later. tried use pickle, gave following error.

loc=open('vectorizer.obj','w') pickle.dump(self.vectorizer,loc) *** typeerror: can't pickle instancemethod objects 

i tried using joblib in sklearn.externals, again gave similar error. there way save object can reuse later?

here full object:

class changetomatrix(object): def __init__(self,ngram_range=(1,1),tokenizer=stemtokenizer()):     sklearn.feature_extraction.text import tfidfvectorizer     self.vectorizer = tfidfvectorizer(ngram_range=ngram_range,analyzer='word',lowercase=true,\                                           token_pattern='[a-za-z0-9]+',strip_accents='unicode',tokenizer=tokenizer)  def load_ref_text(self,text_file):     textfile = open(text_file,'r')     lines=textfile.readlines()     textfile.close()     lines = ' '.join(lines)     sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')     sentences = [ sent_tokenizer.tokenize(lines.strip()) ]     sentences1 = [item.strip().strip('.') sublist in sentences item in sublist]           chk2=pd.dataframe(self.vectorizer.fit_transform(sentences1).toarray()) #vectorizer transformed in step      return sentences1,[chk2]  def get_processed_data(self,data_loc):     ref_sentences,ref_dataframes=self.load_ref_text(data_loc)     loc=open("indexeddata/vectorizer.obj","w")     pickle.dump(self.vectorizer,loc) #getting error here     loc.close()     return ref_sentences,ref_dataframes 

firstly, it's better leave import @ top of code instead of within class:

from sklearn.feature_extraction.text import tfidfvectorizer class changetomatrix(object):   def __init__(self,ngram_range=(1,1),tokenizer=stemtokenizer()):     ... 

next stemtokenizer don't seem canonical class. possibly you've got http://sahandsaba.com/visualizing-philosophers-and-scientists-by-the-words-they-used-with-d3js-and-python.html or maybe somewhere else we'll assume returns list of strings.

class stemtokenizer(object):     def __init__(self):         self.ignore_set = {'footnote', 'nietzsche', 'plato', 'mr.'}      def __call__(self, doc):         words = []         word in word_tokenize(doc):             word = word.lower()             w = wn.morphy(word)             if w , len(w) > 1 , w not in self.ignore_set:                 words.append(w)         return words 

now answer actual question, it's possible need open file in byte mode before dumping pickle, i.e.:

>>> sklearn.feature_extraction.text import tfidfvectorizer >>> nltk import word_tokenize >>> import cpickle pickle >>> vectorizer = tfidfvectorizer(ngram_range=(0,2),analyzer='word',lowercase=true, token_pattern='[a-za-z0-9]+',strip_accents='unicode',tokenizer=word_tokenize) >>> vectorizer tfidfvectorizer(analyzer='word', binary=false, decode_error=u'strict',         dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',         lowercase=true, max_df=1.0, max_features=none, min_df=1,         ngram_range=(0, 2), norm=u'l2', preprocessor=none, smooth_idf=true,         stop_words=none, strip_accents='unicode', sublinear_tf=false,         token_pattern='[a-za-z0-9]+',         tokenizer=<function word_tokenize @ 0x7f5ea68e88c0>, use_idf=true,         vocabulary=none) >>> open('vectorizer.pk', 'wb') fin: ...     pickle.dump(vectorizer, fin) ...  >>> exit() alvas@ubi:~$ ls -lah vectorizer.pk  -rw-rw-r-- 1 alvas alvas 763 jun 15 14:18 vectorizer.pk 

note: using with idiom i/o file access automatically closes file once out of with scope.

regarding issue snowballstemmer(), note snowballstemmer('english') object while stemming function snowballstemmer('english').stem.

important:

  • tfidfvectorizer's tokenizer parameter expects take string , return list of string
  • but snowball stemmer not take string input , return list of string.

so need this:

>>> nltk.stem import snowballstemmer >>> nltk import word_tokenize >>> stemmer = snowballstemmer('english').stem >>> def stem_tokenize(text): ...     return [stemmer(i) in word_tokenize(text)] ...  >>> vectorizer = tfidfvectorizer(ngram_range=(0,2),analyzer='word',lowercase=true, token_pattern='[a-za-z0-9]+',strip_accents='unicode',tokenizer=stem_tokenize) >>> open('vectorizer.pk', 'wb') fin: ...     pickle.dump(vectorizer, fin) ... >>> exit() alvas@ubi:~$ ls -lah vectorizer.pk  -rw-rw-r-- 1 alvas alvas 758 jun 15 15:55 vectorizer.pk 

Comments

Popular posts from this blog

javascript - Google App Script ContentService downloadAsFile not working -

javascript - Function overwritting -

php - Find a regex to take part of Email -