python - Save and reuse TfidfVectorizer in scikit learn -
i using tfidfvectorizer in scikit learn create matrix text data. need save object reusing later. tried use pickle, gave following error.
loc=open('vectorizer.obj','w') pickle.dump(self.vectorizer,loc) *** typeerror: can't pickle instancemethod objects i tried using joblib in sklearn.externals, again gave similar error. there way save object can reuse later?
here full object:
class changetomatrix(object): def __init__(self,ngram_range=(1,1),tokenizer=stemtokenizer()): sklearn.feature_extraction.text import tfidfvectorizer self.vectorizer = tfidfvectorizer(ngram_range=ngram_range,analyzer='word',lowercase=true,\ token_pattern='[a-za-z0-9]+',strip_accents='unicode',tokenizer=tokenizer) def load_ref_text(self,text_file): textfile = open(text_file,'r') lines=textfile.readlines() textfile.close() lines = ' '.join(lines) sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') sentences = [ sent_tokenizer.tokenize(lines.strip()) ] sentences1 = [item.strip().strip('.') sublist in sentences item in sublist] chk2=pd.dataframe(self.vectorizer.fit_transform(sentences1).toarray()) #vectorizer transformed in step return sentences1,[chk2] def get_processed_data(self,data_loc): ref_sentences,ref_dataframes=self.load_ref_text(data_loc) loc=open("indexeddata/vectorizer.obj","w") pickle.dump(self.vectorizer,loc) #getting error here loc.close() return ref_sentences,ref_dataframes
firstly, it's better leave import @ top of code instead of within class:
from sklearn.feature_extraction.text import tfidfvectorizer class changetomatrix(object): def __init__(self,ngram_range=(1,1),tokenizer=stemtokenizer()): ... next stemtokenizer don't seem canonical class. possibly you've got http://sahandsaba.com/visualizing-philosophers-and-scientists-by-the-words-they-used-with-d3js-and-python.html or maybe somewhere else we'll assume returns list of strings.
class stemtokenizer(object): def __init__(self): self.ignore_set = {'footnote', 'nietzsche', 'plato', 'mr.'} def __call__(self, doc): words = [] word in word_tokenize(doc): word = word.lower() w = wn.morphy(word) if w , len(w) > 1 , w not in self.ignore_set: words.append(w) return words now answer actual question, it's possible need open file in byte mode before dumping pickle, i.e.:
>>> sklearn.feature_extraction.text import tfidfvectorizer >>> nltk import word_tokenize >>> import cpickle pickle >>> vectorizer = tfidfvectorizer(ngram_range=(0,2),analyzer='word',lowercase=true, token_pattern='[a-za-z0-9]+',strip_accents='unicode',tokenizer=word_tokenize) >>> vectorizer tfidfvectorizer(analyzer='word', binary=false, decode_error=u'strict', dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content', lowercase=true, max_df=1.0, max_features=none, min_df=1, ngram_range=(0, 2), norm=u'l2', preprocessor=none, smooth_idf=true, stop_words=none, strip_accents='unicode', sublinear_tf=false, token_pattern='[a-za-z0-9]+', tokenizer=<function word_tokenize @ 0x7f5ea68e88c0>, use_idf=true, vocabulary=none) >>> open('vectorizer.pk', 'wb') fin: ... pickle.dump(vectorizer, fin) ... >>> exit() alvas@ubi:~$ ls -lah vectorizer.pk -rw-rw-r-- 1 alvas alvas 763 jun 15 14:18 vectorizer.pk note: using with idiom i/o file access automatically closes file once out of with scope.
regarding issue snowballstemmer(), note snowballstemmer('english') object while stemming function snowballstemmer('english').stem.
important:
tfidfvectorizer's tokenizer parameter expects take string , return list of string- but snowball stemmer not take string input , return list of string.
so need this:
>>> nltk.stem import snowballstemmer >>> nltk import word_tokenize >>> stemmer = snowballstemmer('english').stem >>> def stem_tokenize(text): ... return [stemmer(i) in word_tokenize(text)] ... >>> vectorizer = tfidfvectorizer(ngram_range=(0,2),analyzer='word',lowercase=true, token_pattern='[a-za-z0-9]+',strip_accents='unicode',tokenizer=stem_tokenize) >>> open('vectorizer.pk', 'wb') fin: ... pickle.dump(vectorizer, fin) ... >>> exit() alvas@ubi:~$ ls -lah vectorizer.pk -rw-rw-r-- 1 alvas alvas 758 jun 15 15:55 vectorizer.pk
Comments
Post a Comment