python - A fast and efficient, not-so complex word content filter -
without getting bayesian-level content classification project, i'm trying make simple profanity filter twitter accounts.
in essense, join of user's tweets 1 large text blob , run content against filter, in essence works this:
badwords = ['bad', 'worse', 'momwouldbeangry', 'thousandsofperversesayings', 'xxx', 'etc'] s = 'get free xxx etc' score = 0 b in badwords: if b in s: score = score+1
i have 3k list of bad words (what perverted world live in!) , ideally i'd create score based not on word occurance, how many times each word occurs. if word occurs twice, score increment twice.
the score generator above extremely simple re-evaluates string thousands of times, plus not increment way i'd like.
how can adjusted performance , accuracy?
so len(badwords) == 3000
, therefore tweet_words = len(s.split()))
len(tweet_words) < len(badwords)
; hence
for b in badwords: if b in s: score = score+1
is inefficient.
first thing do: make badwords
frozenset
. way, it's faster ocurrence of in it.
then, search words in badwords
, not other way around:
for t_word in tweet_words if t_word in badwords: score = score+1
then, bit more functional!
score_function = lambda word: 0 if len(word) < 3 or (word not in badwords) else 1 score = lambda tweet: sum(score(lower(word)) word in tweet.split())
which faster full loops, because python needs construct , destruct less temporary contexts (that's technically bit misleading, save lot of cpython pyobject creations).
Comments
Post a Comment