python - A fast and efficient, not-so complex word content filter -

June 15, 2012

without getting bayesian-level content classification project, i'm trying make simple profanity filter twitter accounts.

in essense, join of user's tweets 1 large text blob , run content against filter, in essence works this:

badwords = ['bad', 'worse', 'momwouldbeangry', 'thousandsofperversesayings', 'xxx', 'etc']  s = 'get free xxx etc'  score = 0  b in badwords:     if b in s:         score = score+1

i have 3k list of bad words (what perverted world live in!) , ideally i'd create score based not on word occurance, how many times each word occurs. if word occurs twice, score increment twice.

the score generator above extremely simple re-evaluates string thousands of times, plus not increment way i'd like.

how can adjusted performance , accuracy?

so len(badwords) == 3000, therefore tweet_words = len(s.split())) len(tweet_words) < len(badwords); hence

for b in badwords:     if b in s:         score = score+1

is inefficient.

first thing do: make badwords frozenset. way, it's faster ocurrence of in it.

then, search words in badwords, not other way around:

for t_word in tweet_words     if t_word in badwords:         score = score+1

then, bit more functional!

score_function = lambda word: 0 if len(word) < 3 or (word not in badwords) else 1 score = lambda tweet: sum(score(lower(word)) word in tweet.split())

which faster full loops, because python needs construct , destruct less temporary contexts (that's technically bit misleading, save lot of cpython pyobject creations).

Search This Blog

ANgular

python - A fast and efficient, not-so complex word content filter -

Comments

Post a Comment

Popular posts from this blog

c# - Validate object ID from GET to POST -

node.js - Custom Model Validator SailsJS -

php - Find a regex to take part of Email -