lucene - Shingles in Elasticsearch which respect punctuation -

June 15, 2011

i building address matching engine uk addresses in elasticsearch , have found shingles useful seeing issues when comes punctuation. query "4 walmley close" returning following matches:

units 3 , 4, walmley chambers, 3 walmley close
flat 4, walmley court, 10 walmley close
co-operative retail services ltd, 4 walmley close

the true match number 3, both 1 , 2 match (falsely) both become '4 walmley' when turned shingles. tell shingle analyzer not generate shingles straddle commas. so, example 1) get:

units 3
3 and
and 4
4 walmley
walmley chambers
chambers 3
3 walmley
walmley close

...when in actual fact want is....

units 3
3 and
and 4
walmley chambers
3 walmley
walmley close

my current settings below. have experimented swapping tokenizer standard whitespace, helps in retains commas , potentially avoid situation above (i.e. end '4, walmley' shingle in address 1 , 2) end lots of unusable shingles in index , 70 million documents need keep index size down.

as can see in index settings have have street_sym filter love able use in shingles e.g. example, in addition generating 'walmley close' have 'walmley cl' when attempted include got shingles of 'close cl' not terribly helpful!

any advice more experienced elasticsearch users hugely appreciated. have read through gormley , tong's excellent book cannot head around particular issue.

thanks in advance offered.

"analysis": {     "filter": {         "shingle": {             "type": "shingle",             "output_unigrams": false         },         "street_sym": {             "type": "synonym",             "synonyms": [                 "st => street",                 "rd => road",                 "ave => avenue",                 "ct => court",                 "ln => lane",                 "terr => terrace",                 "cir => circle",                 "hwy => highway",                 "pkwy => parkway",                 "cl => close",                 "blvd => boulevard",                 "dr => drive",                 "ste => suite",                 "wy => way",                 "tr => trail"             ]         }     },     "analyzer": {         "shingle": {             "type": "custom",             "tokenizer": "standard",             "filter": [                 "lowercase",                 "shingle"             ]         }     } }

see comment on question why solution still won't stop "4 walmley close" matching 3 of matches provided. however, possible @ least tokens want. i'm not sure it's elegant/performant solution, using pattern replace, pattern capture, , length filters on shingles seems trick:

"analysis": {     "filter": {         "shingle": {             "type": "shingle",             "output_unigrams": false         },         "street_sym": {             "type": "synonym",             "synonyms": [                 "st => street",                 "rd => road",                 "ave => avenue",                 "ct => court",                 "ln => lane",                 "terr => terrace",                 "cir => circle",                 "hwy => highway",                 "pkwy => parkway",                 "cl => close",                 "blvd => boulevard",                 "dr => drive",                 "ste => suite",                 "wy => way",                 "tr => trail"             ]         },         "no_middle_comma": {             "type": "pattern_replace",             "pattern": ".+,.+",             "replacement": ""          },         "no_trailing_comma": {             "type": "pattern_capture",             "preserve_original": false,             "patterns": [                 "(.*),"             ]         },         "not_empty": {             "type": "length",             "min": 1         }     },     "analyzer": {         "test": {             "type": "custom",             "tokenizer": "whitespace",             "filter": [                 "lowercase",                 "street_sym",                 "shingle",                 "no_middle_comma",                 "no_trailing_comma",                 "not_empty"             ]         }     } }

no_middle_comma: replace tokens comma in middle empty token
no_trailing_comma: replace tokens ending comma part before comma
not_empty: remove empty tokens resulting above

for example, "units 3 , 4, walmley chambers, 3 walmley cl" becomes:

{    "tokens": [       {          "token": "units 3",          "start_offset": 0,          "end_offset": 7,          "type": "shingle",          "position": 0       },       {          "token": "3 and",          "start_offset": 6,          "end_offset": 11,          "type": "shingle",          "position": 1       },       {          "token": "and 4",          "start_offset": 8,          "end_offset": 14,          "type": "shingle",          "position": 2       },       {          "token": "walmley chambers",          "start_offset": 15,          "end_offset": 32,          "type": "shingle",          "position": 4       },       {          "token": "3 walmley",          "start_offset": 33,          "end_offset": 42,          "type": "shingle",          "position": 6       },       {          "token": "walmley close",          "start_offset": 35,          "end_offset": 45,          "type": "shingle",          "position": 7       }    ] }

note synonym filter works: "walmley cl" became "walmley close".

Search This Blog

ANgular

lucene - Shingles in Elasticsearch which respect punctuation -

Comments

Post a Comment

Popular posts from this blog

javascript - Google App Script ContentService downloadAsFile not working -

javascript - Function overwritting -

php - Find a regex to take part of Email -