lucene - Shingles in Elasticsearch which respect punctuation -


i building address matching engine uk addresses in elasticsearch , have found shingles useful seeing issues when comes punctuation. query "4 walmley close" returning following matches:

  1. units 3 , 4, walmley chambers, 3 walmley close
  2. flat 4, walmley court, 10 walmley close
  3. co-operative retail services ltd, 4 walmley close

the true match number 3, both 1 , 2 match (falsely) both become '4 walmley' when turned shingles. tell shingle analyzer not generate shingles straddle commas. so, example 1) get:

  • units 3
  • 3 and
  • and 4
  • 4 walmley
  • walmley chambers
  • chambers 3
  • 3 walmley
  • walmley close

...when in actual fact want is....

  • units 3
  • 3 and
  • and 4
  • walmley chambers
  • 3 walmley
  • walmley close

my current settings below. have experimented swapping tokenizer standard whitespace, helps in retains commas , potentially avoid situation above (i.e. end '4, walmley' shingle in address 1 , 2) end lots of unusable shingles in index , 70 million documents need keep index size down.

as can see in index settings have have street_sym filter love able use in shingles e.g. example, in addition generating 'walmley close' have 'walmley cl' when attempted include got shingles of 'close cl' not terribly helpful!

any advice more experienced elasticsearch users hugely appreciated. have read through gormley , tong's excellent book cannot head around particular issue.

thanks in advance offered.

"analysis": {     "filter": {         "shingle": {             "type": "shingle",             "output_unigrams": false         },         "street_sym": {             "type": "synonym",             "synonyms": [                 "st => street",                 "rd => road",                 "ave => avenue",                 "ct => court",                 "ln => lane",                 "terr => terrace",                 "cir => circle",                 "hwy => highway",                 "pkwy => parkway",                 "cl => close",                 "blvd => boulevard",                 "dr => drive",                 "ste => suite",                 "wy => way",                 "tr => trail"             ]         }     },     "analyzer": {         "shingle": {             "type": "custom",             "tokenizer": "standard",             "filter": [                 "lowercase",                 "shingle"             ]         }     } } 

see comment on question why solution still won't stop "4 walmley close" matching 3 of matches provided. however, possible @ least tokens want. i'm not sure it's elegant/performant solution, using pattern replace, pattern capture, , length filters on shingles seems trick:

"analysis": {     "filter": {         "shingle": {             "type": "shingle",             "output_unigrams": false         },         "street_sym": {             "type": "synonym",             "synonyms": [                 "st => street",                 "rd => road",                 "ave => avenue",                 "ct => court",                 "ln => lane",                 "terr => terrace",                 "cir => circle",                 "hwy => highway",                 "pkwy => parkway",                 "cl => close",                 "blvd => boulevard",                 "dr => drive",                 "ste => suite",                 "wy => way",                 "tr => trail"             ]         },         "no_middle_comma": {             "type": "pattern_replace",             "pattern": ".+,.+",             "replacement": ""          },         "no_trailing_comma": {             "type": "pattern_capture",             "preserve_original": false,             "patterns": [                 "(.*),"             ]         },         "not_empty": {             "type": "length",             "min": 1         }     },     "analyzer": {         "test": {             "type": "custom",             "tokenizer": "whitespace",             "filter": [                 "lowercase",                 "street_sym",                 "shingle",                 "no_middle_comma",                 "no_trailing_comma",                 "not_empty"             ]         }     } } 
  • no_middle_comma: replace tokens comma in middle empty token
  • no_trailing_comma: replace tokens ending comma part before comma
  • not_empty: remove empty tokens resulting above

for example, "units 3 , 4, walmley chambers, 3 walmley cl" becomes:

{    "tokens": [       {          "token": "units 3",          "start_offset": 0,          "end_offset": 7,          "type": "shingle",          "position": 0       },       {          "token": "3 and",          "start_offset": 6,          "end_offset": 11,          "type": "shingle",          "position": 1       },       {          "token": "and 4",          "start_offset": 8,          "end_offset": 14,          "type": "shingle",          "position": 2       },       {          "token": "walmley chambers",          "start_offset": 15,          "end_offset": 32,          "type": "shingle",          "position": 4       },       {          "token": "3 walmley",          "start_offset": 33,          "end_offset": 42,          "type": "shingle",          "position": 6       },       {          "token": "walmley close",          "start_offset": 35,          "end_offset": 45,          "type": "shingle",          "position": 7       }    ] } 

note synonym filter works: "walmley cl" became "walmley close".


Comments

Popular posts from this blog

javascript - Google App Script ContentService downloadAsFile not working -

javascript - Function overwritting -

php - Find a regex to take part of Email -