lucene - Shingles in Elasticsearch which respect punctuation -
i building address matching engine uk addresses in elasticsearch , have found shingles useful seeing issues when comes punctuation. query "4 walmley close" returning following matches:
- units 3 , 4, walmley chambers, 3 walmley close
- flat 4, walmley court, 10 walmley close
- co-operative retail services ltd, 4 walmley close
the true match number 3, both 1 , 2 match (falsely) both become '4 walmley' when turned shingles. tell shingle analyzer not generate shingles straddle commas. so, example 1) get:
- units 3
- 3 and
- and 4
- 4 walmley
- walmley chambers
- chambers 3
- 3 walmley
- walmley close
...when in actual fact want is....
- units 3
- 3 and
- and 4
- walmley chambers
- 3 walmley
- walmley close
my current settings below. have experimented swapping tokenizer standard whitespace, helps in retains commas , potentially avoid situation above (i.e. end '4, walmley' shingle in address 1 , 2) end lots of unusable shingles in index , 70 million documents need keep index size down.
as can see in index settings have have street_sym filter love able use in shingles e.g. example, in addition generating 'walmley close' have 'walmley cl' when attempted include got shingles of 'close cl' not terribly helpful!
any advice more experienced elasticsearch users hugely appreciated. have read through gormley , tong's excellent book cannot head around particular issue.
thanks in advance offered.
"analysis": { "filter": { "shingle": { "type": "shingle", "output_unigrams": false }, "street_sym": { "type": "synonym", "synonyms": [ "st => street", "rd => road", "ave => avenue", "ct => court", "ln => lane", "terr => terrace", "cir => circle", "hwy => highway", "pkwy => parkway", "cl => close", "blvd => boulevard", "dr => drive", "ste => suite", "wy => way", "tr => trail" ] } }, "analyzer": { "shingle": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "shingle" ] } } }
see comment on question why solution still won't stop "4 walmley close" matching 3 of matches provided. however, possible @ least tokens want. i'm not sure it's elegant/performant solution, using pattern replace, pattern capture, , length filters on shingles seems trick:
"analysis": { "filter": { "shingle": { "type": "shingle", "output_unigrams": false }, "street_sym": { "type": "synonym", "synonyms": [ "st => street", "rd => road", "ave => avenue", "ct => court", "ln => lane", "terr => terrace", "cir => circle", "hwy => highway", "pkwy => parkway", "cl => close", "blvd => boulevard", "dr => drive", "ste => suite", "wy => way", "tr => trail" ] }, "no_middle_comma": { "type": "pattern_replace", "pattern": ".+,.+", "replacement": "" }, "no_trailing_comma": { "type": "pattern_capture", "preserve_original": false, "patterns": [ "(.*)," ] }, "not_empty": { "type": "length", "min": 1 } }, "analyzer": { "test": { "type": "custom", "tokenizer": "whitespace", "filter": [ "lowercase", "street_sym", "shingle", "no_middle_comma", "no_trailing_comma", "not_empty" ] } } } no_middle_comma: replace tokens comma in middle empty tokenno_trailing_comma: replace tokens ending comma part before commanot_empty: remove empty tokens resulting above
for example, "units 3 , 4, walmley chambers, 3 walmley cl" becomes:
{ "tokens": [ { "token": "units 3", "start_offset": 0, "end_offset": 7, "type": "shingle", "position": 0 }, { "token": "3 and", "start_offset": 6, "end_offset": 11, "type": "shingle", "position": 1 }, { "token": "and 4", "start_offset": 8, "end_offset": 14, "type": "shingle", "position": 2 }, { "token": "walmley chambers", "start_offset": 15, "end_offset": 32, "type": "shingle", "position": 4 }, { "token": "3 walmley", "start_offset": 33, "end_offset": 42, "type": "shingle", "position": 6 }, { "token": "walmley close", "start_offset": 35, "end_offset": 45, "type": "shingle", "position": 7 } ] } note synonym filter works: "walmley cl" became "walmley close".
Comments
Post a Comment