c++ - Recursive solutions for glob pattern matching -
i'm studying implementations of unix-style glob pattern matching. generally, fnmatch
library reference implementation functionality.
looking @ of the implementations, reading various blogs/tutorials this, seems algorithm implemented recursively.
generally, sort of algorithm requires "back tracking", or requires returning earlier state, nicely lends recursive solution. includes things tree traversal, or parsing nested structures.
but i'm having trouble understanding why glob pattern matching in particular implemented recursively. idea tracking necessary, example if have string aabaabxbaab
, , pattern a*baab
, characters after *
match first "baab" substring, aa(baab)xbaab
, , fail match rest of string. algorithm need backtrack character match counter starts over, , can match second occurrence of baab
, like: aabaabx(baab)
.
okay, recursion used when might require multiple nested levels of backtracking, , don't see how necessary in case. seems we'd ever have backtrack 1 level @ time, when iterator on pattern , iterator on input string fail match. @ point, iterator on pattern need move last "save point", either beginning of string, or last *
matched something. doesn't require stack - single save point.
the complication can think of in event of "overlapping" match. example if have input string aabaabaab
, , pattern a*baab
, fail match because "b" in last baab
part of either first match or second match. seems solved backtracking input iterator if distance between last pattern iterator save point , end of pattern greater distance between input iterator position , end of input string.
so, far i'm seeing, shouldn't of issue implement glob matching algorithm iteratively (assuming simple glob matcher, uses *
character mean "match 0 or more characters". also, matching strategy greedy.)
so, assume i'm wrong this, because else recursively - must missing something. it's can't see i'm missing here. implemented simple glob matcher in c++ (that supports *
operator), see if figure out i'm missing. extremely straightforward, simple iterative solution uses inner loop wildcard matching. records indices *
character matches in vector of pairs:
bool match_pattern(const std::string& pattern, const std::string& input, std::vector<std::pair<std::size_t, std::size_t>>& matches) { const char wildcard = '*'; auto pat = std::begin(pattern); auto pat_end = std::end(pattern); auto = std::begin(input); auto end = std::end(input); while (it != end && pat != pat_end) { const char c = *pat; if (*it == c) { ++it; ++pat; } else if (c == wildcard) { matches.push_back(std::make_pair(std::distance(std::begin(input), it), 0)); ++pat; if (pat == pat_end) { matches.back().second = input.size(); return true; } auto save = pat; std::size_t matched_chars = 0; while (it != end && pat != pat_end) { if (*it == *pat) { ++it; ++pat; ++matched_chars; if (pat == pat_end && != end) { pat = save; matched_chars = 0; // check overlap , input iterator if necessary // std::size_t d1 = std::distance(it, end); std::size_t d2 = std::distance(pat, pat_end); if (d2 > d1) -= (d2 - d1); } } else if (*pat == wildcard) { break; } else { if (pat == save) ++it; pat = save; matched_chars = 0; } } matches.back().second = std::distance(std::begin(input), it) - matched_chars; } else break; } return == end && pat == pat_end; }
then wrote series of tests see if find pattern or input string require multiple levels of backtracking (and therefore recursion or stack), can't seem come anything.
here test function:
void test(const std::string& input, const std::string& pattern) { std::vector<std::pair<std::size_t, std::size_t>> matches; bool result = match_pattern(pattern, input, matches); auto match_iter = matches.begin(); std::cout << "input: " << input << std::endl; std::cout << "pattern: " << pattern << std::endl; std::cout << "indices: "; (auto& p : matches) { std::cout << "(" << p.first << "," << p.second << ") "; } std::cout << std::endl; if (result) { std::cout << "match: "; (std::size_t idx = 0; idx < input.size(); ++idx) { if (match_iter != matches.end()) { if (idx == match_iter->first) std::cout << '('; else if (idx == match_iter->second) { std::cout << ')'; ++match_iter; } } std::cout << input[idx]; } if (!matches.empty() && matches.back().second == input.size()) std::cout << ")"; std::cout << std::endl; } else { std::cout << "no match!" << std::endl; } std::cout << std::endl; }
and actual tests:
test("aabaabaab", "a*b*ab"); test("aabaabaab", "a*"); test("aabaabaab", "aa*"); test("aabaabaab", "aaba*"); test("/foo/bar/baz/xlig/fig/blig", "/foo/*/blig"); test("/foo/bar/baz/blig/fig/blig", "/foo/*/blig"); test("abcdd", "*d"); test("abcdd", "*d*"); test("aabaabqqbaab", "a*baab"); test("aabaabaab", "a*baab");
so outputs:
input: aabaabaab pattern: a*b*ab indices: (1,2) (3,7) match: a(a)b(aaba)ab input: aabaabaab pattern: a* indices: (1,9) match: a(abaabaab) input: aabaabaab pattern: aa* indices: (2,9) match: aa(baabaab) input: aabaabaab pattern: aaba* indices: (4,9) match: aaba(abaab) input: /foo/bar/baz/xlig/fig/blig pattern: /foo/*/blig indices: (5,21) match: /foo/(bar/baz/xlig/fig)/blig input: /foo/bar/baz/blig/fig/blig pattern: /foo/*/blig indices: (5,21) match: /foo/(bar/baz/blig/fig)/blig input: abcdd pattern: *d indices: (0,4) match: (abcd)d input: abcdd pattern: *d* indices: (0,3) (4,5) match: (abc)d(d) input: aabaabqqbaab pattern: a*baab indices: (1,8) match: a(abaabqq)baab input: aabaabaab pattern: a*baab indices: (1,5) match: a(abaa)baab
the parentheses appear in output after "match: "
show substrings matched/consumed each *
character. so, seems work fine, , can't seem see why necessary backtrack multiple levels here - @ least if limit our pattern allow *
characters, , assume greedy matching.
so assume i'm wrong this, , overlooking simple - can me see why algorithm might require multiple levels of backtracking (and therefore recursion or stack)?
i didn't check details of implementation, true can match without recursive backtracking.
you can glob matching without backtracking @ building simple finite-state machine. translate glob regular expression , build dfa in normal way, or use similar aho-corasick machine; if tweaked algorithm little bit, you'd achieve same result. (the key don't need backup input scan; need figure out correct scan state, can precomputed.)
the classic fnmatch implementations not optimized speed; they're based on assumption patterns , target strings short. assumption reasonable, if allow untrusted patterns, you're opening dos attack. , based on assumption, algorithm similar 1 present, not require precomputation, faster in vast majority of use cases algorithm requires precomputing state transition tables while avoiding exponential blowup pathological patterns.
Comments
Post a Comment