The good suffix match heuristic The occurence rule

Consider a string matching situation where a suffix S has been matched when a mismatch occurs:
T = a b a a b a b a c b a P = c a b a b

Then the next opportunity for a match is when the matched suffix appears again before the matched location:

T = a b a a b a b a c b a P = c a b a b --- <---- here: b a --- <---- or here: a b (found !) Next state: T = a b a a b a b a c b a P = c a b a b

The good suffix match heuristic The no occurence rule

If the matched suffix S does not appear as a proper prefix of the pattern:

T = a b c a b a b a c b a P = c b a a b

Then we can shift the pattern m positions, i.e.: shift it pass the matched prefix:

T = a b c a b a b a c b a P = c b a a b P = c b a a b P = c b a a b P = c b a a b <-- Watch out for this "edge" case ! New state: T = a b c a b a b a c b a P = c b a a b

A word of warning: edge cases

When finding the matched suffix in the proper prefix of the pattern, be careful to consider the edge cases where a portion of the suffix appears as a prefix of the pattern:

T = . . . . x b a b . . . . . . P = a b b a c b a b ^ ^ | | Matched suffix = b a b s j The matched prefix "b a b" is not found in proper prefix of P
However, a portion "a b" is: T = . . . . x b a b . . . . . . P = a b b a c b a b This is the new state after the shift !

The good suffix match heuristic

The shift amount when there is a mismatch with a non-empty prefix:

Suppose mismatch happens at X and the matched prefix is found at pos k: T = . . . . . . . . . . . . . X y y y . . . . . P = . . y y y . . . x y y y ^ ^ ^ | | <---->| s k pre m <------------------> m-k New state: T = . . . . . . . . . . . . . X y y y . . . . . P = . . y y y . . . x y y y ^ | s' = s + m - k - len(pre)

The good suffix match heuristic

The shift amount for edge cases:

Suppose mismatch happens at X and a portion P' is found at start: T = . . . . . . . . . . . . . X a b c . . . . . P = b c . . . . . . x a b c ^ ^ | | <-> s P' <-------------------> m - len(P') New state: T = . . . . . . . . . . . . . X a b c . . . . . P = b c . . . . . . x a b c ^ | s' = s + m - len(P')

Computing the shift amount for the good suffix match heuristic

int goodSuffixShift(string P, int pos) { string matchedSuffix = P.substr(pos+1, P.length()); int len = matchedSuffix.length(); // (1) Check string P[k] .. P[k+len-1] == matchedSuffix for ( int k = pos; k >= 0; k-- ) { string s1 = P.substr(k, len); if ( s1 == matchedSuffix ) return P.length() - k - matchedSuffix.length(); } // (2) Check edge cases for ( int k = matchedSuffix.length()-1; k >= 1; k-- ) { if ( P.substr(0,k) == P.substr(P.length()-k, k) ) return P.length() - k; } // (3) No suffix in P --> shift P.length() return P.length(); }

The Boyer-Moore algorithm using only the good suffix match heuristic

int bm(string T, string P, int s) // s = start position { int n = T.length(), m = P.length(); int j = m-1; // Start match with character pos = m-1 while ( s+(m-1) < n ) // Last char pos in P is valid character in T { if ( P[j] == T[s+j] ) { if ( j == 0 ) return(s); // Pattern found at pos s j--; // Continue with next character } else { int shift = 1; // Default shift if ( j < m-1 ) // Make sure len(suffix) >= 1 shift = goodSuffixShift(P, j); // Find suffix P[j+1]..P[m-1] s += shift; j = m-1; // reset to 1st char in pattern to match } } return -1; // Pattern not found }

DEMO: Progs/bm-2.cc