The good suffix match heuristic   The occurence rule

  • Consider a string matching situation where a suffix S has been matched when a mismatch occurs:

       T = a b a a b a b a c b a
       P = c a b a b       

  • Then the next opportunity for a match is when the matched suffix appears again before the matched location:

       T = a b a a b a b a c b a
       P = c a b a b 
               ---          <----    here:    b a
             ---            <---- or here:    a b (found !)
    
     Next state:
    
       T = a b a a b a b a c b a
       P =     c a b a b 
    

The good suffix match heuristic   The no occurence rule

  • If the matched suffix S does not appear as a proper prefix of the pattern:

       T = a b c a b a b a c b a
       P = c b a a b       

  • Then we can shift the pattern m positions, i.e.: shift it pass the matched prefix:

       T = a b c a b a b a c b a
       P =   c b a a b 
       P =     c b a a b   
       P =       c b a a b 
       P =         c b a a b  <-- Watch out for this "edge" case !
    
     New state:
    
       T = a b c a b a b a c b a
       P =           c b a a b 
    

A word of warning:    edge cases

  • When finding the matched suffix in the proper prefix of the pattern, be careful to consider the edge cases where a portion of the suffix appears as a prefix of the pattern:

      T = . . . . x b a b . . . . . . 
      P = a b b a c b a b
          ^       ^
          |	      |         Matched suffix = b a b 
          s	      j 
    
     The matched prefix "b a b" is not found in proper prefix of P
    
    
    However, a portion "a b" is: T = . . . . x b a b . . . . . . P = a b b a c b a b This is the new state after the shift !

The good suffix match heuristic

The shift amount when there is a mismatch with a non-empty prefix:

 Suppose mismatch happens at X and the matched prefix is found at pos k:

  T = . . . . . . . . . . . . . X y y y . . . . .
  P =           . . y y y . . . x y y y
                ^   ^                   ^
                |   |             <---->|
		s   k              pre  m
                     <------------------>
                            m-k

 New state:

  T = . . . . . . . . . . . . . X y y y . . . . .
  P =                         . . y y y . . . x y y y
                              ^
			      |
			   s' = s + m - k - len(pre)
 

The good suffix match heuristic

The shift amount for edge cases:

 Suppose mismatch happens at X and a portion P' is found at start:

  T = . . . . . . . . . . . . . X a b c . . . . .
  P =           b c . . . . . . x a b c
                ^ ^                   
                | |                 <->
		s                    P' 
                   <------------------->
                         m - len(P')

 New state:
  T = . . . . . . . . . . . . . X a b c . . . . .
  P =                               b c . . . . . . x a b c
                                    ^
			            |
			   s' = s + m - len(P')
 

Computing the shift amount for the good suffix match heuristic

int goodSuffixShift(string P, int pos)
{
   string matchedSuffix = P.substr(pos+1, P.length());
   int    len = matchedSuffix.length();

   // (1) Check string  P[k] .. P[k+len-1] == matchedSuffix
   for ( int k = pos; k >= 0; k-- )
   {
      string s1 = P.substr(k, len);
      if ( s1 == matchedSuffix )
          return P.length() - k - matchedSuffix.length();
   }

   // (2) Check edge cases
   for ( int k = matchedSuffix.length()-1; k >= 1; k-- )
   {
       if ( P.substr(0,k) == P.substr(P.length()-k, k) )
          return  P.length() - k;
   }

   // (3) No suffix in P --> shift P.length()
   return P.length();
} 

The Boyer-Moore algorithm using only the good suffix match heuristic

int bm(string T, string P, int s)  // s = start position
{
    int n = T.length(), m = P.length();

    int j = m-1;  // Start match with character pos = m-1

    while ( s+(m-1) < n )  // Last char pos in P is valid character in T
    {
        if ( P[j] == T[s+j] )  
        { 
            if ( j == 0 )
                return(s);    // Pattern found at pos s
            j--;              // Continue with next character
        }
        else
        {
            int shift = 1;     // Default shift
            if ( j < m-1 )      // Make sure len(suffix) >= 1
                shift = goodSuffixShift(P, j); // Find suffix P[j+1]..P[m-1]
            s += shift;

            j = m-1;      // reset to 1st char in pattern to match
        }
    }
    return -1;          // Pattern not found
}

DEMO: Progs/bm-2.cc