The bad character match heuristic   The no occurence rule

  • There are advantages to perform string matching from right ⇒ left.

  • Consider a string matching situation where the text symbol (here: d) that is compared with the pattern symbol does not occur in the pattern at all:

       T = a b b a d a b a c b a
       P = b a b a c                  d does not occur in P
    
     We can skip all these shifts:
    
       T = a b b a d a b a c b a
       P =   b a b a c
       P =     b a b a c
       P =       b a b a c
       P =         b a b a c
      

The bad character match heuristic   The no occurence rule

  • If the text symbol that is compared with the pattern symbol does not occur in the pattern at all, then:

      • The pattern can be shifted by m (length(P)) positions behind the mismatched text symbol.

  • Example 1:

       T = a b b a d a b a c b a
       P = b a b a c
    
       d does not occur in P: there is no way to match d
    
    
    Shift the pattern pass the missing character: T = a b b a d a b a c b a P = b a b a c

The bad character match heuristic   The no occurence rule

  • If the text symbol that is compared with the pattern symbol does not occur in the pattern at all, then:

      • The pattern can be shifted by m (length(P)) positions behind the mismatched text symbol.

  • Example 2: some characters has matched before the mismatch

       T = a b b a d b a a b a c b a
       P = b a b a c b a
    
       d does not occur in P: there is no way to match d
    
    
    Shift the pattern pass the missing character: T = a b b a d b a a b a c b a P = b a b a c b a

The bad character match heuristic   The no occurence rule

 

 Suppose mismatch happens here:

  T = . . . . . . . . . . X y y y . . . .
  P =           . . . . . x y y y   
                ^         ^     ^   Mismatched symbol X is not in pattern
		|         |     |
		s         j    m-1
                 <------->
                     j

 New state:

  T = . . . . . . . . . . X y y y . . . .
  P =                       . . . . . x y y y
                            ^               ^
		            |               |
	                 s' = s + j + 1    m-1
  

The bad character match heuristic   The occurence rule

  • If the text symbol X that is compared with the pattern symbol occur in the pattern, then:

      • The pattern can be safely shifted where the last occurring symbol X in the pattern lines up with the mismatched character X

  • Example:

       T = a b b a b a b a c b a
       P = b a b a c
    
       b does occur in P: we can safely line b up 
                          with the last "b" in the pattern 
    
    
    Shift the pattern so the last b lines up with the mismatched b: T = a b b a b a b a c b a P = b a b a c

The bad character match heuristic   The occurence rule

  • Note:

      • It is possible that the last occurring symbol X in the pattern is after the mismatched symbol.

      • In this case, we do not have any useful information and shift pattern 1 position forward.

  • Example 2: some characters has matched before the mismatch

       T = a b b a b b a a b a c b a
       P = b a b a c b a
    
     The last occuring symbol b is after the mismatched symbol
    
    
    Shift the pattern 1 position forward: T = a b b a d b a a b a c b a P = b a b a c d a

The bad character match heuristic   The occurence rule

 

 Suppose mismatch happens here:

  T = . . . . . . . . . . X y y y . . . .
  P =           . . X . . x y y y
                ^   ^     ^     ^
		|   |     |     |
		s   k     j    m-1
                     <--->
                      j-k

 New state:

  T = . . . . . . . . . . X y y y . . . .
  P =                 . . X . . x y y y
                      ^               ^
		      |               |
	         s' = s + j - k      m-1
  

The bad character match heuristic   Consolidation the 2 rules

  • Shift amount when mismatched character does not occur in pattern:

       s' = s + j + 1        (j = position of mismatch)

  • Shift amount when mismatched character occurs lastest at position k in pattern:

       s' = s + j - k        (j = position of mismatch)

  • How to consolidate both rules:

       If character does not occur in pattern:
    
             Assign lastPos = -1 !!!
    

Review:    The brute-force matching algorithm with right ⇒ left matchhing

int bruteForce(string T, string P, int s)  // s = start position
{
    int n = T.length(), m = P.length();

    int j = m-1;  // Start match with character pos = m-1

    while ( s+(m-1) < n )  // Last char pos in P is valid character in T
    {
        if ( P[j] == T[s+j] )  
        { 
            if ( j == 0 )
                return(s);    // Pattern found at pos s
            j--;              // Continue with next character
        }
        else
        {


            s += 1;       // Use next shift position
            j = m-1;      // reset to 1st char in pattern to match
        }
    }
    return -1;          // Pattern not found
}

The Boyer-Moore algorithm using only the bad character match heuristic

int bm(string T, string P, int s)  // s = start position
{
    int n = T.length(), m = P.length();

    int j = m-1;  // Start match with character pos = m-1

    while ( s+(m-1) < n )  // Last char pos in P is valid character in T
    {
        if ( P[j] == T[s+j] )  
        { 
            if ( j == 0 )
                return(s);    // Pattern found at pos s
            j--;              // Continue with next character
        }
        else
        {
            int lastPos = findLastPos(P, T[s+j]);
	    int shift = j - lastPos;
            s += ((shift >= 1) ? shift : 1);  
            j = m-1;      // reset to 1st char in pattern to match
        }
    }
    return -1;          // Pattern not found
}

The findLastPos(s, c) function

 

int findLastPos(string s, char c)
{
   for ( int i = s.length()-1; i >= 0; i-- )
      if ( c == s[i] )
          return i;

   // Character c does not occur in pattern: return -1
   return -1;
}
  

DEMO: Progs/bm-1.cc

Note: we can of course speed things up by pre-computing the bad character shift table...
It's very simple so I will not do it in my notes...