The bad character match heuristic The no occurence rule

There are advantages to perform string matching from right ⇒ left.

Consider a string matching situation where the text symbol (here: d) that is compared with the pattern symbol does not occur in the pattern at all:

T = a b b a d a b a c b a P = b a b a c d does not occur in P We can skip all these shifts: T = a b b a d a b a c b a P = b a b a c P = b a b a c P = b a b a c P = b a b a c

The bad character match heuristic The no occurence rule

If the text symbol that is compared with the pattern symbol does not occur in the pattern at all, then:

Example 1:

T = a b b a d a b a c b a P = b a b a c d does not occur in P: there is no way to match d
Shift the pattern pass the missing character: T = a b b a d a b a c b a P = b a b a c

The bad character match heuristic The no occurence rule

If the text symbol that is compared with the pattern symbol does not occur in the pattern at all, then:

Example 2: some characters has matched before the mismatch

T = a b b a d b a a b a c b a P = b a b a c b a d does not occur in P: there is no way to match d
Shift the pattern pass the missing character: T = a b b a d b a a b a c b a P = b a b a c b a

The bad character match heuristic The no occurence rule

Suppose mismatch happens here: T = . . . . . . . . . . X y y y . . . . P = . . . . . x y y y ^ ^ ^ Mismatched symbol X is not in pattern | | | s j m-1 <-------> j New state: T = . . . . . . . . . . X y y y . . . . P = . . . . . x y y y ^ ^ | | s' = s + j + 1 m-1

The bad character match heuristic The occurence rule

If the text symbol X that is compared with the pattern symbol occur in the pattern, then:

Example:

T = a b b a b a b a c b a P = b a b a c b does occur in P: we can safely line b up with the last "b" in the pattern
Shift the pattern so the last b lines up with the mismatched b: T = a b b a b a b a c b a P = b a b a c

The bad character match heuristic The occurence rule

Note:

It is possible that the last occurring symbol X in the pattern is after the mismatched symbol.
In this case, we do not have any useful information and shift pattern 1 position forward.

Example 2: some characters has matched before the mismatch

T = a b b a b b a a b a c b a P = b a b a c b a The last occuring symbol b is after the mismatched symbol
Shift the pattern 1 position forward: T = a b b a d b a a b a c b a P = b a b a c d a

The bad character match heuristic The occurence rule

Suppose mismatch happens here: T = . . . . . . . . . . X y y y . . . . P = . . X . . x y y y ^ ^ ^ ^ | | | | s k j m-1 <---> j-k New state: T = . . . . . . . . . . X y y y . . . . P = . . X . . x y y y ^ ^ | | s' = s + j - k m-1

The bad character match heuristic Consolidation the 2 rules

Shift amount when mismatched character does not occur in pattern:

s' = s + j + 1 (j = position of mismatch)

Shift amount when mismatched character occurs lastest at position k in pattern:
s' = s + j - k (j = position of mismatch)

How to consolidate both rules:

If character does not occur in pattern: Assign lastPos = -1 !!!

Review: The brute-force matching algorithm with right ⇒ left matchhing

int bruteForce(string T, string P, int s) // s = start position { int n = T.length(), m = P.length(); int j = m-1; // Start match with character pos = m-1 while ( s+(m-1) < n ) // Last char pos in P is valid character in T { if ( P[j] == T[s+j] ) { if ( j == 0 ) return(s); // Pattern found at pos s j--; // Continue with next character } else { s += 1; // Use next shift position j = m-1; // reset to 1st char in pattern to match } } return -1; // Pattern not found }

The Boyer-Moore algorithm using only the bad character match heuristic

int bm(string T, string P, int s) // s = start position { int n = T.length(), m = P.length(); int j = m-1; // Start match with character pos = m-1 while ( s+(m-1) < n ) // Last char pos in P is valid character in T { if ( P[j] == T[s+j] ) { if ( j == 0 ) return(s); // Pattern found at pos s j--; // Continue with next character } else { int lastPos = findLastPos(P, T[s+j]); int shift = j - lastPos; s += ((shift >= 1) ? shift : 1); j = m-1; // reset to 1st char in pattern to match } } return -1; // Pattern not found }

The findLastPos(s, c) function

int findLastPos(string s, char c) { for ( int i = s.length()-1; i >= 0; i-- ) if ( c == s[i] ) return i; // Character c does not occur in pattern: return -1 return -1; }

DEMO: Progs/bm-1.cc

Note: we can of course speed things up by pre-computing the bad character shift table...
It's very simple so I will not do it in my notes...