The
bad character match
heuristic
The no occurence rule
- There are advantages to
perform
string matching from
right ⇒ left.
- Consider a
string matching situation where
the text symbol
(here:
d)
that is compared with
the
pattern symbol
does not occur in the pattern
at all:
T = a b b a d a b a c b a
P = b a b a c d does not occur in P
We can skip all these shifts:
T = a b b a d a b a c b a
P = b a b a c
P = b a b a c
P = b a b a c
P = b a b a c
|
|
The
bad character match
heuristic
The no occurence rule
- If the text symbol
that is compared with
the
pattern symbol
does not occur in the pattern
at all, then:
- The pattern
can be shifted
by m (length(P))
positions
behind the
mismatched text symbol.
|
- Example 1:
T = a b b a d a b a c b a
P = b a b a c
d does not occur in P: there is no way to match d
Shift the pattern pass the missing character:
T = a b b a d a b a c b a
P = b a b a c
|
|
The
bad character match
heuristic
The no occurence rule
- If the text symbol
that is compared with
the
pattern symbol
does not occur in the pattern
at all, then:
- The pattern
can be shifted
by m (length(P))
positions
behind the
mismatched text symbol.
|
- Example 2: some characters
has matched before
the mismatch
T = a b b a d b a a b a c b a
P = b a b a c b a
d does not occur in P: there is no way to match d
Shift the pattern pass the missing character:
T = a b b a d b a a b a c b a
P = b a b a c b a
|
|
The
bad character match
heuristic
The no occurence rule
Suppose mismatch happens here:
T = . . . . . . . . . . X y y y . . . .
P = . . . . . x y y y
^ ^ ^ Mismatched symbol X is not in pattern
| | |
s j m-1
<------->
j
New state:
T = . . . . . . . . . . X y y y . . . .
P = . . . . . x y y y
^ ^
| |
s' = s + j + 1 m-1
|
The
bad character match
heuristic
The occurence rule
- If the text symbol X
that is compared with
the
pattern symbol
occur in the pattern,
then:
- The pattern
can be
safely shifted
where the
last occurring
symbol X
in the pattern
lines up with
the mismatched character X
|
- Example:
T = a b b a b a b a c b a
P = b a b a c
b does occur in P: we can safely line b up
with the last "b" in the pattern
Shift the pattern so the last b lines up with the mismatched b:
T = a b b a b a b a c b a
P = b a b a c
|
|
The
bad character match
heuristic
The occurence rule
-
Note:
- It is possible that the
last occurring symbol X
in the pattern is
after the
mismatched symbol.
- In this case,
we do not have any
useful information and
shift pattern
1 position
forward.
|
- Example 2: some characters
has matched before
the mismatch
T = a b b a b b a a b a c b a
P = b a b a c b a
The last occuring symbol b is after the mismatched symbol
Shift the pattern 1 position forward:
T = a b b a d b a a b a c b a
P = b a b a c d a
|
|
The
bad character match
heuristic
The occurence rule
Suppose mismatch happens here:
T = . . . . . . . . . . X y y y . . . .
P = . . X . . x y y y
^ ^ ^ ^
| | | |
s k j m-1
<--->
j-k
New state:
T = . . . . . . . . . . X y y y . . . .
P = . . X . . x y y y
^ ^
| |
s' = s + j - k m-1
|
The
bad character match
heuristic
Consolidation the 2 rules
- Shift amount when
mismatched character
does
not occur in
pattern:
s' = s + j + 1 (j = position of mismatch)
|
- Shift amount when
mismatched character
occurs lastest at
position k in
pattern:
s' = s + j - k (j = position of mismatch)
|
- How to
consolidate
both rules:
If character does not occur in pattern:
Assign lastPos = -1 !!!
|
|
Review:
The brute-force matching algorithm
with right ⇒ left matchhing
int bruteForce(string T, string P, int s) // s = start position
{
int n = T.length(), m = P.length();
int j = m-1; // Start match with character pos = m-1
while ( s+(m-1) < n ) // Last char pos in P is valid character in T
{
if ( P[j] == T[s+j] )
{
if ( j == 0 )
return(s); // Pattern found at pos s
j--; // Continue with next character
}
else
{
s += 1; // Use next shift position
j = m-1; // reset to 1st char in pattern to match
}
}
return -1; // Pattern not found
}
|
The Boyer-Moore
algorithm using
only the
bad character match
heuristic
int bm(string T, string P, int s) // s = start position
{
int n = T.length(), m = P.length();
int j = m-1; // Start match with character pos = m-1
while ( s+(m-1) < n ) // Last char pos in P is valid character in T
{
if ( P[j] == T[s+j] )
{
if ( j == 0 )
return(s); // Pattern found at pos s
j--; // Continue with next character
}
else
{
int lastPos = findLastPos(P, T[s+j]);
int shift = j - lastPos;
s += ((shift >= 1) ? shift : 1);
j = m-1; // reset to 1st char in pattern to match
}
}
return -1; // Pattern not found
}
|
The
findLastPos(s, c)
function
int findLastPos(string s, char c)
{
for ( int i = s.length()-1; i >= 0; i-- )
if ( c == s[i] )
return i;
// Character c does not occur in pattern: return -1
return -1;
}
|
DEMO:
Progs/bm-1.cc
Note:
we can of course
speed things up by
pre-computing the
bad character shift
table...
It's very
simple so
I will not do it in
my notes...
❮
❯