Definition: prefix

  • Prefix (of the pattern P):

      P = aaaaabaa????????
          ^^^^^^^^
          Prefix
    
      Prefix (of pattern P) = a portion of the text 
                              at the start of P 
     

     

     

     

Definition: proper suffix of a prefix

  • A proper suffix of a prefix (of a pattern P):

      P = aaaaabaa????????
                ^^
               Proper suffix of a Prefix
    
      Proper suffix of a prefix  = a proper substring 
                                   at the tail portion
                                   of a prefix of P 

  • The entire prefix is not a proper suffix:

      P = aaaaabaa????????
          ^^^^^^^^
          Not a proper suffix

The maximum overlap of a prefix

  • Let pre be a prefix of a pattern P

  • MaxOverlap(pre) = the longest proper prefix of pre that is also a prefix of pre

  • Examples:

       MaxOverlap("aaaa") = "aaa"
    
           because:         aaaa    aaa = longest proper suffix
                             aaaa   that is also a prefix
    
    MaxOverlap("aaba") = "a" because: aaba a = longest proper suffix aaba that is also a prefix

The maximum overlap of a prefix

  • The MaxOverlap(pre) can be empty:

  • Example:

       MaxOverlap("aaab") = ""
    
           because:         aaab       "" = longest proper suffix
                                aaab   that is also a prefix
     
     

     

     

How to use a matched prefix to shift the pattern farther

  • Suppose the first mismatch occured here:

      T:   .......abc??abcX..............
      P:          abc??abcx???
                  ^       ^    Matched prefix = aab??abc
    	      |       |    MaxOverlap("abc??abc") = "abc"
    	      s	      j
    
      First mismatch occurs at j: P[j] != T[s+j]
    
    
    
    
    
    
    
    

How to use a matched prefix to shift the pattern farther

  • The earliest possible match can start at the maximum overlap:

      T:   .......abc??abcX..............
      P:          abc??abcx???
                  ^       ^    Matched prefix = aab??abc
    	      |       |    MaxOverlap("abc??abc") = "abc"
    	      s	      j
    
      Earliest possible match can start here:
    
      T:   .......abc??abcX..............
      P:               abc??abcx???
                       ^         Matched prefix = abc
    	           |            
    	           s += len(pre) - MaxOverlap(pre)
                       j=0 

How to use a matched prefix to shift the pattern farther   Example

  • Suppose the first mismatch occured here:

      T:   .....abadababaccabacabaabb
      P:        abadabacb
                ^      ^    Matched prefix = abadaba
    	    |      |    MaxOverlap("abadaba") = "aba"
    	    s	   j
    
    
    
    
    
    
    
    
     

How to use a matched prefix to shift the pattern farther   Example

  • Suppose the first mismatch occured here:

      T:   .....abadababaccabacabaabb
      P:        abadabacb
                ^      ^    Matched prefix = abadaba
    	    |      |    MaxOverlap("abadaba") = "aba"
    	    s	   j
    
      Earliest possible match can start here:
    
      T:   .....abadababaccabacabaabb
      P:            abadabacb
                    ^       Matched prefix = aba
    	        |            
    	        s += len("abadaba") - len("aba") = 4
                       j=0 

Review:    The brute force string matching algorithm

int bruteForce(string T, string P, int s)  // s = start position
{
    int n = T.length(), m = P.length();

    int j = 0;  // Start match with character pos = 0

    while ( s+(m-1) < n )  // Last char pos in P is valid character in T
    {
        if ( P[j] == T[s+j] )  
        { 
            j++;              // Continue with next character
            if ( j == m )
                return(s);    // Pattern found at pos s
        }
        else
        {
            s += 1;     // Use next shift position 
            j = 0;      // reset to 1st char in pattern to match
        }
    }
    return -1;          // Pattern not found
}

The Knutt-Morris-Pratt string matching algorithm

int KMP(string T, string P, int s)  // s = start position
{
    int n = T.length(), m = P.length();

    int j = 0;  // Start match with character pos = 0

    while ( s+(m-1) < n )  // Last char pos in P is valid character in T
    {
        if ( P[j] == T[s+j] )  
        { 
            j++;              // Continue with next character
            if ( j == m )
                return(s);    // Pattern found at pos s
        }
        else
        {   string pre = P.substr(0, j);
            s += max(1, len(pre) - MaxOverlap(pre)); 
            j = 0;      // reset to 1st char in pattern to match
        }
    }
    return -1;          // Pattern not found
} 

How to compute MaxOverlap(pre)

 

int MaxOverlap(string s)
{
   for (int k = s.length()-1; k >= 1; k--)
   {
      string s1 = s.substr(0, k);
      string s2 = s.substr(s.length()-k, s.length());

      if ( s1 == s2 )
         return s1.length();
   }

   return 0;
}
  

Demo: Progs/kmp-1.cc

Precomputing the MaxOverlap( ) values

  • Notice that the KMP( ) algorithm calls MaxOverlap( ) with different prefixes of the pattern P

  • However, there is a fixed number (= finite) of different prefixes of pattern P

  • Example:

      P = abadabacb
          ^  ^
          s  j
      
         j     prefix     MaxOverlap
      -------------------------------   
         0                 ""
         1       a         ""  Note: a prefix is uniquely identified
         2	     ab	       ""        by its length "j"
         3	     aba       a
         4	     abad      ""  and so on... 

Precomputing the MaxOverlap( ) values

  • One solution make MaxOverlap( ) run more efficiently is:

      1. Pre-compute MaxOverlap(s) for every prefix

        Note: each prefix pre is uniquely identified by its length pre.length()

      2. The KMP( ) will use pre.length() to lookup the value MaxOverlap(pre.length())

  • Comment:

      • The table of pre-computed MaxOverlap(s) values is called the: KMP failure function

Example KMP failure function

    P:           abacab

    length       prefix         MaxOverlap(prefix)      f(k)
  ---------------------------------------------------------------------
    k = 0          ""             ""                    f(k=0) = 0
    k = 1          a              ""                    f(k=1) = 0
    k = 2	   ab		  ""			f(k=2) = 0
    k = 3	   aba		  a			f(k=3) = 1
    k = 4	   abac		  ""			f(k=4) = 0
    k = 5	   abaca	  a			f(k=5) = 1
    k = 6	   abacab	  ab                    f(k=6) = 2

  KMP failure function:

                 k = | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
           ----------+---+---+---+---+---+---+----
              f(k) = | 0 | 0 | 0 | 1 | 0 | 1 | 2 |
  

DEMO: Progs/kmp-2.cc

My personal preference: use memoization

int KMPFailureF[m]; // Initialized to -1, -1, -1, ...

int MaxOverlap(string s)
{
   if ( KMPFailureF[s.length()] >= 0 )
      return KMPFailureF[s.length()];

   for (int k = s.length()-1; k >= 1; k--)
   {
      string s1 = s.substr(0, k);
      string s2 = s.substr(s.length()-k, s.length());

      if ( s1 == s2 )
      {
         KMPFailureF[s.length()] = s1.length();
         return KMPFailureF[s.length()];
      }
   }

   return 0;
} 

DEMO: Progs/kmp-3.cc