Duplicate substring searching

Question

Is there any efficient way to find the duplicate substring? Here, duplicate means that two same substring close to each other have the same value without overlap. For example, the source string is:

ABCDDEFGHFGH

'D' and 'FGH' is duplicated. 'F' appear two times in the sequence, however, they are not close to each other, so it does not duplicate. so our algorithm will return ['D', 'FGH']. I want to know whether there exists an elegant algorithm instead the brute force method?


Show source
| search   | algorithm   | substring   2016-12-22 12:12 3 Answers

Answers to Duplicate substring searching ( 3 )

  1. 2016-12-22 12:12

    It relates to Longest repeated substring problem, which builds Suffix Tree to provide string searching in linear time and space complexity Θ(n)

  2. 2016-12-22 13:12

    Not very efficient (suffix tree/array are better for very large strings), but very short regular expression solution (C#):

      string source = @"ABCDDEFGHFGH";
    
      string[] result = Regex
        .Matches(source, @"(.+)\1")
        .OfType<Match>()
        .Select(match => match.Groups[1].Value)
        .ToArray(); 
    

    Explanation

    (.+) - group of any (at least 1) characters
    \1   - the same group (group #1) repeated 
    

    Test

      Console.Write(string.Join(", ", result));     
    

    Outcome

      D, FGH
    

    In case of ambiguity, e.g. "AAAA" where we can provide "AA" as well as "A" the solution performs greedy and thus "AA" is returned.

  3. 2016-12-22 15:12

    Without using any regex which might turn out to be very slow, I guess it's best to use two cursors running hand to hand. The algorithm is pretty obvious from the below JS code.

    function getNborDupes(s){
      var cl = 0,  // cursor left
          cr = 0,  // cursor right
          ts = "", // test string
         res = []; // result array
      while (cl < s.length){
        cr = cl;
        while (++cr < s.length){
          ts = s.slice(cl,cr);  // ts starting from cl to cr (char @ cr excluded)
          
          // check ts with subst from cr to cr + ts.length (char @ cr + ts.length excluded)
          // if they match push it to result advance cursors to cl + ts.length and continue
          
          ts === s.substr(cr,ts.length) && (res.push(ts), cl = cr += ts.length);
        }
      cl++;
      }
      return res;
    }
    
    var str = "ABCDDEFGHFGH";
    console.log(getNborDupes(str));

    Throughout the whole process ts will take the following values.

    A
    AB
    ABC
    ABCD
    ABCDD
    ABCDDE
    ABCDDEF
    ABCDDEFG
    ABCDDEFGH
    ABCDDEFGHF
    ABCDDEFGHFG
    B
    BC
    BCD
    BCDD
    BCDDE
    BCDDEF
    BCDDEFG
    BCDDEFGH
    BCDDEFGHF
    BCDDEFGHFG
    C
    CD
    CDD
    CDDE
    CDDEF
    CDDEFG
    CDDEFGH
    CDDEFGHF
    CDDEFGHFG
    D
    E
    EF
    EFG
    EFGH
    EFGHF
    EFGHFG
    F
    FG
    FGH
    

    Though the cl = cr += ts.length part decides whether or not to re-start searching on from before or after the matching sub-string. As of currently the above code; "ABABABAB" input would return ["AB","AB"] for but if you make it cr = cl += ts.length then you should expect the result to be ["AB", "AB", "AB"].

Leave a reply to - Duplicate substring searching

◀ Go back