## Duplicate substring searching

Question

Is there any efficient way to find the duplicate substring? Here, duplicate means that two same substring close to each other have the same value without overlap. For example, the source string is:

``````ABCDDEFGHFGH
``````

'D' and 'FGH' is duplicated. 'F' appear two times in the sequence, however, they are not close to each other, so it does not duplicate. so our algorithm will return ['D', 'FGH']. I want to know whether there exists an elegant algorithm instead the brute force method?

Show source

## Answers to Duplicate substring searching ( 3 )

1. It relates to Longest repeated substring problem, which builds Suffix Tree to provide string searching in linear time and space complexity `Θ(n)`

2. Not very efficient (suffix tree/array are better for very large strings), but very short regular expression solution (C#):

``````  string source = @"ABCDDEFGHFGH";

string[] result = Regex
.Matches(source, @"(.+)\1")
.OfType<Match>()
.Select(match => match.Groups[1].Value)
.ToArray();
``````

Explanation

``````(.+) - group of any (at least 1) characters
\1   - the same group (group #1) repeated
``````

Test

``````  Console.Write(string.Join(", ", result));
``````

Outcome

``````  D, FGH
``````

In case of ambiguity, e.g. `"AAAA"` where we can provide `"AA"` as well as `"A"` the solution performs greedy and thus `"AA"` is returned.

3. Without using any regex which might turn out to be very slow, I guess it's best to use two cursors running hand to hand. The algorithm is pretty obvious from the below JS code.

``````function getNborDupes(s){
var cl = 0,  // cursor left
cr = 0,  // cursor right
ts = "", // test string
res = []; // result array
while (cl < s.length){
cr = cl;
while (++cr < s.length){
ts = s.slice(cl,cr);  // ts starting from cl to cr (char @ cr excluded)

// check ts with subst from cr to cr + ts.length (char @ cr + ts.length excluded)
// if they match push it to result advance cursors to cl + ts.length and continue

ts === s.substr(cr,ts.length) && (res.push(ts), cl = cr += ts.length);
}
cl++;
}
return res;
}

var str = "ABCDDEFGHFGH";
console.log(getNborDupes(str));``````

Throughout the whole process `ts` will take the following values.

``````A
AB
ABC
ABCD
ABCDD
ABCDDE
ABCDDEF
ABCDDEFG
ABCDDEFGH
ABCDDEFGHF
ABCDDEFGHFG
B
BC
BCD
BCDD
BCDDE
BCDDEF
BCDDEFG
BCDDEFGH
BCDDEFGHF
BCDDEFGHFG
C
CD
CDD
CDDE
CDDEF
CDDEFG
CDDEFGH
CDDEFGHF
CDDEFGHFG
D
E
EF
EFG
EFGH
EFGHF
EFGHFG
F
FG
FGH
``````

Though the `cl = cr += ts.length` part decides whether or not to re-start searching on from before or after the matching sub-string. As of currently the above code; `"ABABABAB"` input would return `["AB","AB"]` for but if you make it `cr = cl += ts.length` then you should expect the result to be `["AB", "AB", "AB"]`.