## How do I find the percentage of similarity between two multiline Strings?

Question

I have got two multi-line strings. I'm using the following code to determine the similarity between two of them. This makes use of Levenshtein distance algorithm.

```
public static double similarity(String s1, String s2) {
String longer = s1, shorter = s2;
if (s1.length() < s2.length()) {
longer = s2; shorter = s1;
}
int longerLength = longer.length();
if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
public static int editDistance(String s1, String s2) {
s1 = s1.toLowerCase();
s2 = s2.toLowerCase();
int[] costs = new int[s2.length() + 1];
for (int i = 0; i <= s1.length(); i++) {
int lastValue = i;
for (int j = 0; j <= s2.length(); j++) {
if (i == 0)
costs[j] = j;
else {
if (j > 0) {
int newValue = costs[j - 1];
if (s1.charAt(i - 1) != s2.charAt(j - 1))
newValue = Math.min(Math.min(newValue, lastValue),
costs[j]) + 1;
costs[j - 1] = lastValue;
lastValue = newValue;
}
}
}
if (i > 0)
costs[s2.length()] = lastValue;
}
return costs[s2.length()];
}
```

But the above code is not working as expected.

For instance lets say that we have got the following two strings say `s1`

and `s2`

,

S1 -> `How do we optimize the performance? . What should we do to compare both strings to find the percentage of similarity between both?`

S2-> `How do we optimize tje performance? What should we do to compare both strings to find the percentage of similarity between both?`

Then I'm passing the above string to similarity method but it does not find the exact percentage of difference. How do I optimize the algorithm?

Following is my main method

**update**:

```
public static boolean authQuestion(String question) throws SQLException{
boolean isQuestionAvailable = false;
Connection dbCon = null;
try {
dbCon = MyResource.getConnection();
String query = "SELECT * FROM WORDBANK where WORD ~* ?;";
PreparedStatement checkStmt = dbCon.prepareStatement(query);
checkStmt.setString(1, question);
ResultSet rs = checkStmt.executeQuery();
while (rs.next()) {
double re=similarity( rs.getString("question"), question);
if(re > 0.6){
isQuestionAvailable = true;
}else {
isQuestionAvailable = false;
}
}
} catch (URISyntaxException e1) {
e1.printStackTrace();
} catch (SQLException sqle) {
sqle.printStackTrace();
} catch (Exception e) {
if (dbCon != null)
dbCon.close();
} finally {
if (dbCon != null)
dbCon.close();
}
return isQuestionAvailable;
}
```

Show source

## Answers ( 3 )

I can suggest you an approach...

You are using edit distance, which gives you the number of characters in S1 you need to change/add/remove in order to turn it to S2.

So, for example:

the edit distance is 3 and they are 100% different (taking in consideration you see it in some kind of char by char comparison).

So you can have an approximate percentage if you do

the min is a workaround to treat the cases where the strings are very different, for example:

so the edit distance would be bigger than the length of S1 and the percentage should be more than 100%, so maybe dividing by the bigger of the sizes should be better.

hope that helps

Your

`similarity`

method returns a number between 0 and 1 (both ends inclusive) where one means that the strings are the same (edit distance is zero).However in your

`authQuestion`

method you are acting as if it returns a number between zero and 100, evidenced by this line:You need to change that to

Or to

Since you are using your entire S1 in the

where clauseof your sql query, it will either find a perfect match or won't return any result at all.As mentioned by @ErwinBolwidt, if it

returns nothingthen you`isQuestionAvailable`

will always remainfalse. And if it returns aperfect matchthen you are bound to get100% similarity.What you can do is: Use a

substring of your S1to search for questions that match that part.You can make following changes:

`authQuestion method`

Out of the results fetched, you can compare each result with your question for similarity.