How do I find the percentage of similarity between two multiline Strings?

Question

I have got two multi-line strings. I'm using the following code to determine the similarity between two of them. This makes use of Levenshtein distance algorithm.

  public static double similarity(String s1, String s2) {
    String longer = s1, shorter = s2;
    if (s1.length() < s2.length()) { 
      longer = s2; shorter = s1;
    }
    int longerLength = longer.length();
    if (longerLength == 0) { return 1.0; /* both strings are zero length */ }

    return (longerLength - editDistance(longer, shorter)) / (double) longerLength;

  }

  public static int editDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
      int lastValue = i;
      for (int j = 0; j <= s2.length(); j++) {
        if (i == 0)
          costs[j] = j;
        else {
          if (j > 0) {
            int newValue = costs[j - 1];
            if (s1.charAt(i - 1) != s2.charAt(j - 1))
              newValue = Math.min(Math.min(newValue, lastValue),
                  costs[j]) + 1;
            costs[j - 1] = lastValue;
            lastValue = newValue;
          }
        }
      }
      if (i > 0)
        costs[s2.length()] = lastValue;
    }
    return costs[s2.length()];
  }

But the above code is not working as expected.

For instance lets say that we have got the following two strings say s1 and s2,

S1 -> How do we optimize the performance? . What should we do to compare both strings to find the percentage of similarity between both?

S2-> How do we optimize tje performance? What should we do to compare both strings to find the percentage of similarity between both?

Then I'm passing the above string to similarity method but it does not find the exact percentage of difference. How do I optimize the algorithm?

Following is my main method

update:

public static boolean authQuestion(String question) throws SQLException{


        boolean isQuestionAvailable = false;
        Connection dbCon = null;
        try {
            dbCon = MyResource.getConnection();
            String query = "SELECT * FROM WORDBANK where WORD ~*  ?;";
            PreparedStatement checkStmt = dbCon.prepareStatement(query);
            checkStmt.setString(1, question);
            ResultSet rs = checkStmt.executeQuery();
            while (rs.next()) {
                double re=similarity( rs.getString("question"), question);
                if(re  > 0.6){
                    isQuestionAvailable = true;
                }else {
                    isQuestionAvailable = false;
                }
            }
        } catch (URISyntaxException e1) {
            e1.printStackTrace();
        } catch (SQLException sqle) {
            sqle.printStackTrace();
        } catch (Exception e) {
            if (dbCon != null)
                dbCon.close();
        } finally {
            if (dbCon != null)
                dbCon.close();
        }

        return isQuestionAvailable;
    }

Show source
| java   | algorithm   | levenshtein-distance   2017-01-03 06:01 3 Answers

Answers to How do I find the percentage of similarity between two multiline Strings? ( 3 )

  1. 2017-01-03 07:01

    I can suggest you an approach...

    You are using edit distance, which gives you the number of characters in S1 you need to change/add/remove in order to turn it to S2.

    So, for example:

    S1 = "abc"
    S2 = "cde"
    

    the edit distance is 3 and they are 100% different (taking in consideration you see it in some kind of char by char comparison).

    So you can have an approximate percentage if you do

    S1 = "abc"
    S2 = "cde"
    edit = edit_distance(S1, S2)
    percentage = min(edit/S1.length(), edit/S2.length())
    

    the min is a workaround to treat the cases where the strings are very different, for example:

    S1 = "abc"
    S2 = "defghijklmno"
    

    so the edit distance would be bigger than the length of S1 and the percentage should be more than 100%, so maybe dividing by the bigger of the sizes should be better.

    hope that helps

  2. 2017-01-03 08:01

    Your similarity method returns a number between 0 and 1 (both ends inclusive) where one means that the strings are the same (edit distance is zero).

    However in your authQuestion method you are acting as if it returns a number between zero and 100, evidenced by this line:

    if(re > 60){
    

    You need to change that to

    if(re > .6){
    

    Or to

    if(re * 100 > 60){
    
  3. 2017-01-03 11:01

    Since you are using your entire S1 in the where clause of your sql query, it will either find a perfect match or won't return any result at all.

    As mentioned by @ErwinBolwidt, if it returns nothing then you isQuestionAvailable will always remain false. And if it returns a perfect match then you are bound to get 100% similarity.

    What you can do is: Use a substring of your S1 to search for questions that match that part.

    You can make following changes:

    authQuestion method

    checkStmt.setString(1, question.substring(0,20)); //say
    

    Out of the results fetched, you can compare each result with your question for similarity.

Leave a reply to - How do I find the percentage of similarity between two multiline Strings?

◀ Go back