deletion distance between words

Question

I'm trying to find how many characters I would need to delete to make the two words the same. For instance "at", "cat" would be 1 because I could delete the c, "boat" and "got" would be 3 because I could delete the b,a and g to make it ot. I put the words into a dictionary with their count as the value. Then I iterate over the dictionary and see if that key exists in the other dictionary otherwise I add 1 to the difference. Is this a very inefficient algorithm?

But it is overestimating the number of deletions I need.

def deletiondistance(firstword, secondword):
dfw = {}
dsw = {}
diff = 0
for i in range(len(firstword)):
    print firstword[i]
    if firstword[i] in dfw:
        dfw[firstword[i]]+=1
    else:
        dfw[firstword[i]]=1
for j in range(len(secondword)):
    if secondword[j] in dsw:
        dsw[secondword[j]] +=1
    else:
        dsw[secondword[j]]=1

for key, value in dfw.iteritems():

    if key in dsw:
        #print "key exists"
        pass

    else:
        diff +=1

print "diff",diff

Show source
| python   | algorithm   2016-12-22 04:12 3 Answers

Answers ( 3 )

  1. 2016-12-22 04:12

    I think your goal is similar to levenshtein distance.

    Levenshtein distance is a metric for measuring distance between 2 strings.

    Here is a wiki-link. https://en.wikipedia.org/wiki/Levenshtein_distance

    And here is pypi package for levenshtein distance. https://pypi.python.org/pypi/python-Levenshtein

  2. 2016-12-22 07:12

    You can use difflib for this.

    Example:

    import difflib
    
    cases=[('cat','bat'),
           ('bat','cat'),
           ('broom', 'ballroom'),
           ('boat','got')]
    
    for a,b in cases:     
        print('{} => {}'.format(a,b)) 
        cnt=0
        for i,s in enumerate(difflib.ndiff(a, b)):
            if s[0]==' ': continue
            elif s[0]=='-':
                print(u'Delete "{}" from position {}'.format(s[-1],i))
            elif s[0]=='+':
                print(u'Add "{}" to position {}'.format(s[-1],i))    
            cnt+=1  
        print("total=",cnt,"\n")
    

    Prints:

    cat => bat
    Delete "c" from position 0
    Add "b" to position 1
    total= 2 
    
    bat => cat
    Delete "b" from position 0
    Add "c" to position 1
    total= 2 
    
    broom => ballroom
    Add "a" to position 1
    Add "l" to position 2
    Add "l" to position 3
    total= 3 
    
    boat => got
    Delete "b" from position 0
    Add "g" to position 1
    Delete "a" from position 3
    total= 3 
    
  3. 2016-12-22 09:12

    As @Hulk mentioned this is similar to levenshtein distance. The only difference is that substitutions are not allowed but that can be rectified by using substitution cost of 2 which is the same as removing character from both strings. Example:

    def dist(s1, s2):
        cur = list(range(len(s2) + 1))
        prev = [0] * (len(s2) + 1)
        for i in range(len(s1)):
            cur, prev = prev, cur
            cur[0] = i + 1
            for j in range(len(s2)):
                # Substitution is same as two deletions
                sub = 0 if s1[i] == s2[j] else 2
                cur[j+1] = min(prev[j] + sub, cur[j] + 1, prev[j+1] + 1)
    
        return cur[-1]
    
    cases=[('cat','bat'),
           ('bat','cat'),
           ('broom', 'ballroom'),
           ('boat','got'),
           ('foo', 'bar'),
           ('foobar', '')]
    
    for s1, s2 in cases:
        print('{} & {} = {}'.format(s1, s2, dist(s1, s2)))
    

    Output:

    cat & bat = 2
    bat & cat = 2
    broom & ballroom = 3
    boat & got = 3
    foo & bar = 6
    foobar &  = 6
    
◀ Go back