Matching IDs Between Pandas DataFrames and Applying Function

Question

I have two data frames that look like the following:

df_A:

ID    x     y
a     0     0
c     3     2
b     2     5

df_B:

ID    x     y
a     2     1
c     3     5
b     1     2

I want to add a column in db_B that is the Euclidean distance between the x,y coordinates in df_B from df_A for each identifier. The desired result would be:

ID    x     y    dist
a     2     1    1.732
c     3     5    3
b     1     2    3.162

The identifiers are not necessarily going to be in the same order. I know how to do this by looping through the rows of df_A and finding the matching ID in df_B, but I was hoping to avoid using a for loop since this will be used on data with tens of millions of rows. Is there some way to use apply but condition it on matching IDs?


Show source
| python   | performance   | numpy   | apply   | pandas   2016-12-24 23:12 3 Answers

Answers ( 3 )

  1. 2016-12-24 23:12

    If ID isn't the index, make it so.

    df_B.set_index('ID', inplace=True)
    df_A.set_index('ID', inplace=True)
    
    df_B['dist'] = ((df_A - df_B) ** 2).sum(1) ** .5
    

    Since index and columns are already aligned, simply doing the math should just work.

  2. 2016-12-25 00:12

    Solution which uses sklearn.metrics.pairwise.paired_distances method:

    In [73]: A
    Out[73]:
        x  y
    ID
    a   0  0
    c   3  2
    b   2  5
    
    In [74]: B
    Out[74]:
        x  y
    ID
    a   2  1
    c   3  5
    b   1  2
    
    In [75]: from sklearn.metrics.pairwise import paired_distances
    
    In [76]: B['dist'] = paired_distances(B, A)
    
    In [77]: B
    Out[77]:
        x  y      dist
    ID
    a   2  1  2.236068
    c   3  5  3.000000
    b   1  2  3.162278
    
  3. 2016-12-25 08:12

    For performance, you might want to work with NumPy arrays and for euclidean distance computations between corresponding rows, np.einsum would be do it pretty efficiently.

    Incorporating the fixing of rows to make them aligned, here's an implementation -

    # Get sorted row indices for dataframe-A
    sidx = df_A.index.argsort()
    idx = sidx[df_A.index.searchsorted(df_B.index,sorter=sidx)]
    
    # Sort A rows accordingly and get the elementwise differences against B
    s = df_A.values[idx] - df_B.values
    
    # Use einsum to square and sum each row and finally sqrt for distances
    df_B['dist'] = np.sqrt(np.einsum('ij,ij->i',s,s))
    

    Sample input, output -

    In [121]: df_A
    Out[121]: 
       0  1
    a  0  0
    c  3  2
    b  2  5
    
    In [122]: df_B
    Out[122]: 
       0  1
    c  3  5
    a  2  1
    b  1  2
    
    In [124]: df_B  # After code run
    Out[124]: 
       0  1      dist
    c  3  5  3.000000
    a  2  1  2.236068
    b  1  2  3.162278
    

    Here's a runtime test comparing einsum against few other counterparts.

◀ Go back