## Matching IDs Between Pandas DataFrames and Applying Function

Question

I have two data frames that look like the following:

df_A:

``````ID    x     y
a     0     0
c     3     2
b     2     5
``````

df_B:

``````ID    x     y
a     2     1
c     3     5
b     1     2
``````

I want to add a column in db_B that is the Euclidean distance between the x,y coordinates in df_B from df_A for each identifier. The desired result would be:

``````ID    x     y    dist
a     2     1    1.732
c     3     5    3
b     1     2    3.162
``````

The identifiers are not necessarily going to be in the same order. I know how to do this by looping through the rows of df_A and finding the matching ID in df_B, but I was hoping to avoid using a for loop since this will be used on data with tens of millions of rows. Is there some way to use apply but condition it on matching IDs?

Show source

## Answers to Matching IDs Between Pandas DataFrames and Applying Function ( 3 )

1. If `ID` isn't the index, make it so.

``````df_B.set_index('ID', inplace=True)
df_A.set_index('ID', inplace=True)

df_B['dist'] = ((df_A - df_B) ** 2).sum(1) ** .5
``````

Since index and columns are already aligned, simply doing the math should just work.

2. Solution which uses sklearn.metrics.pairwise.paired_distances method:

``````In [73]: A
Out[73]:
x  y
ID
a   0  0
c   3  2
b   2  5

In [74]: B
Out[74]:
x  y
ID
a   2  1
c   3  5
b   1  2

In [75]: from sklearn.metrics.pairwise import paired_distances

In [76]: B['dist'] = paired_distances(B, A)

In [77]: B
Out[77]:
x  y      dist
ID
a   2  1  2.236068
c   3  5  3.000000
b   1  2  3.162278
``````
3. For performance, you might want to work with NumPy arrays and for euclidean distance computations between corresponding rows, `np.einsum` would be do it pretty efficiently.

Incorporating the fixing of rows to make them aligned, here's an implementation -

``````# Get sorted row indices for dataframe-A
sidx = df_A.index.argsort()
idx = sidx[df_A.index.searchsorted(df_B.index,sorter=sidx)]

# Sort A rows accordingly and get the elementwise differences against B
s = df_A.values[idx] - df_B.values

# Use einsum to square and sum each row and finally sqrt for distances
df_B['dist'] = np.sqrt(np.einsum('ij,ij->i',s,s))
``````

Sample input, output -

``````In [121]: df_A
Out[121]:
0  1
a  0  0
c  3  2
b  2  5

In [122]: df_B
Out[122]:
0  1
c  3  5
a  2  1
b  1  2

In [124]: df_B  # After code run
Out[124]:
0  1      dist
c  3  5  3.000000
a  2  1  2.236068
b  1  2  3.162278
``````

Here's a `runtime test` comparing `einsum` against few other counterparts.