[SOLVED] How can I speed up closest point comparison using cdist or tensorflow?


I have two sets of points, one is a map consisting of x,y coordinates, and the second is a path of x,y coordinates. I’m trying to find the closest map points to my path points, pretty simple. Except my map is 380000 points and my paths (of which I have several) each consist of ~ 350000 points themselves.
Other than sampling my data to get smaller datasets, I’m trying to find a faster way to accomplish this task.

base algorithm:

import pandas as pd
from scipy.spatial.distance import cdist

def closeset_point(point, points):
    return points[cdist([point], points).argmin()]

# log['point'].shape; 333000
# map_data['point'].shape; 380000
closest = [closest_point(log_p, list(map_data['point'])) for log_p in log['point']]

as per this example: Find closest point in Pandas DataFrames

After converting this to a tqdm progress bar to see how long it would take (as it was taking a while, obviously), I noticed it would take about 10hrs to complete.

tqdm loop:

for i in trange(len(log), desc='finding closest points'):
    closest.append(closest_point(log['point'].loc[i], list(map_data['point'])))
>> finding closest points: 5%|   | 16432/333456 [32:11<10:13:52], 8.60it/s

While 10 hours is not impossible, I wonder if there is a way to speed this up? I have a solid gpu/cpu/ram at my disposal so I feel this should be doable. I’m also learning tensorflow (but honestly my math is atrocious so I’m very in the dark with it)

Any ideas on how to speed this up with either multi-threading, gpu computation, tensorflow or some other sort of wizardry?
inb4 python is slow 😉

blue is map, green is path, orange is closest point

*edit: image shows what i’m trying to do. green is path, blue is map, orange is what I’m trying to find.


The following is a mini example of what you’re trying to do. Considers the variable coords1 as your variable log['point'] and coords2 as your variable log['point']. The end result is the index of the coord2 closest to coord1.

from scipy.spatial import distance
import numpy as np
coords1 = [(35.0456, -85.2672),
          (35.1174, -89.9711),
          (35.9728, -83.9422),
          (36.1667, -86.7833)]
coords2 = [(35.0456, -85.2672),
          (35.1174, -89.9711),
          (35.9728, -83.9422),
          (34.9728, -83.9422),
          (36.1667, -86.7833)]

tmp = distance.cdist(coords1, coords2, "sqeuclidean") #  sqeuclidean based on Mark Setchell comment to improve speed further
result = np.argmin(tmp,1)
# result: array([0, 1, 2, 4])

This should be way faster, because it’s done everything in one iteration.

Answered By – silgon

Answer Checked By – Gilberto Lyons (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *