[SOLVED] How to sort strings in pandas dataframe or check equivalency without regard to order?


I have a data frame ex and want to compare two variables bt_x, bt_y (the data frame is about 20K rows; will reach 150K+ eventually).

I would like to compare strings, but the words are in different order. If the sets are otherwise exactly equivalent i.e. ‘abc’ = ‘cab’, but ‘abc’ != ‘cabe’ then I want to show a match, otherwise a nonmatch.

Currently code is set up like this for all variables:

ex['bt M'] = np.where(ex['bt_x'] == ex['bt_y'], 1, 0)

Most variables are numbers, but there are a couple of strings where I would like to ignore the order, or sort. I tried this:

ex['bt_x_2'] = ''.join(sorted(ex['bt_x']))

The new variable ‘bt_x_2’ appears to contain the sorted result of all rows of ‘bt_x’ in the data frame in every row. I want to sort the result in each row, independently of all others. In other words: apply

ex['bt_x_2'] = ''.join(sorted(ex['bt_x']))

or some other method to each row. I’ll do this for both comparison strings, then check for equivalency. If there is a better way then great. I’d love to hear about it. I’ve looked for a good way to do this here and there on a handful of occasions. I’ve written a bunch of regex rules before, but would be best not to do this.

Example data frame:

File Name: "file 1.pdf", "file 2.pdf"
bt_x: "Series A + Series B; Series C + D; No Common Shares", "series A-1 + B-1" 
bt_y: "Series C + D; No Common Shares; Series A + Series B", series B-1 + A-1  
dividends_x: .08, .667  
dividends_y: .11, .06667

Desired output (the spaces and other symbols are ultimately unimportant to match):

bt_x: "Series A + Series B; Series C + D; No Common Shares", "series A-1 + B-1" 
bt_y: "Series A + Series B, Series C + D; No Common Shares; , series A-1 + B-1  

I am good with stripping the characters " ; or + or – or ""

Basically, is there an exactly matching set of words and letters in bt_x and bt_y is the question I want to answer in a third column bt_M(1,0)


You could use apply combined with collections.Counter:

import pandas as pd
from collections import Counter

data = [['abc', 'bca'],
        ['aab', 'aba'],
        ['abc', 'cabe']]

df = pd.DataFrame(data=data, columns=['A', 'B'])

df['C_A'] = df.A.apply(Counter)
df['C_B'] = df.B.apply(Counter)

mask = df.C_A == df.C_B


0     True
1     True
2    False
dtype: bool

Counter creates a dictionary with the counts of each letter in the string, eg:

'abc' -> Counter({'a': 1, 'b': 1, 'c': 1})

Counters are equals if and only both the keys and the counts are equal, i.e, if and only if the strings are equals regardless of the order of the characters. This solution is O(n) for each string vs the sorted approach that is O(n*logn).

Answered By – Dani Mesejo

Answer Checked By – Gilberto Lyons (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *