Issue
I have a data frame ex
and want to compare two variables bt_x
, bt_y
(the data frame is about 20K rows; will reach 150K+ eventually).
I would like to compare strings, but the words are in different order. If the sets are otherwise exactly equivalent i.e. ‘abc’ = ‘cab’, but ‘abc’ != ‘cabe’ then I want to show a match, otherwise a nonmatch.
Currently code is set up like this for all variables:
ex['bt M'] = np.where(ex['bt_x'] == ex['bt_y'], 1, 0)
Most variables are numbers, but there are a couple of strings where I would like to ignore the order, or sort. I tried this:
ex['bt_x_2'] = ''.join(sorted(ex['bt_x']))
The new variable ‘bt_x_2’ appears to contain the sorted result of all rows of ‘bt_x’ in the data frame in every row. I want to sort the result in each row, independently of all others. In other words: apply
ex['bt_x_2'] = ''.join(sorted(ex['bt_x']))
or some other method to each row. I’ll do this for both comparison strings, then check for equivalency. If there is a better way then great. I’d love to hear about it. I’ve looked for a good way to do this here and there on a handful of occasions. I’ve written a bunch of regex rules before, but would be best not to do this.
Example data frame:
File Name: "file 1.pdf", "file 2.pdf"
bt_x: "Series A + Series B; Series C + D; No Common Shares", "series A-1 + B-1"
bt_y: "Series C + D; No Common Shares; Series A + Series B", series B-1 + A-1
dividends_x: .08, .667
dividends_y: .11, .06667
Desired output (the spaces and other symbols are ultimately unimportant to match):
bt_x: "Series A + Series B; Series C + D; No Common Shares", "series A-1 + B-1"
bt_y: "Series A + Series B, Series C + D; No Common Shares; , series A-1 + B-1
I am good with stripping the characters " ; or + or – or ""
Basically, is there an exactly matching set of words and letters in bt_x
and bt_y
is the question I want to answer in a third column bt_M(1,0)
Solution
You could use apply combined with collections.Counter:
import pandas as pd
from collections import Counter
data = [['abc', 'bca'],
['aab', 'aba'],
['abc', 'cabe']]
df = pd.DataFrame(data=data, columns=['A', 'B'])
df['C_A'] = df.A.apply(Counter)
df['C_B'] = df.B.apply(Counter)
mask = df.C_A == df.C_B
print(mask)
Output
0 True
1 True
2 False
dtype: bool
Counter creates a dictionary with the counts of each letter in the string, eg:
'abc' -> Counter({'a': 1, 'b': 1, 'c': 1})
Counters are equals if and only both the keys and the counts are equal, i.e, if and only if the strings are equals regardless of the order of the characters. This solution is O(n)
for each string vs the sorted approach that is O(n*logn)
.
Answered By – Dani Mesejo
Answer Checked By – Gilberto Lyons (BugsFixing Admin)