[SOLVED] How can I remove duplicate parts of a string, within a column and sort the values

Issue

I have a dataframe like this (assuming one column):

column
[A,C,B,A]
[HELLO,HELLO,ha]
[test/1, test/1, test2]

The type of the column above is:
dtype(‘O’)

I would like to remove the duplicates here, resulting in:

column
[A,C,B]                  # - A
[HELLO, ha]               # removing 1 hello
[test/1, test2]  # removing 1 test/1 

Then, I would like to sort the data

column
[A,B,C]                  
[ha, HELLO]             
[test2, test/1]  # assuming that number comes before / 

I am struggling getting this done in a proper way. Hope anyone has nice ideas (would it make sense to transform to small lists?)

Solution

Assuming that you have lists in the column, use a list comprehension.

If you want to maintain order:

df['column_keep_order'] = [list(dict.fromkeys(x)) for x in df['column']]

If you want to sort the items:

df['column_sorted'] = [sorted(set(x)) for x in df['column']]

output:

                    column column_keep_order    column_sorted
0             [A, C, B, A]         [A, C, B]        [A, B, C]
1       [HELLO, HELLO, ha]       [HELLO, ha]      [HELLO, ha]
2  [test/1, test/1, test2]   [test/1, test2]  [test/1, test2]

reproducible input:

df = pd.DataFrame({'column': [['A','C','B','A'],
                              ['HELLO','HELLO','ha'],
                              ['test/1', 'test/1', 'test2']]})

Answered By – mozway

Answer Checked By – Cary Denson (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *