[SOLVED] How to find duplicates in a string using Spark SQL?

Issue

I want to find duplicates in a string, I wonder is there any way that we can find this using spark sql. Below is the query which i’ve written.

spark.sql("""select case when lower(value) like '%,code,%' or lower(value) like '%,code%' or lower(value) like '%code,%' then 'Y' else 'N' end as value from input""").show(false)

Input :

val sample = Seq(("code,code")).toDF("value")

The issue is what if the value ‘code’ repeats twice? in that case my requirement is to return flag ‘N’. Is there any way to do that using Spark SQL? Please share your suggestions. TIA

Solution

You can establish the number of occurances using length and replace

set @a = 'code,aaa';
set @b = 'code,code';
set @c = 'aaa,bbb,ccc';

select @a, length(@a) - length(replace(@a,'code','')) lengthdiff,
        case when length(@a) - length(replace(@a,'code','')) = 4 then 'y'
             when length(@a) - length(replace(@a,'code','')) >= 4 then 'n'
        else 'n'
        end  occurances
;

+----------+------------+------------+
| @a       | lengthdiff | occurances |
+----------+------------+------------+
| code,aaa |          4 | y          |
+----------+------------+------------+
1 row in set (0.001 sec)

select @b, length(@b) - length(replace(@b,'code','')) lengthdiff,
        case when length(@b) - length(replace(@b,'code','')) = 4 then 'y'
             when length(@b) - length(replace(@b,'code','')) > 4 then 'n'
        else 'n'
        end  occurances
;

+-----------+------------+------------+
| @b        | lengthdiff | occurances |
+-----------+------------+------------+
| code,code |          8 | n          |
+-----------+------------+------------+
1 row in set (0.000 sec)

select @c, length(@c) - length(replace(@c,'code','')) lengthdiff,
        case when length(@c) - length(replace(@c,'code','')) = 4 then 'y'
             when length(@c) - length(replace(@c,'code','')) > 4 then 'n'
        else 'n'
        end  occurances
;

+-------------+------------+------------+
| @c          | lengthdiff | occurances |
+-------------+------------+------------+
| aaa,bbb,ccc |          0 | n          |
+-------------+------------+------------+
1 row in set (0.001 sec)

Answered By – P.Salmon

Answer Checked By – Gilberto Lyons (BugsFixing Admin)

Leave a Reply

Your email address will not be published. Required fields are marked *