[SOLVED] How to remove overlapping sequences from two datatables?

Issue

I have two data.tables that provide sequence coordinates across different chromosomes (categories). For example:

library(data.table)
dt1 <- data.table(chromosome = c("1", "1", "1", "1", "X"),
                  start = c(1, 50, 110, 150, 110),
                  end = c(11, 100, 121, 200, 200))
dt2 <- data.table(chromosome = c("1", "1", "X"),
                  start = c(12, 60, 50),
                  end = c(20, 115, 80))

I need to create a third data.table that provides coordinates for sequences that contain all the integers in dt1 that do not overlap with any integers from the sequences in dt2. For example:

dt3 <- data.table(chromosome = c("1", "1", "1", "1", "X"),
                  start = c(1, 50, 116, 150, 110),
                  end = c(11, 59, 121, 200, 200))

The data.tables I need to run this on are very large and therefore I need to maximise performance. I have tried using the foverlaps() function but to no avail. Any help would be greatly appreciated!

Solution

You can start with something like this from foverlaps

setkey(dt2,chromosome,start,end)
ds = foverlaps(dt1,dt2,  type="any")
ds[,.(chromosome, 
      start = fcase(is.na(start) | i.start <= start,i.start,
                    i.end >= end, end + 1),
      end = fcase(is.na(end) | i.end >= end, i.end,
                  i.start <= start, start - 1)
      )]
#   chromosome start   end
#       <char> <num> <num>
#1:          1     1    11
#2:          1    50    59
#3:          1   116   121
#4:          1   150   200
#5:          X   110   200

Answered By – Peace Wang

Answer Checked By – David Marino (BugsFixing Volunteer)

Leave a Reply

Your email address will not be published. Required fields are marked *