Question: IRanges: reduce(), findOverlaps(), and intersect() by group
0
gravatar for pesquivel
3.9 years ago by
pesquivel0
United States
pesquivel0 wrote:

Hi,

Background I have two drug utilization datasets for two drugs A and B. Each row in the datasets represents a prescription, described by patient.id, drug.namestart.date , and  days.supply. Both datasets have been filtered so they contain only seven-day-long prescriptions (days.supply = 7 for all rows) and only  patients that consumed both drugs. All values of start.date are positive integers.

Question Using IRanges but without resorting to loops, how could one find the patients who simultaneously consumed both drugs? More precisely, find a no-loop IRanges algorithm that identifies every patient.id with at least one prescription for A and one prescription for B that mutually overlap by at least one day. Note that two A-prescriptions for the same patient should be considered one longer A-prescription; same goes for B-prescriptions.

Initial code and error message This question seems to me to be a time to use reduce() and either findOverlaps() or intersect(). While it is easy enough to apply these functions to each dataset as a whole, they however must instead be applied patient by patient.

ir.A <- IRanges(start = as.integer(A$start.date), width = as.integer(A$days.supply))
ir.B <- IRanges(start = as.integer(B$start.date), width = as.integer(B$days.supply))

split.A <- split(ir.A, A$patient.id)​
split.B <- split(ir.B, B$patient.id)

red.A <- reduce(ir.A)
red.B <- reduce(ir.B)

x <- findOverlaps(red.A, red.B)

#Warning in View :
#'optional' and arguments in '...' are ignored
#Error in View : arguments imply differing number of rows: 1, 0, 2, 3, 4, 6

Session information

R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] XVector_0.10.0      IRanges_2.4.4       S4Vectors_0.8.2     BiocGenerics_0.16.1
[5] data.table_1.9.6    bit64_0.9-5         bit_1.1-12         

loaded via a namespace (and not attached):
[1] zlibbioc_1.16.0 tools_3.2.2     chron_2.3-47
ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by pesquivel0

Would you please provide a reproducible example of this error? That code should work.

ADD REPLYlink written 3.9 years ago by Michael Lawrence11k

Hi,

I've added to my 'Initial code and error message' section the error-causing line of code, which I originally forgot:

x <- findOverlaps(red.A, red.B)

As for data that works, I have CSVs that become the data.tables A and B for drugs A and B (respectively). Would you want me to provide these data.tables as part of my reproducible error? If so, what is the best way for me to provide them?

 

ADD REPLYlink written 3.9 years ago by pesquivel0

Just a minimal example would suffice, i.e., a minimal subset of your full dataset. Ideally something that can be constructed directly in code.

ADD REPLYlink written 3.9 years ago by Michael Lawrence11k

Hi Michael,

Here is a minimal example:

a <- data.frame(as.character(c(seq(1:4), 4)),
                c("A", "A", "A", "A", "A"),
                c(as.Date("2013-10-30"), as.Date("2013-11-15"),
                  as.Date("2013-09-05"), as.Date("2013-11-29"), as.Date("2013-12-27")),
                c(as.Date("2013-11-06"), as.Date("2013-11-22"),
                  as.Date("2013-09-12"), as.Date("2013-12-06"), as.Date("2014-01-03")))
names(a) <- c("patient.id", "drug", "start.date", "end.date")
a.ir <- IRanges(as.integer(a$start.date), as.integer(a$end.date))
a.split <- split(a.ir, a$patient.id)
a.red <- reduce(a.split)
b <- data.frame(as.character(seq(1:4)),
                c("B", "B", "B", "B"),
                c(as.Date("2013-10-30"), as.Date("2013-11-15"),
                  as.Date("2013-09-05"), as.Date("2013-11-29")),
                c(as.Date("2013-11-06"), as.Date("2013-11-22"),
                  as.Date("2013-09-12"), as.Date("2013-12-06")))
names(b) <- c("patient.id", "drug", "start.date", "end.date")
b.ir <- IRanges(as.integer(b$start.date), as.integer(b$end.date))
b.split <- split(b.ir, b$patient.id)
b.red <- reduce(b.split)

x <- findOverlaps(a.red, b.red)
View(x)

Weirdly enough, this is the new message that arises from this minimal example (the full datasets produce the error message that I noted originally):

Warning messages:
1: In as.data.frame.Hits(x[[i]], optional = TRUE) :
  'optional' and arguments in '...' are ignored
2: In as.data.frame.Hits(x[[i]], optional = TRUE) :
  'optional' and arguments in '...' are ignored
3: In as.data.frame.Hits(x[[i]], optional = TRUE) :
  'optional' and arguments in '...' are ignored
4: In as.data.frame.Hits(x[[i]], optional = TRUE) :
  'optional' and arguments in '...' are ignored

What is weird is that the final output has unexpected columns such as queryHits.1, queryHits.2, and subjectHits.1.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by pesquivel0

The warning is unnecessary, so I will remove it. As for the weird columns, this comes down to there being no c,Hits method, and it is not obvious how to implement one. So we would need to do something at coercion to DataFrame.
 

ADD REPLYlink written 3.9 years ago by Michael Lawrence11k

Hi, Maybe this is useful to note in the context of the 'different number of rows' error: The number of rows in data.table A can be expected to differ from the number of rows in data.table B. But the number of patient.id values should be identical in each data.table because both data.tables refer to the same patient population.

 

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by pesquivel0
Answer: IRanges: reduce(), findOverlaps(), and intersect() by group
0
gravatar for pesquivel
3.9 years ago by
pesquivel0
United States
pesquivel0 wrote:

Michael, thanks for providing technical details.

Turns out that the answer to my particular question is easy to find. Instead of using findOverlaps, I simply used overlapsAny():

overlaps <- as.data.table(overlapsAny(a, b))
names(overlaps) <- c("group", "patient.id", "has.overlaps")
overlaps[, has.overlaps := as.numeric(has.overlaps)]
overlaps[, has.overlaps := max(has.overlaps), by = patient.id]
overlappers <- overlaps[has.overlaps == TRUE]

overlappers lists all the patients who simultaneously consumed drugs A and B, which was what we wanted.

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by pesquivel0

Yea, or just:

any(overlapsAny(a, b))

But I do want to iron out that issue with HitsList.

ADD REPLYlink written 3.9 years ago by Michael Lawrence11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 448 users visited in the last hour