Question: Big data for IRanges: Combining reduce and gaps with dplyr functions
0
3.9 years ago by
United States
pesquivel0 wrote:

Hi all,

I am an advanced beginner in R who is extremely thankful for IRanges! It has accelerated the process I describe below by ~10-fold.

Desired advice: Because I am working with millions of records, I wonder if there are further speed improvements that can be obtained by creatively combining IRanges with functions from the dplyr package, which are generally fast. Any insights into using the data.table package would also be appreciated.

Background: My job involves working on the medication records of de-identified health insurance plan members. My team increasingly needs to exclude patients from downstream analyses of our Claims data if their medication history contains any gap longer than 60 days long. We identify these patients by looking at the startdate and enddate of each record in our dataset; each record also contains the member_id of the member who made the purchase. Unfortunately, identifying 'gappy' patients is not as simple as pairwise comparison of records. This simplified excerpt from Claims illustrates why; dates are represented as integers below:

member_id   startdate   enddate

A           1           90
A           14          15
A           121         180
B           1           30
B           2001        2030
...         ...         ... 

Patient B should obviously be removed since he has a gap of length 2001 − 30 − 1 = 1970 > 60. I would like to retain Patient A despite the gap of length 121 − 15 − 1 = 75 > 60 between his second and third prescriptions, however; the only gap in his medication history is the one of length 121 − 90 − 1 = 30 < 60 between his first and third prescriptions.

Current approach and situation: I have been able to take these issues into account using a custom function called smart (it definitely is smarter than the previous loop-based function we employed). smart invokes IRanges::reduce and IRanges::gaps

> smart <- function(Claims) > { >   MemberClaims_I <- IRanges(start = as.numeric(Claims$startdate), end = as.numeric(Claims$enddate)) >   MemberClaims_Red <- reduce(MemberClaims_I) >   MemberGaps <- as.data.table(gaps(MemberClaims_Red)) > }

This custom function is then currently applied to Claims using plyr:ddply:

> member_id <- levels(Claims\$member_id) > #system.time(Claims_Smart <- ddply(Claims, .(member_id), smart)) > Claims_Smart <- ddply(Claims, .(member_id), smart)

The hashed-out line tells me that ~20,000 rows for ~1,000 patients are processed in ~8 seconds. A dataset with 3 million rows and 600,000 patients gets processed in ~8 hours. Here are my session details:

> sessionInfo() R version 3.2.2 (2015-08-14) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.11 (El Capitan)

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] grid      stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages: [1] data.table_1.9.6    dplyr_0.4.3         zoo_1.7-12          stringr_1.0.0       fields_8.3-5        maps_3.0.0-2       [7] spam_1.3-0          shiny_0.12.2        RPostgreSQL_0.4     DBI_0.3.1           zipcode_1.0         visreg_2.2-0       [13] plyr_1.8.3          IRanges_2.4.1       S4Vectors_0.8.2     BiocGenerics_0.16.1

loaded via a namespace (and not attached): [1] Rcpp_0.12.2     magrittr_1.5    xtable_1.8-0    lattice_0.20-33 R6_2.1.1        tools_3.2.2     htmltools_0.2.6 lazyeval_0.1.10 [9] assertthat_0.1  digest_0.6.8    mime_0.4        stringi_1.0-1   jsonlite_0.9.17 chron_2.3-47    httpuv_1.3.3   

Thanks for your help!

iranges R reduce gaps • 851 views