I am an advanced beginner in R who is extremely thankful for
IRanges! It has accelerated the process I describe below by ~10-fold.
Desired advice: Because I am working with millions of records, I wonder if there are further speed improvements that can be obtained by creatively combining
IRanges with functions from the
dplyr package, which are generally fast. Any insights into using the
data.table package would also be appreciated.
Background: My job involves working on the medication records of de-identified health insurance plan members. My team increasingly needs to exclude patients from downstream analyses of our
Claims data if their medication history contains any gap longer than 60 days long. We identify these patients by looking at the
enddate of each record in our dataset; each record also contains the
member_id of the member who made the purchase. Unfortunately, identifying 'gappy' patients is not as simple as pairwise comparison of records. This simplified excerpt from
Claims illustrates why; dates are represented as integers below:
member_id startdate enddate
A 1 90 A 14 15 A 121 180 B 1 30 B 2001 2030 ... ... ...
Patient B should obviously be removed since he has a gap of length 2001 − 30 − 1 = 1970 > 60. I would like to retain Patient A despite the gap of length 121 − 15 − 1 = 75 > 60 between his second and third prescriptions, however; the only gap in his medication history is the one of length 121 − 90 − 1 = 30 < 60 between his first and third prescriptions.
Current approach and situation: I have been able to take these issues into account using a custom function called
smart (it definitely is smarter than the previous loop-based function we employed).
> smart <- function(Claims)
> MemberClaims_I <- IRanges(start = as.numeric(Claims$startdate), end = as.numeric(Claims$enddate))
> MemberClaims_Red <- reduce(MemberClaims_I)
> MemberGaps <- as.data.table(gaps(MemberClaims_Red))
This custom function is then currently applied to
> member_id <- levels(Claims$member_id)
> #system.time(Claims_Smart <- ddply(Claims, .(member_id), smart))
> Claims_Smart <- ddply(Claims, .(member_id), smart)
The hashed-out line tells me that ~20,000 rows for ~1,000 patients are processed in ~8 seconds. A dataset with 3 million rows and 600,000 patients gets processed in ~8 hours. Here are my session details:
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11 (El Capitan)
attached base packages:
 grid stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
 data.table_1.9.6 dplyr_0.4.3 zoo_1.7-12 stringr_1.0.0 fields_8.3-5 maps_3.0.0-2
 spam_1.3-0 shiny_0.12.2 RPostgreSQL_0.4 DBI_0.3.1 zipcode_1.0 visreg_2.2-0
 plyr_1.8.3 IRanges_2.4.1 S4Vectors_0.8.2 BiocGenerics_0.16.1
loaded via a namespace (and not attached):
 Rcpp_0.12.2 magrittr_1.5 xtable_1.8-0 lattice_0.20-33 R6_2.1.1 tools_3.2.2 htmltools_0.2.6 lazyeval_0.1.10
 assertthat_0.1 digest_0.6.8 mime_0.4 stringi_1.0-1 jsonlite_0.9.17 chron_2.3-47 httpuv_1.3.3
Thanks for your help!