Question: Big data for IRanges: Combining reduce and gaps with dplyr functions
0
3.9 years ago by
United States
pesquivel0 wrote:

Hi all,

I am an advanced beginner in R who is extremely thankful for IRanges! It has accelerated the process I describe below by ~10-fold.

Desired advice: Because I am working with millions of records, I wonder if there are further speed improvements that can be obtained by creatively combining IRanges with functions from the dplyr package, which are generally fast. Any insights into using the data.table package would also be appreciated.

Background: My job involves working on the medication records of de-identified health insurance plan members. My team increasingly needs to exclude patients from downstream analyses of our Claims data if their medication history contains any gap longer than 60 days long. We identify these patients by looking at the startdate and enddate of each record in our dataset; each record also contains the member_id of the member who made the purchase. Unfortunately, identifying 'gappy' patients is not as simple as pairwise comparison of records. This simplified excerpt from Claims illustrates why; dates are represented as integers below:

member_id   startdate   enddate

A           1           90
A           14          15
A           121         180
B           1           30
B           2001        2030
...         ...         ... 

Patient B should obviously be removed since he has a gap of length 2001 − 30 − 1 = 1970 > 60. I would like to retain Patient A despite the gap of length 121 − 15 − 1 = 75 > 60 between his second and third prescriptions, however; the only gap in his medication history is the one of length 121 − 90 − 1 = 30 < 60 between his first and third prescriptions.

EDIT: Previous approach and situation: I have been able to take these issues into account using a custom function called smart (it definitely is smarter than the previous loop-based function we employed). smart invokes IRanges::reduce and IRanges::gaps

> smart <- function(Claims)
> {
>   MemberClaims_I <- IRanges(start = as.numeric(Claims$startdate), end = as.numeric(Claims$enddate))
>   MemberClaims_Red <- reduce(MemberClaims_I)
>   MemberGaps <- as.data.table(gaps(MemberClaims_Red))
> }

This custom function is then currently applied to Claims using plyr:ddply:

> member_id <- levels(Claims$member_id) > #system.time(Claims_Smart <- ddply(Claims, .(member_id), smart)) > Claims_Smart <- ddply(Claims, .(member_id), smart) The hashed-out line tells me that ~20,000 rows for ~1,000 patients are processed in ~8 seconds. A dataset with 3 million rows and 600,000 patients gets processed in ~8 hours. EDIT: Current solution: Thanks to Michael, I now have a function that covers ~20,000 rows for ~3,000 patients in just 0.11 second --- even on a sucky computer. See below; hashed-out lines were run to help confirm accuracy of solution by spot-checking output. > gapruler <- function (Claims) > { > ClaimsByMember <- with(Claims, split(IRanges(as.integer(Claims$startdate), as.integer(Claims$enddate)), member_id)) > #Gapsnew <- as.data.frame(gaps(ClaimsByMember)) > #colnames(Gapsnew) <- c("group", "member_id", "startdate", "enddate", "daysgap") > #Gapsnew <- Gapsnew[order(Gapsnew$daysgap),]
>   #rownames(Gapsnew) <- seq(1:nrow(Gapsnew))
>   Gapwidths <- width(gaps(ClaimsByMember))
>   threshgapwidth <- quantile(unlist(Gapwidths, use.names=FALSE), threshgapperc/100)
>   Claims_FirstTime <- any(splitAsList(Claims$first_time, Claims$member_id) == "Y")
>   Claims_Included <- ClaimsByMember[max(Gapwidths) <= threshgapwidth & Claims_FirstTime]
>   LTVTable <- as.data.frame(unlist(range(ClaimsByMember)))
>   colnames(LTVTable) <- c("mincoresdate", "maxcoredate", "therapylength", "member_id")
>   LTVTable <- subset(LTVTable, mincoresdate < thresholdenddate)
>   LTV <- mean(LTVTable$therapylength) > print(paste("LTV =", round(LTV), "days")) > } > > Gaps <- as.data.frame(gapruler(Claims))  Here are my session details: > sessionInfo() R version 3.2.2 (2015-08-14) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 8 x64 (build 9200) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] visreg_2.2-0 plyr_1.8.3 zipcode_1.0 XVector_0.10.0 IRanges_2.4.1 S4Vectors_0.8.2 BiocGenerics_0.16.1 [8] dplyr_0.4.3 loaded via a namespace (and not attached): [1] Rcpp_0.12.2 lattice_0.20-33 assertthat_0.1 grid_3.2.2 R6_2.1.1 DBI_0.3.1 magrittr_1.5 zlibbioc_1.16.0 [9] tools_3.2.2  iranges R reduce gaps • 874 views ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by pesquivel0 Answer: Big data for IRanges: Combining reduce and gaps with dplyr functions 1 3.9 years ago by United States Michael Lawrence11k wrote: I don't think ddply() is going to help much here in terms of performance, due to the running time of the smart() function. You're going to want to form a RangesList by splitting your ranges by member_id: claimsByMember <- with(Claims, split(IRanges(startdate, enddate), member_id)) Then just compute the gaps (the gaps imply reduction, btw), and find the elements where the max gap width is less than some value: valid <- max(width(gaps(claimsByMember))) <= 60 I haven't tested that, but it should get you pretty close, and with reasonable performance. If not, please let us know. ADD COMMENTlink written 3.9 years ago by Michael Lawrence11k Hi Michael, Thank you for being a lifesaver! I could have been a bit clearer in saying that we remove any patient who has __even one__ too-long gap. It nevertheless inspired the following code, which differs from yours merely in that it does not employ max(). This approach can process 3 million records for 600,000 patients in 0.3 seconds. ClaimsByMember <- with(Claims, split(IRanges(as.numeric(Claims$startdate), as.numeric(Claims$enddate)), member_id)) Gaps <- as.data.frame((width(gaps(ClaimsByMember)))) Gaps <- select(Gaps, -group) Gaps <- as.data.frame(Gaps) colnames(Gaps) <- c("member_id", "daysgap") Once more, thank you very much for your inspiration! ADD REPLYlink written 3.9 years ago by pesquivel0 Wouldn't finding the maximum gap per patient detect if a patient has even one gap over the limit? Computing the max on the list of widths is going to be extremely fast, so you might want to make use of it. Couple of more notes: the endpoints are integer, so use as.integer() instead of as.numeric(). If you really want to convert the list to a table, do this: stack(Gaps, index.var="member_id", value.var="daysgap") ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Michael Lawrence11k Hi Michael, Thanks again for your advice. I oversimplified what my protocol does in real life. The threshold gap length is not a hardcoded 60 days. Rather, it is found as the 90th- (or 99th, or 99.9th-) percentile length; the threshold percentile is itself a variable. It is still true, though, that all records for any patient with even one gap longer than the threshold gap length must be excluded from downstream analysis. Nevertheless, I would be open to a faster procedure to exclude patients. Currently, the procedure downstream of the code I updated takes 30 seconds. Not bad, but I know R can do it better. Thoughts? Best, Paolo ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by pesquivel0 It might help to see the downstream code. ADD REPLYlink written 3.9 years ago by Michael Lawrence11k In the code below, LTV is the variable of interest. It is conceptualized as the length of time that patients are on a given medication. We calculate it only for each included patient. LTV is defined as the difference between the maximum of the patient's enddates and the minimum of his startdates. This entire procedure depends mostly on the threshold gap-length percentile variable threshgapperc (which can take on the values of 99.9, 99, and 90). #Excluding outlier gapwidths to calculate threshold gapwith Gaps <- Gaps[order(Gaps$daysgap),]
Gaps <- Gaps[1:round(threshgapperc*nrow(Gaps)/100),]
threshgapwidth <- max(Gaps$daysgap) #Define an outlier claim as a claim whose member_id has is linked to even one gap of 'gapwidth' > 'threshgapwidth' #Equivalently, include only patients with all gaps of 'gapwidth' <= 'threshgapwidth' Gaps_Excluder <- Gaps[Gaps$daysgap > threshgapwidth,]
Claims_Included <- Claims[!(Claims$member_id %in% Gaps_Excluder$member_id),]
Claims_FT <- Claims[Claims_Included$first_time == "Y",] Claims_FT$member_id <- factor(Claims_FT$member_id) print(paste("Number of distinct member_id values in Claims_FT =", length(levels(Claims_FT$member_id))))

#Find all claims from Claims_Included whose member_id is in Claims_FT
Claims_Included <- Claims_Included[Claims_Included$member_id %in% Claims_FT$member_id,]

#Calculate patient-level LTVs
Mincorsdate <- as.data.frame(aggregate(x = list(Claims_Included$startdate), by = list(Claims_Included$member_id), FUN = "min"))
Maxcoredate <- as.data.frame(aggregate(x = list(Claims_Included$enddate), by = list(Claims_Included$member_id), FUN = "max"))
names(Mincorsdate) <- c("member_id", "mincorsdate")
names(Maxcoredate) <- c("member_id", "maxcoredate")
LTVTable <- merge(Mincorsdate, Maxcoredate, by.x = "member_id", by.y = "member_id")

#Remove from LTV calculation all patients who started 'too late'
LTVTable <- LTVTable[LTVTable$mincorsdate < thresholdenddate,] LTVTable$therapylength <- LTVTable$maxcoredate - LTVTable$mincorsdate
LTVTable$member_id <- factor(LTVTable$member_id)
print(paste("Number of distinct member_id values in LTVTable =", length(levels(LTVTable$member_id)))) LTV <- as.numeric(mean(LTVTable$therapylength))
print(paste("LTV =", round(LTV), "days"))

Here is some code I whipped up and haven't tested. But it should get the point across.

claimsByMember <- with(Claims, split(IRanges(startdate, enddate), member_id))
gapwidths <- width(gaps(claimsByMember))
threshgapwidth <- quantile(unlist(gapwidths, use.names=FALSE),
threshgapperc/100)
ft <- any(splitAsList(Claims$first_time, Claims$member_id) == "Y")
Claims_Included <- claimsByMember[max(gapwidths) <= threshgapwidth & ft]
LTVTable <- as.data.frame(unlist(range(claimsByMember)))
colnames(LTVTable) <- c("mincoresdate", "maxcoredate", "therapylength",
"member_id")
LTVTable <- subset(LTVTable, mincoresdate < thresholdenddate)
LTV <- mean(LTVTable$therapylength) ADD REPLYlink written 3.9 years ago by Michael Lawrence11k Hi Michael, Thank you so much for your help thus far! I've spot-checked the output of your code and have found it working great! Three million rows for 4,000 patients are done in 0.11 seconds. Here is the final function I used: gapruler <- function (Claims) { ClaimsByMember <- with(Claims, split(IRanges(as.integer(Claims$startdate), as.integer(Claims$enddate)), member_id)) #Gapsnew <- as.data.frame(gaps(ClaimsByMember)) #colnames(Gapsnew) <- c("group", "member_id", "startdate", "enddate", "daysgap") #Gapsnew <- Gapsnew[order(Gapsnew$daysgap),]
#rownames(Gapsnew) <- seq(1:nrow(Gapsnew))
Gapwidths <- width(gaps(ClaimsByMember))
threshgapwidth <- quantile(unlist(Gapwidths, use.names=FALSE), threshgapperc/100)
Claims_FirstTime <- any(splitAsList(Claims$first_time, Claims$member_id) == "Y")
Claims_Included <- ClaimsByMember[max(Gapwidths) <= threshgapwidth & Claims_FirstTime]
LTVTable <- as.data.frame(unlist(range(ClaimsByMember)))
colnames(LTVTable) <- c("mincoresdate", "maxcoredate", "therapylength", "member_id")
LTVTable <- subset(LTVTable, mincoresdate < thresholdenddate)
LTV <- mean(LTVTable\$therapylength)
print(paste("LTV =", round(LTV), "days"))
}

Gaps <- as.data.frame(gapruler(Claims))