Fastest way to work with a large RangedSummarizedExperiment object
0
0
Entering edit mode
@alexgutteridge-7451
Last seen 3.1 years ago
United States

I have a large RangedSummarizedExperiment generated by recount3 (>1000 samples, >10M exon junctions). I'd like to perform some simple summary statistics over each sample and have code that works but is rather slow (~30s per sample on my machine).

library(recount3)
#Takes about 5 minutes with cached files; another couple of minutes to download
ccle_jxn = create_rse(
  subset(available_projects(), project == "SRP186687" & project_type == "data_sources"),
  type = "jxn"
)
min_reads = 5
total_junctions = lapply(colData(ccle_jxn)$sra.sample_name, function(cell_line){
  x = subset(ccle_jxn, select = sra.sample_name == cell_line)
  return(sum(assay(x) >= min_reads))
})

This tweak makes it about x2 faster per sample:

min_reads = 5
total_junctions = lapply(colData(ccle_jxn)$sra.sample_name, function(cell_line){
  return(sum(assay(ccle_jxn)[,which(colData(ccle_jxn)$sra.sample_name == cell_line)] >= min_reads))
})

But is a bit less readable. Is there a way to make further optimisation?

SummarizedExperiment summ • 789 views
ADD COMMENT

Login before adding your answer.

Traffic: 820 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6