Question

Fastest way to work with a large RangedSummarizedExperiment object

0

Entering edit mode

alexgutteridge ▴ 50

@alexgutteridge-7451

Last seen 2.5 years ago

United States

I have a large RangedSummarizedExperiment generated by recount3 (>1000 samples, >10M exon junctions). I'd like to perform some simple summary statistics over each sample and have code that works but is rather slow (~30s per sample on my machine).

library(recount3)
#Takes about 5 minutes with cached files; another couple of minutes to download
ccle_jxn = create_rse(
  subset(available_projects(), project == "SRP186687" & project_type == "data_sources"),
  type = "jxn"
)
min_reads = 5
total_junctions = lapply(colData(ccle_jxn)$sra.sample_name, function(cell_line){
  x = subset(ccle_jxn, select = sra.sample_name == cell_line)
  return(sum(assay(x) >= min_reads))
})

This tweak makes it about x2 faster per sample:

min_reads = 5
total_junctions = lapply(colData(ccle_jxn)$sra.sample_name, function(cell_line){
  return(sum(assay(ccle_jxn)[,which(colData(ccle_jxn)$sra.sample_name == cell_line)] >= min_reads))
})

But is a bit less readable. Is there a way to make further optimisation?

SummarizedExperiment summ • 627 views

ADD COMMENT • link 2.6 years ago alexgutteridge ▴ 50