Subsampling of BAM/SAM alignments
1
0
Entering edit mode
Julian Gehring ★ 1.3k
@julian-gehring-5818
Last seen 5.0 years ago
Hi, Is there an efficient function to randomly subsample alignments from a BAM/SAM within R? In the best case, an equivalent to 'FastqSampler' in 'ShortRead'. If not, what would be a clever way of doing this with the bioc framework, without having to read the whole file first? Best wishes Julian
• 1.4k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 16 days ago
United States
On 03/12/2013 01:32 PM, Julian Gehring wrote: > Hi, > > Is there an efficient function to randomly subsample alignments from a BAM/SAM > within R? In the best case, an equivalent to 'FastqSampler' in 'ShortRead'. If > not, what would be a clever way of doing this with the bioc framework, without > having to read the whole file first? Hi Julian -- there isn't anything at the moment; below is a function that will have at most yieldSize * 2 elements in memory at one time. All the elements in the bam file (satisfying ScanBamParam) will eventually pass through memory, so not super efficient. sampleBam <- function(file, index=file, ..., yieldSize, verbose=FALSE, param=ScanBamParam(what=scanBamWhat())) { qunlist <- Rsamtools:::.quickUnlist tot <- sampleSize <- yieldSize bf <- open(BamFile(file, index, yieldSize=yieldSize)) smpl <- qunlist(scanBam(bf, param=param)) repeat { yld <- qunlist(scanBam(bf, param=param)) yld_n <- length(yld[[1]]) if (length(yld[[1]]) == 0L) break tot <- tot + yld_n keep <- rbinom(1L, yld_n, yld_n/ tot) if (verbose) message(sprintf("sampling %d of %d", keep, tot)) if (keep == 0L) next i <- sample(sampleSize, keep) j <- sample(yld_n, keep) smpl <- Map(function(x, y, i, j) { x[i] <- y[j] x }, smpl, yld, MoreArgs=list(i=i, j=j)) } close(bf) smpl } with fl <- system.file("extdata", "ex1.bam", package="Rsamtools") param <- ScanBamParam( flag=scanBamFlag(isUnmappedQuery=FALSE), what=c("rname", "pos", "cigar", "strand")) lst <- sampleBam(fl, yieldSize=1000, param=param, verbose=TRUE) The end result could be fed to a more friendly structure names(lst)[names(lst) == "rname"] = "seqname" do.call(GappedAlignments, lst) Lightly tested, especially when ScanBamParam() is non-trivial. Martin > > Best wishes > Julian > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
On 03/12/2013 07:45 PM, Martin Morgan wrote: > On 03/12/2013 01:32 PM, Julian Gehring wrote: >> Hi, >> >> Is there an efficient function to randomly subsample alignments from a BAM/SAM >> within R? In the best case, an equivalent to 'FastqSampler' in 'ShortRead'. If >> not, what would be a clever way of doing this with the bioc framework, without >> having to read the whole file first? > > Hi Julian -- there isn't anything at the moment; below is a function that will > have at most yieldSize * 2 elements in memory at one time. All the elements in > the bam file (satisfying ScanBamParam) will eventually pass through memory, so > not super efficient. I took a different path in Rsamtools 1.11.23; one can create a BamSampler instance and use it like BamFile, e.g., in readGappedAlignments / scanBam / ... > example(BamSampler) BmSmpl> fl <- system.file("extdata", "ex1.bam", package="Rsamtools") BmSmpl> samp <- BamSampler(fl, yieldSize=1000) BmSmpl> ## two independent samples BmSmpl> head(readBamGappedAlignments(samp)) ... BmSmpl> head(readBamGappedAlignments(samp)) ... Still not super-efficient. Martin > > sampleBam <- > function(file, index=file, ..., yieldSize, verbose=FALSE, > param=ScanBamParam(what=scanBamWhat())) > { > qunlist <- Rsamtools:::.quickUnlist > tot <- sampleSize <- yieldSize > bf <- open(BamFile(file, index, yieldSize=yieldSize)) > smpl <- qunlist(scanBam(bf, param=param)) > > repeat { > yld <- qunlist(scanBam(bf, param=param)) > yld_n <- length(yld[[1]]) > if (length(yld[[1]]) == 0L) > break > tot <- tot + yld_n > keep <- rbinom(1L, yld_n, yld_n/ tot) > if (verbose) > message(sprintf("sampling %d of %d", keep, tot)) > if (keep == 0L) > next > > i <- sample(sampleSize, keep) > j <- sample(yld_n, keep) > smpl <- Map(function(x, y, i, j) { > x[i] <- y[j] > x > }, smpl, yld, MoreArgs=list(i=i, j=j)) > } > > close(bf) > smpl > } > > > with > > fl <- system.file("extdata", "ex1.bam", package="Rsamtools") > param <- ScanBamParam( > flag=scanBamFlag(isUnmappedQuery=FALSE), > what=c("rname", "pos", "cigar", "strand")) > > lst <- sampleBam(fl, yieldSize=1000, param=param, verbose=TRUE) > > The end result could be fed to a more friendly structure > > names(lst)[names(lst) == "rname"] = "seqname" > do.call(GappedAlignments, lst) > > Lightly tested, especially when ScanBamParam() is non-trivial. > > Martin > >> >> Best wishes >> Julian >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD REPLY
0
Entering edit mode
Hi Martin, Thanks, that is very useful. For large-scale subsampling, the command line tools may still be the way to go, but for many use cases the BamSampler will be sufficient. Best wishes Julian On 03/13/2013 09:26 PM, Martin Morgan wrote: > On 03/12/2013 07:45 PM, Martin Morgan wrote: >> On 03/12/2013 01:32 PM, Julian Gehring wrote: >>> Hi, >>> >>> Is there an efficient function to randomly subsample alignments from >>> a BAM/SAM >>> within R? In the best case, an equivalent to 'FastqSampler' in >>> 'ShortRead'. If >>> not, what would be a clever way of doing this with the bioc >>> framework, without >>> having to read the whole file first? >> >> Hi Julian -- there isn't anything at the moment; below is a function >> that will >> have at most yieldSize * 2 elements in memory at one time. All the >> elements in >> the bam file (satisfying ScanBamParam) will eventually pass through >> memory, so >> not super efficient. > > I took a different path in Rsamtools 1.11.23; one can create a > BamSampler instance and use it like BamFile, e.g., in > readGappedAlignments / scanBam / ... > > > example(BamSampler) > > BmSmpl> fl <- system.file("extdata", "ex1.bam", package="Rsamtools") > > BmSmpl> samp <- BamSampler(fl, yieldSize=1000) > > BmSmpl> ## two independent samples > BmSmpl> head(readBamGappedAlignments(samp)) > ... > BmSmpl> head(readBamGappedAlignments(samp)) > ... > > Still not super-efficient. > > Martin > >> >> sampleBam <- >> function(file, index=file, ..., yieldSize, verbose=FALSE, >> param=ScanBamParam(what=scanBamWhat())) >> { >> qunlist <- Rsamtools:::.quickUnlist >> tot <- sampleSize <- yieldSize >> bf <- open(BamFile(file, index, yieldSize=yieldSize)) >> smpl <- qunlist(scanBam(bf, param=param)) >> >> repeat { >> yld <- qunlist(scanBam(bf, param=param)) >> yld_n <- length(yld[[1]]) >> if (length(yld[[1]]) == 0L) >> break >> tot <- tot + yld_n >> keep <- rbinom(1L, yld_n, yld_n/ tot) >> if (verbose) >> message(sprintf("sampling %d of %d", keep, tot)) >> if (keep == 0L) >> next >> >> i <- sample(sampleSize, keep) >> j <- sample(yld_n, keep) >> smpl <- Map(function(x, y, i, j) { >> x[i] <- y[j] >> x >> }, smpl, yld, MoreArgs=list(i=i, j=j)) >> } >> >> close(bf) >> smpl >> } >> >> >> with >> >> fl <- system.file("extdata", "ex1.bam", package="Rsamtools") >> param <- ScanBamParam( >> flag=scanBamFlag(isUnmappedQuery=FALSE), >> what=c("rname", "pos", "cigar", "strand")) >> >> lst <- sampleBam(fl, yieldSize=1000, param=param, verbose=TRUE) >> >> The end result could be fed to a more friendly structure >> >> names(lst)[names(lst) == "rname"] = "seqname" >> do.call(GappedAlignments, lst) >> >> Lightly tested, especially when ScanBamParam() is non-trivial. >> >> Martin >> >>> >>> Best wishes >>> Julian >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > >
ADD REPLY

Login before adding your answer.

Traffic: 551 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6