Hi,
Is there an efficient function to randomly subsample alignments from a
BAM/SAM within R? In the best case, an equivalent to 'FastqSampler'
in
'ShortRead'. If not, what would be a clever way of doing this with
the
bioc framework, without having to read the whole file first?
Best wishes
Julian
On 03/12/2013 01:32 PM, Julian Gehring wrote:
> Hi,
>
> Is there an efficient function to randomly subsample alignments from
a BAM/SAM
> within R? In the best case, an equivalent to 'FastqSampler' in
'ShortRead'. If
> not, what would be a clever way of doing this with the bioc
framework, without
> having to read the whole file first?
Hi Julian -- there isn't anything at the moment; below is a function
that will
have at most yieldSize * 2 elements in memory at one time. All the
elements in
the bam file (satisfying ScanBamParam) will eventually pass through
memory, so
not super efficient.
sampleBam <-
function(file, index=file, ..., yieldSize, verbose=FALSE,
param=ScanBamParam(what=scanBamWhat()))
{
qunlist <- Rsamtools:::.quickUnlist
tot <- sampleSize <- yieldSize
bf <- open(BamFile(file, index, yieldSize=yieldSize))
smpl <- qunlist(scanBam(bf, param=param))
repeat {
yld <- qunlist(scanBam(bf, param=param))
yld_n <- length(yld[[1]])
if (length(yld[[1]]) == 0L)
break
tot <- tot + yld_n
keep <- rbinom(1L, yld_n, yld_n/ tot)
if (verbose)
message(sprintf("sampling %d of %d", keep, tot))
if (keep == 0L)
next
i <- sample(sampleSize, keep)
j <- sample(yld_n, keep)
smpl <- Map(function(x, y, i, j) {
x[i] <- y[j]
x
}, smpl, yld, MoreArgs=list(i=i, j=j))
}
close(bf)
smpl
}
with
fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
param <- ScanBamParam(
flag=scanBamFlag(isUnmappedQuery=FALSE),
what=c("rname", "pos", "cigar", "strand"))
lst <- sampleBam(fl, yieldSize=1000, param=param, verbose=TRUE)
The end result could be fed to a more friendly structure
names(lst)[names(lst) == "rname"] = "seqname"
do.call(GappedAlignments, lst)
Lightly tested, especially when ScanBamParam() is non-trivial.
Martin
>
> Best wishes
> Julian
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
On 03/12/2013 07:45 PM, Martin Morgan wrote:
> On 03/12/2013 01:32 PM, Julian Gehring wrote:
>> Hi,
>>
>> Is there an efficient function to randomly subsample alignments
from a BAM/SAM
>> within R? In the best case, an equivalent to 'FastqSampler' in
'ShortRead'. If
>> not, what would be a clever way of doing this with the bioc
framework, without
>> having to read the whole file first?
>
> Hi Julian -- there isn't anything at the moment; below is a function
that will
> have at most yieldSize * 2 elements in memory at one time. All the
elements in
> the bam file (satisfying ScanBamParam) will eventually pass through
memory, so
> not super efficient.
I took a different path in Rsamtools 1.11.23; one can create a
BamSampler
instance and use it like BamFile, e.g., in readGappedAlignments /
scanBam / ...
> example(BamSampler)
BmSmpl> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
BmSmpl> samp <- BamSampler(fl, yieldSize=1000)
BmSmpl> ## two independent samples
BmSmpl> head(readBamGappedAlignments(samp))
...
BmSmpl> head(readBamGappedAlignments(samp))
...
Still not super-efficient.
Martin
>
> sampleBam <-
> function(file, index=file, ..., yieldSize, verbose=FALSE,
> param=ScanBamParam(what=scanBamWhat()))
> {
> qunlist <- Rsamtools:::.quickUnlist
> tot <- sampleSize <- yieldSize
> bf <- open(BamFile(file, index, yieldSize=yieldSize))
> smpl <- qunlist(scanBam(bf, param=param))
>
> repeat {
> yld <- qunlist(scanBam(bf, param=param))
> yld_n <- length(yld[[1]])
> if (length(yld[[1]]) == 0L)
> break
> tot <- tot + yld_n
> keep <- rbinom(1L, yld_n, yld_n/ tot)
> if (verbose)
> message(sprintf("sampling %d of %d", keep, tot))
> if (keep == 0L)
> next
>
> i <- sample(sampleSize, keep)
> j <- sample(yld_n, keep)
> smpl <- Map(function(x, y, i, j) {
> x[i] <- y[j]
> x
> }, smpl, yld, MoreArgs=list(i=i, j=j))
> }
>
> close(bf)
> smpl
> }
>
>
> with
>
> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
> param <- ScanBamParam(
> flag=scanBamFlag(isUnmappedQuery=FALSE),
> what=c("rname", "pos", "cigar", "strand"))
>
> lst <- sampleBam(fl, yieldSize=1000, param=param, verbose=TRUE)
>
> The end result could be fed to a more friendly structure
>
> names(lst)[names(lst) == "rname"] = "seqname"
> do.call(GappedAlignments, lst)
>
> Lightly tested, especially when ScanBamParam() is non-trivial.
>
> Martin
>
>>
>> Best wishes
>> Julian
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
Hi Martin,
Thanks, that is very useful. For large-scale subsampling, the command
line tools may still be the way to go, but for many use cases the
BamSampler will be sufficient.
Best wishes
Julian
On 03/13/2013 09:26 PM, Martin Morgan wrote:
> On 03/12/2013 07:45 PM, Martin Morgan wrote:
>> On 03/12/2013 01:32 PM, Julian Gehring wrote:
>>> Hi,
>>>
>>> Is there an efficient function to randomly subsample alignments
from
>>> a BAM/SAM
>>> within R? In the best case, an equivalent to 'FastqSampler' in
>>> 'ShortRead'. If
>>> not, what would be a clever way of doing this with the bioc
>>> framework, without
>>> having to read the whole file first?
>>
>> Hi Julian -- there isn't anything at the moment; below is a
function
>> that will
>> have at most yieldSize * 2 elements in memory at one time. All the
>> elements in
>> the bam file (satisfying ScanBamParam) will eventually pass through
>> memory, so
>> not super efficient.
>
> I took a different path in Rsamtools 1.11.23; one can create a
> BamSampler instance and use it like BamFile, e.g., in
> readGappedAlignments / scanBam / ...
>
> > example(BamSampler)
>
> BmSmpl> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
>
> BmSmpl> samp <- BamSampler(fl, yieldSize=1000)
>
> BmSmpl> ## two independent samples
> BmSmpl> head(readBamGappedAlignments(samp))
> ...
> BmSmpl> head(readBamGappedAlignments(samp))
> ...
>
> Still not super-efficient.
>
> Martin
>
>>
>> sampleBam <-
>> function(file, index=file, ..., yieldSize, verbose=FALSE,
>> param=ScanBamParam(what=scanBamWhat()))
>> {
>> qunlist <- Rsamtools:::.quickUnlist
>> tot <- sampleSize <- yieldSize
>> bf <- open(BamFile(file, index, yieldSize=yieldSize))
>> smpl <- qunlist(scanBam(bf, param=param))
>>
>> repeat {
>> yld <- qunlist(scanBam(bf, param=param))
>> yld_n <- length(yld[[1]])
>> if (length(yld[[1]]) == 0L)
>> break
>> tot <- tot + yld_n
>> keep <- rbinom(1L, yld_n, yld_n/ tot)
>> if (verbose)
>> message(sprintf("sampling %d of %d", keep, tot))
>> if (keep == 0L)
>> next
>>
>> i <- sample(sampleSize, keep)
>> j <- sample(yld_n, keep)
>> smpl <- Map(function(x, y, i, j) {
>> x[i] <- y[j]
>> x
>> }, smpl, yld, MoreArgs=list(i=i, j=j))
>> }
>>
>> close(bf)
>> smpl
>> }
>>
>>
>> with
>>
>> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
>> param <- ScanBamParam(
>> flag=scanBamFlag(isUnmappedQuery=FALSE),
>> what=c("rname", "pos", "cigar", "strand"))
>>
>> lst <- sampleBam(fl, yieldSize=1000, param=param, verbose=TRUE)
>>
>> The end result could be fed to a more friendly structure
>>
>> names(lst)[names(lst) == "rname"] = "seqname"
>> do.call(GappedAlignments, lst)
>>
>> Lightly tested, especially when ScanBamParam() is non-trivial.
>>
>> Martin
>>
>>>
>>> Best wishes
>>> Julian
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>
>