On 1/22/2013 7:54 AM, Renjie Tan wrote:
> Hi Martin,
>
> Thank you for your answer. I still have some questions about that.
>
> 1, the seqinfo() only result to 86 sequences for a really bam file.
So I know the 86 sequence is not the read sequence. What the 86
sequences are? And where are the reads?
seqinfo is returning information about the sequences to which reads
have been
aligned, e.g., chr1, chr2, chr3, .... This is read from the BAM header
file,
probably added by the aligner you used.
> 2, " cvg <- coverage(bf)" this step really time costly. It needs
more than 10 mins for a really 5G bam file with a cluster. If I just
want to calculate some parts of region's RD. should us read the whole
bam file into RAM? If so how can I parse hundreds of bam samples?
In the release version, probably the easiest thing to do is to create
a
'GRanges' describing the regions that you are interested in. You could
create it
by hand if there were only a few regions, but maybe you'd create it by
manipulating some annotation resource of one sort or another. By hand,
and from
your earlier email
>> Y:2595298-2595418
>>
>> Y:2601341-2601461
>>
>> Y:2606003-2606363
maybe you would do
starts <- c(2595298, 2601341, 2606003)
ends <- c(2595418, 2601461, 2606363)
roi <- GRanges("Y", IRanges(starts, ends), seqinfo=si)
to define your 'regions of interest', and then use this to input just
the reads
that overlap (including partially) those regions
param <- ScanBamParam(which=roi)
gal <- readBamGappedAlignments(bf, param=param)
This should be quick, provided there are not too many regions. Then
cvg <- coverage(gal)
brings you to
> 3, the Result's output format is like this? What's the " [ 9 9 8
9 9 8 8 8 7 7 8 8 9 10 7 6 6 ...]"parts mean? Which number
is the RD for this region?
> start end width
> [1] 100 299 200 [ 9 9 8 9 9 8 8 8 7 7 8 8 9
10 7 6 6 ...]
> [2] 1000 1199 200 [35 36 38 38 39 40 40 39 39 39 40 39 40
43 43 45 44 ...]
This is a 'View' onto the coverage vector, and it represents the
number of reads
over the nucleotide at coordinate 100 (9), 101 (9), 102 (8), etc. The
coordinates 100-299, 1000-1199 are in this view because in my response
I
identified those as regions of interest. Obviously you would use 'roi'
created
above.
> 4, the output format I expect is look like GATK's form. Such as
> Target total_coverage
> 1:14467-14587 261
I guess (but you can tell me if I'm wrong) that 'total_coverage' is
the sum of
the Views vectors, so from below...
> lapply(v, viewSums)
$seq1
[1] 4126 8412
$seq2
integer(0)
which says that, in the first region of interest, reads contributed a
total of
4126 nucleotides.
If one were analyzing many BAM files, it would make sense to write a
function
that takes as an argument a single bam file and perhaps other
arguments, and
then does whatever it is that you're interested in
fun <- function(bamFile, param) {
bf <- BamFile(bamFile)
gal <- readBamGappedAlignments(bf, param=param)
cvg <- coverage(gal)
## etc
}
then to apply it to a list of bam files
fls <- dir("some/where", pattern=".*bam$")
res <- lapply(fls, fun, param=param)
and then to parallelize it, e.g., on a single machine
library(parallel)
res <- mclapply(fls, fun, param=param)
It is also possible to process the files in a cluster with only a
little more
difficulty, but that's probably a topic for a different email thread.
You might find the resources at
http://bioconductor.org/help/course-materials/2012/
e.g., the PDF at least
http://bioconductor.org/help/course-materials/2012/CSC2012/
useful, and there is an upcoming course
https://secure.bioconductor.org/Seattle-Feb-2013/
that would be suitable for this type of question
Martin
> 1:14639-14883 1974
> 1:14943-15064 3412
> 1:15671-15990 374
> 1:16591-16719 0
>
> I appreciate your kindly help.
>
> Thanks,
> Renjie
>
>
> -----Original Message-----
> From: Martin Morgan [mailto:mtmorgan at fhcrc.org]
> Sent: Monday, January 21, 2013 6:45 PM
> To: Renjie Tan
> Cc: Maintainer; bioconductor at r-project.org
> Subject: Re: [devteam-bioc] Can you give me a sample to calculate
the Read Depth with Rsamtools?
>
> Hi Renjie -- it's better to ask these questions on the Bioconductor
mailing list, e.g., through the 'guest posting' facility if you don't
want to subscribe
>
>
http://bioconductor.org/help/mailing-list/mailform/
>
> That way everyone benefits from both questions and answers. For a
reproducible example I ran
>
> library(Rsamtools)
> example(scanBam)
>
> This creates a variable 'fl' that points to a small bam file. I made
an instance of the BamFile class, then discovered the information
about chromosomes in the file
>
> bf <- BamFile(fl)
> si <- seqinfo(bf)
>
> which told me that I have two short sequences
>
> > si
> Seqinfo of length 2
> seqnames seqlengths isCircular genome
> seq1 1575 NA <na>
> seq2 1584 NA <na>
>
> Suppose some 'regions of interest' were on 'seq1', e.g., windows of
width 200, starting at position 100 and 1000.
>
> roi <- GRanges("seq1", IRanges(c(100, 1000), width=200),
seqinfo=si)
>
> I could calculate coverage in my BAM file, just over my regions of
interest
>
> cvg <- coverage(bf)
>
> and then create 'Views' on my regions of interest
>
> v <- Map(Views, cvg, ranges(split(roi, seqnames(roi))))
>
> > v
> $seq1
> Views on a 1575-length Rle subject
>
> views:
> start end width
> [1] 100 299 200 [ 9 9 8 9 9 8 8 8 7 7 8 8 9 10 7
6 6 ...]
> [2] 1000 1199 200 [35 36 38 38 39 40 40 39 39 39 40 39 40 43 43
45 44 ...]
>
> $seq2
> Views on a 1584-length Rle subject
>
> views: NONE
>
> You could then access elements of this as, e.g., v$seq1[1] so for a
playful animation
>
> for (i in seq_along(v$seq1)) {
> plot(v$seq1[[i]], type="l")
> Sys.sleep(2)
> }
>
> If using the 'devel' (to be 3.0) version of R, coverage(bf) takes an
argument
> param=ScanBamParam(which=roi) which would just calculate coverage of
(reads that
> overlap) your regions of interest; this would obviously save a lot
of computation if you had just a few regions, or regions on a single
chromosome.
>
> Martin
>
> On 01/21/2013 01:29 PM, Maintainer wrote:
>> Hi Martin,
>>
>> I am a new beginner Rsamtools user. I know that's really a powerful
>> tool. Can you give me a sample to calculate the Read Depth with
Rsamtools?
>>
>> My input is a bam file(s) and a series regions like:
>>
>> Y:2595298-2595418
>>
>> Y:2601341-2601461
>>
>> Y:2606003-2606363
>>
>> ........................
>>
>> I need to calculate the *Read depth* for every regions above. Not
the read count.
>>
>> Thanks,
>>
>> Renjie
>>
>>
>>
>>
______________________________________________________________________
>> __
>> devteam-bioc mailing list
>> To unsubscribe from this mailing list send a blank email to
>> devteam-bioc-leave at lists.fhcrc.org
>> You can also unsubscribe or change your personal options at
>>
https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>>
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>
--
Dr. Martin Morgan, PhD
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109