[devteam-bioc] Can you give me a sample to calculate the Read Depth with Rsamtools?

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 2 days ago

United States

Hi Renjie -- it's better to ask these questions on the Bioconductor mailing list, e.g., through the 'guest posting' facility if you don't want to subscribe http://bioconductor.org/help/mailing-list/mailform/ That way everyone benefits from both questions and answers. For a reproducible example I ran library(Rsamtools) example(scanBam) This creates a variable 'fl' that points to a small bam file. I made an instance of the BamFile class, then discovered the information about chromosomes in the file bf <- BamFile(fl) si <- seqinfo(bf) which told me that I have two short sequences > si Seqinfo of length 2 seqnames seqlengths isCircular genome seq1 1575 NA <na> seq2 1584 NA <na> Suppose some 'regions of interest' were on 'seq1', e.g., windows of width 200, starting at position 100 and 1000. roi <- GRanges("seq1", IRanges(c(100, 1000), width=200), seqinfo=si) I could calculate coverage in my BAM file, just over my regions of interest cvg <- coverage(bf) and then create 'Views' on my regions of interest v <- Map(Views, cvg, ranges(split(roi, seqnames(roi)))) > v $seq1 Views on a 1575-length Rle subject views: start end width [1] 100 299 200 [ 9 9 8 9 9 8 8 8 7 7 8 8 9 10 7 6 6 ...] [2] 1000 1199 200 [35 36 38 38 39 40 40 39 39 39 40 39 40 43 43 45 44 ...] $seq2 Views on a 1584-length Rle subject views: NONE You could then access elements of this as, e.g., v$seq1[1] so for a playful animation for (i in seq_along(v$seq1)) { plot(v$seq1[[i]], type="l") Sys.sleep(2) } If using the 'devel' (to be 3.0) version of R, coverage(bf) takes an argument param=ScanBamParam(which=roi) which would just calculate coverage of (reads that overlap) your regions of interest; this would obviously save a lot of computation if you had just a few regions, or regions on a single chromosome. Martin On 01/21/2013 01:29 PM, Maintainer wrote: > Hi Martin, > > I am a new beginner Rsamtools user. I know that?s really a powerful tool. Can > you give me a sample to calculate the Read Depth with Rsamtools? > > My input is a bam file(s) and a series regions like: > > Y:2595298-2595418 > > Y:2601341-2601461 > > Y:2606003-2606363 > > ???????? > > I need to calculate the *Read depth* for every regions above. Not the read count. > > Thanks, > > Renjie > > > > ____________________________________________________________________ ____ > devteam-bioc mailing list > To unsubscribe from this mailing list send a blank email to > devteam-bioc-leave at lists.fhcrc.org > You can also unsubscribe or change your personal options at > https://lists.fhcrc.org/mailman/listinfo/devteam-bioc > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

Coverage Cancer Rsamtools Coverage Cancer Rsamtools • 1.8k views

ADD COMMENT • link updated 13.0 years ago by Renjie Tan ▴ 10 • written 13.0 years ago by Martin Morgan 25k

0

Entering edit mode

Renjie Tan ▴ 10

@renjie-tan-5724

Last seen 11.4 years ago

Hi Martin, Thank you for your answer. I still have some questions about that. 1, the seqinfo() only result to 86 sequences for a really bam file. So I know the 86 sequence is not the read sequence. What the 86 sequences are? And where are the reads? 2, " cvg <- coverage(bf)" this step really time costly. It needs more than 10 mins for a really 5G bam file with a cluster. If I just want to calculate some parts of region's RD. should us read the whole bam file into RAM? If so how can I parse hundreds of bam samples? 3, the Result's output format is like this? What's the " [ 9 9 8 9 9 8 8 8 7 7 8 8 9 10 7 6 6 ...]"parts mean? Which number is the RD for this region? start end width [1] 100 299 200 [ 9 9 8 9 9 8 8 8 7 7 8 8 9 10 7 6 6 ...] [2] 1000 1199 200 [35 36 38 38 39 40 40 39 39 39 40 39 40 43 43 45 44 ...] 4, the output format I expect is look like GATK's form. Such as Target total_coverage 1:14467-14587 261 1:14639-14883 1974 1:14943-15064 3412 1:15671-15990 374 1:16591-16719 0 I appreciate your kindly help. Thanks, Renjie -----Original Message----- From: Martin Morgan [mailto:mtmorgan@fhcrc.org] Sent: Monday, January 21, 2013 6:45 PM To: Renjie Tan Cc: Maintainer; bioconductor at r-project.org Subject: Re: [devteam-bioc] Can you give me a sample to calculate the Read Depth with Rsamtools? Hi Renjie -- it's better to ask these questions on the Bioconductor mailing list, e.g., through the 'guest posting' facility if you don't want to subscribe http://bioconductor.org/help/mailing-list/mailform/ That way everyone benefits from both questions and answers. For a reproducible example I ran library(Rsamtools) example(scanBam) This creates a variable 'fl' that points to a small bam file. I made an instance of the BamFile class, then discovered the information about chromosomes in the file bf <- BamFile(fl) si <- seqinfo(bf) which told me that I have two short sequences > si Seqinfo of length 2 seqnames seqlengths isCircular genome seq1 1575 NA <na> seq2 1584 NA <na> Suppose some 'regions of interest' were on 'seq1', e.g., windows of width 200, starting at position 100 and 1000. roi <- GRanges("seq1", IRanges(c(100, 1000), width=200), seqinfo=si) I could calculate coverage in my BAM file, just over my regions of interest cvg <- coverage(bf) and then create 'Views' on my regions of interest v <- Map(Views, cvg, ranges(split(roi, seqnames(roi)))) > v $seq1 Views on a 1575-length Rle subject views: start end width [1] 100 299 200 [ 9 9 8 9 9 8 8 8 7 7 8 8 9 10 7 6 6 ...] [2] 1000 1199 200 [35 36 38 38 39 40 40 39 39 39 40 39 40 43 43 45 44 ...] $seq2 Views on a 1584-length Rle subject views: NONE You could then access elements of this as, e.g., v$seq1[1] so for a playful animation for (i in seq_along(v$seq1)) { plot(v$seq1[[i]], type="l") Sys.sleep(2) } If using the 'devel' (to be 3.0) version of R, coverage(bf) takes an argument param=ScanBamParam(which=roi) which would just calculate coverage of (reads that overlap) your regions of interest; this would obviously save a lot of computation if you had just a few regions, or regions on a single chromosome. Martin On 01/21/2013 01:29 PM, Maintainer wrote: > Hi Martin, > > I am a new beginner Rsamtools user. I know that's really a powerful > tool. Can you give me a sample to calculate the Read Depth with Rsamtools? > > My input is a bam file(s) and a series regions like: > > Y:2595298-2595418 > > Y:2601341-2601461 > > Y:2606003-2606363 > > ........................ > > I need to calculate the *Read depth* for every regions above. Not the read count. > > Thanks, > > Renjie > > > > ______________________________________________________________________ > __ > devteam-bioc mailing list > To unsubscribe from this mailing list send a blank email to > devteam-bioc-leave at lists.fhcrc.org > You can also unsubscribe or change your personal options at > https://lists.fhcrc.org/mailman/listinfo/devteam-bioc > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD COMMENT • link 13.0 years ago Renjie Tan ▴ 10

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 2 days ago

United States

On 1/22/2013 7:54 AM, Renjie Tan wrote: > Hi Martin, > > Thank you for your answer. I still have some questions about that. > > 1, the seqinfo() only result to 86 sequences for a really bam file. So I know the 86 sequence is not the read sequence. What the 86 sequences are? And where are the reads? seqinfo is returning information about the sequences to which reads have been aligned, e.g., chr1, chr2, chr3, .... This is read from the BAM header file, probably added by the aligner you used. > 2, " cvg <- coverage(bf)" this step really time costly. It needs more than 10 mins for a really 5G bam file with a cluster. If I just want to calculate some parts of region's RD. should us read the whole bam file into RAM? If so how can I parse hundreds of bam samples? In the release version, probably the easiest thing to do is to create a 'GRanges' describing the regions that you are interested in. You could create it by hand if there were only a few regions, but maybe you'd create it by manipulating some annotation resource of one sort or another. By hand, and from your earlier email >> Y:2595298-2595418 >> >> Y:2601341-2601461 >> >> Y:2606003-2606363 maybe you would do starts <- c(2595298, 2601341, 2606003) ends <- c(2595418, 2601461, 2606363) roi <- GRanges("Y", IRanges(starts, ends), seqinfo=si) to define your 'regions of interest', and then use this to input just the reads that overlap (including partially) those regions param <- ScanBamParam(which=roi) gal <- readBamGappedAlignments(bf, param=param) This should be quick, provided there are not too many regions. Then cvg <- coverage(gal) brings you to > 3, the Result's output format is like this? What's the " [ 9 9 8 9 9 8 8 8 7 7 8 8 9 10 7 6 6 ...]"parts mean? Which number is the RD for this region? > start end width > [1] 100 299 200 [ 9 9 8 9 9 8 8 8 7 7 8 8 9 10 7 6 6 ...] > [2] 1000 1199 200 [35 36 38 38 39 40 40 39 39 39 40 39 40 43 43 45 44 ...] This is a 'View' onto the coverage vector, and it represents the number of reads over the nucleotide at coordinate 100 (9), 101 (9), 102 (8), etc. The coordinates 100-299, 1000-1199 are in this view because in my response I identified those as regions of interest. Obviously you would use 'roi' created above. > 4, the output format I expect is look like GATK's form. Such as > Target total_coverage > 1:14467-14587 261 I guess (but you can tell me if I'm wrong) that 'total_coverage' is the sum of the Views vectors, so from below... > lapply(v, viewSums) $seq1 [1] 4126 8412 $seq2 integer(0) which says that, in the first region of interest, reads contributed a total of 4126 nucleotides. If one were analyzing many BAM files, it would make sense to write a function that takes as an argument a single bam file and perhaps other arguments, and then does whatever it is that you're interested in fun <- function(bamFile, param) { bf <- BamFile(bamFile) gal <- readBamGappedAlignments(bf, param=param) cvg <- coverage(gal) ## etc } then to apply it to a list of bam files fls <- dir("some/where", pattern=".*bam$") res <- lapply(fls, fun, param=param) and then to parallelize it, e.g., on a single machine library(parallel) res <- mclapply(fls, fun, param=param) It is also possible to process the files in a cluster with only a little more difficulty, but that's probably a topic for a different email thread. You might find the resources at http://bioconductor.org/help/course-materials/2012/ e.g., the PDF at least http://bioconductor.org/help/course-materials/2012/CSC2012/ useful, and there is an upcoming course https://secure.bioconductor.org/Seattle-Feb-2013/ that would be suitable for this type of question Martin > 1:14639-14883 1974 > 1:14943-15064 3412 > 1:15671-15990 374 > 1:16591-16719 0 > > I appreciate your kindly help. > > Thanks, > Renjie > > > -----Original Message----- > From: Martin Morgan [mailto:mtmorgan at fhcrc.org] > Sent: Monday, January 21, 2013 6:45 PM > To: Renjie Tan > Cc: Maintainer; bioconductor at r-project.org > Subject: Re: [devteam-bioc] Can you give me a sample to calculate the Read Depth with Rsamtools? > > Hi Renjie -- it's better to ask these questions on the Bioconductor mailing list, e.g., through the 'guest posting' facility if you don't want to subscribe > > http://bioconductor.org/help/mailing-list/mailform/ > > That way everyone benefits from both questions and answers. For a reproducible example I ran > > library(Rsamtools) > example(scanBam) > > This creates a variable 'fl' that points to a small bam file. I made an instance of the BamFile class, then discovered the information about chromosomes in the file > > bf <- BamFile(fl) > si <- seqinfo(bf) > > which told me that I have two short sequences > > > si > Seqinfo of length 2 > seqnames seqlengths isCircular genome > seq1 1575 NA <na> > seq2 1584 NA <na> > > Suppose some 'regions of interest' were on 'seq1', e.g., windows of width 200, starting at position 100 and 1000. > > roi <- GRanges("seq1", IRanges(c(100, 1000), width=200), seqinfo=si) > > I could calculate coverage in my BAM file, just over my regions of interest > > cvg <- coverage(bf) > > and then create 'Views' on my regions of interest > > v <- Map(Views, cvg, ranges(split(roi, seqnames(roi)))) > > > v > $seq1 > Views on a 1575-length Rle subject > > views: > start end width > [1] 100 299 200 [ 9 9 8 9 9 8 8 8 7 7 8 8 9 10 7 6 6 ...] > [2] 1000 1199 200 [35 36 38 38 39 40 40 39 39 39 40 39 40 43 43 45 44 ...] > > $seq2 > Views on a 1584-length Rle subject > > views: NONE > > You could then access elements of this as, e.g., v$seq1[1] so for a playful animation > > for (i in seq_along(v$seq1)) { > plot(v$seq1[[i]], type="l") > Sys.sleep(2) > } > > If using the 'devel' (to be 3.0) version of R, coverage(bf) takes an argument > param=ScanBamParam(which=roi) which would just calculate coverage of (reads that > overlap) your regions of interest; this would obviously save a lot of computation if you had just a few regions, or regions on a single chromosome. > > Martin > > On 01/21/2013 01:29 PM, Maintainer wrote: >> Hi Martin, >> >> I am a new beginner Rsamtools user. I know that's really a powerful >> tool. Can you give me a sample to calculate the Read Depth with Rsamtools? >> >> My input is a bam file(s) and a series regions like: >> >> Y:2595298-2595418 >> >> Y:2601341-2601461 >> >> Y:2606003-2606363 >> >> ........................ >> >> I need to calculate the *Read depth* for every regions above. Not the read count. >> >> Thanks, >> >> Renjie >> >> >> >> ______________________________________________________________________ >> __ >> devteam-bioc mailing list >> To unsubscribe from this mailing list send a blank email to >> devteam-bioc-leave at lists.fhcrc.org >> You can also unsubscribe or change your personal options at >> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc >> > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > -- Dr. Martin Morgan, PhD Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

ADD COMMENT • link 13.0 years ago Martin Morgan 25k

Login before adding your answer.