Rsubread/Rsamtools mappings outside of chromosome ranges
5
0
Entering edit mode
Thomas Girke ★ 1.7k
@thomas-girke-993
Last seen 10 months ago
United States
With some Illumina libraries I have been running into problems importing the read mappings from Rsubread into Rsamtools. After some testing I found out that some reads reported in Rsubread's SAM output have their end positions outside of the chromosome ranges. If those reads are removed from the SAM file then the import into Rsamtools works just fine. Below is a reproducible example of this problem. I could think of several solutions to fix this, e.g. not reporting reads outside of chromosome ranges or updating their length in the CIAGR string. An additional option could be to remove invalid mappings during the import into Rsamtools. Thanks, Thomas ################################ ## Read mapping with Rsubread ## ################################ library(Rsubread) ## Reference source: ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/ ## Fastq source: ftp://ftp-trace.ncbi.nih.gov/sra/sra- instant/reads/ByStudy/sra/SRP%2FSRP009%2FSRP009850/SRR390302/ buildindex(basename="tair10chr.fasta", reference="tair10chr.fasta") align(index="tair10chr.fasta", readfile1="SRR390302.fastq", output_file="SRR390302_subread.sam", nthreads=8, indels=1, TH1=2) ##################################### ## Process SAM file with Rsamtools ## ##################################### library(Rsamtools) asBam(file="SRR390302_subread.sam", destination="SRR390302_subread") [1] "./data/SRR390302_subread.bam" reads <- readBamGappedAlignments("SRR390302_subread.bam", use.names=FALSE) Error in validObject(.Object) : invalid class "GappedAlignments" object: 'ranges' contains values outside of sequence bounds ## The error disappears if the following two reads are removed from the SAM file: ## SRR390302.434558 and SRR390302.2716043. Both reads have their end positions outside ## the range of Chr5 as illustrated here: ## Relevant chunks of SAM file generated by Rsubread @SQ SN:Chr1 LN:30427671 @SQ SN:Chr2 LN:19698289 @SQ SN:Chr3 LN:23459830 @SQ SN:Chr4 LN:18585056 @SQ SN:Chr5 LN:26975502 @SQ SN:ChrC LN:154478 @SQ SN:ChrM LN:366924 SRR390302.1 0 Chr1 27018257 2.0 35M * 0 0 TGG...ACC III...GD)0 SRR390302.2 4 * 0 0 * * 0 0 TCC...AAA III...+I@ ... SRR390302.434558 0 Chr5 26975485 5.0 35M * 0 0 ATG...TGG III...&4( SRR390302.2716043 0 Chr5 26975486 3.0 35M * 0 0 CAT...GGT III...+2& > sessionInfo() R version 2.14.0 (2011-10-31) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Rsamtools_1.6.3 Biostrings_2.22.0 GenomicRanges_1.6.4 IRanges_1.12.5 Rsubread_1.4.2 loaded via a namespace (and not attached): [1] BSgenome_1.22.0 RCurl_1.8-0 XML_3.6-2 bitops_1.0-4.1 rtracklayer_1.14.4 tools_2.14.0 zlibbioc_1.0.0
PROcess Rsamtools Rsubread TCC PROcess Rsamtools Rsubread TCC • 2.4k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 5 months ago
United States
On 02/10/2012 07:44 PM, Thomas Girke wrote: > With some Illumina libraries I have been running into problems importing > the read mappings from Rsubread into Rsamtools. After some testing I > found out that some reads reported in Rsubread's SAM output have their > end positions outside of the chromosome ranges. If those reads are > removed from the SAM file then the import into Rsamtools works just > fine. Below is a reproducible example of this problem. > > I could think of several solutions to fix this, e.g. not reporting reads > outside of chromosome ranges or updating their length in the CIAGR string. > An additional option could be to remove invalid mappings during the import > into Rsamtools. Hi Thomas -- for the latter and for the case where those reads are intrinsically not interesting, it might be reasonable to specify a range on readGappedAlignments that precludes the possibility of extension beyond sequence ends. > si <- seqinfo(BamFile(fl)) > gr <- GRanges(seqnames(si), IRanges(34, seqlengths(si)-34)) > bam <- readGappedAlignments(fl, param=ScanBamParam(which=gr)) One would like a better solution, either doing this automatically with a warning, or warning rather than stopping on reads overlapping ends. Martin > > Thanks, > > Thomas > > ################################ > ## Read mapping with Rsubread ## > ################################ > library(Rsubread) > ## Reference source: ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/ > ## Fastq source: ftp://ftp-trace.ncbi.nih.gov/sra/sra- instant/reads/ByStudy/sra/SRP%2FSRP009%2FSRP009850/SRR390302/ > buildindex(basename="tair10chr.fasta", reference="tair10chr.fasta") > align(index="tair10chr.fasta", readfile1="SRR390302.fastq", output_file="SRR390302_subread.sam", nthreads=8, indels=1, TH1=2) > > ##################################### > ## Process SAM file with Rsamtools ## > ##################################### > library(Rsamtools) > asBam(file="SRR390302_subread.sam", destination="SRR390302_subread") > [1] "./data/SRR390302_subread.bam" > reads<- readBamGappedAlignments("SRR390302_subread.bam", use.names=FALSE) > Error in validObject(.Object) : > invalid class "GappedAlignments" object: 'ranges' contains values outside of sequence bounds > > ## The error disappears if the following two reads are removed from the SAM file: > ## SRR390302.434558 and SRR390302.2716043. Both reads have their end positions outside > ## the range of Chr5 as illustrated here: > > ## Relevant chunks of SAM file generated by Rsubread > @SQ SN:Chr1 LN:30427671 > @SQ SN:Chr2 LN:19698289 > @SQ SN:Chr3 LN:23459830 > @SQ SN:Chr4 LN:18585056 > @SQ SN:Chr5 LN:26975502 > @SQ SN:ChrC LN:154478 > @SQ SN:ChrM LN:366924 > SRR390302.1 0 Chr1 27018257 2.0 35M * 0 0 TGG...ACC III...GD)0 > SRR390302.2 4 * 0 0 * * 0 0 TCC...AAA III...+I@ > ... > SRR390302.434558 0 Chr5 26975485 5.0 35M * 0 0 ATG...TGG III...&4( > SRR390302.2716043 0 Chr5 26975486 3.0 35M * 0 0 CAT...GGT III...+2& > > >> sessionInfo() > R version 2.14.0 (2011-10-31) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Rsamtools_1.6.3 Biostrings_2.22.0 GenomicRanges_1.6.4 IRanges_1.12.5 Rsubread_1.4.2 > > loaded via a namespace (and not attached): > [1] BSgenome_1.22.0 RCurl_1.8-0 XML_3.6-2 bitops_1.0-4.1 rtracklayer_1.14.4 tools_2.14.0 zlibbioc_1.0.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793
ADD COMMENT
0
Entering edit mode
Thomas Girke ★ 1.7k
@thomas-girke-993
Last seen 10 months ago
United States
Thanks Martin for your suggestion! Thomas On Sun, Feb 12, 2012 at 05:28:21PM +0000, Martin Morgan wrote: > On 02/10/2012 07:44 PM, Thomas Girke wrote: > > With some Illumina libraries I have been running into problems importing > > the read mappings from Rsubread into Rsamtools. After some testing I > > found out that some reads reported in Rsubread's SAM output have their > > end positions outside of the chromosome ranges. If those reads are > > removed from the SAM file then the import into Rsamtools works just > > fine. Below is a reproducible example of this problem. > > > > I could think of several solutions to fix this, e.g. not reporting reads > > outside of chromosome ranges or updating their length in the CIAGR string. > > An additional option could be to remove invalid mappings during the import > > into Rsamtools. > > Hi Thomas -- > > for the latter and for the case where those reads are intrinsically not > interesting, it might be reasonable to specify a range on > readGappedAlignments that precludes the possibility of extension beyond > sequence ends. > > > si <- seqinfo(BamFile(fl)) > > gr <- GRanges(seqnames(si), IRanges(34, seqlengths(si)-34)) > > bam <- readGappedAlignments(fl, param=ScanBamParam(which=gr)) > > One would like a better solution, either doing this automatically with a > warning, or warning rather than stopping on reads overlapping ends. > > Martin > > > > > Thanks, > > > > Thomas > > > > ################################ > > ## Read mapping with Rsubread ## > > ################################ > > library(Rsubread) > > ## Reference source: ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/ > > ## Fastq source: ftp://ftp-trace.ncbi.nih.gov/sra/sra- instant/reads/ByStudy/sra/SRP%2FSRP009%2FSRP009850/SRR390302/ > > buildindex(basename="tair10chr.fasta", reference="tair10chr.fasta") > > align(index="tair10chr.fasta", readfile1="SRR390302.fastq", output_file="SRR390302_subread.sam", nthreads=8, indels=1, TH1=2) > > > > ##################################### > > ## Process SAM file with Rsamtools ## > > ##################################### > > library(Rsamtools) > > asBam(file="SRR390302_subread.sam", destination="SRR390302_subread") > > [1] "./data/SRR390302_subread.bam" > > reads<- readBamGappedAlignments("SRR390302_subread.bam", use.names=FALSE) > > Error in validObject(.Object) : > > invalid class "GappedAlignments" object: 'ranges' contains values outside of sequence bounds > > > > ## The error disappears if the following two reads are removed from the SAM file: > > ## SRR390302.434558 and SRR390302.2716043. Both reads have their end positions outside > > ## the range of Chr5 as illustrated here: > > > > ## Relevant chunks of SAM file generated by Rsubread > > @SQ SN:Chr1 LN:30427671 > > @SQ SN:Chr2 LN:19698289 > > @SQ SN:Chr3 LN:23459830 > > @SQ SN:Chr4 LN:18585056 > > @SQ SN:Chr5 LN:26975502 > > @SQ SN:ChrC LN:154478 > > @SQ SN:ChrM LN:366924 > > SRR390302.1 0 Chr1 27018257 2.0 35M * 0 0 TGG...ACC III...GD)0 > > SRR390302.2 4 * 0 0 * * 0 0 TCC...AAA III...+I@ > > ... > > SRR390302.434558 0 Chr5 26975485 5.0 35M * 0 0 ATG...TGG III...&4( > > SRR390302.2716043 0 Chr5 26975486 3.0 35M * 0 0 CAT...GGT III...+2& > > > > > >> sessionInfo() > > R version 2.14.0 (2011-10-31) > > Platform: x86_64-unknown-linux-gnu (64-bit) > > > > locale: > > [1] C > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > other attached packages: > > [1] Rsamtools_1.6.3 Biostrings_2.22.0 GenomicRanges_1.6.4 IRanges_1.12.5 Rsubread_1.4.2 > > > > loaded via a namespace (and not attached): > > [1] BSgenome_1.22.0 RCurl_1.8-0 XML_3.6-2 bitops_1.0-4.1 rtracklayer_1.14.4 tools_2.14.0 zlibbioc_1.0.0 > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793
ADD COMMENT
0
Entering edit mode
Dear Thomas, Sorry for the problem you have encountered. This was a bug we have identified a little while ago and it has been fixed in the C version of subread aligner (http://subread.sourceforge.net). I'm now updating the C code in Rsubread package to fix this and will let you know when this is done (should be in a couple of days). Cheers, Wei On Feb 13, 2012, at 11:12 AM, Thomas Girke wrote: > Thanks Martin for your suggestion! > > Thomas > > On Sun, Feb 12, 2012 at 05:28:21PM +0000, Martin Morgan wrote: >> On 02/10/2012 07:44 PM, Thomas Girke wrote: >>> With some Illumina libraries I have been running into problems importing >>> the read mappings from Rsubread into Rsamtools. After some testing I >>> found out that some reads reported in Rsubread's SAM output have their >>> end positions outside of the chromosome ranges. If those reads are >>> removed from the SAM file then the import into Rsamtools works just >>> fine. Below is a reproducible example of this problem. >>> >>> I could think of several solutions to fix this, e.g. not reporting reads >>> outside of chromosome ranges or updating their length in the CIAGR string. >>> An additional option could be to remove invalid mappings during the import >>> into Rsamtools. >> >> Hi Thomas -- >> >> for the latter and for the case where those reads are intrinsically not >> interesting, it might be reasonable to specify a range on >> readGappedAlignments that precludes the possibility of extension beyond >> sequence ends. >> >>> si <- seqinfo(BamFile(fl)) >>> gr <- GRanges(seqnames(si), IRanges(34, seqlengths(si)-34)) >>> bam <- readGappedAlignments(fl, param=ScanBamParam(which=gr)) >> >> One would like a better solution, either doing this automatically with a >> warning, or warning rather than stopping on reads overlapping ends. >> >> Martin >> >>> >>> Thanks, >>> >>> Thomas >>> >>> ################################ >>> ## Read mapping with Rsubread ## >>> ################################ >>> library(Rsubread) >>> ## Reference source: ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/ >>> ## Fastq source: ftp://ftp-trace.ncbi.nih.gov/sra/sra- instant/reads/ByStudy/sra/SRP%2FSRP009%2FSRP009850/SRR390302/ >>> buildindex(basename="tair10chr.fasta", reference="tair10chr.fasta") >>> align(index="tair10chr.fasta", readfile1="SRR390302.fastq", output_file="SRR390302_subread.sam", nthreads=8, indels=1, TH1=2) >>> >>> ##################################### >>> ## Process SAM file with Rsamtools ## >>> ##################################### >>> library(Rsamtools) >>> asBam(file="SRR390302_subread.sam", destination="SRR390302_subread") >>> [1] "./data/SRR390302_subread.bam" >>> reads<- readBamGappedAlignments("SRR390302_subread.bam", use.names=FALSE) >>> Error in validObject(.Object) : >>> invalid class "GappedAlignments" object: 'ranges' contains values outside of sequence bounds >>> >>> ## The error disappears if the following two reads are removed from the SAM file: >>> ## SRR390302.434558 and SRR390302.2716043. Both reads have their end positions outside >>> ## the range of Chr5 as illustrated here: >>> >>> ## Relevant chunks of SAM file generated by Rsubread >>> @SQ SN:Chr1 LN:30427671 >>> @SQ SN:Chr2 LN:19698289 >>> @SQ SN:Chr3 LN:23459830 >>> @SQ SN:Chr4 LN:18585056 >>> @SQ SN:Chr5 LN:26975502 >>> @SQ SN:ChrC LN:154478 >>> @SQ SN:ChrM LN:366924 >>> SRR390302.1 0 Chr1 27018257 2.0 35M * 0 0 TGG...ACC III...GD)0 >>> SRR390302.2 4 * 0 0 * * 0 0 TCC...AAA III...+I@ >>> ... >>> SRR390302.434558 0 Chr5 26975485 5.0 35M * 0 0 ATG...TGG III...&4( >>> SRR390302.2716043 0 Chr5 26975486 3.0 35M * 0 0 CAT...GGT III...+2& >>> >>> >>>> sessionInfo() >>> R version 2.14.0 (2011-10-31) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] Rsamtools_1.6.3 Biostrings_2.22.0 GenomicRanges_1.6.4 IRanges_1.12.5 Rsubread_1.4.2 >>> >>> loaded via a namespace (and not attached): >>> [1] BSgenome_1.22.0 RCurl_1.8-0 XML_3.6-2 bitops_1.0-4.1 rtracklayer_1.14.4 tools_2.14.0 zlibbioc_1.0.0 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> -- >> Computational Biology >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 >> >> Location: M1-B861 >> Telephone: 206 667-2793 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}
ADD REPLY
0
Entering edit mode
Hi Thomas, I have just committed changes to both the release version and the devel version of Rsubread package to fix the bug of mapping reads out of chromosomal boundaries. They should be available in 24-36 hours. Please rebuild your index when you rerun your alignment. Let me know if the problem persists or you encounter further problems. Cheers, Wei On Feb 13, 2012, at 11:12 AM, Thomas Girke wrote: > Thanks Martin for your suggestion! > > Thomas > > On Sun, Feb 12, 2012 at 05:28:21PM +0000, Martin Morgan wrote: >> On 02/10/2012 07:44 PM, Thomas Girke wrote: >>> With some Illumina libraries I have been running into problems importing >>> the read mappings from Rsubread into Rsamtools. After some testing I >>> found out that some reads reported in Rsubread's SAM output have their >>> end positions outside of the chromosome ranges. If those reads are >>> removed from the SAM file then the import into Rsamtools works just >>> fine. Below is a reproducible example of this problem. >>> >>> I could think of several solutions to fix this, e.g. not reporting reads >>> outside of chromosome ranges or updating their length in the CIAGR string. >>> An additional option could be to remove invalid mappings during the import >>> into Rsamtools. >> >> Hi Thomas -- >> >> for the latter and for the case where those reads are intrinsically not >> interesting, it might be reasonable to specify a range on >> readGappedAlignments that precludes the possibility of extension beyond >> sequence ends. >> >>> si <- seqinfo(BamFile(fl)) >>> gr <- GRanges(seqnames(si), IRanges(34, seqlengths(si)-34)) >>> bam <- readGappedAlignments(fl, param=ScanBamParam(which=gr)) >> >> One would like a better solution, either doing this automatically with a >> warning, or warning rather than stopping on reads overlapping ends. >> >> Martin >> >>> >>> Thanks, >>> >>> Thomas >>> >>> ################################ >>> ## Read mapping with Rsubread ## >>> ################################ >>> library(Rsubread) >>> ## Reference source: ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/ >>> ## Fastq source: ftp://ftp-trace.ncbi.nih.gov/sra/sra- instant/reads/ByStudy/sra/SRP%2FSRP009%2FSRP009850/SRR390302/ >>> buildindex(basename="tair10chr.fasta", reference="tair10chr.fasta") >>> align(index="tair10chr.fasta", readfile1="SRR390302.fastq", output_file="SRR390302_subread.sam", nthreads=8, indels=1, TH1=2) >>> >>> ##################################### >>> ## Process SAM file with Rsamtools ## >>> ##################################### >>> library(Rsamtools) >>> asBam(file="SRR390302_subread.sam", destination="SRR390302_subread") >>> [1] "./data/SRR390302_subread.bam" >>> reads<- readBamGappedAlignments("SRR390302_subread.bam", use.names=FALSE) >>> Error in validObject(.Object) : >>> invalid class "GappedAlignments" object: 'ranges' contains values outside of sequence bounds >>> >>> ## The error disappears if the following two reads are removed from the SAM file: >>> ## SRR390302.434558 and SRR390302.2716043. Both reads have their end positions outside >>> ## the range of Chr5 as illustrated here: >>> >>> ## Relevant chunks of SAM file generated by Rsubread >>> @SQ SN:Chr1 LN:30427671 >>> @SQ SN:Chr2 LN:19698289 >>> @SQ SN:Chr3 LN:23459830 >>> @SQ SN:Chr4 LN:18585056 >>> @SQ SN:Chr5 LN:26975502 >>> @SQ SN:ChrC LN:154478 >>> @SQ SN:ChrM LN:366924 >>> SRR390302.1 0 Chr1 27018257 2.0 35M * 0 0 TGG...ACC III...GD)0 >>> SRR390302.2 4 * 0 0 * * 0 0 TCC...AAA III...+I@ >>> ... >>> SRR390302.434558 0 Chr5 26975485 5.0 35M * 0 0 ATG...TGG III...&4( >>> SRR390302.2716043 0 Chr5 26975486 3.0 35M * 0 0 CAT...GGT III...+2& >>> >>> >>>> sessionInfo() >>> R version 2.14.0 (2011-10-31) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> >>> locale: >>> [1] C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] Rsamtools_1.6.3 Biostrings_2.22.0 GenomicRanges_1.6.4 IRanges_1.12.5 Rsubread_1.4.2 >>> >>> loaded via a namespace (and not attached): >>> [1] BSgenome_1.22.0 RCurl_1.8-0 XML_3.6-2 bitops_1.0-4.1 rtracklayer_1.14.4 tools_2.14.0 zlibbioc_1.0.0 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> -- >> Computational Biology >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 >> >> Location: M1-B861 >> Telephone: 206 667-2793 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}
ADD REPLY
0
Entering edit mode
Thomas Girke ★ 1.7k
@thomas-girke-993
Last seen 10 months ago
United States
That's great, thanks a lot! BTW: are there any published/shareable performance test results available comparing subread against other RNA-read to genome aligners such as Tophat with respect to the accuracy of aligning reads across exon/intron junctions? Thanks, Thomas On Mon, Feb 13, 2012 at 11:41:42PM +0000, Wei Shi wrote: > Hi Thomas, > > I have just committed changes to both the release version and the devel version of Rsubread package to fix the bug of mapping reads out of chromosomal boundaries. They should be available in 24-36 hours. Please rebuild your index when you rerun your alignment. > > Let me know if the problem persists or you encounter further problems. > > Cheers, > Wei > > On Feb 13, 2012, at 11:12 AM, Thomas Girke wrote: > > > Thanks Martin for your suggestion! > > > > Thomas > > > > On Sun, Feb 12, 2012 at 05:28:21PM +0000, Martin Morgan wrote: > >> On 02/10/2012 07:44 PM, Thomas Girke wrote: > >>> With some Illumina libraries I have been running into problems importing > >>> the read mappings from Rsubread into Rsamtools. After some testing I > >>> found out that some reads reported in Rsubread's SAM output have their > >>> end positions outside of the chromosome ranges. If those reads are > >>> removed from the SAM file then the import into Rsamtools works just > >>> fine. Below is a reproducible example of this problem. > >>> > >>> I could think of several solutions to fix this, e.g. not reporting reads > >>> outside of chromosome ranges or updating their length in the CIAGR string. > >>> An additional option could be to remove invalid mappings during the import > >>> into Rsamtools. > >> > >> Hi Thomas -- > >> > >> for the latter and for the case where those reads are intrinsically not > >> interesting, it might be reasonable to specify a range on > >> readGappedAlignments that precludes the possibility of extension beyond > >> sequence ends. > >> > >>> si <- seqinfo(BamFile(fl)) > >>> gr <- GRanges(seqnames(si), IRanges(34, seqlengths(si)-34)) > >>> bam <- readGappedAlignments(fl, param=ScanBamParam(which=gr)) > >> > >> One would like a better solution, either doing this automatically with a > >> warning, or warning rather than stopping on reads overlapping ends. > >> > >> Martin > >> > >>> > >>> Thanks, > >>> > >>> Thomas > >>> > >>> ################################ > >>> ## Read mapping with Rsubread ## > >>> ################################ > >>> library(Rsubread) > >>> ## Reference source: ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/ > >>> ## Fastq source: ftp://ftp-trace.ncbi.nih.gov/sra/sra- instant/reads/ByStudy/sra/SRP%2FSRP009%2FSRP009850/SRR390302/ > >>> buildindex(basename="tair10chr.fasta", reference="tair10chr.fasta") > >>> align(index="tair10chr.fasta", readfile1="SRR390302.fastq", output_file="SRR390302_subread.sam", nthreads=8, indels=1, TH1=2) > >>> > >>> ##################################### > >>> ## Process SAM file with Rsamtools ## > >>> ##################################### > >>> library(Rsamtools) > >>> asBam(file="SRR390302_subread.sam", destination="SRR390302_subread") > >>> [1] "./data/SRR390302_subread.bam" > >>> reads<- readBamGappedAlignments("SRR390302_subread.bam", use.names=FALSE) > >>> Error in validObject(.Object) : > >>> invalid class "GappedAlignments" object: 'ranges' contains values outside of sequence bounds > >>> > >>> ## The error disappears if the following two reads are removed from the SAM file: > >>> ## SRR390302.434558 and SRR390302.2716043. Both reads have their end positions outside > >>> ## the range of Chr5 as illustrated here: > >>> > >>> ## Relevant chunks of SAM file generated by Rsubread > >>> @SQ SN:Chr1 LN:30427671 > >>> @SQ SN:Chr2 LN:19698289 > >>> @SQ SN:Chr3 LN:23459830 > >>> @SQ SN:Chr4 LN:18585056 > >>> @SQ SN:Chr5 LN:26975502 > >>> @SQ SN:ChrC LN:154478 > >>> @SQ SN:ChrM LN:366924 > >>> SRR390302.1 0 Chr1 27018257 2.0 35M * 0 0 TGG...ACC III...GD)0 > >>> SRR390302.2 4 * 0 0 * * 0 0 TCC...AAA III...+I@ > >>> ... > >>> SRR390302.434558 0 Chr5 26975485 5.0 35M * 0 0 ATG...TGG III...&4( > >>> SRR390302.2716043 0 Chr5 26975486 3.0 35M * 0 0 CAT...GGT III...+2& > >>> > >>> > >>>> sessionInfo() > >>> R version 2.14.0 (2011-10-31) > >>> Platform: x86_64-unknown-linux-gnu (64-bit) > >>> > >>> locale: > >>> [1] C > >>> > >>> attached base packages: > >>> [1] stats graphics grDevices utils datasets methods base > >>> > >>> other attached packages: > >>> [1] Rsamtools_1.6.3 Biostrings_2.22.0 GenomicRanges_1.6.4 IRanges_1.12.5 Rsubread_1.4.2 > >>> > >>> loaded via a namespace (and not attached): > >>> [1] BSgenome_1.22.0 RCurl_1.8-0 XML_3.6-2 bitops_1.0-4.1 rtracklayer_1.14.4 tools_2.14.0 zlibbioc_1.0.0 > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor at r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > >> > >> -- > >> Computational Biology > >> Fred Hutchinson Cancer Research Center > >> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > >> > >> Location: M1-B861 > >> Telephone: 206 667-2793 > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:4}}
ADD COMMENT
0
Entering edit mode
Hi Thomas, Although subread can map junction reads to the genome which is enough for performing an RNA-seq expression analysis, it does not have the capacity to identify splicing points in the genome. However, the subjunc function included in Rsubread package can be used for this purpose. It takes as input the SAM file outputted from the align() function and then uses the discovered junction reads (CIGAR strings of such reads contain letter 'N') to find exon-exon junctions. It can detect junction locations in the reads which are as close as 4bp from the ends of the reads. We have compared subjunc with tophat, mapsplice and splicemap and obtained quite nice results. I could not make these results publicly available before we get our algorithms published, but I will be happy to send you some of the evaluation results off the list. Cheers, Wei The subjunc function in Rsubread package On Feb 14, 2012, at 11:58 AM, Thomas Girke wrote: > That's great, thanks a lot! > > BTW: are there any published/shareable performance test results > available comparing subread against other RNA-read to genome aligners > such as Tophat with respect to the accuracy of aligning reads across > exon/intron junctions? > > Thanks, > > Thomas > > On Mon, Feb 13, 2012 at 11:41:42PM +0000, Wei Shi wrote: >> Hi Thomas, >> >> I have just committed changes to both the release version and the devel version of Rsubread package to fix the bug of mapping reads out of chromosomal boundaries. They should be available in 24-36 hours. Please rebuild your index when you rerun your alignment. >> >> Let me know if the problem persists or you encounter further problems. >> >> Cheers, >> Wei >> >> On Feb 13, 2012, at 11:12 AM, Thomas Girke wrote: >> >>> Thanks Martin for your suggestion! >>> >>> Thomas >>> >>> On Sun, Feb 12, 2012 at 05:28:21PM +0000, Martin Morgan wrote: >>>> On 02/10/2012 07:44 PM, Thomas Girke wrote: >>>>> With some Illumina libraries I have been running into problems importing >>>>> the read mappings from Rsubread into Rsamtools. After some testing I >>>>> found out that some reads reported in Rsubread's SAM output have their >>>>> end positions outside of the chromosome ranges. If those reads are >>>>> removed from the SAM file then the import into Rsamtools works just >>>>> fine. Below is a reproducible example of this problem. >>>>> >>>>> I could think of several solutions to fix this, e.g. not reporting reads >>>>> outside of chromosome ranges or updating their length in the CIAGR string. >>>>> An additional option could be to remove invalid mappings during the import >>>>> into Rsamtools. >>>> >>>> Hi Thomas -- >>>> >>>> for the latter and for the case where those reads are intrinsically not >>>> interesting, it might be reasonable to specify a range on >>>> readGappedAlignments that precludes the possibility of extension beyond >>>> sequence ends. >>>> >>>>> si <- seqinfo(BamFile(fl)) >>>>> gr <- GRanges(seqnames(si), IRanges(34, seqlengths(si)-34)) >>>>> bam <- readGappedAlignments(fl, param=ScanBamParam(which=gr)) >>>> >>>> One would like a better solution, either doing this automatically with a >>>> warning, or warning rather than stopping on reads overlapping ends. >>>> >>>> Martin >>>> >>>>> >>>>> Thanks, >>>>> >>>>> Thomas >>>>> >>>>> ################################ >>>>> ## Read mapping with Rsubread ## >>>>> ################################ >>>>> library(Rsubread) >>>>> ## Reference source: ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/ >>>>> ## Fastq source: ftp://ftp-trace.ncbi.nih.gov/sra/sra- instant/reads/ByStudy/sra/SRP%2FSRP009%2FSRP009850/SRR390302/ >>>>> buildindex(basename="tair10chr.fasta", reference="tair10chr.fasta") >>>>> align(index="tair10chr.fasta", readfile1="SRR390302.fastq", output_file="SRR390302_subread.sam", nthreads=8, indels=1, TH1=2) >>>>> >>>>> ##################################### >>>>> ## Process SAM file with Rsamtools ## >>>>> ##################################### >>>>> library(Rsamtools) >>>>> asBam(file="SRR390302_subread.sam", destination="SRR390302_subread") >>>>> [1] "./data/SRR390302_subread.bam" >>>>> reads<- readBamGappedAlignments("SRR390302_subread.bam", use.names=FALSE) >>>>> Error in validObject(.Object) : >>>>> invalid class "GappedAlignments" object: 'ranges' contains values outside of sequence bounds >>>>> >>>>> ## The error disappears if the following two reads are removed from the SAM file: >>>>> ## SRR390302.434558 and SRR390302.2716043. Both reads have their end positions outside >>>>> ## the range of Chr5 as illustrated here: >>>>> >>>>> ## Relevant chunks of SAM file generated by Rsubread >>>>> @SQ SN:Chr1 LN:30427671 >>>>> @SQ SN:Chr2 LN:19698289 >>>>> @SQ SN:Chr3 LN:23459830 >>>>> @SQ SN:Chr4 LN:18585056 >>>>> @SQ SN:Chr5 LN:26975502 >>>>> @SQ SN:ChrC LN:154478 >>>>> @SQ SN:ChrM LN:366924 >>>>> SRR390302.1 0 Chr1 27018257 2.0 35M * 0 0 TGG...ACC III...GD)0 >>>>> SRR390302.2 4 * 0 0 * * 0 0 TCC...AAA III...+I@ >>>>> ... >>>>> SRR390302.434558 0 Chr5 26975485 5.0 35M * 0 0 ATG...TGG III...&4( >>>>> SRR390302.2716043 0 Chr5 26975486 3.0 35M * 0 0 CAT...GGT III...+2& >>>>> >>>>> >>>>>> sessionInfo() >>>>> R version 2.14.0 (2011-10-31) >>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>> >>>>> locale: >>>>> [1] C >>>>> >>>>> attached base packages: >>>>> [1] stats graphics grDevices utils datasets methods base >>>>> >>>>> other attached packages: >>>>> [1] Rsamtools_1.6.3 Biostrings_2.22.0 GenomicRanges_1.6.4 IRanges_1.12.5 Rsubread_1.4.2 >>>>> >>>>> loaded via a namespace (and not attached): >>>>> [1] BSgenome_1.22.0 RCurl_1.8-0 XML_3.6-2 bitops_1.0-4.1 rtracklayer_1.14.4 tools_2.14.0 zlibbioc_1.0.0 >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>>> -- >>>> Computational Biology >>>> Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 >>>> >>>> Location: M1-B861 >>>> Telephone: 206 667-2793 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the addressee. >> You must not disclose, forward, print or use it without the permission of the sender. >> ______________________________________________________________________ ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}
ADD REPLY
0
Entering edit mode
Hi Thomas, After re-checking the changes we have made before for addressing the issue of reads mapped out of the chromosome ranges, I found we actually made the decision to allow reads to be mapped out of the chromosome ranges, due to the fact that the reference sequences might not be complete and reads could originate from outside of the reference sequences used for mapping. So you may take Martin's suggestion to add extra parameters when you call 'readGappedAlignments' function to make it work for you. Or alternatively you may try the 'featureCounts' function in Rsubread package to count the number of reads mapped to each gene, if the purpose of your analysis is perform RNA-seq analysis. This function works on SAM file, so you do not need to convert it to a BAM file. But you will have to provide an Arabidopsis anntation file it as it has built-in annotation for mouse and human only at present. I think it might be better if the 'readGappedAlignments' function could allow reads being mapped outside of chromosomal ranges. We have been successfully running through a SNP discovery pipeline using subread/subjunc, samtools and vcftools on unix command lines, which does not complain about the reads mapped outside of the chromosomal ranges. I guess many other downstream analysis tools do not require mapping locations of reads to be within the chromosomal ranges as well. Hope this helps. Cheers, Wei On Feb 14, 2012, at 11:58 AM, Thomas Girke wrote: > That's great, thanks a lot! > > BTW: are there any published/shareable performance test results > available comparing subread against other RNA-read to genome aligners > such as Tophat with respect to the accuracy of aligning reads across > exon/intron junctions? > > Thanks, > > Thomas > > On Mon, Feb 13, 2012 at 11:41:42PM +0000, Wei Shi wrote: >> Hi Thomas, >> >> I have just committed changes to both the release version and the devel version of Rsubread package to fix the bug of mapping reads out of chromosomal boundaries. They should be available in 24-36 hours. Please rebuild your index when you rerun your alignment. >> >> Let me know if the problem persists or you encounter further problems. >> >> Cheers, >> Wei >> >> On Feb 13, 2012, at 11:12 AM, Thomas Girke wrote: >> >>> Thanks Martin for your suggestion! >>> >>> Thomas >>> >>> On Sun, Feb 12, 2012 at 05:28:21PM +0000, Martin Morgan wrote: >>>> On 02/10/2012 07:44 PM, Thomas Girke wrote: >>>>> With some Illumina libraries I have been running into problems importing >>>>> the read mappings from Rsubread into Rsamtools. After some testing I >>>>> found out that some reads reported in Rsubread's SAM output have their >>>>> end positions outside of the chromosome ranges. If those reads are >>>>> removed from the SAM file then the import into Rsamtools works just >>>>> fine. Below is a reproducible example of this problem. >>>>> >>>>> I could think of several solutions to fix this, e.g. not reporting reads >>>>> outside of chromosome ranges or updating their length in the CIAGR string. >>>>> An additional option could be to remove invalid mappings during the import >>>>> into Rsamtools. >>>> >>>> Hi Thomas -- >>>> >>>> for the latter and for the case where those reads are intrinsically not >>>> interesting, it might be reasonable to specify a range on >>>> readGappedAlignments that precludes the possibility of extension beyond >>>> sequence ends. >>>> >>>>> si <- seqinfo(BamFile(fl)) >>>>> gr <- GRanges(seqnames(si), IRanges(34, seqlengths(si)-34)) >>>>> bam <- readGappedAlignments(fl, param=ScanBamParam(which=gr)) >>>> >>>> One would like a better solution, either doing this automatically with a >>>> warning, or warning rather than stopping on reads overlapping ends. >>>> >>>> Martin >>>> >>>>> >>>>> Thanks, >>>>> >>>>> Thomas >>>>> >>>>> ################################ >>>>> ## Read mapping with Rsubread ## >>>>> ################################ >>>>> library(Rsubread) >>>>> ## Reference source: ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/ >>>>> ## Fastq source: ftp://ftp-trace.ncbi.nih.gov/sra/sra- instant/reads/ByStudy/sra/SRP%2FSRP009%2FSRP009850/SRR390302/ >>>>> buildindex(basename="tair10chr.fasta", reference="tair10chr.fasta") >>>>> align(index="tair10chr.fasta", readfile1="SRR390302.fastq", output_file="SRR390302_subread.sam", nthreads=8, indels=1, TH1=2) >>>>> >>>>> ##################################### >>>>> ## Process SAM file with Rsamtools ## >>>>> ##################################### >>>>> library(Rsamtools) >>>>> asBam(file="SRR390302_subread.sam", destination="SRR390302_subread") >>>>> [1] "./data/SRR390302_subread.bam" >>>>> reads<- readBamGappedAlignments("SRR390302_subread.bam", use.names=FALSE) >>>>> Error in validObject(.Object) : >>>>> invalid class "GappedAlignments" object: 'ranges' contains values outside of sequence bounds >>>>> >>>>> ## The error disappears if the following two reads are removed from the SAM file: >>>>> ## SRR390302.434558 and SRR390302.2716043. Both reads have their end positions outside >>>>> ## the range of Chr5 as illustrated here: >>>>> >>>>> ## Relevant chunks of SAM file generated by Rsubread >>>>> @SQ SN:Chr1 LN:30427671 >>>>> @SQ SN:Chr2 LN:19698289 >>>>> @SQ SN:Chr3 LN:23459830 >>>>> @SQ SN:Chr4 LN:18585056 >>>>> @SQ SN:Chr5 LN:26975502 >>>>> @SQ SN:ChrC LN:154478 >>>>> @SQ SN:ChrM LN:366924 >>>>> SRR390302.1 0 Chr1 27018257 2.0 35M * 0 0 TGG...ACC III...GD)0 >>>>> SRR390302.2 4 * 0 0 * * 0 0 TCC...AAA III...+I@ >>>>> ... >>>>> SRR390302.434558 0 Chr5 26975485 5.0 35M * 0 0 ATG...TGG III...&4( >>>>> SRR390302.2716043 0 Chr5 26975486 3.0 35M * 0 0 CAT...GGT III...+2& >>>>> >>>>> >>>>>> sessionInfo() >>>>> R version 2.14.0 (2011-10-31) >>>>> Platform: x86_64-unknown-linux-gnu (64-bit) >>>>> >>>>> locale: >>>>> [1] C >>>>> >>>>> attached base packages: >>>>> [1] stats graphics grDevices utils datasets methods base >>>>> >>>>> other attached packages: >>>>> [1] Rsamtools_1.6.3 Biostrings_2.22.0 GenomicRanges_1.6.4 IRanges_1.12.5 Rsubread_1.4.2 >>>>> >>>>> loaded via a namespace (and not attached): >>>>> [1] BSgenome_1.22.0 RCurl_1.8-0 XML_3.6-2 bitops_1.0-4.1 rtracklayer_1.14.4 tools_2.14.0 zlibbioc_1.0.0 >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>>> -- >>>> Computational Biology >>>> Fred Hutchinson Cancer Research Center >>>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 >>>> >>>> Location: M1-B861 >>>> Telephone: 206 667-2793 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the addressee. >> You must not disclose, forward, print or use it without the permission of the sender. >> ______________________________________________________________________ ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}
ADD REPLY
0
Entering edit mode
On 02/14/2012 02:36 PM, Wei Shi wrote: > Hi Thomas, > > After re-checking the changes we have made before for addressing the > issue of reads mapped out of the chromosome ranges, I found we > actually made the decision to allow reads to be mapped out of the > chromosome ranges, due to the fact that the reference sequences might > not be complete and reads could originate from outside of the > reference sequences used for mapping. > > So you may take Martin's suggestion to add extra parameters when you Actually, Herve has patched GenomicRanges (1.7.25) in the development branch so that ranges outside sequence bounds are now accepted (with a warning). Martin > call 'readGappedAlignments' function to make it work for you. Or > alternatively you may try the 'featureCounts' function in Rsubread > package to count the number of reads mapped to each gene, if the > purpose of your analysis is perform RNA-seq analysis. This function > works on SAM file, so you do not need to convert it to a BAM file. > But you will have to provide an Arabidopsis anntation file it as it > has built-in annotation for mouse and human only at present. > > I think it might be better if the 'readGappedAlignments' function > could allow reads being mapped outside of chromosomal ranges. We have > been successfully running through a SNP discovery pipeline using > subread/subjunc, samtools and vcftools on unix command lines, which > does not complain about the reads mapped outside of the chromosomal > ranges. I guess many other downstream analysis tools do not require > mapping locations of reads to be within the chromosomal ranges as > well. > > Hope this helps. > > Cheers, Wei > > > > On Feb 14, 2012, at 11:58 AM, Thomas Girke wrote: > >> That's great, thanks a lot! >> >> BTW: are there any published/shareable performance test results >> available comparing subread against other RNA-read to genome >> aligners such as Tophat with respect to the accuracy of aligning >> reads across exon/intron junctions? >> >> Thanks, >> >> Thomas >> >> On Mon, Feb 13, 2012 at 11:41:42PM +0000, Wei Shi wrote: >>> Hi Thomas, >>> >>> I have just committed changes to both the release version and the >>> devel version of Rsubread package to fix the bug of mapping reads >>> out of chromosomal boundaries. They should be available in 24-36 >>> hours. Please rebuild your index when you rerun your alignment. >>> >>> Let me know if the problem persists or you encounter further >>> problems. >>> >>> Cheers, Wei >>> >>> On Feb 13, 2012, at 11:12 AM, Thomas Girke wrote: >>> >>>> Thanks Martin for your suggestion! >>>> >>>> Thomas >>>> >>>> On Sun, Feb 12, 2012 at 05:28:21PM +0000, Martin Morgan wrote: >>>>> On 02/10/2012 07:44 PM, Thomas Girke wrote: >>>>>> With some Illumina libraries I have been running into >>>>>> problems importing the read mappings from Rsubread into >>>>>> Rsamtools. After some testing I found out that some reads >>>>>> reported in Rsubread's SAM output have their end positions >>>>>> outside of the chromosome ranges. If those reads are >>>>>> removed from the SAM file then the import into Rsamtools >>>>>> works just fine. Below is a reproducible example of this >>>>>> problem. >>>>>> >>>>>> I could think of several solutions to fix this, e.g. not >>>>>> reporting reads outside of chromosome ranges or updating >>>>>> their length in the CIAGR string. An additional option >>>>>> could be to remove invalid mappings during the import into >>>>>> Rsamtools. >>>>> >>>>> Hi Thomas -- >>>>> >>>>> for the latter and for the case where those reads are >>>>> intrinsically not interesting, it might be reasonable to >>>>> specify a range on readGappedAlignments that precludes the >>>>> possibility of extension beyond sequence ends. >>>>> >>>>>> si<- seqinfo(BamFile(fl)) gr<- GRanges(seqnames(si), >>>>>> IRanges(34, seqlengths(si)-34)) bam<- >>>>>> readGappedAlignments(fl, param=ScanBamParam(which=gr)) >>>>> >>>>> One would like a better solution, either doing this >>>>> automatically with a warning, or warning rather than stopping >>>>> on reads overlapping ends. >>>>> >>>>> Martin >>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Thomas >>>>>> >>>>>> ################################ ## Read mapping with >>>>>> Rsubread ## ################################ >>>>>> library(Rsubread) ## Reference source: >>>>>> ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/ >>>>>> >>>>>> ## Fastq source: ftp://ftp-trace.ncbi.nih.gov/sra/sra- instant/reads/ByStudy/sra/SRP%2FSRP009%2FSRP009850/SRR390302/ >>>>>> buildindex(basename="tair10chr.fasta", >>>>>> reference="tair10chr.fasta") align(index="tair10chr.fasta", >>>>>> readfile1="SRR390302.fastq", >>>>>> output_file="SRR390302_subread.sam", nthreads=8, indels=1, >>>>>> TH1=2) >>>>>> >>>>>> ##################################### ## Process SAM file >>>>>> with Rsamtools ## ##################################### >>>>>> library(Rsamtools) asBam(file="SRR390302_subread.sam", >>>>>> destination="SRR390302_subread") [1] >>>>>> "./data/SRR390302_subread.bam" reads<- >>>>>> readBamGappedAlignments("SRR390302_subread.bam", >>>>>> use.names=FALSE) Error in validObject(.Object) : invalid >>>>>> class "GappedAlignments" object: 'ranges' contains values >>>>>> outside of sequence bounds >>>>>> >>>>>> ## The error disappears if the following two reads are >>>>>> removed from the SAM file: ## SRR390302.434558 and >>>>>> SRR390302.2716043. Both reads have their end positions >>>>>> outside ## the range of Chr5 as illustrated here: >>>>>> >>>>>> ## Relevant chunks of SAM file generated by Rsubread @SQ >>>>>> SN:Chr1 LN:30427671 @SQ SN:Chr2 LN:19698289 @SQ >>>>>> SN:Chr3 LN:23459830 @SQ SN:Chr4 LN:18585056 @SQ >>>>>> SN:Chr5 LN:26975502 @SQ SN:ChrC LN:154478 @SQ >>>>>> SN:ChrM LN:366924 SRR390302.1 0 Chr1 27018257 >>>>>> 2.0 35M * 0 0 TGG...ACC >>>>>> III...GD)0 SRR390302.2 4 * 0 0 >>>>>> * * 0 0 TCC...AAA III...+I@ >>>>>> ... SRR390302.434558 0 Chr5 26975485 >>>>>> 5.0 35M * 0 0 ATG...TGG >>>>>> III...&4( SRR390302.2716043 0 Chr5 26975486 >>>>>> 3.0 35M * 0 0 CAT...GGT >>>>>> III...+2& >>>>>> >>>>>> >>>>>>> sessionInfo() >>>>>> R version 2.14.0 (2011-10-31) Platform: >>>>>> x86_64-unknown-linux-gnu (64-bit) >>>>>> >>>>>> locale: [1] C >>>>>> >>>>>> attached base packages: [1] stats graphics grDevices >>>>>> utils datasets methods base >>>>>> >>>>>> other attached packages: [1] Rsamtools_1.6.3 >>>>>> Biostrings_2.22.0 GenomicRanges_1.6.4 IRanges_1.12.5 >>>>>> Rsubread_1.4.2 >>>>>> >>>>>> loaded via a namespace (and not attached): [1] >>>>>> BSgenome_1.22.0 RCurl_1.8-0 XML_3.6-2 >>>>>> bitops_1.0-4.1 rtracklayer_1.14.4 tools_2.14.0 >>>>>> zlibbioc_1.0.0 >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search >>>>>> the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> >>>>> >>>>>> -- >>>>> Computational Biology Fred Hutchinson Cancer Research Center >>>>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 >>>>> >>>>> Location: M1-B861 Telephone: 206 667-2793 >>>> >>>> _______________________________________________ Bioconductor >>>> mailing list Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the >>>> archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >>> >>>> ______________________________________________________________________ >>> The information in this email is confidential and intended solely >>> for the addressee. You must not disclose, forward, print or use >>> it without the permission of the sender. >>> ______________________________________________________________________ > >>> > > ______________________________________________________________________ > > The information in this email is confidential and intend...{{dropped:16}}
ADD REPLY
0
Entering edit mode
That's great! Thanks Martin. Cheers, Wei On Feb 15, 2012, at 9:40 AM, Martin Morgan wrote: > On 02/14/2012 02:36 PM, Wei Shi wrote: >> Hi Thomas, >> >> After re-checking the changes we have made before for addressing the >> issue of reads mapped out of the chromosome ranges, I found we >> actually made the decision to allow reads to be mapped out of the >> chromosome ranges, due to the fact that the reference sequences might >> not be complete and reads could originate from outside of the >> reference sequences used for mapping. >> >> So you may take Martin's suggestion to add extra parameters when you > > Actually, Herve has patched GenomicRanges (1.7.25) in the development branch so that ranges outside sequence bounds are now accepted (with a warning). > > Martin > >> call 'readGappedAlignments' function to make it work for you. Or >> alternatively you may try the 'featureCounts' function in Rsubread >> package to count the number of reads mapped to each gene, if the >> purpose of your analysis is perform RNA-seq analysis. This function >> works on SAM file, so you do not need to convert it to a BAM file. >> But you will have to provide an Arabidopsis anntation file it as it >> has built-in annotation for mouse and human only at present. >> >> I think it might be better if the 'readGappedAlignments' function >> could allow reads being mapped outside of chromosomal ranges. We have >> been successfully running through a SNP discovery pipeline using >> subread/subjunc, samtools and vcftools on unix command lines, which >> does not complain about the reads mapped outside of the chromosomal >> ranges. I guess many other downstream analysis tools do not require >> mapping locations of reads to be within the chromosomal ranges as >> well. >> >> Hope this helps. >> >> Cheers, Wei >> >> >> >> On Feb 14, 2012, at 11:58 AM, Thomas Girke wrote: >> >>> That's great, thanks a lot! >>> >>> BTW: are there any published/shareable performance test results >>> available comparing subread against other RNA-read to genome >>> aligners such as Tophat with respect to the accuracy of aligning >>> reads across exon/intron junctions? >>> >>> Thanks, >>> >>> Thomas >>> >>> On Mon, Feb 13, 2012 at 11:41:42PM +0000, Wei Shi wrote: >>>> Hi Thomas, >>>> >>>> I have just committed changes to both the release version and the >>>> devel version of Rsubread package to fix the bug of mapping reads >>>> out of chromosomal boundaries. They should be available in 24-36 >>>> hours. Please rebuild your index when you rerun your alignment. >>>> >>>> Let me know if the problem persists or you encounter further >>>> problems. >>>> >>>> Cheers, Wei >>>> >>>> On Feb 13, 2012, at 11:12 AM, Thomas Girke wrote: >>>> >>>>> Thanks Martin for your suggestion! >>>>> >>>>> Thomas >>>>> >>>>> On Sun, Feb 12, 2012 at 05:28:21PM +0000, Martin Morgan wrote: >>>>>> On 02/10/2012 07:44 PM, Thomas Girke wrote: >>>>>>> With some Illumina libraries I have been running into >>>>>>> problems importing the read mappings from Rsubread into >>>>>>> Rsamtools. After some testing I found out that some reads >>>>>>> reported in Rsubread's SAM output have their end positions >>>>>>> outside of the chromosome ranges. If those reads are >>>>>>> removed from the SAM file then the import into Rsamtools >>>>>>> works just fine. Below is a reproducible example of this >>>>>>> problem. >>>>>>> >>>>>>> I could think of several solutions to fix this, e.g. not >>>>>>> reporting reads outside of chromosome ranges or updating >>>>>>> their length in the CIAGR string. An additional option >>>>>>> could be to remove invalid mappings during the import into >>>>>>> Rsamtools. >>>>>> >>>>>> Hi Thomas -- >>>>>> >>>>>> for the latter and for the case where those reads are >>>>>> intrinsically not interesting, it might be reasonable to >>>>>> specify a range on readGappedAlignments that precludes the >>>>>> possibility of extension beyond sequence ends. >>>>>> >>>>>>> si<- seqinfo(BamFile(fl)) gr<- GRanges(seqnames(si), >>>>>>> IRanges(34, seqlengths(si)-34)) bam<- >>>>>>> readGappedAlignments(fl, param=ScanBamParam(which=gr)) >>>>>> >>>>>> One would like a better solution, either doing this >>>>>> automatically with a warning, or warning rather than stopping >>>>>> on reads overlapping ends. >>>>>> >>>>>> Martin >>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Thomas >>>>>>> >>>>>>> ################################ ## Read mapping with >>>>>>> Rsubread ## ################################ >>>>>>> library(Rsubread) ## Reference source: >>>>>>> ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/ >>>>>>> >>>>>>> > ## Fastq source: ftp://ftp-trace.ncbi.nih.gov/sra/sra- instant/reads/ByStudy/sra/SRP%2FSRP009%2FSRP009850/SRR390302/ >>>>>>> buildindex(basename="tair10chr.fasta", >>>>>>> reference="tair10chr.fasta") align(index="tair10chr.fasta", >>>>>>> readfile1="SRR390302.fastq", >>>>>>> output_file="SRR390302_subread.sam", nthreads=8, indels=1, >>>>>>> TH1=2) >>>>>>> >>>>>>> ##################################### ## Process SAM file >>>>>>> with Rsamtools ## ##################################### >>>>>>> library(Rsamtools) asBam(file="SRR390302_subread.sam", >>>>>>> destination="SRR390302_subread") [1] >>>>>>> "./data/SRR390302_subread.bam" reads<- >>>>>>> readBamGappedAlignments("SRR390302_subread.bam", >>>>>>> use.names=FALSE) Error in validObject(.Object) : invalid >>>>>>> class "GappedAlignments" object: 'ranges' contains values >>>>>>> outside of sequence bounds >>>>>>> >>>>>>> ## The error disappears if the following two reads are >>>>>>> removed from the SAM file: ## SRR390302.434558 and >>>>>>> SRR390302.2716043. Both reads have their end positions >>>>>>> outside ## the range of Chr5 as illustrated here: >>>>>>> >>>>>>> ## Relevant chunks of SAM file generated by Rsubread @SQ >>>>>>> SN:Chr1 LN:30427671 @SQ SN:Chr2 LN:19698289 @SQ >>>>>>> SN:Chr3 LN:23459830 @SQ SN:Chr4 LN:18585056 @SQ >>>>>>> SN:Chr5 LN:26975502 @SQ SN:ChrC LN:154478 @SQ >>>>>>> SN:ChrM LN:366924 SRR390302.1 0 Chr1 27018257 >>>>>>> 2.0 35M * 0 0 TGG...ACC >>>>>>> III...GD)0 SRR390302.2 4 * 0 0 >>>>>>> * * 0 0 TCC...AAA III...+I@ >>>>>>> ... SRR390302.434558 0 Chr5 26975485 >>>>>>> 5.0 35M * 0 0 ATG...TGG >>>>>>> III...&4( SRR390302.2716043 0 Chr5 26975486 >>>>>>> 3.0 35M * 0 0 CAT...GGT >>>>>>> III...+2& >>>>>>> >>>>>>> >>>>>>>> sessionInfo() >>>>>>> R version 2.14.0 (2011-10-31) Platform: >>>>>>> x86_64-unknown-linux-gnu (64-bit) >>>>>>> >>>>>>> locale: [1] C >>>>>>> >>>>>>> attached base packages: [1] stats graphics grDevices >>>>>>> utils datasets methods base >>>>>>> >>>>>>> other attached packages: [1] Rsamtools_1.6.3 >>>>>>> Biostrings_2.22.0 GenomicRanges_1.6.4 IRanges_1.12.5 >>>>>>> Rsubread_1.4.2 >>>>>>> >>>>>>> loaded via a namespace (and not attached): [1] >>>>>>> BSgenome_1.22.0 RCurl_1.8-0 XML_3.6-2 >>>>>>> bitops_1.0-4.1 rtracklayer_1.14.4 tools_2.14.0 >>>>>>> zlibbioc_1.0.0 >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search >>>>>>> the archives: >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>>> >>>>>> >>>>>>> > -- >>>>>> Computational Biology Fred Hutchinson Cancer Research Center >>>>>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 >>>>>> >>>>>> Location: M1-B861 Telephone: 206 667-2793 >>>>> >>>>> _______________________________________________ Bioconductor >>>>> mailing list Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the >>>>> archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>>> >>>>> > ______________________________________________________________________ >>>> The information in this email is confidential and intended solely >>>> for the addressee. You must not disclose, forward, print or use >>>> it without the permission of the sender. >>>> ______________________________________________________________________ >> >>>> >> >> ______________________________________________________________________ >> >> > The information in this email is confidential and inte...{{dropped:24}}
ADD REPLY
0
Entering edit mode
Thomas Girke ★ 1.7k
@thomas-girke-993
Last seen 10 months ago
United States
Hi Wei, I am fine either way. What is important to me is the connectivity with Rsamtools/GenomicRanges and its clan members since this opens up an entire universe of downstream analysis opportunities. Masking the chromosome ends as well as Martin's suggestion both work fine, but it might be useful to document this exception somewhere in the vignettes since others might run into similar problems. Thanks for your rapid response! Thomas On Tue, Feb 14, 2012 at 10:36:53PM +0000, Wei Shi wrote: > Hi Thomas, > > After re-checking the changes we have made before for addressing the issue of reads mapped out of the chromosome ranges, I found we actually made the decision to allow reads to be mapped out of the chromosome ranges, due to the fact that the reference sequences might not be complete and reads could originate from outside of the reference sequences used for mapping. > > So you may take Martin's suggestion to add extra parameters when you call 'readGappedAlignments' function to make it work for you. Or alternatively you may try the 'featureCounts' function in Rsubread package to count the number of reads mapped to each gene, if the purpose of your analysis is perform RNA-seq analysis. This function works on SAM file, so you do not need to convert it to a BAM file. But you will have to provide an Arabidopsis anntation file it as it has built-in annotation for mouse and human only at present. > > I think it might be better if the 'readGappedAlignments' function could allow reads being mapped outside of chromosomal ranges. We have been successfully running through a SNP discovery pipeline using subread/subjunc, samtools and vcftools on unix command lines, which does not complain about the reads mapped outside of the chromosomal ranges. I guess many other downstream analysis tools do not require mapping locations of reads to be within the chromosomal ranges as well. > > Hope this helps. > > Cheers, > Wei > > > > On Feb 14, 2012, at 11:58 AM, Thomas Girke wrote: > > > That's great, thanks a lot! > > > > BTW: are there any published/shareable performance test results > > available comparing subread against other RNA-read to genome aligners > > such as Tophat with respect to the accuracy of aligning reads across > > exon/intron junctions? > > > > Thanks, > > > > Thomas > > > > On Mon, Feb 13, 2012 at 11:41:42PM +0000, Wei Shi wrote: > >> Hi Thomas, > >> > >> I have just committed changes to both the release version and the devel version of Rsubread package to fix the bug of mapping reads out of chromosomal boundaries. They should be available in 24-36 hours. Please rebuild your index when you rerun your alignment. > >> > >> Let me know if the problem persists or you encounter further problems. > >> > >> Cheers, > >> Wei > >> > >> On Feb 13, 2012, at 11:12 AM, Thomas Girke wrote: > >> > >>> Thanks Martin for your suggestion! > >>> > >>> Thomas > >>> > >>> On Sun, Feb 12, 2012 at 05:28:21PM +0000, Martin Morgan wrote: > >>>> On 02/10/2012 07:44 PM, Thomas Girke wrote: > >>>>> With some Illumina libraries I have been running into problems importing > >>>>> the read mappings from Rsubread into Rsamtools. After some testing I > >>>>> found out that some reads reported in Rsubread's SAM output have their > >>>>> end positions outside of the chromosome ranges. If those reads are > >>>>> removed from the SAM file then the import into Rsamtools works just > >>>>> fine. Below is a reproducible example of this problem. > >>>>> > >>>>> I could think of several solutions to fix this, e.g. not reporting reads > >>>>> outside of chromosome ranges or updating their length in the CIAGR string. > >>>>> An additional option could be to remove invalid mappings during the import > >>>>> into Rsamtools. > >>>> > >>>> Hi Thomas -- > >>>> > >>>> for the latter and for the case where those reads are intrinsically not > >>>> interesting, it might be reasonable to specify a range on > >>>> readGappedAlignments that precludes the possibility of extension beyond > >>>> sequence ends. > >>>> > >>>>> si <- seqinfo(BamFile(fl)) > >>>>> gr <- GRanges(seqnames(si), IRanges(34, seqlengths(si)-34)) > >>>>> bam <- readGappedAlignments(fl, param=ScanBamParam(which=gr)) > >>>> > >>>> One would like a better solution, either doing this automatically with a > >>>> warning, or warning rather than stopping on reads overlapping ends. > >>>> > >>>> Martin > >>>> > >>>>> > >>>>> Thanks, > >>>>> > >>>>> Thomas > >>>>> > >>>>> ################################ > >>>>> ## Read mapping with Rsubread ## > >>>>> ################################ > >>>>> library(Rsubread) > >>>>> ## Reference source: ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/ > >>>>> ## Fastq source: ftp://ftp-trace.ncbi.nih.gov/sra/sra- instant/reads/ByStudy/sra/SRP%2FSRP009%2FSRP009850/SRR390302/ > >>>>> buildindex(basename="tair10chr.fasta", reference="tair10chr.fasta") > >>>>> align(index="tair10chr.fasta", readfile1="SRR390302.fastq", output_file="SRR390302_subread.sam", nthreads=8, indels=1, TH1=2) > >>>>> > >>>>> ##################################### > >>>>> ## Process SAM file with Rsamtools ## > >>>>> ##################################### > >>>>> library(Rsamtools) > >>>>> asBam(file="SRR390302_subread.sam", destination="SRR390302_subread") > >>>>> [1] "./data/SRR390302_subread.bam" > >>>>> reads<- readBamGappedAlignments("SRR390302_subread.bam", use.names=FALSE) > >>>>> Error in validObject(.Object) : > >>>>> invalid class "GappedAlignments" object: 'ranges' contains values outside of sequence bounds > >>>>> > >>>>> ## The error disappears if the following two reads are removed from the SAM file: > >>>>> ## SRR390302.434558 and SRR390302.2716043. Both reads have their end positions outside > >>>>> ## the range of Chr5 as illustrated here: > >>>>> > >>>>> ## Relevant chunks of SAM file generated by Rsubread > >>>>> @SQ SN:Chr1 LN:30427671 > >>>>> @SQ SN:Chr2 LN:19698289 > >>>>> @SQ SN:Chr3 LN:23459830 > >>>>> @SQ SN:Chr4 LN:18585056 > >>>>> @SQ SN:Chr5 LN:26975502 > >>>>> @SQ SN:ChrC LN:154478 > >>>>> @SQ SN:ChrM LN:366924 > >>>>> SRR390302.1 0 Chr1 27018257 2.0 35M * 0 0 TGG...ACC III...GD)0 > >>>>> SRR390302.2 4 * 0 0 * * 0 0 TCC...AAA III...+I@ > >>>>> ... > >>>>> SRR390302.434558 0 Chr5 26975485 5.0 35M * 0 0 ATG...TGG III...&4( > >>>>> SRR390302.2716043 0 Chr5 26975486 3.0 35M * 0 0 CAT...GGT III...+2& > >>>>> > >>>>> > >>>>>> sessionInfo() > >>>>> R version 2.14.0 (2011-10-31) > >>>>> Platform: x86_64-unknown-linux-gnu (64-bit) > >>>>> > >>>>> locale: > >>>>> [1] C > >>>>> > >>>>> attached base packages: > >>>>> [1] stats graphics grDevices utils datasets methods base > >>>>> > >>>>> other attached packages: > >>>>> [1] Rsamtools_1.6.3 Biostrings_2.22.0 GenomicRanges_1.6.4 IRanges_1.12.5 Rsubread_1.4.2 > >>>>> > >>>>> loaded via a namespace (and not attached): > >>>>> [1] BSgenome_1.22.0 RCurl_1.8-0 XML_3.6-2 bitops_1.0-4.1 rtracklayer_1.14.4 tools_2.14.0 zlibbioc_1.0.0 > >>>>> > >>>>> _______________________________________________ > >>>>> Bioconductor mailing list > >>>>> Bioconductor at r-project.org > >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>> > >>>> > >>>> -- > >>>> Computational Biology > >>>> Fred Hutchinson Cancer Research Center > >>>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > >>>> > >>>> Location: M1-B861 > >>>> Telephone: 206 667-2793 > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor at r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > >> > >> ______________________________________________________________________ > >> The information in this email is confidential and intended solely for the addressee. > >> You must not disclose, forward, print or use it without the permission of the sender. > >> ______________________________________________________________________ > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:4}}
ADD COMMENT
0
Entering edit mode
Thomas Girke ★ 1.7k
@thomas-girke-993
Last seen 10 months ago
United States
Great! Ticket closed :). On Tue, Feb 14, 2012 at 10:40:40PM +0000, Martin Morgan wrote: > On 02/14/2012 02:36 PM, Wei Shi wrote: > > Hi Thomas, > > > > After re-checking the changes we have made before for addressing the > > issue of reads mapped out of the chromosome ranges, I found we > > actually made the decision to allow reads to be mapped out of the > > chromosome ranges, due to the fact that the reference sequences might > > not be complete and reads could originate from outside of the > > reference sequences used for mapping. > > > > So you may take Martin's suggestion to add extra parameters when you > > Actually, Herve has patched GenomicRanges (1.7.25) in the development > branch so that ranges outside sequence bounds are now accepted (with a > warning). > > Martin > > > call 'readGappedAlignments' function to make it work for you. Or > > alternatively you may try the 'featureCounts' function in Rsubread > > package to count the number of reads mapped to each gene, if the > > purpose of your analysis is perform RNA-seq analysis. This function > > works on SAM file, so you do not need to convert it to a BAM file. > > But you will have to provide an Arabidopsis anntation file it as it > > has built-in annotation for mouse and human only at present. > > > > I think it might be better if the 'readGappedAlignments' function > > could allow reads being mapped outside of chromosomal ranges. We have > > been successfully running through a SNP discovery pipeline using > > subread/subjunc, samtools and vcftools on unix command lines, which > > does not complain about the reads mapped outside of the chromosomal > > ranges. I guess many other downstream analysis tools do not require > > mapping locations of reads to be within the chromosomal ranges as > > well. > > > > Hope this helps. > > > > Cheers, Wei > > > > > > > > On Feb 14, 2012, at 11:58 AM, Thomas Girke wrote: > > > >> That's great, thanks a lot! > >> > >> BTW: are there any published/shareable performance test results > >> available comparing subread against other RNA-read to genome > >> aligners such as Tophat with respect to the accuracy of aligning > >> reads across exon/intron junctions? > >> > >> Thanks, > >> > >> Thomas > >> > >> On Mon, Feb 13, 2012 at 11:41:42PM +0000, Wei Shi wrote: > >>> Hi Thomas, > >>> > >>> I have just committed changes to both the release version and the > >>> devel version of Rsubread package to fix the bug of mapping reads > >>> out of chromosomal boundaries. They should be available in 24-36 > >>> hours. Please rebuild your index when you rerun your alignment. > >>> > >>> Let me know if the problem persists or you encounter further > >>> problems. > >>> > >>> Cheers, Wei > >>> > >>> On Feb 13, 2012, at 11:12 AM, Thomas Girke wrote: > >>> > >>>> Thanks Martin for your suggestion! > >>>> > >>>> Thomas > >>>> > >>>> On Sun, Feb 12, 2012 at 05:28:21PM +0000, Martin Morgan wrote: > >>>>> On 02/10/2012 07:44 PM, Thomas Girke wrote: > >>>>>> With some Illumina libraries I have been running into > >>>>>> problems importing the read mappings from Rsubread into > >>>>>> Rsamtools. After some testing I found out that some reads > >>>>>> reported in Rsubread's SAM output have their end positions > >>>>>> outside of the chromosome ranges. If those reads are > >>>>>> removed from the SAM file then the import into Rsamtools > >>>>>> works just fine. Below is a reproducible example of this > >>>>>> problem. > >>>>>> > >>>>>> I could think of several solutions to fix this, e.g. not > >>>>>> reporting reads outside of chromosome ranges or updating > >>>>>> their length in the CIAGR string. An additional option > >>>>>> could be to remove invalid mappings during the import into > >>>>>> Rsamtools. > >>>>> > >>>>> Hi Thomas -- > >>>>> > >>>>> for the latter and for the case where those reads are > >>>>> intrinsically not interesting, it might be reasonable to > >>>>> specify a range on readGappedAlignments that precludes the > >>>>> possibility of extension beyond sequence ends. > >>>>> > >>>>>> si<- seqinfo(BamFile(fl)) gr<- GRanges(seqnames(si), > >>>>>> IRanges(34, seqlengths(si)-34)) bam<- > >>>>>> readGappedAlignments(fl, param=ScanBamParam(which=gr)) > >>>>> > >>>>> One would like a better solution, either doing this > >>>>> automatically with a warning, or warning rather than stopping > >>>>> on reads overlapping ends. > >>>>> > >>>>> Martin > >>>>> > >>>>>> > >>>>>> Thanks, > >>>>>> > >>>>>> Thomas > >>>>>> > >>>>>> ################################ ## Read mapping with > >>>>>> Rsubread ## ################################ > >>>>>> library(Rsubread) ## Reference source: > >>>>>> ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/ > >>>>>> > >>>>>> > ## Fastq source: > ftp://ftp-trace.ncbi.nih.gov/sra/sra- instant/reads/ByStudy/sra/SRP%2FSRP009%2FSRP009850/SRR390302/ > >>>>>> buildindex(basename="tair10chr.fasta", > >>>>>> reference="tair10chr.fasta") align(index="tair10chr.fasta", > >>>>>> readfile1="SRR390302.fastq", > >>>>>> output_file="SRR390302_subread.sam", nthreads=8, indels=1, > >>>>>> TH1=2) > >>>>>> > >>>>>> ##################################### ## Process SAM file > >>>>>> with Rsamtools ## ##################################### > >>>>>> library(Rsamtools) asBam(file="SRR390302_subread.sam", > >>>>>> destination="SRR390302_subread") [1] > >>>>>> "./data/SRR390302_subread.bam" reads<- > >>>>>> readBamGappedAlignments("SRR390302_subread.bam", > >>>>>> use.names=FALSE) Error in validObject(.Object) : invalid > >>>>>> class "GappedAlignments" object: 'ranges' contains values > >>>>>> outside of sequence bounds > >>>>>> > >>>>>> ## The error disappears if the following two reads are > >>>>>> removed from the SAM file: ## SRR390302.434558 and > >>>>>> SRR390302.2716043. Both reads have their end positions > >>>>>> outside ## the range of Chr5 as illustrated here: > >>>>>> > >>>>>> ## Relevant chunks of SAM file generated by Rsubread @SQ > >>>>>> SN:Chr1 LN:30427671 @SQ SN:Chr2 LN:19698289 @SQ > >>>>>> SN:Chr3 LN:23459830 @SQ SN:Chr4 LN:18585056 @SQ > >>>>>> SN:Chr5 LN:26975502 @SQ SN:ChrC LN:154478 @SQ > >>>>>> SN:ChrM LN:366924 SRR390302.1 0 Chr1 27018257 > >>>>>> 2.0 35M * 0 0 TGG...ACC > >>>>>> III...GD)0 SRR390302.2 4 * 0 0 > >>>>>> * * 0 0 TCC...AAA III...+I@ > >>>>>> ... SRR390302.434558 0 Chr5 26975485 > >>>>>> 5.0 35M * 0 0 ATG...TGG > >>>>>> III...&4( SRR390302.2716043 0 Chr5 26975486 > >>>>>> 3.0 35M * 0 0 CAT...GGT > >>>>>> III...+2& > >>>>>> > >>>>>> > >>>>>>> sessionInfo() > >>>>>> R version 2.14.0 (2011-10-31) Platform: > >>>>>> x86_64-unknown-linux-gnu (64-bit) > >>>>>> > >>>>>> locale: [1] C > >>>>>> > >>>>>> attached base packages: [1] stats graphics grDevices > >>>>>> utils datasets methods base > >>>>>> > >>>>>> other attached packages: [1] Rsamtools_1.6.3 > >>>>>> Biostrings_2.22.0 GenomicRanges_1.6.4 IRanges_1.12.5 > >>>>>> Rsubread_1.4.2 > >>>>>> > >>>>>> loaded via a namespace (and not attached): [1] > >>>>>> BSgenome_1.22.0 RCurl_1.8-0 XML_3.6-2 > >>>>>> bitops_1.0-4.1 rtracklayer_1.14.4 tools_2.14.0 > >>>>>> zlibbioc_1.0.0 > >>>>>> > >>>>>> _______________________________________________ > >>>>>> Bioconductor mailing list Bioconductor at r-project.org > >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search > >>>>>> the archives: > >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>>> > >>>>> > >>>>> > >>>>>> > -- > >>>>> Computational Biology Fred Hutchinson Cancer Research Center > >>>>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > >>>>> > >>>>> Location: M1-B861 Telephone: 206 667-2793 > >>>> > >>>> _______________________________________________ Bioconductor > >>>> mailing list Bioconductor at r-project.org > >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the > >>>> archives: > >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> > >>> > >>> > >>>> > ______________________________________________________________________ > >>> The information in this email is confidential and intended solely > >>> for the addressee. You must not disclose, forward, print or use > >>> it without the permission of the sender. > >>> ______________________________________________________________________ > > > >>> > > > > ______________________________________________________________________ > > > > > The information in this email is confidential and intended solely for > the addressee. > > You must not disclose, forward, print or use it without the > > permission of the sender. > > ______________________________________________________________________ > > > -- > > > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793
ADD COMMENT

Login before adding your answer.

Traffic: 537 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6