Great! Ticket closed :).
On Tue, Feb 14, 2012 at 10:40:40PM +0000, Martin Morgan wrote:
> On 02/14/2012 02:36 PM, Wei Shi wrote:
> > Hi Thomas,
> >
> > After re-checking the changes we have made before for addressing
the
> > issue of reads mapped out of the chromosome ranges, I found we
> > actually made the decision to allow reads to be mapped out of the
> > chromosome ranges, due to the fact that the reference sequences
might
> > not be complete and reads could originate from outside of the
> > reference sequences used for mapping.
> >
> > So you may take Martin's suggestion to add extra parameters when
you
>
> Actually, Herve has patched GenomicRanges (1.7.25) in the
development
> branch so that ranges outside sequence bounds are now accepted (with
a
> warning).
>
> Martin
>
> > call 'readGappedAlignments' function to make it work for you. Or
> > alternatively you may try the 'featureCounts' function in Rsubread
> > package to count the number of reads mapped to each gene, if the
> > purpose of your analysis is perform RNA-seq analysis. This
function
> > works on SAM file, so you do not need to convert it to a BAM file.
> > But you will have to provide an Arabidopsis anntation file it as
it
> > has built-in annotation for mouse and human only at present.
> >
> > I think it might be better if the 'readGappedAlignments' function
> > could allow reads being mapped outside of chromosomal ranges. We
have
> > been successfully running through a SNP discovery pipeline using
> > subread/subjunc, samtools and vcftools on unix command lines,
which
> > does not complain about the reads mapped outside of the
chromosomal
> > ranges. I guess many other downstream analysis tools do not
require
> > mapping locations of reads to be within the chromosomal ranges as
> > well.
> >
> > Hope this helps.
> >
> > Cheers, Wei
> >
> >
> >
> > On Feb 14, 2012, at 11:58 AM, Thomas Girke wrote:
> >
> >> That's great, thanks a lot!
> >>
> >> BTW: are there any published/shareable performance test results
> >> available comparing subread against other RNA-read to genome
> >> aligners such as Tophat with respect to the accuracy of aligning
> >> reads across exon/intron junctions?
> >>
> >> Thanks,
> >>
> >> Thomas
> >>
> >> On Mon, Feb 13, 2012 at 11:41:42PM +0000, Wei Shi wrote:
> >>> Hi Thomas,
> >>>
> >>> I have just committed changes to both the release version and
the
> >>> devel version of Rsubread package to fix the bug of mapping
reads
> >>> out of chromosomal boundaries. They should be available in 24-36
> >>> hours. Please rebuild your index when you rerun your alignment.
> >>>
> >>> Let me know if the problem persists or you encounter further
> >>> problems.
> >>>
> >>> Cheers, Wei
> >>>
> >>> On Feb 13, 2012, at 11:12 AM, Thomas Girke wrote:
> >>>
> >>>> Thanks Martin for your suggestion!
> >>>>
> >>>> Thomas
> >>>>
> >>>> On Sun, Feb 12, 2012 at 05:28:21PM +0000, Martin Morgan wrote:
> >>>>> On 02/10/2012 07:44 PM, Thomas Girke wrote:
> >>>>>> With some Illumina libraries I have been running into
> >>>>>> problems importing the read mappings from Rsubread into
> >>>>>> Rsamtools. After some testing I found out that some reads
> >>>>>> reported in Rsubread's SAM output have their end positions
> >>>>>> outside of the chromosome ranges. If those reads are
> >>>>>> removed from the SAM file then the import into Rsamtools
> >>>>>> works just fine. Below is a reproducible example of this
> >>>>>> problem.
> >>>>>>
> >>>>>> I could think of several solutions to fix this, e.g. not
> >>>>>> reporting reads outside of chromosome ranges or updating
> >>>>>> their length in the CIAGR string. An additional option
> >>>>>> could be to remove invalid mappings during the import into
> >>>>>> Rsamtools.
> >>>>>
> >>>>> Hi Thomas --
> >>>>>
> >>>>> for the latter and for the case where those reads are
> >>>>> intrinsically not interesting, it might be reasonable to
> >>>>> specify a range on readGappedAlignments that precludes the
> >>>>> possibility of extension beyond sequence ends.
> >>>>>
> >>>>>> si<- seqinfo(BamFile(fl)) gr<- GRanges(seqnames(si),
> >>>>>> IRanges(34, seqlengths(si)-34)) bam<-
> >>>>>> readGappedAlignments(fl, param=ScanBamParam(which=gr))
> >>>>>
> >>>>> One would like a better solution, either doing this
> >>>>> automatically with a warning, or warning rather than stopping
> >>>>> on reads overlapping ends.
> >>>>>
> >>>>> Martin
> >>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Thomas
> >>>>>>
> >>>>>> ################################ ## Read mapping with
> >>>>>> Rsubread ## ################################
> >>>>>> library(Rsubread) ## Reference source:
> >>>>>>
ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/
> >>>>>>
> >>>>>>
> ## Fastq source:
>
ftp://ftp-trace.ncbi.nih.gov/sra/sra-
instant/reads/ByStudy/sra/SRP%2FSRP009%2FSRP009850/SRR390302/
> >>>>>> buildindex(basename="tair10chr.fasta",
> >>>>>> reference="tair10chr.fasta") align(index="tair10chr.fasta",
> >>>>>> readfile1="SRR390302.fastq",
> >>>>>> output_file="SRR390302_subread.sam", nthreads=8, indels=1,
> >>>>>> TH1=2)
> >>>>>>
> >>>>>> ##################################### ## Process SAM file
> >>>>>> with Rsamtools ## #####################################
> >>>>>> library(Rsamtools) asBam(file="SRR390302_subread.sam",
> >>>>>> destination="SRR390302_subread") [1]
> >>>>>> "./data/SRR390302_subread.bam" reads<-
> >>>>>> readBamGappedAlignments("SRR390302_subread.bam",
> >>>>>> use.names=FALSE) Error in validObject(.Object) : invalid
> >>>>>> class "GappedAlignments" object: 'ranges' contains values
> >>>>>> outside of sequence bounds
> >>>>>>
> >>>>>> ## The error disappears if the following two reads are
> >>>>>> removed from the SAM file: ## SRR390302.434558 and
> >>>>>> SRR390302.2716043. Both reads have their end positions
> >>>>>> outside ## the range of Chr5 as illustrated here:
> >>>>>>
> >>>>>> ## Relevant chunks of SAM file generated by Rsubread @SQ
> >>>>>> SN:Chr1 LN:30427671 @SQ SN:Chr2 LN:19698289 @SQ
> >>>>>> SN:Chr3 LN:23459830 @SQ SN:Chr4 LN:18585056 @SQ
> >>>>>> SN:Chr5 LN:26975502 @SQ SN:ChrC LN:154478 @SQ
> >>>>>> SN:ChrM LN:366924 SRR390302.1 0 Chr1 27018257
> >>>>>> 2.0 35M * 0 0 TGG...ACC
> >>>>>> III...GD)0 SRR390302.2 4 * 0 0
> >>>>>> * * 0 0 TCC...AAA III...+I@
> >>>>>> ... SRR390302.434558 0 Chr5 26975485
> >>>>>> 5.0 35M * 0 0 ATG...TGG
> >>>>>> III...&4( SRR390302.2716043 0 Chr5 26975486
> >>>>>> 3.0 35M * 0 0 CAT...GGT
> >>>>>> III...+2&
> >>>>>>
> >>>>>>
> >>>>>>> sessionInfo()
> >>>>>> R version 2.14.0 (2011-10-31) Platform:
> >>>>>> x86_64-unknown-linux-gnu (64-bit)
> >>>>>>
> >>>>>> locale: [1] C
> >>>>>>
> >>>>>> attached base packages: [1] stats graphics grDevices
> >>>>>> utils datasets methods base
> >>>>>>
> >>>>>> other attached packages: [1] Rsamtools_1.6.3
> >>>>>> Biostrings_2.22.0 GenomicRanges_1.6.4 IRanges_1.12.5
> >>>>>> Rsubread_1.4.2
> >>>>>>
> >>>>>> loaded via a namespace (and not attached): [1]
> >>>>>> BSgenome_1.22.0 RCurl_1.8-0 XML_3.6-2
> >>>>>> bitops_1.0-4.1 rtracklayer_1.14.4 tools_2.14.0
> >>>>>> zlibbioc_1.0.0
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Bioconductor mailing list Bioconductor at r-project.org
> >>>>>>
https://stat.ethz.ch/mailman/listinfo/bioconductor Search
> >>>>>> the archives:
> >>>>>>
http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>>>
> >>>>>
> >>>>>
> >>>>>>
> --
> >>>>> Computational Biology Fred Hutchinson Cancer Research Center
> >>>>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
> >>>>>
> >>>>> Location: M1-B861 Telephone: 206 667-2793
> >>>>
> >>>> _______________________________________________ Bioconductor
> >>>> mailing list Bioconductor at r-project.org
> >>>>
https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
> >>>> archives:
> >>>>
http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>
> >>>
> >>>
> >>>>
>
______________________________________________________________________
> >>> The information in this email is confidential and intended
solely
> >>> for the addressee. You must not disclose, forward, print or use
> >>> it without the permission of the sender.
> >>>
______________________________________________________________________
> >
> >>>
> >
> >
______________________________________________________________________
> >
> >
> The information in this email is confidential and intended solely
for
> the addressee.
> > You must not disclose, forward, print or use it without the
> > permission of the sender.
> >
______________________________________________________________________
>
>
> --
> >
> Computational Biology
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
> Location: M1-B861
> Telephone: 206 667-2793