Question

Error mapping RNA-seq data to genome

0

Entering edit mode

Paul Geeleher ★ 1.3k

@paul-geeleher-2679

Last seen 11.5 years ago

I'm attempting to read some aligned RNA-seq paired end reads in Bioconductor. The reads were aligned with maq (.fastq to .bqf to .map) and converted to SAM then BAM format by samtools as I couldn't get the paired end reads into biocondcutor in MAQ format. I'm basically following this tutorial: http://bioconductor.org/help/course- materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf The following step causes me an error: > seqlengths(alnRanges) <- seqlengths(hgTxDb)[names(seqlengths(alnRanges))] Error in validObject(.Object) : invalid class "GRanges" object: slot 'ranges' contains values outside of sequence bounds I think it means that there is something aligned to outside the range of my reference genome but I don't understand why this would be as I used HG18 for everything. Any suggestions as to how I can possibly figure out what went wrong here would be greatly appreciated. I can provide more information if necessary. Summaries of my GenomicRanges and GenomicFeatures objects are below. > alnRanges GRanges with 11634719 ranges and 1 elementMetadata value seqnames ranges strand | pData.alignData.from...notNA... <rle> <iranges> <rle> | <integer> [1] chr1 [3345, 3381] + | 163 [2] chr1 [3391, 3427] - | 147 [3] chr1 [4223, 4259] + | 163 [4] chr1 [4247, 4283] - | 147 [5] chr1 [4275, 4311] + | 163 [6] chr1 [4304, 4340] + | 163 [7] chr1 [4320, 4356] + | 163 [8] chr1 [4343, 4379] - | 147 [9] chr1 [4343, 4379] - | 147 ... ... ... ... ... ... [11634711] chrM [16531, 16567] - | 147 [11634712] chrM [16532, 16568] + | 161 [11634713] chrM [16532, 16568] - | 147 [11634714] chrM [16532, 16568] - | 147 [11634715] chrM [16532, 16568] - | 147 [11634716] chrM [16534, 16570] - | 147 [11634717] chrM [16534, 16570] - | 147 [11634718] chrM [16534, 16570] - | 147 [11634719] chrM [16541, 16577] - | 147 seqlengths chr1 chr2 chr3 chr4 chr5 chr6 ... chr20 chr21 chr22 chrX chrY chrM NA NA NA NA NA NA ... NA NA NA NA NA NA > hgTxDb <- makeTranscriptDbFromUCSC(genome="hg18", tablename="ensGene") > hgTxDb TranscriptDb object: | Db type: TranscriptDb | Data source: UCSC | Genome: hg18 | UCSC Table: ensGene | Type of Gene ID: Ensembl gene ID | Full dataset: yes | transcript_nrow: 63280 | exon_nrow: 276075 | cds_nrow: 225373 | Db created by: GenomicFeatures package from Bioconductor | Creation time: 2010-08-17 15:28:19 +0100 (Tue, 17 Aug 2010) | GenomicFeatures version at creation time: 1.0.6 | RSQLite version at creation time: 0.9-1 > -- Paul Geeleher School of Mathematics, Statistics and Applied Mathematics National University of Ireland Galway Ireland -- www.bioinformaticstutorials.com

GenomicFeatures GenomicRanges GenomicFeatures GenomicRanges • 1.5k views

ADD COMMENT • link updated 15.5 years ago by Martin Morgan 25k • written 15.5 years ago by Paul Geeleher ★ 1.3k

score 0 · Answer 1 · 2010-08-18

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 5 weeks ago

United States

Hi Paul -- On 08/17/2010 08:09 AM, Paul Geeleher wrote: > I'm attempting to read some aligned RNA-seq paired end reads in > Bioconductor. The reads were aligned with maq (.fastq to .bqf to .map) > and converted to SAM then BAM format by samtools as I couldn't get the > paired end reads into biocondcutor in MAQ format. > > I'm basically following this tutorial: > http://bioconductor.org/help/course- materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf > > The following step causes me an error: > >> seqlengths(alnRanges) <- seqlengths(hgTxDb)[names(seqlengths(alnRanges))] > Error in validObject(.Object) : > invalid class "GRanges" object: slot 'ranges' contains values > outside of sequence bounds > > I think it means that there is something aligned to outside the range > of my reference genome but I don't understand why this would be as I > used HG18 for everything. probably one or more of your reads are 'hanging over' one of the chromosome bounds. You might tapply(alnRanges, seqnames(alnRanges), f unction(x) c(min(start(x)), max(end(x)))) to get a sense of whether you're on the right track, and maybe something like refLen <- seqlengths(hgTxDb)[as.character(seqnames(alnRanges)] alnRanges[refLen < end(alnRanges) to get the actual ranges? You could trim or discard these Martin > Any suggestions as to how I can possibly figure out what went wrong > here would be greatly appreciated. I can provide more information if > necessary. Summaries of my GenomicRanges and GenomicFeatures objects > are below. > > >> alnRanges > GRanges with 11634719 ranges and 1 elementMetadata value > seqnames ranges strand | pData.alignData.from...notNA... > <rle> <iranges> <rle> | <integer> > [1] chr1 [3345, 3381] + | 163 > [2] chr1 [3391, 3427] - | 147 > [3] chr1 [4223, 4259] + | 163 > [4] chr1 [4247, 4283] - | 147 > [5] chr1 [4275, 4311] + | 163 > [6] chr1 [4304, 4340] + | 163 > [7] chr1 [4320, 4356] + | 163 > [8] chr1 [4343, 4379] - | 147 > [9] chr1 [4343, 4379] - | 147 > ... ... ... ... ... ... > [11634711] chrM [16531, 16567] - | 147 > [11634712] chrM [16532, 16568] + | 161 > [11634713] chrM [16532, 16568] - | 147 > [11634714] chrM [16532, 16568] - | 147 > [11634715] chrM [16532, 16568] - | 147 > [11634716] chrM [16534, 16570] - | 147 > [11634717] chrM [16534, 16570] - | 147 > [11634718] chrM [16534, 16570] - | 147 > [11634719] chrM [16541, 16577] - | 147 > > seqlengths > chr1 chr2 chr3 chr4 chr5 chr6 ... chr20 chr21 chr22 chrX chrY chrM > NA NA NA NA NA NA ... NA NA NA NA NA NA > >> hgTxDb <- makeTranscriptDbFromUCSC(genome="hg18", tablename="ensGene") >> hgTxDb > TranscriptDb object: > | Db type: TranscriptDb > | Data source: UCSC > | Genome: hg18 > | UCSC Table: ensGene > | Type of Gene ID: Ensembl gene ID > | Full dataset: yes > | transcript_nrow: 63280 > | exon_nrow: 276075 > | cds_nrow: 225373 > | Db created by: GenomicFeatures package from Bioconductor > | Creation time: 2010-08-17 15:28:19 +0100 (Tue, 17 Aug 2010) > | GenomicFeatures version at creation time: 1.0.6 > | RSQLite version at creation time: 0.9-1 >> > > > > > -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793

ADD COMMENT • link 15.5 years ago Martin Morgan 25k

0

Entering edit mode

Hi Martin, Your advice seems to have been pretty successful, there was a read aligned beyond the end of the chrM. No idea why this might have happened though, which is a bit worrying. For the record removing the offending read worked ok. I used the following code: # find offending read: > refLen <- seqlengths(hgTxDb)[as.character(seqnames(alnRanges))] > print(alnRanges[refLen < end(alnRanges)]) GRanges with 1 range and 1 elementMetadata value seqnames ranges strand | pData.alignData.from...notNA... <rle> <iranges> <rle> | <integer> [1] chrM [16541, 16577] - | 147 # remove offending read: alnRanges <- alnRanges[-which(refLen < end(alnRanges))] Cheers, Paul On Wed, Aug 18, 2010 at 2:30 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: > Hi Paul -- > > On 08/17/2010 08:09 AM, Paul Geeleher wrote: >> I'm attempting to read some aligned RNA-seq paired end reads in >> Bioconductor. The reads were aligned with maq (.fastq to .bqf to .map) >> and converted to SAM then BAM format by samtools as I couldn't get the >> paired end reads into biocondcutor in MAQ format. >> >> I'm basically following this tutorial: >> http://bioconductor.org/help/course- materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf >> >> The following step causes me an error: >> >>> seqlengths(alnRanges) <- seqlengths(hgTxDb)[names(seqlengths(alnRanges))] >> Error in validObject(.Object) : >> ? invalid class "GRanges" object: slot 'ranges' contains values >> outside of sequence bounds >> >> I think it means that there is something aligned to outside the range >> of my reference genome but I don't understand why this would be as I >> used HG18 for everything. > > probably one or more of your reads are 'hanging over' one of the > chromosome bounds. You might > > ?tapply(alnRanges, seqnames(alnRanges), f > ? ? ? ? unction(x) c(min(start(x)), max(end(x)))) > > to get a sense of whether you're on the right track, and maybe something > like > > ?refLen <- seqlengths(hgTxDb)[as.character(seqnames(alnRanges)] > ?alnRanges[refLen < end(alnRanges) > > to get the actual ranges? You could trim or discard these > > Martin > >> Any suggestions as to how I can possibly figure out what went wrong >> here would be greatly appreciated. I can provide more information if >> necessary. Summaries of my GenomicRanges and GenomicFeatures objects >> are below. >> >> >>> alnRanges >> GRanges with 11634719 ranges and 1 elementMetadata value >> ? ? ? ? ? ?seqnames ? ? ? ? ranges strand ? | pData.alignData.from...notNA... >> ? ? ? ? ? ? ? <rle> ? ? ?<iranges> ?<rle> ? | ? ? ? ? ? ? ? ? ? ? ? <integer> >> ? ? ? ?[1] ? ? chr1 ? [3345, 3381] ? ? ?+ ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 163 >> ? ? ? ?[2] ? ? chr1 ? [3391, 3427] ? ? ?- ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> ? ? ? ?[3] ? ? chr1 ? [4223, 4259] ? ? ?+ ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 163 >> ? ? ? ?[4] ? ? chr1 ? [4247, 4283] ? ? ?- ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> ? ? ? ?[5] ? ? chr1 ? [4275, 4311] ? ? ?+ ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 163 >> ? ? ? ?[6] ? ? chr1 ? [4304, 4340] ? ? ?+ ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 163 >> ? ? ? ?[7] ? ? chr1 ? [4320, 4356] ? ? ?+ ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 163 >> ? ? ? ?[8] ? ? chr1 ? [4343, 4379] ? ? ?- ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> ? ? ? ?[9] ? ? chr1 ? [4343, 4379] ? ? ?- ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> ? ? ? ?... ? ? ?... ? ? ? ? ? ?... ? ?... ... ? ? ? ? ? ? ? ? ? ? ? ? ? ? ... >> [11634711] ? ? chrM [16531, 16567] ? ? ?- ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> [11634712] ? ? chrM [16532, 16568] ? ? ?+ ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 161 >> [11634713] ? ? chrM [16532, 16568] ? ? ?- ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> [11634714] ? ? chrM [16532, 16568] ? ? ?- ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> [11634715] ? ? chrM [16532, 16568] ? ? ?- ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> [11634716] ? ? chrM [16534, 16570] ? ? ?- ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> [11634717] ? ? chrM [16534, 16570] ? ? ?- ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> [11634718] ? ? chrM [16534, 16570] ? ? ?- ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> [11634719] ? ? chrM [16541, 16577] ? ? ?- ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> >> seqlengths >> ? chr1 ?chr2 ?chr3 ?chr4 ?chr5 ?chr6 ... chr20 chr21 chr22 ?chrX ?chrY ?chrM >> ? ? NA ? ?NA ? ?NA ? ?NA ? ?NA ? ?NA ... ? ?NA ? ?NA ? ?NA ? ?NA ? ?NA ? ?NA >> >>> hgTxDb <- makeTranscriptDbFromUCSC(genome="hg18", tablename="ensGene") >>> hgTxDb >> TranscriptDb object: >> | Db type: TranscriptDb >> | Data source: UCSC >> | Genome: hg18 >> | UCSC Table: ensGene >> | Type of Gene ID: Ensembl gene ID >> | Full dataset: yes >> | transcript_nrow: 63280 >> | exon_nrow: 276075 >> | cds_nrow: 225373 >> | Db created by: GenomicFeatures package from Bioconductor >> | Creation time: 2010-08-17 15:28:19 +0100 (Tue, 17 Aug 2010) >> | GenomicFeatures version at creation time: 1.0.6 >> | RSQLite version at creation time: 0.9-1 >>> >> >> >> >> >> > > > -- > Martin Morgan > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > -- Paul Geeleher School of Mathematics, Statistics and Applied Mathematics National University of Ireland Galway Ireland -- www.bioinformaticstutorials.com

ADD REPLY • link 15.5 years ago Paul Geeleher ★ 1.3k

0

Entering edit mode

Hello, as the mtDNA is circular, such an overhanging alignment covering end and start of the mtDNA is not worrying at all. As such an alignment can also happen with bacterial genomes, one could start thinking whether such alignments should be treated in a special way. However, I am not sure whether this should be addressed by BioC or rather upstream in the alignment program output formats such as SAM/BAM. Regards, Joern On Mon, 23 Aug 2010 12:51:03 +0100, Paul Geeleher wrote > Hi Martin, > > Your advice seems to have been pretty successful, there was a read > aligned beyond the end of the chrM. No idea why this might have > happened though, which is a bit worrying. > > For the record removing the offending read worked ok. I used the > following code: > > # find offending read: > > refLen <- seqlengths(hgTxDb)[as.character(seqnames(alnRanges))] > > print(alnRanges[refLen < end(alnRanges)]) > > GRanges with 1 range and 1 elementMetadata value > seqnames ranges strand | pData.alignData.from...notNA... > <rle> <iranges> <rle> | <integer> > [1] chrM [16541, 16577] - | 147 > > # remove offending read: > alnRanges <- alnRanges[-which(refLen < end(alnRanges))] > > Cheers, > > Paul > > On Wed, Aug 18, 2010 at 2:30 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: > > Hi Paul -- > > > > On 08/17/2010 08:09 AM, Paul Geeleher wrote: > >> I'm attempting to read some aligned RNA-seq paired end reads in > >> Bioconductor. The reads were aligned with maq (.fastq to .bqf to .map) > >> and converted to SAM then BAM format by samtools as I couldn't get the > >> paired end reads into biocondcutor in MAQ format. > >> > >> I'm basically following this tutorial: > >> http://bioconductor.org/help/course- materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf > >> > >> The following step causes me an error: > >> > >>> seqlengths(alnRanges) <- seqlengths(hgTxDb)[names(seqlengths(alnRanges))] > >> Error in validObject(.Object) : > >> ? invalid class "GRanges" object: slot 'ranges' contains values > >> outside of sequence bounds > >> > >> I think it means that there is something aligned to outside the range > >> of my reference genome but I don't understand why this would be as I > >> used HG18 for everything. > > > > probably one or more of your reads are 'hanging over' one of the > > chromosome bounds. You might > > > > ?tapply(alnRanges, seqnames(alnRanges), f > > ? ? ? ? unction(x) c(min(start(x)), max(end(x)))) > > > > to get a sense of whether you're on the right track, and maybe something > > like > > > > ?refLen <- seqlengths(hgTxDb)[as.character(seqnames(alnRanges)] > > ?alnRanges[refLen < end(alnRanges) > > > > to get the actual ranges? You could trim or discard these > > > > Martin > >

ADD REPLY • link 15.5 years ago Joern Toedling ▴ 450

0

Entering edit mode

Ah of course, that never occured to me, thanks! On Mon, Aug 23, 2010 at 2:03 PM, Joern Toedling <joern.toedling at="" curie.fr=""> wrote: > Hello, > > as the mtDNA is circular, such an overhanging alignment covering end and start > of the mtDNA is not worrying at all. > > As such an alignment can also happen with bacterial genomes, one could start > thinking whether such alignments should be treated in a special way. However, > I am not sure whether this should be addressed by BioC or rather upstream in > the alignment program output formats such as SAM/BAM. > > Regards, > Joern > > On Mon, 23 Aug 2010 12:51:03 +0100, Paul Geeleher wrote >> Hi Martin, >> >> Your advice seems to have been pretty successful, there was a read >> aligned beyond the end of the chrM. No idea why this might have >> happened though, which is a bit worrying. >> >> For the record removing the offending read worked ok. I used the >> following code: >> >> # find offending read: >> > refLen <- seqlengths(hgTxDb)[as.character(seqnames(alnRanges))] >> > print(alnRanges[refLen < end(alnRanges)]) >> >> GRanges with 1 range and 1 elementMetadata value >> ? ? seqnames ? ? ? ? ranges strand | pData.alignData.from...notNA... >> ? ? ? ?<rle> ? ? ?<iranges> ?<rle> | ? ? ? ? ? ? ? ? ? ? ? <integer> >> [1] ? ? chrM [16541, 16577] ? ? ?- | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> >> # remove offending read: >> alnRanges <- alnRanges[-which(refLen < end(alnRanges))] >> >> Cheers, >> >> Paul >> >> On Wed, Aug 18, 2010 at 2:30 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: >> > Hi Paul -- >> > >> > On 08/17/2010 08:09 AM, Paul Geeleher wrote: >> >> I'm attempting to read some aligned RNA-seq paired end reads in >> >> Bioconductor. The reads were aligned with maq (.fastq to .bqf to .map) >> >> and converted to SAM then BAM format by samtools as I couldn't get the >> >> paired end reads into biocondcutor in MAQ format. >> >> >> >> I'm basically following this tutorial: >> >> > http://bioconductor.org/help/course- materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf >> >> >> >> The following step causes me an error: >> >> >> >>> seqlengths(alnRanges) <- seqlengths(hgTxDb)[names(seqlengths(alnRanges))] >> >> Error in validObject(.Object) : >> >> ? invalid class "GRanges" object: slot 'ranges' contains values >> >> outside of sequence bounds >> >> >> >> I think it means that there is something aligned to outside the range >> >> of my reference genome but I don't understand why this would be as I >> >> used HG18 for everything. >> > >> > probably one or more of your reads are 'hanging over' one of the >> > chromosome bounds. You might >> > >> > ?tapply(alnRanges, seqnames(alnRanges), f >> > ? ? ? ? unction(x) c(min(start(x)), max(end(x)))) >> > >> > to get a sense of whether you're on the right track, and maybe something >> > like >> > >> > ?refLen <- seqlengths(hgTxDb)[as.character(seqnames(alnRanges)] >> > ?alnRanges[refLen < end(alnRanges) >> > >> > to get the actual ranges? You could trim or discard these >> > >> > Martin >> > > > -- Paul Geeleher School of Mathematics, Statistics and Applied Mathematics National University of Ireland Galway Ireland -- www.bioinformaticstutorials.com

ADD REPLY • link 15.5 years ago Paul Geeleher ★ 1.3k