Error mapping RNA-seq data to genome
1
0
Entering edit mode
Paul Geeleher ★ 1.3k
@paul-geeleher-2679
Last seen 9.6 years ago
I'm attempting to read some aligned RNA-seq paired end reads in Bioconductor. The reads were aligned with maq (.fastq to .bqf to .map) and converted to SAM then BAM format by samtools as I couldn't get the paired end reads into biocondcutor in MAQ format. I'm basically following this tutorial: http://bioconductor.org/help/course- materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf The following step causes me an error: > seqlengths(alnRanges) <- seqlengths(hgTxDb)[names(seqlengths(alnRanges))] Error in validObject(.Object) : invalid class "GRanges" object: slot 'ranges' contains values outside of sequence bounds I think it means that there is something aligned to outside the range of my reference genome but I don't understand why this would be as I used HG18 for everything. Any suggestions as to how I can possibly figure out what went wrong here would be greatly appreciated. I can provide more information if necessary. Summaries of my GenomicRanges and GenomicFeatures objects are below. > alnRanges GRanges with 11634719 ranges and 1 elementMetadata value seqnames ranges strand | pData.alignData.from...notNA... <rle> <iranges> <rle> | <integer> [1] chr1 [3345, 3381] + | 163 [2] chr1 [3391, 3427] - | 147 [3] chr1 [4223, 4259] + | 163 [4] chr1 [4247, 4283] - | 147 [5] chr1 [4275, 4311] + | 163 [6] chr1 [4304, 4340] + | 163 [7] chr1 [4320, 4356] + | 163 [8] chr1 [4343, 4379] - | 147 [9] chr1 [4343, 4379] - | 147 ... ... ... ... ... ... [11634711] chrM [16531, 16567] - | 147 [11634712] chrM [16532, 16568] + | 161 [11634713] chrM [16532, 16568] - | 147 [11634714] chrM [16532, 16568] - | 147 [11634715] chrM [16532, 16568] - | 147 [11634716] chrM [16534, 16570] - | 147 [11634717] chrM [16534, 16570] - | 147 [11634718] chrM [16534, 16570] - | 147 [11634719] chrM [16541, 16577] - | 147 seqlengths chr1 chr2 chr3 chr4 chr5 chr6 ... chr20 chr21 chr22 chrX chrY chrM NA NA NA NA NA NA ... NA NA NA NA NA NA > hgTxDb <- makeTranscriptDbFromUCSC(genome="hg18", tablename="ensGene") > hgTxDb TranscriptDb object: | Db type: TranscriptDb | Data source: UCSC | Genome: hg18 | UCSC Table: ensGene | Type of Gene ID: Ensembl gene ID | Full dataset: yes | transcript_nrow: 63280 | exon_nrow: 276075 | cds_nrow: 225373 | Db created by: GenomicFeatures package from Bioconductor | Creation time: 2010-08-17 15:28:19 +0100 (Tue, 17 Aug 2010) | GenomicFeatures version at creation time: 1.0.6 | RSQLite version at creation time: 0.9-1 > -- Paul Geeleher School of Mathematics, Statistics and Applied Mathematics National University of Ireland Galway Ireland -- www.bioinformaticstutorials.com
GenomicFeatures GenomicRanges GenomicFeatures GenomicRanges • 1.2k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 4 hours ago
United States
Hi Paul -- On 08/17/2010 08:09 AM, Paul Geeleher wrote: > I'm attempting to read some aligned RNA-seq paired end reads in > Bioconductor. The reads were aligned with maq (.fastq to .bqf to .map) > and converted to SAM then BAM format by samtools as I couldn't get the > paired end reads into biocondcutor in MAQ format. > > I'm basically following this tutorial: > http://bioconductor.org/help/course- materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf > > The following step causes me an error: > >> seqlengths(alnRanges) <- seqlengths(hgTxDb)[names(seqlengths(alnRanges))] > Error in validObject(.Object) : > invalid class "GRanges" object: slot 'ranges' contains values > outside of sequence bounds > > I think it means that there is something aligned to outside the range > of my reference genome but I don't understand why this would be as I > used HG18 for everything. probably one or more of your reads are 'hanging over' one of the chromosome bounds. You might tapply(alnRanges, seqnames(alnRanges), f unction(x) c(min(start(x)), max(end(x)))) to get a sense of whether you're on the right track, and maybe something like refLen <- seqlengths(hgTxDb)[as.character(seqnames(alnRanges)] alnRanges[refLen < end(alnRanges) to get the actual ranges? You could trim or discard these Martin > Any suggestions as to how I can possibly figure out what went wrong > here would be greatly appreciated. I can provide more information if > necessary. Summaries of my GenomicRanges and GenomicFeatures objects > are below. > > >> alnRanges > GRanges with 11634719 ranges and 1 elementMetadata value > seqnames ranges strand | pData.alignData.from...notNA... > <rle> <iranges> <rle> | <integer> > [1] chr1 [3345, 3381] + | 163 > [2] chr1 [3391, 3427] - | 147 > [3] chr1 [4223, 4259] + | 163 > [4] chr1 [4247, 4283] - | 147 > [5] chr1 [4275, 4311] + | 163 > [6] chr1 [4304, 4340] + | 163 > [7] chr1 [4320, 4356] + | 163 > [8] chr1 [4343, 4379] - | 147 > [9] chr1 [4343, 4379] - | 147 > ... ... ... ... ... ... > [11634711] chrM [16531, 16567] - | 147 > [11634712] chrM [16532, 16568] + | 161 > [11634713] chrM [16532, 16568] - | 147 > [11634714] chrM [16532, 16568] - | 147 > [11634715] chrM [16532, 16568] - | 147 > [11634716] chrM [16534, 16570] - | 147 > [11634717] chrM [16534, 16570] - | 147 > [11634718] chrM [16534, 16570] - | 147 > [11634719] chrM [16541, 16577] - | 147 > > seqlengths > chr1 chr2 chr3 chr4 chr5 chr6 ... chr20 chr21 chr22 chrX chrY chrM > NA NA NA NA NA NA ... NA NA NA NA NA NA > >> hgTxDb <- makeTranscriptDbFromUCSC(genome="hg18", tablename="ensGene") >> hgTxDb > TranscriptDb object: > | Db type: TranscriptDb > | Data source: UCSC > | Genome: hg18 > | UCSC Table: ensGene > | Type of Gene ID: Ensembl gene ID > | Full dataset: yes > | transcript_nrow: 63280 > | exon_nrow: 276075 > | cds_nrow: 225373 > | Db created by: GenomicFeatures package from Bioconductor > | Creation time: 2010-08-17 15:28:19 +0100 (Tue, 17 Aug 2010) > | GenomicFeatures version at creation time: 1.0.6 > | RSQLite version at creation time: 0.9-1 >> > > > > > -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
Hi Martin, Your advice seems to have been pretty successful, there was a read aligned beyond the end of the chrM. No idea why this might have happened though, which is a bit worrying. For the record removing the offending read worked ok. I used the following code: # find offending read: > refLen <- seqlengths(hgTxDb)[as.character(seqnames(alnRanges))] > print(alnRanges[refLen < end(alnRanges)]) GRanges with 1 range and 1 elementMetadata value seqnames ranges strand | pData.alignData.from...notNA... <rle> <iranges> <rle> | <integer> [1] chrM [16541, 16577] - | 147 # remove offending read: alnRanges <- alnRanges[-which(refLen < end(alnRanges))] Cheers, Paul On Wed, Aug 18, 2010 at 2:30 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: > Hi Paul -- > > On 08/17/2010 08:09 AM, Paul Geeleher wrote: >> I'm attempting to read some aligned RNA-seq paired end reads in >> Bioconductor. The reads were aligned with maq (.fastq to .bqf to .map) >> and converted to SAM then BAM format by samtools as I couldn't get the >> paired end reads into biocondcutor in MAQ format. >> >> I'm basically following this tutorial: >> http://bioconductor.org/help/course- materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf >> >> The following step causes me an error: >> >>> seqlengths(alnRanges) <- seqlengths(hgTxDb)[names(seqlengths(alnRanges))] >> Error in validObject(.Object) : >> ? invalid class "GRanges" object: slot 'ranges' contains values >> outside of sequence bounds >> >> I think it means that there is something aligned to outside the range >> of my reference genome but I don't understand why this would be as I >> used HG18 for everything. > > probably one or more of your reads are 'hanging over' one of the > chromosome bounds. You might > > ?tapply(alnRanges, seqnames(alnRanges), f > ? ? ? ? unction(x) c(min(start(x)), max(end(x)))) > > to get a sense of whether you're on the right track, and maybe something > like > > ?refLen <- seqlengths(hgTxDb)[as.character(seqnames(alnRanges)] > ?alnRanges[refLen < end(alnRanges) > > to get the actual ranges? You could trim or discard these > > Martin > >> Any suggestions as to how I can possibly figure out what went wrong >> here would be greatly appreciated. I can provide more information if >> necessary. Summaries of my GenomicRanges and GenomicFeatures objects >> are below. >> >> >>> alnRanges >> GRanges with 11634719 ranges and 1 elementMetadata value >> ? ? ? ? ? ?seqnames ? ? ? ? ranges strand ? | pData.alignData.from...notNA... >> ? ? ? ? ? ? ? <rle> ? ? ?<iranges> ?<rle> ? | ? ? ? ? ? ? ? ? ? ? ? <integer> >> ? ? ? ?[1] ? ? chr1 ? [3345, 3381] ? ? ?+ ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 163 >> ? ? ? ?[2] ? ? chr1 ? [3391, 3427] ? ? ?- ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> ? ? ? ?[3] ? ? chr1 ? [4223, 4259] ? ? ?+ ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 163 >> ? ? ? ?[4] ? ? chr1 ? [4247, 4283] ? ? ?- ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> ? ? ? ?[5] ? ? chr1 ? [4275, 4311] ? ? ?+ ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 163 >> ? ? ? ?[6] ? ? chr1 ? [4304, 4340] ? ? ?+ ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 163 >> ? ? ? ?[7] ? ? chr1 ? [4320, 4356] ? ? ?+ ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 163 >> ? ? ? ?[8] ? ? chr1 ? [4343, 4379] ? ? ?- ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> ? ? ? ?[9] ? ? chr1 ? [4343, 4379] ? ? ?- ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> ? ? ? ?... ? ? ?... ? ? ? ? ? ?... ? ?... ... ? ? ? ? ? ? ? ? ? ? ? ? ? ? ... >> [11634711] ? ? chrM [16531, 16567] ? ? ?- ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> [11634712] ? ? chrM [16532, 16568] ? ? ?+ ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 161 >> [11634713] ? ? chrM [16532, 16568] ? ? ?- ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> [11634714] ? ? chrM [16532, 16568] ? ? ?- ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> [11634715] ? ? chrM [16532, 16568] ? ? ?- ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> [11634716] ? ? chrM [16534, 16570] ? ? ?- ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> [11634717] ? ? chrM [16534, 16570] ? ? ?- ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> [11634718] ? ? chrM [16534, 16570] ? ? ?- ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> [11634719] ? ? chrM [16541, 16577] ? ? ?- ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> >> seqlengths >> ? chr1 ?chr2 ?chr3 ?chr4 ?chr5 ?chr6 ... chr20 chr21 chr22 ?chrX ?chrY ?chrM >> ? ? NA ? ?NA ? ?NA ? ?NA ? ?NA ? ?NA ... ? ?NA ? ?NA ? ?NA ? ?NA ? ?NA ? ?NA >> >>> hgTxDb <- makeTranscriptDbFromUCSC(genome="hg18", tablename="ensGene") >>> hgTxDb >> TranscriptDb object: >> | Db type: TranscriptDb >> | Data source: UCSC >> | Genome: hg18 >> | UCSC Table: ensGene >> | Type of Gene ID: Ensembl gene ID >> | Full dataset: yes >> | transcript_nrow: 63280 >> | exon_nrow: 276075 >> | cds_nrow: 225373 >> | Db created by: GenomicFeatures package from Bioconductor >> | Creation time: 2010-08-17 15:28:19 +0100 (Tue, 17 Aug 2010) >> | GenomicFeatures version at creation time: 1.0.6 >> | RSQLite version at creation time: 0.9-1 >>> >> >> >> >> >> > > > -- > Martin Morgan > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > -- Paul Geeleher School of Mathematics, Statistics and Applied Mathematics National University of Ireland Galway Ireland -- www.bioinformaticstutorials.com
ADD REPLY
0
Entering edit mode
Hello, as the mtDNA is circular, such an overhanging alignment covering end and start of the mtDNA is not worrying at all. As such an alignment can also happen with bacterial genomes, one could start thinking whether such alignments should be treated in a special way. However, I am not sure whether this should be addressed by BioC or rather upstream in the alignment program output formats such as SAM/BAM. Regards, Joern On Mon, 23 Aug 2010 12:51:03 +0100, Paul Geeleher wrote > Hi Martin, > > Your advice seems to have been pretty successful, there was a read > aligned beyond the end of the chrM. No idea why this might have > happened though, which is a bit worrying. > > For the record removing the offending read worked ok. I used the > following code: > > # find offending read: > > refLen <- seqlengths(hgTxDb)[as.character(seqnames(alnRanges))] > > print(alnRanges[refLen < end(alnRanges)]) > > GRanges with 1 range and 1 elementMetadata value > seqnames ranges strand | pData.alignData.from...notNA... > <rle> <iranges> <rle> | <integer> > [1] chrM [16541, 16577] - | 147 > > # remove offending read: > alnRanges <- alnRanges[-which(refLen < end(alnRanges))] > > Cheers, > > Paul > > On Wed, Aug 18, 2010 at 2:30 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: > > Hi Paul -- > > > > On 08/17/2010 08:09 AM, Paul Geeleher wrote: > >> I'm attempting to read some aligned RNA-seq paired end reads in > >> Bioconductor. The reads were aligned with maq (.fastq to .bqf to .map) > >> and converted to SAM then BAM format by samtools as I couldn't get the > >> paired end reads into biocondcutor in MAQ format. > >> > >> I'm basically following this tutorial: > >> http://bioconductor.org/help/course- materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf > >> > >> The following step causes me an error: > >> > >>> seqlengths(alnRanges) <- seqlengths(hgTxDb)[names(seqlengths(alnRanges))] > >> Error in validObject(.Object) : > >> ? invalid class "GRanges" object: slot 'ranges' contains values > >> outside of sequence bounds > >> > >> I think it means that there is something aligned to outside the range > >> of my reference genome but I don't understand why this would be as I > >> used HG18 for everything. > > > > probably one or more of your reads are 'hanging over' one of the > > chromosome bounds. You might > > > > ?tapply(alnRanges, seqnames(alnRanges), f > > ? ? ? ? unction(x) c(min(start(x)), max(end(x)))) > > > > to get a sense of whether you're on the right track, and maybe something > > like > > > > ?refLen <- seqlengths(hgTxDb)[as.character(seqnames(alnRanges)] > > ?alnRanges[refLen < end(alnRanges) > > > > to get the actual ranges? You could trim or discard these > > > > Martin > >
ADD REPLY
0
Entering edit mode
Ah of course, that never occured to me, thanks! On Mon, Aug 23, 2010 at 2:03 PM, Joern Toedling <joern.toedling at="" curie.fr=""> wrote: > Hello, > > as the mtDNA is circular, such an overhanging alignment covering end and start > of the mtDNA is not worrying at all. > > As such an alignment can also happen with bacterial genomes, one could start > thinking whether such alignments should be treated in a special way. However, > I am not sure whether this should be addressed by BioC or rather upstream in > the alignment program output formats such as SAM/BAM. > > Regards, > Joern > > On Mon, 23 Aug 2010 12:51:03 +0100, Paul Geeleher wrote >> Hi Martin, >> >> Your advice seems to have been pretty successful, there was a read >> aligned beyond the end of the chrM. No idea why this might have >> happened though, which is a bit worrying. >> >> For the record removing the offending read worked ok. I used the >> following code: >> >> # find offending read: >> > refLen <- seqlengths(hgTxDb)[as.character(seqnames(alnRanges))] >> > print(alnRanges[refLen < end(alnRanges)]) >> >> GRanges with 1 range and 1 elementMetadata value >> ? ? seqnames ? ? ? ? ranges strand | pData.alignData.from...notNA... >> ? ? ? ?<rle> ? ? ?<iranges> ?<rle> | ? ? ? ? ? ? ? ? ? ? ? <integer> >> [1] ? ? chrM [16541, 16577] ? ? ?- | ? ? ? ? ? ? ? ? ? ? ? ? ? ? 147 >> >> # remove offending read: >> alnRanges <- alnRanges[-which(refLen < end(alnRanges))] >> >> Cheers, >> >> Paul >> >> On Wed, Aug 18, 2010 at 2:30 PM, Martin Morgan <mtmorgan at="" fhcrc.org=""> wrote: >> > Hi Paul -- >> > >> > On 08/17/2010 08:09 AM, Paul Geeleher wrote: >> >> I'm attempting to read some aligned RNA-seq paired end reads in >> >> Bioconductor. The reads were aligned with maq (.fastq to .bqf to .map) >> >> and converted to SAM then BAM format by samtools as I couldn't get the >> >> paired end reads into biocondcutor in MAQ format. >> >> >> >> I'm basically following this tutorial: >> >> > http://bioconductor.org/help/course- materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf >> >> >> >> The following step causes me an error: >> >> >> >>> seqlengths(alnRanges) <- seqlengths(hgTxDb)[names(seqlengths(alnRanges))] >> >> Error in validObject(.Object) : >> >> ? invalid class "GRanges" object: slot 'ranges' contains values >> >> outside of sequence bounds >> >> >> >> I think it means that there is something aligned to outside the range >> >> of my reference genome but I don't understand why this would be as I >> >> used HG18 for everything. >> > >> > probably one or more of your reads are 'hanging over' one of the >> > chromosome bounds. You might >> > >> > ?tapply(alnRanges, seqnames(alnRanges), f >> > ? ? ? ? unction(x) c(min(start(x)), max(end(x)))) >> > >> > to get a sense of whether you're on the right track, and maybe something >> > like >> > >> > ?refLen <- seqlengths(hgTxDb)[as.character(seqnames(alnRanges)] >> > ?alnRanges[refLen < end(alnRanges) >> > >> > to get the actual ranges? You could trim or discard these >> > >> > Martin >> > > > -- Paul Geeleher School of Mathematics, Statistics and Applied Mathematics National University of Ireland Galway Ireland -- www.bioinformaticstutorials.com
ADD REPLY

Login before adding your answer.

Traffic: 753 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6