readGAlignmentPairsFromBam and readGAlignmentFromBam differences
2
0
Entering edit mode
Guest User ★ 13k
@guest-user-4897
Last seen 9.6 years ago
Hello, I am doing an RNA-seq analysis for paired-end reads and I have some questions for readGAlignmentPairsFromBam. I have mapped the reads with tophatv2 and aligned them with Genomic Features using the readGAlignmentPairsFromBam function, see the code below. I understand that since I have paired-end reads I should use readGAlignmentPairsFromBam instead of readGAlignmentFromBam but it seems to have much more strict criteria for the pairing and therefore, the counts are much lower (mean number of counts: 185 versus 440 for a random sample). I read the pairing criteria trying to understand why lots of the reads are not being counted but I get differences for which the reason of exclusion is not clear to me. example for one of these reads: $ samtools view 209_accepted_hits.bam | grep "HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259" HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 97 chr15 77414942 50 51M = 77415180 289 GTCTGCAGAGTCTCCACTCTGCCGGAAGTCGGATCCCCTTTTAGACTCCAG &&&(((((**(**++++++++++++++++++++)+++++++++++++++++ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU XS:A:- NH:i:1 HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 145 chr15 77415180 50 51M = 77414942 -289 TGCATTTTGATCCAGTCACCGTTTCTTTAGGCTGCTCTGCCCTTAACCCAG *+(#++)+)+++**($)(++++*)(+++*)'++++*)+**(((&&%%&$$$ AS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:20A30 YT:Z:UU XS:A:- NH:i:1 is included in the result from readGAlignmentFromBam but not from readGAlignmentPairsFromBam. Also, many pairs with bitwise flags of 97 and 145 (for pair) are not included in readGAlignmentPairsFromBam although their flags satisfy the pairing criteria. Also I get some reads in readGAlignmentPairsFromBam that I don???t get from readGAlignmentFromBam, which again I don???t understand why. Example for one of these reads: $samtools view 209_accepted_hits.bam.fltrd.bam | grep "HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830" HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 161 chr15 77426477 50 51M = 77426641 215 TGTGGTTCCAGTCACCAGCTGCACATCTGAGACACACCAGACACAAGCTCT %$%&('&(**)(*+++)++++++++++++++++++++++++++++++++++ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU XS:A:- NH:i:1 HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 81 chr15 77426641 50 51M = 77426477 -215 GCCATAGCTGTCCATGATCTCTGATTACCTTTCTTCTATGATTTAAAAGGG ++++++++++++++++++++++++++*+++++++++++*****(((((&&& AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU NH:i:1 Can someone, please, explain why this is? Thank you in advance for your time. Emmanouela -- output of sessionInfo(): library(GenomicRanges) library(Rsamtools) library(GenomicFeatures) library(rtracklayer) txdb.ensgene <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = "refGene") range.exon <- exonsBy(txdb.ensgene, "gene") my_bam <- "209_accepted_hits.bam.fltrd.bam" aligns <- readGAlignmentPairsFromBam(my_bam,use.names=T) aligns_s <- readGAlignmentsFromBam(my_bam,use.names=T) temp3 <- findOverlaps(range.exon["328561"],aligns) # gene Apol10b temp3_aligns <- aligns[subjectHits(temp3)] temp4 <- findOverlaps(range.exon["328561"],aligns_s) temp4_aligns <- aligns_s[subjectHits(temp4)] setdiff(names(temp3_aligns) , names(temp4_aligns))[1] [1] "HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830" setdiff(names(temp4_aligns),names(temp3_aligns))[1] [1] "HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259" sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_GB.ISO-8859-1 LC_NUMERIC=C [3] LC_TIME=en_GB.ISO-8859-1 LC_COLLATE=en_GB.ISO-8859-1 [5] LC_MONETARY=en_GB.ISO-8859-1 LC_MESSAGES=en_GB.ISO-8859-1 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.ISO-8859-1 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] org.Mm.eg.db_2.10.1 RSQLite_0.11.4 DBI_0.2-7 [4] annotate_1.40.0 rtracklayer_1.22.1 GenomicFeatures_1.14.2 [7] AnnotationDbi_1.24.0 Biobase_2.22.0 Rsamtools_1.14.2 [10] Biostrings_2.30.1 GenomicRanges_1.14.4 XVector_0.2.0 [13] IRanges_1.20.6 BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] biomaRt_2.18.0 bitops_1.0-6 BSgenome_1.30.0 RCurl_1.95-4.1 [5] stats4_3.0.1 tools_3.0.1 XML_3.98-1.1 xtable_1.7-1 [9] zlibbioc_1.8.0 -- Sent via the guest posting facility at bioconductor.org.
• 1.0k views
ADD COMMENT
0
Entering edit mode
@valerie-obenchain-4275
Last seen 2.3 years ago
United States
Hi, On 02/10/2014 07:19 AM, Maintainer wrote: > > Hello, > > I am doing an RNA-seq analysis for paired-end reads and I have some questions for readGAlignmentPairsFromBam. I have mapped the reads with tophatv2 and aligned them with Genomic Features using the readGAlignmentPairsFromBam function, Just to clarify, readGAlignmentsPairsFromBam() does not align reads. It reads in records that have been previously aligned. see the code below. > I understand that since I have paired-end reads I should use readGAlignmentPairsFromBam instead of readGAlignmentFromBam There is a family of functions available for reading in single and paired-end data. The functions with the *Bam extension are lower level and will probably go away (deprecated) in the next release cycle to eliminate redundancy. The functions below are the higher level counter parts that call the *Bam functions; these are the functions we recommend you use. Single-end: readGAlignments() Paired-end" readGAlignmentPairs() readGAlignmentsList() If you have paired-end data you can use either readGAlignmentPairs() or readGAlignmentsList(). These functions both use a mate-pairing algorithm but return different classes. In the devel branch the two call the same algorithm, in release they are different. readGAlignmentPairs() returns a GAlignmentPairs class. This class can only hold pairs so any unmapped reads, singletons etc. are dumped. The dumped reads can be retrieved with getDumpedAlignments(). readGAlignmentsList() returns a List of GAlignments. The class can hold both pairs and records that cannot be paired. See ?readGAlignmentsList for details. but it seems to have much more strict criteria for the pairing and therefore, the counts are much lower (mean number of counts: 185 versus 440 for a random sample). > > I read the pairing criteria trying to understand why lots of the reads are not being counted but I get differences for which the reason of exclusion is not clear to me. > > example for one of these reads: > $ samtools view 209_accepted_hits.bam | grep "HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259" > HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 97 chr15 77414942 50 51M = 77415180 289 GTCTGCAGAGTCTCCACTCTGCCGGAAGTCGGATCCCCTTTTAGACTCCAG &&&(((((**(**++++++++++++++++++++)+++++++++++++++++ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU XS:A:- NH:i:1 > HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 145 chr15 77415180 50 51M = 77414942 -289 TGCATTTTGATCCAGTCACCGTTTCTTTAGGCTGCTCTGCCCTTAACCCAG *+(#++)+)+++**($)(++++*)(+++*)'++++*)+**(((&&%%&$$$ AS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:20A30 YT:Z:UU XS:A:- NH:i:1 > > is included in the result from readGAlignmentFromBam but not from readGAlignmentPairsFromBam. readGAlignmentsFromBam() returns both records because it treats all reads as single-end; there is no mate-paring here. The output of the two functions is not expected to be the same because one assumes single- end reads, the other assumes paired-end. Are you saying that one of these reads above is not returned by readGAlignmentPairs()? If they are pairs (which it looks like they are) either both or neither should come back from readGAlignmentPairs(). Remember the class has right and left pairs that can be accessed with right() and left(). If neither are returned then they did not meet the mate-pairing criteria. As an fyi, you can investigate bam flags further with the bamFlagAsBitMatrix() function. > Also, many pairs with bitwise flags of 97 and 145 (for pair) are not included in readGAlignmentPairsFromBam although their flags satisfy the pairing criteria. > > Also I get some reads in readGAlignmentPairsFromBam that I don???t get from readGAlignmentFromBam, which again I don???t understand why. readGAlignmentPairsFromBam() returns records that satisfy a mate- pairing algorithm. readGAlignmentFromBam() treats all records as single-end; there is no mate-pairing criteria in this function so all records are returned (sans those you filter out with ScanBamParam). > > Example for one of these reads: > $samtools view 209_accepted_hits.bam.fltrd.bam | grep "HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830" > HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 161 chr15 77426477 50 51M = 77426641 215 TGTGGTTCCAGTCACCAGCTGCACATCTGAGACACACCAGACACAAGCTCT %$%&('&(**)(*+++)++++++++++++++++++++++++++++++++++ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU XS:A:- NH:i:1 > HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 81 chr15 77426641 50 51M = 77426477 -215 GCCATAGCTGTCCATGATCTCTGATTACCTTTCTTCTATGATTTAAAAGGG ++++++++++++++++++++++++++*+++++++++++*****(((((&&& AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU NH:i:1 > Can you try reading the data in with readGAlignmentsList()? This class will hold pairs and records that did not meet the pairing criteria. Using GenomicRanges 1.14.4 in release: library(GenomicRanges) library(Rsamtools) library(pasillaBamSubset) fl <- untreated3_chr4() galist <- readGAlignmentsList(fl) The metadata column 'mates' indicates if they passed the mate criteria. >> head(galist, 2) > GAlignmentsList of length 2: > $1 > GAlignments with 2 alignments and 1 metadata column: > seqnames strand cigar qwidth start end width ngap | mates > [1] chr4 + 37M 37 169 205 37 0 | TRUE > [2] chr4 - 37M 37 326 362 37 0 | TRUE > > $2 > GAlignments with 2 alignments and 1 metadata column: > seqnames strand cigar qwidth start end width ngap | mates > [1] chr4 + 37M 37 946 982 37 0 | TRUE > [2] chr4 - 37M 37 986 1022 37 0 | TRUE > Value is FALSE for records that did not pass. >> tail(galist, 2) > GAlignmentsList of length 2: > $95788 > GAlignments with 2 alignments and 1 metadata column: > seqnames strand cigar qwidth start end width ngap | mates > [1] chr4 + 37M 37 87190 87226 37 0 | FALSE > [2] chr4 - 37M 37 87476 87512 37 0 | FALSE > > $95789 > GAlignments with 2 alignments and 1 metadata column: > seqnames strand cigar qwidth start end width ngap | mates > [1] chr4 + 37M 37 12491 12527 37 0 | FALSE > [2] chr4 - 37M 37 12667 12703 37 0 | FALSE If these pairs appear in the output from readGAlignmentsList() with mates=FALSE then they did not meet some aspect of the criteria. In this case the records can also be found with getDumpedAlignments() after runing readGAlignmentPairs(). Let me know how it goes. Valerie > Can someone, please, explain why this is? > Thank you in advance for your time. > Emmanouela > > > -- output of sessionInfo(): > > > > library(GenomicRanges) > library(Rsamtools) > library(GenomicFeatures) > library(rtracklayer) > > txdb.ensgene <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = "refGene") > range.exon <- exonsBy(txdb.ensgene, "gene") > > my_bam <- "209_accepted_hits.bam.fltrd.bam" > aligns <- readGAlignmentPairsFromBam(my_bam,use.names=T) > aligns_s <- readGAlignmentsFromBam(my_bam,use.names=T) > > temp3 <- findOverlaps(range.exon["328561"],aligns) # gene Apol10b > temp3_aligns <- aligns[subjectHits(temp3)] > > temp4 <- findOverlaps(range.exon["328561"],aligns_s) > temp4_aligns <- aligns_s[subjectHits(temp4)] > > setdiff(names(temp3_aligns) , names(temp4_aligns))[1] > [1] "HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830" > > setdiff(names(temp4_aligns),names(temp3_aligns))[1] > [1] "HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259" > > > > sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.ISO-8859-1 LC_NUMERIC=C > [3] LC_TIME=en_GB.ISO-8859-1 LC_COLLATE=en_GB.ISO-8859-1 > [5] LC_MONETARY=en_GB.ISO-8859-1 LC_MESSAGES=en_GB.ISO-8859-1 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.ISO-8859-1 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] org.Mm.eg.db_2.10.1 RSQLite_0.11.4 DBI_0.2-7 > [4] annotate_1.40.0 rtracklayer_1.22.1 GenomicFeatures_1.14.2 > [7] AnnotationDbi_1.24.0 Biobase_2.22.0 Rsamtools_1.14.2 > [10] Biostrings_2.30.1 GenomicRanges_1.14.4 XVector_0.2.0 > [13] IRanges_1.20.6 BiocGenerics_0.8.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.18.0 bitops_1.0-6 BSgenome_1.30.0 RCurl_1.95-4.1 > [5] stats4_3.0.1 tools_3.0.1 XML_3.98-1.1 xtable_1.7-1 > [9] zlibbioc_1.8.0 > > > -- > Sent via the guest posting facility at bioconductor.org. > > > > ____________________________________________________________________ ____ > devteam-bioc mailing list > To unsubscribe from this mailing list send a blank email to > devteam-bioc-leave at lists.fhcrc.org > You can also unsubscribe or change your personal options at > https://lists.fhcrc.org/mailman/listinfo/devteam-bioc > -- Valerie Obenchain Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B155 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: vobencha at fhcrc.org Phone: (206) 667-3158 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 4 hours ago
Seattle, WA, United States
Hi Emmanouela, Short answer: It's a strand issue (and you should probably use ignore.strand=TRUE). Long answer: see below. On 02/10/2014 07:19 AM, Maintainer wrote: > > Hello, > > I am doing an RNA-seq analysis for paired-end reads and I have some questions for readGAlignmentPairsFromBam. I have mapped the reads with tophatv2 and aligned them with Genomic Features using the readGAlignmentPairsFromBam function, see the code below. > I understand that since I have paired-end reads I should use readGAlignmentPairsFromBam instead of readGAlignmentFromBam but it seems to have much more strict criteria for the pairing and therefore, the counts are much lower (mean number of counts: 185 versus 440 for a random sample). > > I read the pairing criteria trying to understand why lots of the reads are not being counted but I get differences for which the reason of exclusion is not clear to me. > > example for one of these reads: > $ samtools view 209_accepted_hits.bam | grep "HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259" > HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 97 chr15 77414942 50 51M = 77415180 289 GTCTGCAGAGTCTCCACTCTGCCGGAAGTCGGATCCCCTTTTAGACTCCAG &&&(((((**(**++++++++++++++++++++)+++++++++++++++++ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU XS:A:- NH:i:1 > HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 145 chr15 77415180 50 51M = 77414942 -289 TGCATTTTGATCCAGTCACCGTTTCTTTAGGCTGCTCTGCCCTTAACCCAG *+(#++)+)+++**($)(++++*)(+++*)'++++*)+**(((&&%%&$$$ AS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:20A30 YT:Z:UU XS:A:- NH:i:1 If I put those 2 records alone in a BAM file, they show up with readGAlignmentsFromBam(): > gal1 <- readGAlignmentsFromBam("test1.bam", use.names=TRUE) > gal1 GAlignments with 2 alignments and 0 metadata columns: seqnames strand cigar qwidth start end <rle> <rle> <character> <integer> <integer> <integer> HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 chr15 + 51M 51 77414942 77414992 HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 chr15 - 51M 51 77415180 77415230 width ngap <integer> <integer> HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 51 0 HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 51 0 --- seqlengths: chr15 103494974 They also get paired by readGAlignmentPairsFromBam(): > galp1 <- readGAlignmentPairsFromBam("test1.bam", use.names=TRUE) > galp1 GAlignmentPairs with 1 alignment pair and 0 metadata columns: seqnames strand : ranges -- ranges <rle> <rle> : <iranges> -- <iranges> HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 chr15 + : [77414942, 77414992] -- [77415180, 77415230] --- seqlengths: chr15 103494974 > > is included in the result from readGAlignmentFromBam but not from readGAlignmentPairsFromBam. Also, many pairs with bitwise flags of 97 and 145 (for pair) are not included in readGAlignmentPairsFromBam although their flags satisfy the pairing criteria. > > Also I get some reads in readGAlignmentPairsFromBam that I don???t get from readGAlignmentFromBam, which again I don???t understand why. > > Example for one of these reads: > $samtools view 209_accepted_hits.bam.fltrd.bam | grep "HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830" > HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 161 chr15 77426477 50 51M = 77426641 215 TGTGGTTCCAGTCACCAGCTGCACATCTGAGACACACCAGACACAAGCTCT %$%&('&(**)(*+++)++++++++++++++++++++++++++++++++++ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU XS:A:- NH:i:1 > HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 81 chr15 77426641 50 51M = 77426477 -215 GCCATAGCTGTCCATGATCTCTGATTACCTTTCTTCTATGATTTAAAAGGG ++++++++++++++++++++++++++*+++++++++++*****(((((&&& AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU NH:i:1 Here too, if I put those 2 records alone in a BAM file: > gal2 <- readGAlignmentsFromBam("test2.bam", use.names=TRUE) > gal2 GAlignments with 2 alignments and 0 metadata columns: seqnames strand cigar qwidth start end <rle> <rle> <character> <integer> <integer> <integer> HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 chr15 + 51M 51 77426477 77426527 HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 chr15 - 51M 51 77426641 77426691 width ngap <integer> <integer> HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 51 0 HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 51 0 --- seqlengths: chr15 103494974 Also: > galp2 <- readGAlignmentPairsFromBam("test2.bam", use.names=TRUE) > galp2 GAlignmentPairs with 1 alignment pair and 0 metadata columns: seqnames strand : ranges -- ranges <rle> <rle> : <iranges> -- <iranges> HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 chr15 - : [77426641, 77426691] -- [77426477, 77426527] --- seqlengths: chr15 103494974 So the "count" issue you describe is not a "pairing criteria" issue. You need to understand how findOverlaps() works on a GAlignmentPairs object. First with GAlignments object 'gal1': > findOverlaps(range.exon["328561"], gal1) Hits of length 1 queryLength: 1 subjectLength: 2 queryHits subjectHits <integer> <integer> 1 1 2 Your gene (which is on the minus strand) overlaps with the 2nd element in 'gal1'. When you pass a GAlignments or GAlignmentPairs object to findOverlaps(), it's first turned into a GRangesList object internally, with grglist(). The key here is that, grglist() on a GAlignmentPairs object inverts the strand of the 2nd mate: > grglist(galp1) GRangesList of length 1: $HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 GRanges with 2 ranges and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] chr15 [77414942, 77414992] + [2] chr15 [77415180, 77415230] + --- seqlengths: chr15 103494974 See ?GAlignmentPairs for more info about this (especially the IMPORTANT note in the Coercion section). The reason for this behavior is because the strand of the 2nd mate was already inverted by the sequencing technology (when the cDNA template was sequenced from the 2nd end). This is why you don't see any overlap when you do: > length(findOverlaps(range.exon["328561"], galp1)) [1] 0 (because your gene is on the minus strand). Most of the time those strand considerations don't matter because the RNA-seq protocol is not stranded. In that case you should use ignore.strand=TRUE: > findOverlaps(range.exon["328561"], gal1, ignore.strand=TRUE) Hits of length 2 queryLength: 1 subjectLength: 2 queryHits subjectHits <integer> <integer> 1 1 1 2 1 2 > findOverlaps(range.exon["328561"], galp1, ignore.strand=TRUE) Hits of length 1 queryLength: 1 subjectLength: 1 queryHits subjectHits <integer> <integer> 1 1 1 Similar story for gal2/galp2: > length(findOverlaps(range.exon["328561"], gal2)) [1] 0 No hit with the 1st element in 'gal2' because even though gal2[1] is in the range of your gene, it's not on the same strand. No hit with the 2nd element in 'gal2' because even though gal2[2] is on the same strand as your gene, it's not in its range. However, we see a hit with the pair: > length(findOverlaps(range.exon["328561"], galp2)) [1] 1 That's because when converting to GRangesList, everything ends up on the minus strand: > grglist(galp2) GRangesList of length 1: $HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 GRanges with 2 ranges and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] chr15 [77426477, 77426527] - [2] chr15 [77426641, 77426691] - --- seqlengths: chr15 103494974 But again, if you use ignore.strand=TRUE: > findOverlaps(range.exon["328561"], gal2, ignore.strand=TRUE) Hits of length 1 queryLength: 1 subjectLength: 2 queryHits subjectHits <integer> <integer> 1 1 1 Hope this helps, H. > > Can someone, please, explain why this is? > Thank you in advance for your time. > Emmanouela > > > -- output of sessionInfo(): > > > > library(GenomicRanges) > library(Rsamtools) > library(GenomicFeatures) > library(rtracklayer) > > txdb.ensgene <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = "refGene") > range.exon <- exonsBy(txdb.ensgene, "gene") > > my_bam <- "209_accepted_hits.bam.fltrd.bam" > aligns <- readGAlignmentPairsFromBam(my_bam,use.names=T) > aligns_s <- readGAlignmentsFromBam(my_bam,use.names=T) > > temp3 <- findOverlaps(range.exon["328561"],aligns) # gene Apol10b > temp3_aligns <- aligns[subjectHits(temp3)] > > temp4 <- findOverlaps(range.exon["328561"],aligns_s) > temp4_aligns <- aligns_s[subjectHits(temp4)] > > setdiff(names(temp3_aligns) , names(temp4_aligns))[1] > [1] "HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830" > > setdiff(names(temp4_aligns),names(temp3_aligns))[1] > [1] "HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259" > > > > sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.ISO-8859-1 LC_NUMERIC=C > [3] LC_TIME=en_GB.ISO-8859-1 LC_COLLATE=en_GB.ISO-8859-1 > [5] LC_MONETARY=en_GB.ISO-8859-1 LC_MESSAGES=en_GB.ISO-8859-1 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.ISO-8859-1 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] org.Mm.eg.db_2.10.1 RSQLite_0.11.4 DBI_0.2-7 > [4] annotate_1.40.0 rtracklayer_1.22.1 GenomicFeatures_1.14.2 > [7] AnnotationDbi_1.24.0 Biobase_2.22.0 Rsamtools_1.14.2 > [10] Biostrings_2.30.1 GenomicRanges_1.14.4 XVector_0.2.0 > [13] IRanges_1.20.6 BiocGenerics_0.8.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.18.0 bitops_1.0-6 BSgenome_1.30.0 RCurl_1.95-4.1 > [5] stats4_3.0.1 tools_3.0.1 XML_3.98-1.1 xtable_1.7-1 > [9] zlibbioc_1.8.0 > > > -- > Sent via the guest posting facility at bioconductor.org. > > > > ____________________________________________________________________ ____ > devteam-bioc mailing list > To unsubscribe from this mailing list send a blank email to > devteam-bioc-leave at lists.fhcrc.org > You can also unsubscribe or change your personal options at > https://lists.fhcrc.org/mailman/listinfo/devteam-bioc > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
Thank you Herve, That makes complete sense! Now I get the results I'd expect. I hadn't even seen that ignore.strand parameter because when I did ?findOverlaps I got the IRanges version that doesn't mention the strand at all, and didn't think to look at the GenomicRanges version of it. Thanks a lot! Emmanouela Emmanouela Repapi Computational Biology Research Group Weatherall Institute of Molecular Medicine University of Oxford, OX3 9DS Tel: (01865 2)2253 ________________________________________ From: Hervé Pagès [hpages@fhcrc.org] Sent: 11 February 2014 02:45 To: guest at bioconductor.org; bioconductor at r-project.org; Emmanouela Repapi Cc: Rsamtools Maintainer Subject: Re: [devteam-bioc] readGAlignmentPairsFromBam and readGAlignmentFromBam differences Hi Emmanouela, Short answer: It's a strand issue (and you should probably use ignore.strand=TRUE). Long answer: see below. On 02/10/2014 07:19 AM, Maintainer wrote: > > Hello, > > I am doing an RNA-seq analysis for paired-end reads and I have some questions for readGAlignmentPairsFromBam. I have mapped the reads with tophatv2 and aligned them with Genomic Features using the readGAlignmentPairsFromBam function, see the code below. > I understand that since I have paired-end reads I should use readGAlignmentPairsFromBam instead of readGAlignmentFromBam but it seems to have much more strict criteria for the pairing and therefore, the counts are much lower (mean number of counts: 185 versus 440 for a random sample). > > I read the pairing criteria trying to understand why lots of the reads are not being counted but I get differences for which the reason of exclusion is not clear to me. > > example for one of these reads: > $ samtools view 209_accepted_hits.bam | grep "HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259" > HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 97 chr15 77414942 50 51M = 77415180 289 GTCTGCAGAGTCTCCACTCTGCCGGAAGTCGGATCCCCTTTTAGACTCCAG &&&(((((**(**++++++++++++++++++++)+++++++++++++++++ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU XS:A:- NH:i:1 > HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 145 chr15 77415180 50 51M = 77414942 -289 TGCATTTTGATCCAGTCACCGTTTCTTTAGGCTGCTCTGCCCTTAACCCAG *+(#++)+)+++**($)(++++*)(+++*)'++++*)+**(((&&%%&$$$ AS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:20A30 YT:Z:UU XS:A:- NH:i:1 If I put those 2 records alone in a BAM file, they show up with readGAlignmentsFromBam(): > gal1 <- readGAlignmentsFromBam("test1.bam", use.names=TRUE) > gal1 GAlignments with 2 alignments and 0 metadata columns: seqnames strand cigar qwidth start end <rle> <rle> <character> <integer> <integer> <integer> HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 chr15 + 51M 51 77414942 77414992 HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 chr15 - 51M 51 77415180 77415230 width ngap <integer> <integer> HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 51 0 HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 51 0 --- seqlengths: chr15 103494974 They also get paired by readGAlignmentPairsFromBam(): > galp1 <- readGAlignmentPairsFromBam("test1.bam", use.names=TRUE) > galp1 GAlignmentPairs with 1 alignment pair and 0 metadata columns: seqnames strand : ranges -- ranges <rle> <rle> : <iranges> -- <iranges> HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 chr15 + : [77414942, 77414992] -- [77415180, 77415230] --- seqlengths: chr15 103494974 > > is included in the result from readGAlignmentFromBam but not from readGAlignmentPairsFromBam. Also, many pairs with bitwise flags of 97 and 145 (for pair) are not included in readGAlignmentPairsFromBam although their flags satisfy the pairing criteria. > > Also I get some reads in readGAlignmentPairsFromBam that I don???t get from readGAlignmentFromBam, which again I don???t understand why. > > Example for one of these reads: > $samtools view 209_accepted_hits.bam.fltrd.bam | grep "HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830" > HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 161 chr15 77426477 50 51M = 77426641 215 TGTGGTTCCAGTCACCAGCTGCACATCTGAGACACACCAGACACAAGCTCT %$%&('&(**)(*+++)++++++++++++++++++++++++++++++++++ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU XS:A:- NH:i:1 > HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 81 chr15 77426641 50 51M = 77426477 -215 GCCATAGCTGTCCATGATCTCTGATTACCTTTCTTCTATGATTTAAAAGGG ++++++++++++++++++++++++++*+++++++++++*****(((((&&& AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU NH:i:1 Here too, if I put those 2 records alone in a BAM file: > gal2 <- readGAlignmentsFromBam("test2.bam", use.names=TRUE) > gal2 GAlignments with 2 alignments and 0 metadata columns: seqnames strand cigar qwidth start end <rle> <rle> <character> <integer> <integer> <integer> HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 chr15 + 51M 51 77426477 77426527 HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 chr15 - 51M 51 77426641 77426691 width ngap <integer> <integer> HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 51 0 HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 51 0 --- seqlengths: chr15 103494974 Also: > galp2 <- readGAlignmentPairsFromBam("test2.bam", use.names=TRUE) > galp2 GAlignmentPairs with 1 alignment pair and 0 metadata columns: seqnames strand : ranges -- ranges <rle> <rle> : <iranges> -- <iranges> HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 chr15 - : [77426641, 77426691] -- [77426477, 77426527] --- seqlengths: chr15 103494974 So the "count" issue you describe is not a "pairing criteria" issue. You need to understand how findOverlaps() works on a GAlignmentPairs object. First with GAlignments object 'gal1': > findOverlaps(range.exon["328561"], gal1) Hits of length 1 queryLength: 1 subjectLength: 2 queryHits subjectHits <integer> <integer> 1 1 2 Your gene (which is on the minus strand) overlaps with the 2nd element in 'gal1'. When you pass a GAlignments or GAlignmentPairs object to findOverlaps(), it's first turned into a GRangesList object internally, with grglist(). The key here is that, grglist() on a GAlignmentPairs object inverts the strand of the 2nd mate: > grglist(galp1) GRangesList of length 1: $HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 GRanges with 2 ranges and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] chr15 [77414942, 77414992] + [2] chr15 [77415180, 77415230] + --- seqlengths: chr15 103494974 See ?GAlignmentPairs for more info about this (especially the IMPORTANT note in the Coercion section). The reason for this behavior is because the strand of the 2nd mate was already inverted by the sequencing technology (when the cDNA template was sequenced from the 2nd end). This is why you don't see any overlap when you do: > length(findOverlaps(range.exon["328561"], galp1)) [1] 0 (because your gene is on the minus strand). Most of the time those strand considerations don't matter because the RNA-seq protocol is not stranded. In that case you should use ignore.strand=TRUE: > findOverlaps(range.exon["328561"], gal1, ignore.strand=TRUE) Hits of length 2 queryLength: 1 subjectLength: 2 queryHits subjectHits <integer> <integer> 1 1 1 2 1 2 > findOverlaps(range.exon["328561"], galp1, ignore.strand=TRUE) Hits of length 1 queryLength: 1 subjectLength: 1 queryHits subjectHits <integer> <integer> 1 1 1 Similar story for gal2/galp2: > length(findOverlaps(range.exon["328561"], gal2)) [1] 0 No hit with the 1st element in 'gal2' because even though gal2[1] is in the range of your gene, it's not on the same strand. No hit with the 2nd element in 'gal2' because even though gal2[2] is on the same strand as your gene, it's not in its range. However, we see a hit with the pair: > length(findOverlaps(range.exon["328561"], galp2)) [1] 1 That's because when converting to GRangesList, everything ends up on the minus strand: > grglist(galp2) GRangesList of length 1: $HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 GRanges with 2 ranges and 0 metadata columns: seqnames ranges strand <rle> <iranges> <rle> [1] chr15 [77426477, 77426527] - [2] chr15 [77426641, 77426691] - --- seqlengths: chr15 103494974 But again, if you use ignore.strand=TRUE: > findOverlaps(range.exon["328561"], gal2, ignore.strand=TRUE) Hits of length 1 queryLength: 1 subjectLength: 2 queryHits subjectHits <integer> <integer> 1 1 1 Hope this helps, H. > > Can someone, please, explain why this is? > Thank you in advance for your time. > Emmanouela > > > -- output of sessionInfo(): > > > > library(GenomicRanges) > library(Rsamtools) > library(GenomicFeatures) > library(rtracklayer) > > txdb.ensgene <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = "refGene") > range.exon <- exonsBy(txdb.ensgene, "gene") > > my_bam <- "209_accepted_hits.bam.fltrd.bam" > aligns <- readGAlignmentPairsFromBam(my_bam,use.names=T) > aligns_s <- readGAlignmentsFromBam(my_bam,use.names=T) > > temp3 <- findOverlaps(range.exon["328561"],aligns) # gene Apol10b > temp3_aligns <- aligns[subjectHits(temp3)] > > temp4 <- findOverlaps(range.exon["328561"],aligns_s) > temp4_aligns <- aligns_s[subjectHits(temp4)] > > setdiff(names(temp3_aligns) , names(temp4_aligns))[1] > [1] "HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830" > > setdiff(names(temp4_aligns),names(temp3_aligns))[1] > [1] "HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259" > > > > sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_GB.ISO-8859-1 LC_NUMERIC=C > [3] LC_TIME=en_GB.ISO-8859-1 LC_COLLATE=en_GB.ISO-8859-1 > [5] LC_MONETARY=en_GB.ISO-8859-1 LC_MESSAGES=en_GB.ISO-8859-1 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_GB.ISO-8859-1 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] org.Mm.eg.db_2.10.1 RSQLite_0.11.4 DBI_0.2-7 > [4] annotate_1.40.0 rtracklayer_1.22.1 GenomicFeatures_1.14.2 > [7] AnnotationDbi_1.24.0 Biobase_2.22.0 Rsamtools_1.14.2 > [10] Biostrings_2.30.1 GenomicRanges_1.14.4 XVector_0.2.0 > [13] IRanges_1.20.6 BiocGenerics_0.8.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.18.0 bitops_1.0-6 BSgenome_1.30.0 RCurl_1.95-4.1 > [5] stats4_3.0.1 tools_3.0.1 XML_3.98-1.1 xtable_1.7-1 > [9] zlibbioc_1.8.0 > > > -- > Sent via the guest posting facility at bioconductor.org. > > > > ____________________________________________________________________ ____ > devteam-bioc mailing list > To unsubscribe from this mailing list send a blank email to > devteam-bioc-leave at lists.fhcrc.org > You can also unsubscribe or change your personal options at > https://lists.fhcrc.org/mailman/listinfo/devteam-bioc > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY

Login before adding your answer.

Traffic: 706 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6