readGappedAlignmentPairs with multimapping reads
3
0
Entering edit mode
Guest User ★ 13k
@guest-user-4897
Last seen 9.6 years ago
How does the readGappedAlignmentPairs from the GenomicRanges library handle reads that map to several places in the genome? Sometimes it can happen that one pair of the read is flagged as properly paired even if the other read maps to several locations, how is this handled? Thank you in advance! -- output of sessionInfo(): R version 2.15.0 (2012-03-30) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0 [4] plyr_1.6 stringr_0.6 BiocInstaller_1.4.4 loaded via a namespace (and not attached): [1] stats4_2.15.0 tools_2.15.0 -- Sent via the guest posting facility at bioconductor.org.
GenomicRanges GenomicRanges • 1.1k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 2 days ago
United States
On 06/02/2012 06:08 AM, Vedran Franke [guest] wrote: > > How does the readGappedAlignmentPairs from the GenomicRanges library handle reads that map to several places in the genome? > > Sometimes it can happen that one pair of the read is flagged as properly paired even if the other read maps to several locations, how is this handled? The full details will eventually be documented on ?readBamGappedAlignmentPairs and ?makeGappedAlignmentPairs; the author is currently taking a few days off. My understanding is that each alignment is a separate record, and that in the SAM specification one mate tells the location of the second mate (and vice versa) so that pairs can be identified unambiguously (except when the mate alignments differ only in the pattern of indels within an end). More details will be forthcoming. Martin > Thank you in advance! > > -- output of sessionInfo(): > > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C > [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 > [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0 > [4] plyr_1.6 stringr_0.6 BiocInstaller_1.4.4 > > loaded via a namespace (and not attached): > [1] stats4_2.15.0 tools_2.15.0 > > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793
ADD COMMENT
0
Entering edit mode
On Sat, Jun 2, 2012 at 11:48 AM, Martin Morgan <mtmorgan@fhcrc.org> wrote: > On 06/02/2012 06:08 AM, Vedran Franke [guest] wrote: > >> >> How does the readGappedAlignmentPairs from the GenomicRanges library >> handle reads that map to several places in the genome? >> >> Sometimes it can happen that one pair of the read is flagged as properly >> paired even if the other read maps to several locations, how is this >> handled? >> > > The full details will eventually be documented on > ?readBamGappedAlignmentPairs and ?makeGappedAlignmentPairs; the author is > currently taking a few days off. My understanding is that each alignment is > a separate record, and that in the SAM specification one mate tells the > location of the second mate (and vice versa) so that pairs can be > identified unambiguously (except when the mate alignments differ only in > the pattern of indels within an end). > > More details will be forthcoming. > > I think I can add a few details, since I've been playing with this lately. I think the intent at least is to drop the read pairs that are not one-to-one. I'm a little skeptical that it's actually working correctly though, because it seems to always drop exactly one read pair from my BAMs, even though there must be more than one that is ambiguous. Take this with a grain of salt though. Michael > Martin > > > Thank you in advance! >> >> -- output of sessionInfo(): >> >> R version 2.15.0 (2012-03-30) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C >> [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 >> [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0 >> [4] plyr_1.6 stringr_0.6 BiocInstaller_1.4.4 >> >> loaded via a namespace (and not attached): >> [1] stats4_2.15.0 tools_2.15.0 >> >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > > -- > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793 > > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
@herve-pages-1542
Last seen 3 hours ago
Seattle, WA, United States
Hi Vedran, On 06/02/2012 06:08 AM, Vedran Franke [guest] wrote: > > How does the readGappedAlignmentPairs from the GenomicRanges library handle reads that map to several places in the genome? Good question and I wish there was a simple answer... readGappedAlignmentPairs() delegates to findMateAlignment() for doing the pairing. findMateAlignment() does not have a full man page yet but will have one soon. The man page will explain the algorithm used for doing the pairing of records loaded from a BAM file. Here is roughly how it works. First only records with flag bit 0x1 set to 1, flag bit 0x4 set to 0, and flag bit 0x8 set to 0 are candidates for pairing (see the SAM Spec for a description of flag bits and fields). Any other record is discarded. That is, records that correspond to single end reads, and records that correspond to paired end reads where one or both ends are unmapped, are discarded. Then the algorithm looks at the following fields and flag bits: (A) QNAME (B) RNAME, RNEXT (C) POS, PNEXT (D) Flag bits Ox10 and 0x20 (E) Flag bits 0x40 and 0x80 2 records rec(i) and rec(j) are considered mates iff all the following conditions are satisfied: (A) They have the same QNAME (B) RNEXT(i) == RNAME(j) and RNEXT(j) == RNAME(i) (C) PNEXT(i) == POS(j) and PNEXT(j) == POS(i) (D) Flag bit 0x20 of rec(i) == Flag bit 0x10 of rec(j) and Flag bit 0x20 of rec(j) == Flag bit 0x10 of rec(i) (E) rec(i) corresponds to the fi rst segment in the template and rec(j) corresponds to the last segment in the template OR rec(j) corresponds to the fi rst segment in the template and rec(i) corresponds to the last segment in the template This algorithm will find almost all pairs unambiguously, even when the same pair of reads maps to several places in the genome. Note that when a given pair maps to a single place in the genome, looking at (A) is enough to pair the 2 corresponding records. The additional conditions (B), (C), (D) and (E) are only here to help in the situation where more than 2 records share the same QNAME. And that works most of the times but there are still situations where this is not enough to solve the pairing problem unambiguously. For example, here are 4 records (loaded in a GappedAlignments object) that cannot be paired with the above algorithm: ** Showing the 4 records as a GappedAlignments object of length 4: GappedAlignments with 4 alignments and 2 elementMetadata cols: seqnames strand cigar qwidth start end <rle> <rle> <character> <integer> <integer> <integer> SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 SRR031714.2658602 chr2R - 13M372N24M 37 6983858 6984266 SRR031714.2658602 chr2R - 13M378N24M 37 6983858 6984272 width ngap | mrnm mpos <integer> <integer> | <factor> <integer> SRR031714.2658602 421 1 | chr2R 6983858 SRR031714.2658602 421 1 | chr2R 6983858 SRR031714.2658602 409 1 | chr2R 6983850 SRR031714.2658602 415 1 | chr2R 6983850 Note that the BAM fields show up in the following columns: - QNAME: the names of the GappedAlignments object (unnamed col) - RNAME: the seqnames col - POS: the start col - RNEXT: the mrnm col - PNEXT: the mpos col As you can see, the aligner has aligned the same pair to the same location twice! The only difference between the 2 aligned pairs is in the cigar i.e. one end of the pair is aligned twice to the same location with exactly the same cigar while the other end of the pair is aligned twice to the same location but with slightly different cigars. ** Now showing the corresponding flag bits: isPaired isProperPair isUnmappedQuery hasUnmappedMate isMinusStrand [1,] 1 1 0 0 0 [2,] 1 1 0 0 0 [3,] 1 1 0 0 1 [4,] 1 1 0 0 1 isMateMinusStrand isFirstMateRead isSecondMateRead isNotPrimaryRead [1,] 1 0 1 0 [2,] 1 0 1 0 [3,] 0 1 0 0 [4,] 0 1 0 0 isNotPassingQualityControls isDuplicate [1,] 0 0 [2,] 0 0 [3,] 0 0 [4,] 0 0 As you can see, rec(1) and rec(2) are second mates, rec(3) and rec(4) are both first mates. But looking at (A), (B), (C), (D) and (E), the pairs could be rec(1) <-> rec(3) and rec(2) <-> rec(4), or they could be rec(1) <-> rec(4) and rec(2) <-> rec(3). There is no way to disambiguate! Also note that everything is tagged as proper pair (flag bit 0x2 is set to 1) and primary read (flag bit 0x100 is set to 0), so using this information would not help disambiguate here. I'm wondering if there is some other place we should look at in the BAM file in order to disambiguate (e.g. a tag?), or if those ambiguous pairings are just part of the life with the SAM Spec. Not sure whether this is a weakness of the Spec? Or A feature? Any input on this would be appreciated. In the meantime, findMateAlignment() is just ignoring those ambiguous pairings (with a warning) i.e. records that cannot be paired unambiguously are not paired at all. Concretely that means that readGappedAlignmentPairs() is guaranteed to return a GappedAlignmentPairs object where every pair could be formed in an non-ambiguous way. Note that AFAICS in practice this approach doesn't seem to leave aside a lot of records because ambiguous pairing events seem pretty rare. Cheers, H. > > Sometimes it can happen that one pair of the read is flagged as properly paired even if the other read maps to several locations, how is this handled? > > Thank you in advance! > > -- output of sessionInfo(): > > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C > [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 > [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0 > [4] plyr_1.6 stringr_0.6 BiocInstaller_1.4.4 > > loaded via a namespace (and not attached): > [1] stats4_2.15.0 tools_2.15.0 > > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 3 hours ago
Seattle, WA, United States
There was a small formatting problem with my previous email (due to some copy/paste from the SAM Spec opened in Acroread to my mailer), so I'm sending this again. Sorry for the noise... On 06/02/2012 06:08 AM, Vedran Franke [guest] wrote: > > How does the readGappedAlignmentPairs from the GenomicRanges library handle reads that map to several places in the genome? Good question and I wish there was a simple answer... readGappedAlignmentPairs() delegates to findMateAlignment() for doing the pairing. findMateAlignment() does not have a full man page yet but will have one soon. The man page will explain the algorithm used for doing the pairing of records loaded from a BAM file. Here is roughly how it works. First only records with flag bit 0x1 set to 1, flag bit 0x4 set to 0, and flag bit 0x8 set to 0 are candidates for pairing (see the SAM Spec for a description of flag bits and fields). Any other record is discarded. That is, records that correspond to single end reads, and records that correspond to paired end reads where one or both ends are unmapped, are discarded. Then the algorithm looks at the following fields and flag bits: (A) QNAME (B) RNAME, RNEXT (C) POS, PNEXT (D) Flag bits Ox10 and 0x20 (E) Flag bits 0x40 and 0x80 2 records rec(i) and rec(j) are considered mates iff all the following conditions are satisfied: (A) They have the same QNAME (B) RNEXT(i) == RNAME(j) and RNEXT(j) == RNAME(i) (C) PNEXT(i) == POS(j) and PNEXT(j) == POS(i) (D) Flag bit 0x20 of rec(i) == Flag bit 0x10 of rec(j) and Flag bit 0x20 of rec(j) == Flag bit 0x10 of rec(i) (E) rec(i) corresponds to the first segment in the template and rec(j) corresponds to the last segment in the template OR rec(j) corresponds to the first segment in the template and rec(i) corresponds to the last segment in the template This algorithm will find almost all pairs unambiguously, even when the same pair of reads maps to several places in the genome. Note that when a given pair maps to a single place in the genome, looking at (A) is enough to pair the 2 corresponding records. The additional conditions (B), (C), (D) and (E) are only here to help in the situation where more than 2 records share the same QNAME. And that works most of the times but there are still situations where this is not enough to solve the pairing problem unambiguously. For example, here are 4 records (loaded in a GappedAlignments object) that cannot be paired with the above algorithm: ** Showing the 4 records as a GappedAlignments object of length 4: GappedAlignments with 4 alignments and 2 elementMetadata cols: seqnames strand cigar qwidth start end <rle> <rle> <character> <integer> <integer> <integer> SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 SRR031714.2658602 chr2R - 13M372N24M 37 6983858 6984266 SRR031714.2658602 chr2R - 13M378N24M 37 6983858 6984272 width ngap | mrnm mpos <integer> <integer> | <factor> <integer> SRR031714.2658602 421 1 | chr2R 6983858 SRR031714.2658602 421 1 | chr2R 6983858 SRR031714.2658602 409 1 | chr2R 6983850 SRR031714.2658602 415 1 | chr2R 6983850 Note that the BAM fields show up in the following columns: - QNAME: the names of the GappedAlignments object (unnamed col) - RNAME: the seqnames col - POS: the start col - RNEXT: the mrnm col - PNEXT: the mpos col As you can see, the aligner has aligned the same pair to the same location twice! The only difference between the 2 aligned pairs is in the cigar i.e. one end of the pair is aligned twice to the same location with exactly the same cigar while the other end of the pair is aligned twice to the same location but with slightly different cigars. ** Now showing the corresponding flag bits: isPaired isProperPair isUnmappedQuery hasUnmappedMate isMinusStrand [1,] 1 1 0 0 0 [2,] 1 1 0 0 0 [3,] 1 1 0 0 1 [4,] 1 1 0 0 1 isMateMinusStrand isFirstMateRead isSecondMateRead isNotPrimaryRead [1,] 1 0 1 0 [2,] 1 0 1 0 [3,] 0 1 0 0 [4,] 0 1 0 0 isNotPassingQualityControls isDuplicate [1,] 0 0 [2,] 0 0 [3,] 0 0 [4,] 0 0 As you can see, rec(1) and rec(2) are second mates, rec(3) and rec(4) are both first mates. But looking at (A), (B), (C), (D) and (E), the pairs could be rec(1) <-> rec(3) and rec(2) <-> rec(4), or they could be rec(1) <-> rec(4) and rec(2) <-> rec(3). There is no way to disambiguate! Also note that everything is tagged as proper pair (flag bit 0x2 is set to 1) and primary read (flag bit 0x100 is set to 0), so using this information would not help disambiguate here. I'm wondering if there is some other place we should look at in the BAM file in order to disambiguate (e.g. a tag?), or if those ambiguous pairings are just part of the life with the SAM Spec. Not sure whether this is a weakness of the Spec? Or A feature? Any input on this would be appreciated. In the meantime, findMateAlignment() is just ignoring those ambiguous pairings (with a warning) i.e. records that cannot be paired unambiguously are not paired at all. Concretely that means that readGappedAlignmentPairs() is guaranteed to return a GappedAlignmentPairs object where every pair was formed in an non-ambiguous way. Note that AFAICS in practice this approach doesn't seem to leave aside a lot of records because ambiguous pairing events seem pretty rare. Cheers, H. > > Sometimes it can happen that one pair of the read is flagged as properly paired even if the other read maps to several locations, how is this handled? > > Thank you in advance! > > -- output of sessionInfo(): > > R version 2.15.0 (2012-03-30) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C > [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 > [5] LC_MONETARY=en_US.iso885915 LC_MESSAGES=en_US.iso885915 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GenomicRanges_1.8.3 IRanges_1.14.2 BiocGenerics_0.2.0 > [4] plyr_1.6 stringr_0.6 BiocInstaller_1.4.4 > > loaded via a namespace (and not attached): > [1] stats4_2.15.0 tools_2.15.0 > > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT

Login before adding your answer.

Traffic: 738 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6