Question

The ambiguous alignment examples in findMateAlignment don't have to be dumped

0

Entering edit mode

yun YAN ▴ 20

@yun-yan-6355

Last seen 6.4 years ago

Canada

Hi Hervé Pagès and other maintainers in Bioconductor, Thanks for your package instruction ( http://140.107.3.20/packages/2.14/bioc/manuals/GenomicAlignments/man/G enomicAlignments.pdf) so that I could understand the how does it work. Here you mentioned the dumped alignments but I was wondering whether they should be dumped? seqnames strand cigar qwidth start end > <rle> <rle> <character> <integer> <integer> <integer> > SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 ## 2nd mate/read2 > ... > SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 ## 2nd mate/read2 > ... > SRR031714.2658602 chr2R - 13M372N24M 37 6983858 6984266 ## 1st mate/read1 > ... > SRR031714.2658602 chr2R - 13M378N24M 37 6983858 6984272 ## 1st mate/read1 > ... For them you mentioned: pairs could be rec(1) <-> rec(3) and rec(2) <-> > rec(4), or they could be rec(1) <-> rec(4) and rec(2) <-> rec(3). There is > no way to disambiguate! I agree the two pairing strategies above are ambiguous in light of codes/computation. But it is fact that rec(1) and rec(2) are exactly same, thus these two alignments are actually one type of multiple alignment, though wired but potentially useful. The factor that leads to the multiple mappable positions of one mate of the fragment and fixed position of the other mate could probably: a> the reference sequence similarity between these two neighboring regions together with different splicing isoforms. Here first mate could have 2 mapped positions. # is exon while = is intron. 2nd mate: ---- ---- 1st mate: ---- ---- 1st mate: ---- ---- isoform1: ==####======#####==========####==####=== isoform2: ==######====#####==========####==####=== b> or just aligner artifact, etc. Therefore I cannot agree your package dumps those alignments. It needs better function name, at least, to retrieve them. Regards, Yun [[alternative HTML version deleted]]

Alignment Alignment • 1.5k views

ADD COMMENT • link updated 11.9 years ago by Hervé Pagès 16k • written 11.9 years ago by yun YAN ▴ 20

score 0 · Answer 1 · 2014-01-23

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 5 days ago

Seattle, WA, United States

Hi Yun, On 01/23/2014 12:56 PM, yun YAN wrote: > Hi Hervé Pagès and other maintainers in Bioconductor, > > Thanks for your package instruction > (http://140.107.3.20/packages/2.14/bioc/manuals/GenomicAlignments/ma n/GenomicAlignments.pdf) > so that I could understand the how does it work. Here you mentioned the > dumped alignments but I was wondering whether they should be dumped? > > seqnames strand cigar qwidth start end > <rle> <rle> <character> <integer> <integer> <integer> > SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 ## 2nd > mate/read2 ... > SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 ## 2nd > mate/read2 ... > SRR031714.2658602 chr2R - 13M372N24M 37 6983858 6984266 ## 1st > mate/read1 ... > SRR031714.2658602 chr2R - 13M378N24M 37 6983858 6984272 ## 1st > mate/read1 ... > > > For them you mentioned: > > pairs could be rec(1) <-> rec(3) and rec(2) <-> > rec(4), or they could be rec(1) <-> rec(4) and rec(2) <-> rec(3). > There is no way to disambiguate! > > > I agree the two pairing strategies above are ambiguous in light of > codes/computation. But it is fact that rec(1) and rec(2) are exactly > same, Yes in that case. But it's not always the case. Maybe I should put an example in this man page (i.e. the man page for findMateAlignment) where the 2 records are not the same. Note that in order to decide whether they are exactly the same or not, one would need to look at all the fixed fields in the 2 records plus all the possible tags (there can be an arbitrary number of them). There is a computational cost associated to this and I don't want to impose that cost by default just to save a tiny fraction of ambiguous pairs. Not worth it. My understanding is that it's not even possible at the moment to load all the tags with Rsamtools (unless you already know the tags but I'm not aware of an easy way to discover that). > thusthese two alignments are actually one type of multiple > alignment, though wired but potentially useful. The factor that leads to > the multiple mappable positions of one mate of the fragment and fixed > position of the other mate could probably: > a> the reference sequence similarity between these > two neighboring regions together with different splicing isoforms. Here > first mate could have 2 mapped positions. # is exon while = is intron. > > 2nd mate: ---- ---- > 1st mate: ---- ---- > 1st mate: ---- ---- > isoform1: ==####======#####==========####==####=== > isoform2: ==######====#####==========####==####=== > > b> or just aligner artifact, etc. Therefore I cannot agree your package > dumps those alignments. If you want to keep the ambiguous alignments, you can use readGAlignmentsList() instead of readGAlignmentPairs(). See ?readGAlignmentsList for the details. > It needs better function name, at least, to > retrieve them. The utility to retrieve them is called getDumpedAlignments(). What do you propose? Thanks, H. > > Regards, > Yun > > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 11.9 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Hervé, Thanks for your tips on `readGAlignmentsList`. Unlike `readGAlignmentPairs`, it is not compatible with `first`, `last`, `left`, `right` function with feedbacks: unable to find an inherited method for function first for signature > "GAlignmentsList" I'd better speak out that the goal of my sub-modules here is to get the reads coverage on strand(+)/(-) separately as my RNA-seq library is strand-specific, therefore the small fraction of ambiguous reads are better not dumped. (BTW you are right the fraction is too small, 1340/13 million alignments in my test case, to deserve computational resources). bamData <- readGAlignmentsList("my_alignment.bam", ...) signalOnPositiveStrand <- coverage(grglist(left(bamData), drop.D.range=T)) The psudo-code would make it in one-step if my understanding were right together with the compatibility of `readGAlignmentsList` and `left`. **Is there any other functions or methods that could help me to filter the first/second mate when processing the readGAlignmentsList object or dumped alignments retrieved by getDumpedAlignments?** Like this: firstMate <- bamData[mates(bamData) == 1, ] secondMate <- bamData[mates(bamData) == 2, ] firstMate <- readsRetrived[mates(readsRetrived) == 1, ] secondMate <- readsRetrived[mates(readsRetrived) == 2, ] I do have other three strategies: 1> back to root, the `scanBam` function with the following parameters. But it is slow (as a result some guys prefer ShortRead package and python pysam package), and returns a list, the fundamental R class, thus I'd better not recreate wheels to dig out information, construct data table, etc. isFirstMatePrmt <- ScanBamParam(flag = scanBamFlag(isPaired=TRUE, isProperPair=TRUE, isFirstMateRead=TRUE), what=scanBamWhat()) 2> feed R with the first/second mate bam by using samtools. But it would cost extra disk storage and not strongly dependent on R. 3> betray to HTSeq, a python package, the worst choice. Cheers, Yun On Thu, Jan 23, 2014 at 7:25 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Yun, > > > On 01/23/2014 12:56 PM, yun YAN wrote: > >> Hi Hervé Pagès and other maintainers in Bioconductor, >> >> Thanks for your package instruction >> (http://140.107.3.20/packages/2.14/bioc/manuals/GenomicAlignments/man/ >> GenomicAlignments.pdf) >> so that I could understand the how does it work. Here you mentioned the >> dumped alignments but I was wondering whether they should be dumped? >> >> seqnames strand cigar qwidth start end >> <rle> <rle> <character> <integer> <integer> <integer> >> SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 ## 2nd >> mate/read2 ... >> SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 ## 2nd >> mate/read2 ... >> SRR031714.2658602 chr2R - 13M372N24M 37 6983858 6984266 ## 1st >> mate/read1 ... >> SRR031714.2658602 chr2R - 13M378N24M 37 6983858 6984272 ## 1st >> mate/read1 ... >> >> >> For them you mentioned: >> >> pairs could be rec(1) <-> rec(3) and rec(2) <-> >> rec(4), or they could be rec(1) <-> rec(4) and rec(2) <-> rec(3). >> There is no way to disambiguate! >> >> >> I agree the two pairing strategies above are ambiguous in light of >> codes/computation. But it is fact that rec(1) and rec(2) are exactly >> same, >> > > Yes in that case. But it's not always the case. Maybe I should put an > example in this man page (i.e. the man page for findMateAlignment) where > the 2 records are not the same. > > Note that in order to decide whether they are exactly the same or not, one > would need to look at all the fixed fields in the 2 records plus > all the possible tags (there can be an arbitrary number of them). > There is a computational cost associated to this and I don't want to > impose that cost by default just to save a tiny fraction of ambiguous > pairs. Not worth it. My understanding is that it's not even possible > at the moment to load all the tags with Rsamtools (unless you already > know the tags but I'm not aware of an easy way to discover that). > > thusthese two alignments are actually one type of multiple >> >> alignment, though wired but potentially useful. The factor that leads to >> the multiple mappable positions of one mate of the fragment and fixed >> position of the other mate could probably: >> a> the reference sequence similarity between these >> two neighboring regions together with different splicing isoforms. Here >> first mate could have 2 mapped positions. # is exon while = is intron. >> >> 2nd mate: ---- ---- >> 1st mate: ---- ---- >> 1st mate: ---- ---- >> isoform1: ==####======#####==========####==####=== >> isoform2: ==######====#####==========####==####=== >> >> b> or just aligner artifact, etc. Therefore I cannot agree your package >> dumps those alignments. >> > > If you want to keep the ambiguous alignments, you can use > readGAlignmentsList() instead of readGAlignmentPairs(). > See ?readGAlignmentsList for the details. > > > It needs better function name, at least, to >> retrieve them. >> > > The utility to retrieve them is called getDumpedAlignments(). What do > you propose? > > Thanks, > H. > > >> Regards, >> Yun >> >> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > -- YAN Yun, Master of Science RNA Lab, 6101 Molecular Biology and Biochemistry Department College of Life Sciences, Wuhan University Wuhan, Hubei, P.R.China [[alternative HTML version deleted]]

ADD REPLY • link 11.9 years ago yun YAN ▴ 20

0

Entering edit mode

Hi Yun, On 01/24/2014 02:01 PM, yun YAN wrote: > Hi Herv?, > Thanks for your tips on `readGAlignmentsList`. Unlike > `readGAlignmentPairs`, it is not compatible with `first`, `last`, > `left`, `right` function with feedbacks: > > unable to find an inherited method for function ?first? for > signature ?"GAlignmentsList"? > > I'd better speak out that the goal of my sub-modules here is to get the > reads coverage on strand(+)/(-) separately as my RNA-seq library is > strand-specific, therefore the small fraction of ambiguous reads are > better not dumped. (BTW you are right the fraction is too small, 1340/13 > million alignments in my test case, to deserve computational resources). If you want to compute coverage, you don't need to go thru the pairing process. Just load them in a GAlignments object with readGAlignments() and call coverage() on it. You won't get *exactly* the same coverage vector as if you load the alignments with readGAlignmentPairs() because the latter cannot pair everything. But it will be *very* close. Alternatively, you can load them with readGAlignmentsList() and do coverage(unlist(.)) on your GAlignmentsList object. That should give you exactly the same result as when you load with readGAlignments(). > > bamData <- readGAlignmentsList("my_alignment.bam", ...) > > signalOnPositiveStrand <- coverage(grglist(left(bamData), > drop.D.range=T)) > > > The psudo-code would make it in one-step if my understanding were right > together with the compatibility of `readGAlignmentsList` and `left`. > **Is there any other functions or methods that could help me to filter > the first/second mate when processing the readGAlignmentsList object or > dumped alignments retrieved by getDumpedAlignments?** Like this: > > firstMate <- bamData[mates(bamData) == 1, ] > > secondMate <- bamData[mates(bamData) == 2, ] > > firstMate <- readsRetrived[mates(readsRetrived) == 1, ] > > secondMate <- readsRetrived[mates(readsRetrived) == 2, ] > Like scanBam(), which they rely on, the readGAlignment* functions can take a ScanBamParam object thru their 'param' argument. Using a ScanBamParam object lets you filter the alignments based on strand, 1st/2nd mate, etc... See ?ScanBamParam for more info. > > I do have other three strategies: > 1> back to root, the `scanBam` function with the following parameters. > But it is slow (as a result some guys prefer ShortRead package and > python pysam package), and returns a list, the fundamental R class, thus > I'd better not recreate wheels to dig out information, construct data > table, etc. > > isFirstMatePrmt <- ScanBamParam(flag = scanBamFlag(isPaired=TRUE, > isProperPair=TRUE, isFirstMateRead=TRUE), what=scanBamWhat()) So you know about ScanBamParam() and its 'flag' arg. Now use that with any of the readGAlignment* functions. Note that, when using these function, some flags are implicit e.g. readGAlignmentPairs() automatically sets the isPaired flag to TRUE for you. Cheers, H. > > 2> feed R with the first/second mate bam by using samtools. But it would > cost extra disk storage and not strongly dependent on R. > 3> betray to HTSeq, a python package, the worst choice. > > Cheers, > > Yun > > On Thu, Jan 23, 2014 at 7:25 PM, Hervé Pagès <hpages at="" fhcrc.org=""> <mailto:hpages at="" fhcrc.org="">> wrote: > > Hi Yun, > > > On 01/23/2014 12:56 PM, yun YAN wrote: > > Hi Hervé Pagès and other maintainers in Bioconductor, > > Thanks for your package instruction > (http://140.107.3.20/packages/__2.14/bioc/manuals/__GenomicA lignments/man/__GenomicAlignments.pdf > <http: 140.107.3.20="" packages="" 2.14="" bioc="" manuals="" genomicalign="" ments="" man="" genomicalignments.pdf="">) > so that I could understand the how does it work. Here you > mentioned the > dumped alignments but I was wondering whether they should be dumped? > > seqnames strand cigar qwidth start end > <rle> <rle> <character> <integer> <integer> <integer> > SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 ## 2nd > mate/read2 ... > SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 ## 2nd > mate/read2 ... > SRR031714.2658602 chr2R - 13M372N24M 37 6983858 6984266 ## 1st > mate/read1 ... > SRR031714.2658602 chr2R - 13M378N24M 37 6983858 6984272 ## 1st > mate/read1 ... > > > For them you mentioned: > > pairs could be rec(1) <-> rec(3) and rec(2) <-> > rec(4), or they could be rec(1) <-> rec(4) and rec(2) <-> > rec(3). > There is no way to disambiguate! > > > I agree the two pairing strategies above are ambiguous in light of > codes/computation. But it is fact that rec(1) and rec(2) are exactly > same, > > > Yes in that case. But it's not always the case. Maybe I should put an > example in this man page (i.e. the man page for findMateAlignment) where > the 2 records are not the same. > > Note that in order to decide whether they are exactly the same or > not, one would need to look at all the fixed fields in the 2 records > plus > all the possible tags (there can be an arbitrary number of them). > There is a computational cost associated to this and I don't want to > impose that cost by default just to save a tiny fraction of ambiguous > pairs. Not worth it. My understanding is that it's not even possible > at the moment to load all the tags with Rsamtools (unless you already > know the tags but I'm not aware of an easy way to discover that). > > thusthese two alignments are actually one type of multiple > > alignment, though wired but potentially useful. The factor that > leads to > the multiple mappable positions of one mate of the fragment and > fixed > position of the other mate could probably: > a> the reference sequence similarity between these > two neighboring regions together with different splicing > isoforms. Here > first mate could have 2 mapped positions. # is exon while = is > intron. > > 2nd mate: ---- ---- > 1st mate: ---- ---- > 1st mate: ---- ---- > isoform1: ==####======#####==========###__#==####=== > isoform2: ==######====#####==========###__#==####=== > > b> or just aligner artifact, etc. Therefore I cannot agree your > package > dumps those alignments. > > > If you want to keep the ambiguous alignments, you can use > readGAlignmentsList() instead of readGAlignmentPairs(). > See ?readGAlignmentsList for the details. > > > It needs better function name, at least, to > retrieve them. > > > The utility to retrieve them is called getDumpedAlignments(). What do > you propose? > > Thanks, > H. > > > Regards, > Yun > > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > > > > -- > YAN Yun, Master of Science > RNA Lab, 6101 > Molecular Biology and Biochemistry Department > College of Life Sciences, Wuhan University > Wuhan, Hubei, P.R.China -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 11.9 years ago Hervé Pagès 16k