GenomicAlignments and QNAME collision

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 2.3 years ago

United States

You're welcome. Let us know if you run into other problems. Valerie On 06/01/2014 10:01 AM, Stefano Calza wrote: > Great! > > I would like to thank Valerie and all Bioconductor developers. > > Stefano > > Il 31/05/2014 21:39, Valerie Obenchain ha scritto: > Trimming the qname is implemented in Rsamtools 1.17.18 and should be > available via biocLite() ~ noon tomorrow. > > 'qnamePrefixStart' and 'qnameSuffixEnd' fields have been added to the > BamFile class. Currently the trimming is implemented for mate pairing > only, not just reading records from a bam. > > Unique qnames aren't a problem in this sample file - just using it to > demonstrate the output. > > fl <- system.file("extdata", "ex1.bam", package="Rsamtools") > param <- ScanBamParam(what="qname") > > ## no trimming > bf <- BamFile(fl, asMates=TRUE) >>> scanBam(bf, param=param)[[1]]$qname[1:3] >> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" >> [3] "EAS219_FC30151:7:51:1429:1043" > > ## trim prefix > qnamePrefixEnd(bf) <- "_" >>> scanBam(bf, param=param)[[1]]$qname[1:3] >> [1] "61:4:143:69:578" "61:4:143:69:578" >> "FC30151:7:51:1429:1043" > > ## trim prefix and suffix > qnameSuffixStart(bf) <- ":" >>> scanBam(bf, param=param)[[1]]$qname[1:3] >> [1] "61:4:143:69" "61:4:143:69" "FC30151:7:51:1429" > > > Valerie > > > On 05/08/14 12:17, Stefano Calza wrote: >> Yes, I say that would be easier to use than regexp >> >> Stefano >> >> On 05/08/2014 08:59 PM, Valerie Obenchain wrote: >>> Thanks for the SRA tips. >>> >>> It's starting to look like the modifications are an additional prefix >>> or suffix separated by dot or slash (possibly underscore?). Maybe >>> simply adding an option to trim the QNAME by the pre/post term >>> separated by a given character would be sufficient. This allows >>> flexibilty but prevents unwarranted QNAME mangling. >>> >>> Valerie >>> >>> >>> On 05/08/14 10:56, Stefano Calza wrote: >>>> Right this is how I got some other example. I think it would add the >>>> files names, as from 2 SRA files (SRR1234 & SRR1235) my reads are named >>>> STARTING with SRR1234 or SRR1235 for the two mates, followed by actual >>>> read QNAME. >>>> >>>> Stefano >>>> >>>> On 05/08/2014 07:05 PM, James W. MacDonald wrote: >>>>> Hi Valerie, >>>>> >>>>> You get something similar from the .sra files that you can download >>>>> from the SRA, if they are paired data. If you use the SRA toolkit to >>>>> convert to fastq (fastq-dump), it will spit out two fastq files, and >>>>> the QNAME in each of the fastq files will be appended with a .1 for >>>>> the first pairs and a .2 for the second pairs. >>>>> >>>>> As an example: >>>>> >>>>> zcat SRR833731_1.fastq.gz | head -n 1 >>>>> @SRR833731.1.1 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >>>>> zcat SRR833731_2.fastq.gz | head -n 1 >>>>> @SRR833731.1.2 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >>>>> >>>>> >>>>> Best, >>>>> >>>>> Jim >>>>> >>>>> >>>>> >>>>> On Thursday, May 08, 2014 12:03:29 PM, Stefano Calza wrote: >>>>>> Thanks Valerie >>>>>> >>>>>> I have got this BAM files from different sources but they cannot be >>>>>> distributed. >>>>>> >>>>>> Up to now I experienced twp different 'patterns' in QNAME. One is the >>>>>> trailing value as we said (/1, /2). Another one is a leading string. >>>>>> Eg. (made up QNAME) >>>>>> >>>>>> SRR1122.12345HTR >>>>>> SRR1123.12345HTR >>>>>> >>>>>> So there must be removed SRR1122 and SRR1123) >>>>>> >>>>>> My little program actually uses a regex substitution, so the user can >>>>>> decide what pattern to edit. This second one though it seems quit >>>>>> unusual. >>>>>> >>>>>> Those with trailing values were downloaded by TCGA (if I recall >>>>>> correctly the use a pipeline called MapSplice) >>>>>> >>>>>> >>>>>> Regards >>>>>> >>>>>> Stefano >>>>>> >>>>>> On 05/08/2014 05:54 PM, Valerie Obenchain wrote: >>>>>>> Hi Stefano, >>>>>>> >>>>>>> No, the current mate-pairing doesn't handle the trailing values. I >>>>>>> will implement this and post back when it's done. >>>>>>> >>>>>>> For reference, where did you download your bam files or what >>>>>>> application outputs QNAMEs in this format? I'd like to have some for >>>>>>> test data. >>>>>>> >>>>>>> >>>>>>> Thanks. >>>>>>> Valerie >>>>>>> >>>>>>> >>>>>>> On 05/08/14 08:14, Stefano Calza wrote: >>>>>>>> Hi everybody >>>>>>>> >>>>>>>> >>>>>>>> I am using GenomicAlignments package to read RNAseq pair-end >>>>>>>> data. The >>>>>>>> problem is that readGAlignmentPairsFromBam, after setting >>>>>>>> asMates=TRUE >>>>>>>> in BamFile, returns 0 mates. >>>>>>>> >>>>>>>> The reason is that mates have different QNAMEs. Eg: >>>>>>>> >>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1 >>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2 >>>>>>>> >>>>>>>> that is the two mates have /1 or /2 at the end. >>>>>>>> >>>>>>>> I wrote a Python (and a cpp) program to fix it...but this takes >>>>>>>> still >>>>>>>> quite a substantial amount of time on big files. >>>>>>>> >>>>>>>> Does the mating algorithm allow for this? If so how? >>>>>>>> >>>>>>>> Regards >>>>>>>> >>>>>>>> Stefano >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> -- >>>>> James W. MacDonald, M.S. >>>>> Biostatistician >>>>> University of Washington >>>>> Environmental and Occupational Health Sciences >>>>> 4225 Roosevelt Way NE, # 100 >>>>> Seattle WA 98105-6099 >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Valerie Obenchain Program in Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, Seattle, WA 98109 Email: vobencha at fhcrc.org Phone: (206) 667-3158

RNASeq Cancer convert Rsamtools GenomicAlignments RNASeq Cancer convert Rsamtools • 1.4k views

ADD COMMENT • link updated 9.9 years ago by Stefano Calza ▴ 90 • written 9.9 years ago by Valerie Obenchain ★ 6.8k

0

Entering edit mode

Stefano Calza ▴ 90

@stefano-calza-5062

Last seen 9.6 years ago

Italy

Valerie, just a little note. In case the character in qnameSuffixEnd does not exist in QNAME (i.e. you specify the wrong character...) scanBam (I tested that) hangs forever. I guess there should be a test on the very first read: if the character is not in the string then report error... Stefano On 06/02/2014 05:34 PM, Valerie Obenchain wrote: > You're welcome. Let us know if you run into other problems. > > Valerie > > > On 06/01/2014 10:01 AM, Stefano Calza wrote: > > Great! > > > > I would like to thank Valerie and all Bioconductor developers. > > > > Stefano > > > > Il 31/05/2014 21:39, Valerie Obenchain ha scritto: >> Trimming the qname is implemented in Rsamtools 1.17.18 and should be >> available via biocLite() ~ noon tomorrow. >> >> 'qnamePrefixStart' and 'qnameSuffixEnd' fields have been added to the >> BamFile class. Currently the trimming is implemented for mate pairing >> only, not just reading records from a bam. >> >> Unique qnames aren't a problem in this sample file - just using it to >> demonstrate the output. >> >> fl <- system.file("extdata", "ex1.bam", package="Rsamtools") >> param <- ScanBamParam(what="qname") >> >> ## no trimming >> bf <- BamFile(fl, asMates=TRUE) >>>> scanBam(bf, param=param)[[1]]$qname[1:3] >>> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" >>> [3] "EAS219_FC30151:7:51:1429:1043" >> >> ## trim prefix >> qnamePrefixEnd(bf) <- "_" >>>> scanBam(bf, param=param)[[1]]$qname[1:3] >>> [1] "61:4:143:69:578" "61:4:143:69:578" >>> "FC30151:7:51:1429:1043" >> >> ## trim prefix and suffix >> qnameSuffixStart(bf) <- ":" >>>> scanBam(bf, param=param)[[1]]$qname[1:3] >>> [1] "61:4:143:69" "61:4:143:69" "FC30151:7:51:1429" >> >> >> Valerie >> >> >> On 05/08/14 12:17, Stefano Calza wrote: >>> Yes, I say that would be easier to use than regexp >>> >>> Stefano >>> >>> On 05/08/2014 08:59 PM, Valerie Obenchain wrote: >>>> Thanks for the SRA tips. >>>> >>>> It's starting to look like the modifications are an additional prefix >>>> or suffix separated by dot or slash (possibly underscore?). Maybe >>>> simply adding an option to trim the QNAME by the pre/post term >>>> separated by a given character would be sufficient. This allows >>>> flexibilty but prevents unwarranted QNAME mangling. >>>> >>>> Valerie >>>> >>>> >>>> On 05/08/14 10:56, Stefano Calza wrote: >>>>> Right this is how I got some other example. I think it would add the >>>>> files names, as from 2 SRA files (SRR1234 & SRR1235) my reads are >>>>> named >>>>> STARTING with SRR1234 or SRR1235 for the two mates, followed by >>>>> actual >>>>> read QNAME. >>>>> >>>>> Stefano >>>>> >>>>> On 05/08/2014 07:05 PM, James W. MacDonald wrote: >>>>>> Hi Valerie, >>>>>> >>>>>> You get something similar from the .sra files that you can download >>>>>> from the SRA, if they are paired data. If you use the SRA toolkit to >>>>>> convert to fastq (fastq-dump), it will spit out two fastq files, and >>>>>> the QNAME in each of the fastq files will be appended with a .1 for >>>>>> the first pairs and a .2 for the second pairs. >>>>>> >>>>>> As an example: >>>>>> >>>>>> zcat SRR833731_1.fastq.gz | head -n 1 >>>>>> @SRR833731.1.1 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >>>>>> zcat SRR833731_2.fastq.gz | head -n 1 >>>>>> @SRR833731.1.2 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >>>>>> >>>>>> >>>>>> Best, >>>>>> >>>>>> Jim >>>>>> >>>>>> >>>>>> >>>>>> On Thursday, May 08, 2014 12:03:29 PM, Stefano Calza wrote: >>>>>>> Thanks Valerie >>>>>>> >>>>>>> I have got this BAM files from different sources but they cannot be >>>>>>> distributed. >>>>>>> >>>>>>> Up to now I experienced twp different 'patterns' in QNAME. One >>>>>>> is the >>>>>>> trailing value as we said (/1, /2). Another one is a leading >>>>>>> string. >>>>>>> Eg. (made up QNAME) >>>>>>> >>>>>>> SRR1122.12345HTR >>>>>>> SRR1123.12345HTR >>>>>>> >>>>>>> So there must be removed SRR1122 and SRR1123) >>>>>>> >>>>>>> My little program actually uses a regex substitution, so the >>>>>>> user can >>>>>>> decide what pattern to edit. This second one though it seems quit >>>>>>> unusual. >>>>>>> >>>>>>> Those with trailing values were downloaded by TCGA (if I recall >>>>>>> correctly the use a pipeline called MapSplice) >>>>>>> >>>>>>> >>>>>>> Regards >>>>>>> >>>>>>> Stefano >>>>>>> >>>>>>> On 05/08/2014 05:54 PM, Valerie Obenchain wrote: >>>>>>>> Hi Stefano, >>>>>>>> >>>>>>>> No, the current mate-pairing doesn't handle the trailing values. I >>>>>>>> will implement this and post back when it's done. >>>>>>>> >>>>>>>> For reference, where did you download your bam files or what >>>>>>>> application outputs QNAMEs in this format? I'd like to have >>>>>>>> some for >>>>>>>> test data. >>>>>>>> >>>>>>>> >>>>>>>> Thanks. >>>>>>>> Valerie >>>>>>>> >>>>>>>> >>>>>>>> On 05/08/14 08:14, Stefano Calza wrote: >>>>>>>>> Hi everybody >>>>>>>>> >>>>>>>>> >>>>>>>>> I am using GenomicAlignments package to read RNAseq pair-end >>>>>>>>> data. The >>>>>>>>> problem is that readGAlignmentPairsFromBam, after setting >>>>>>>>> asMates=TRUE >>>>>>>>> in BamFile, returns 0 mates. >>>>>>>>> >>>>>>>>> The reason is that mates have different QNAMEs. Eg: >>>>>>>>> >>>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1 >>>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2 >>>>>>>>> >>>>>>>>> that is the two mates have /1 or /2 at the end. >>>>>>>>> >>>>>>>>> I wrote a Python (and a cpp) program to fix it...but this takes >>>>>>>>> still >>>>>>>>> quite a substantial amount of time on big files. >>>>>>>>> >>>>>>>>> Does the mating algorithm allow for this? If so how? >>>>>>>>> >>>>>>>>> Regards >>>>>>>>> >>>>>>>>> Stefano >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Bioconductor mailing list >>>>>>>>> Bioconductor at r-project.org >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>> Search the archives: >>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>>> -- >>>>>> James W. MacDonald, M.S. >>>>>> Biostatistician >>>>>> University of Washington >>>>>> Environmental and Occupational Health Sciences >>>>>> 4225 Roosevelt Way NE, # 100 >>>>>> Seattle WA 98105-6099 >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD COMMENT • link 9.9 years ago Stefano Calza ▴ 90

0

Entering edit mode

I can't reproduce this problem. When a prefix/suffix is not found in the qname no error is thrown - it just doesn't do any trimming. I've clarified this in the documentation in 1.17.22. With Rsamtools 1.17.21: fl <- system.file("extdata", "ex1.bam", package="Rsamtools") ## original qnames: bfm <- BamFile(fl, asMates=TRUE) scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2] > [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" ## bogus prefix: bfm <- BamFile(fl, asMates=TRUE, qnamePrefixEnd="*") scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2] > [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" ## bogus suffix: bfm <- BamFile(fl, asMates=TRUE, qnameSuffixStart="*") scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2] > [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" Can you provide a reproducible example? Maybe you were calling scanBam() differently? Valerie On 06/03/2014 03:43 AM, Stefano Calza wrote: > Valerie, > > just a little note. In case the character in qnameSuffixEnd does not > exist in QNAME (i.e. you specify the wrong character...) scanBam (I > tested that) hangs forever. > > I guess there should be a test on the very first read: if the character > is not in the string then report error... > > > Stefano > > On 06/02/2014 05:34 PM, Valerie Obenchain wrote: >> You're welcome. Let us know if you run into other problems. >> >> Valerie >> >> >> On 06/01/2014 10:01 AM, Stefano Calza wrote: >> > Great! >> > >> > I would like to thank Valerie and all Bioconductor developers. >> > >> > Stefano >> > >> > Il 31/05/2014 21:39, Valerie Obenchain ha scritto: >>> Trimming the qname is implemented in Rsamtools 1.17.18 and should be >>> available via biocLite() ~ noon tomorrow. >>> >>> 'qnamePrefixStart' and 'qnameSuffixEnd' fields have been added to the >>> BamFile class. Currently the trimming is implemented for mate pairing >>> only, not just reading records from a bam. >>> >>> Unique qnames aren't a problem in this sample file - just using it to >>> demonstrate the output. >>> >>> fl <- system.file("extdata", "ex1.bam", package="Rsamtools") >>> param <- ScanBamParam(what="qname") >>> >>> ## no trimming >>> bf <- BamFile(fl, asMates=TRUE) >>>>> scanBam(bf, param=param)[[1]]$qname[1:3] >>>> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" >>>> [3] "EAS219_FC30151:7:51:1429:1043" >>> >>> ## trim prefix >>> qnamePrefixEnd(bf) <- "_" >>>>> scanBam(bf, param=param)[[1]]$qname[1:3] >>>> [1] "61:4:143:69:578" "61:4:143:69:578" >>>> "FC30151:7:51:1429:1043" >>> >>> ## trim prefix and suffix >>> qnameSuffixStart(bf) <- ":" >>>>> scanBam(bf, param=param)[[1]]$qname[1:3] >>>> [1] "61:4:143:69" "61:4:143:69" "FC30151:7:51:1429" >>> >>> >>> Valerie >>> >>> >>> On 05/08/14 12:17, Stefano Calza wrote: >>>> Yes, I say that would be easier to use than regexp >>>> >>>> Stefano >>>> >>>> On 05/08/2014 08:59 PM, Valerie Obenchain wrote: >>>>> Thanks for the SRA tips. >>>>> >>>>> It's starting to look like the modifications are an additional prefix >>>>> or suffix separated by dot or slash (possibly underscore?). Maybe >>>>> simply adding an option to trim the QNAME by the pre/post term >>>>> separated by a given character would be sufficient. This allows >>>>> flexibilty but prevents unwarranted QNAME mangling. >>>>> >>>>> Valerie >>>>> >>>>> >>>>> On 05/08/14 10:56, Stefano Calza wrote: >>>>>> Right this is how I got some other example. I think it would add the >>>>>> files names, as from 2 SRA files (SRR1234 & SRR1235) my reads are >>>>>> named >>>>>> STARTING with SRR1234 or SRR1235 for the two mates, followed by >>>>>> actual >>>>>> read QNAME. >>>>>> >>>>>> Stefano >>>>>> >>>>>> On 05/08/2014 07:05 PM, James W. MacDonald wrote: >>>>>>> Hi Valerie, >>>>>>> >>>>>>> You get something similar from the .sra files that you can download >>>>>>> from the SRA, if they are paired data. If you use the SRA toolkit to >>>>>>> convert to fastq (fastq-dump), it will spit out two fastq files, and >>>>>>> the QNAME in each of the fastq files will be appended with a .1 for >>>>>>> the first pairs and a .2 for the second pairs. >>>>>>> >>>>>>> As an example: >>>>>>> >>>>>>> zcat SRR833731_1.fastq.gz | head -n 1 >>>>>>> @SRR833731.1.1 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >>>>>>> zcat SRR833731_2.fastq.gz | head -n 1 >>>>>>> @SRR833731.1.2 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >>>>>>> >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Jim >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thursday, May 08, 2014 12:03:29 PM, Stefano Calza wrote: >>>>>>>> Thanks Valerie >>>>>>>> >>>>>>>> I have got this BAM files from different sources but they cannot be >>>>>>>> distributed. >>>>>>>> >>>>>>>> Up to now I experienced twp different 'patterns' in QNAME. One >>>>>>>> is the >>>>>>>> trailing value as we said (/1, /2). Another one is a leading >>>>>>>> string. >>>>>>>> Eg. (made up QNAME) >>>>>>>> >>>>>>>> SRR1122.12345HTR >>>>>>>> SRR1123.12345HTR >>>>>>>> >>>>>>>> So there must be removed SRR1122 and SRR1123) >>>>>>>> >>>>>>>> My little program actually uses a regex substitution, so the >>>>>>>> user can >>>>>>>> decide what pattern to edit. This second one though it seems quit >>>>>>>> unusual. >>>>>>>> >>>>>>>> Those with trailing values were downloaded by TCGA (if I recall >>>>>>>> correctly the use a pipeline called MapSplice) >>>>>>>> >>>>>>>> >>>>>>>> Regards >>>>>>>> >>>>>>>> Stefano >>>>>>>> >>>>>>>> On 05/08/2014 05:54 PM, Valerie Obenchain wrote: >>>>>>>>> Hi Stefano, >>>>>>>>> >>>>>>>>> No, the current mate-pairing doesn't handle the trailing values. I >>>>>>>>> will implement this and post back when it's done. >>>>>>>>> >>>>>>>>> For reference, where did you download your bam files or what >>>>>>>>> application outputs QNAMEs in this format? I'd like to have >>>>>>>>> some for >>>>>>>>> test data. >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks. >>>>>>>>> Valerie >>>>>>>>> >>>>>>>>> >>>>>>>>> On 05/08/14 08:14, Stefano Calza wrote: >>>>>>>>>> Hi everybody >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I am using GenomicAlignments package to read RNAseq pair- end >>>>>>>>>> data. The >>>>>>>>>> problem is that readGAlignmentPairsFromBam, after setting >>>>>>>>>> asMates=TRUE >>>>>>>>>> in BamFile, returns 0 mates. >>>>>>>>>> >>>>>>>>>> The reason is that mates have different QNAMEs. Eg: >>>>>>>>>> >>>>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1 >>>>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2 >>>>>>>>>> >>>>>>>>>> that is the two mates have /1 or /2 at the end. >>>>>>>>>> >>>>>>>>>> I wrote a Python (and a cpp) program to fix it...but this takes >>>>>>>>>> still >>>>>>>>>> quite a substantial amount of time on big files. >>>>>>>>>> >>>>>>>>>> Does the mating algorithm allow for this? If so how? >>>>>>>>>> >>>>>>>>>> Regards >>>>>>>>>> >>>>>>>>>> Stefano >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Bioconductor mailing list >>>>>>>>>> Bioconductor at r-project.org >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>>> Search the archives: >>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> >>>>>>> -- >>>>>>> James W. MacDonald, M.S. >>>>>>> Biostatistician >>>>>>> University of Washington >>>>>>> Environmental and Occupational Health Sciences >>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>> Seattle WA 98105-6099 >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> >>>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > -- Valerie Obenchain Program in Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, Seattle, WA 98109 Email: vobencha at fhcrc.org Phone: (206) 667-3158

ADD REPLY • link 9.9 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Dear Valerie it might be the string I used: qnameSuffixStart="\\" (while it should have been "/") Being the escape character I guess it has been messing up the code. Stefano On 06/03/2014 05:47 PM, Valerie Obenchain wrote: > I can't reproduce this problem. > > When a prefix/suffix is not found in the qname no error is thrown - it > just doesn't do any trimming. I've clarified this in the documentation > in 1.17.22. > > With Rsamtools 1.17.21: > > fl <- system.file("extdata", "ex1.bam", package="Rsamtools") > > ## original qnames: > bfm <- BamFile(fl, asMates=TRUE) > scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2] >> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" > > ## bogus prefix: > bfm <- BamFile(fl, asMates=TRUE, qnamePrefixEnd="*") > scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2] >> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" > > ## bogus suffix: > bfm <- BamFile(fl, asMates=TRUE, qnameSuffixStart="*") > scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2] >> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" > > Can you provide a reproducible example? Maybe you were calling > scanBam() differently? > > Valerie > > On 06/03/2014 03:43 AM, Stefano Calza wrote: >> Valerie, >> >> just a little note. In case the character in qnameSuffixEnd does not >> exist in QNAME (i.e. you specify the wrong character...) scanBam (I >> tested that) hangs forever. >> >> I guess there should be a test on the very first read: if the character >> is not in the string then report error... >> >> >> Stefano >> >> On 06/02/2014 05:34 PM, Valerie Obenchain wrote: >>> You're welcome. Let us know if you run into other problems. >>> >>> Valerie >>> >>> >>> On 06/01/2014 10:01 AM, Stefano Calza wrote: >>> > Great! >>> > >>> > I would like to thank Valerie and all Bioconductor developers. >>> > >>> > Stefano >>> > >>> > Il 31/05/2014 21:39, Valerie Obenchain ha scritto: >>>> Trimming the qname is implemented in Rsamtools 1.17.18 and should be >>>> available via biocLite() ~ noon tomorrow. >>>> >>>> 'qnamePrefixStart' and 'qnameSuffixEnd' fields have been added to the >>>> BamFile class. Currently the trimming is implemented for mate pairing >>>> only, not just reading records from a bam. >>>> >>>> Unique qnames aren't a problem in this sample file - just using it to >>>> demonstrate the output. >>>> >>>> fl <- system.file("extdata", "ex1.bam", package="Rsamtools") >>>> param <- ScanBamParam(what="qname") >>>> >>>> ## no trimming >>>> bf <- BamFile(fl, asMates=TRUE) >>>>>> scanBam(bf, param=param)[[1]]$qname[1:3] >>>>> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" >>>>> [3] "EAS219_FC30151:7:51:1429:1043" >>>> >>>> ## trim prefix >>>> qnamePrefixEnd(bf) <- "_" >>>>>> scanBam(bf, param=param)[[1]]$qname[1:3] >>>>> [1] "61:4:143:69:578" "61:4:143:69:578" >>>>> "FC30151:7:51:1429:1043" >>>> >>>> ## trim prefix and suffix >>>> qnameSuffixStart(bf) <- ":" >>>>>> scanBam(bf, param=param)[[1]]$qname[1:3] >>>>> [1] "61:4:143:69" "61:4:143:69" "FC30151:7:51:1429" >>>> >>>> >>>> Valerie >>>> >>>> >>>> On 05/08/14 12:17, Stefano Calza wrote: >>>>> Yes, I say that would be easier to use than regexp >>>>> >>>>> Stefano >>>>> >>>>> On 05/08/2014 08:59 PM, Valerie Obenchain wrote: >>>>>> Thanks for the SRA tips. >>>>>> >>>>>> It's starting to look like the modifications are an additional >>>>>> prefix >>>>>> or suffix separated by dot or slash (possibly underscore?). Maybe >>>>>> simply adding an option to trim the QNAME by the pre/post term >>>>>> separated by a given character would be sufficient. This allows >>>>>> flexibilty but prevents unwarranted QNAME mangling. >>>>>> >>>>>> Valerie >>>>>> >>>>>> >>>>>> On 05/08/14 10:56, Stefano Calza wrote: >>>>>>> Right this is how I got some other example. I think it would add >>>>>>> the >>>>>>> files names, as from 2 SRA files (SRR1234 & SRR1235) my reads are >>>>>>> named >>>>>>> STARTING with SRR1234 or SRR1235 for the two mates, followed by >>>>>>> actual >>>>>>> read QNAME. >>>>>>> >>>>>>> Stefano >>>>>>> >>>>>>> On 05/08/2014 07:05 PM, James W. MacDonald wrote: >>>>>>>> Hi Valerie, >>>>>>>> >>>>>>>> You get something similar from the .sra files that you can >>>>>>>> download >>>>>>>> from the SRA, if they are paired data. If you use the SRA >>>>>>>> toolkit to >>>>>>>> convert to fastq (fastq-dump), it will spit out two fastq >>>>>>>> files, and >>>>>>>> the QNAME in each of the fastq files will be appended with a .1 >>>>>>>> for >>>>>>>> the first pairs and a .2 for the second pairs. >>>>>>>> >>>>>>>> As an example: >>>>>>>> >>>>>>>> zcat SRR833731_1.fastq.gz | head -n 1 >>>>>>>> @SRR833731.1.1 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >>>>>>>> zcat SRR833731_2.fastq.gz | head -n 1 >>>>>>>> @SRR833731.1.2 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >>>>>>>> >>>>>>>> >>>>>>>> Best, >>>>>>>> >>>>>>>> Jim >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thursday, May 08, 2014 12:03:29 PM, Stefano Calza wrote: >>>>>>>>> Thanks Valerie >>>>>>>>> >>>>>>>>> I have got this BAM files from different sources but they >>>>>>>>> cannot be >>>>>>>>> distributed. >>>>>>>>> >>>>>>>>> Up to now I experienced twp different 'patterns' in QNAME. One >>>>>>>>> is the >>>>>>>>> trailing value as we said (/1, /2). Another one is a leading >>>>>>>>> string. >>>>>>>>> Eg. (made up QNAME) >>>>>>>>> >>>>>>>>> SRR1122.12345HTR >>>>>>>>> SRR1123.12345HTR >>>>>>>>> >>>>>>>>> So there must be removed SRR1122 and SRR1123) >>>>>>>>> >>>>>>>>> My little program actually uses a regex substitution, so the >>>>>>>>> user can >>>>>>>>> decide what pattern to edit. This second one though it seems quit >>>>>>>>> unusual. >>>>>>>>> >>>>>>>>> Those with trailing values were downloaded by TCGA (if I recall >>>>>>>>> correctly the use a pipeline called MapSplice) >>>>>>>>> >>>>>>>>> >>>>>>>>> Regards >>>>>>>>> >>>>>>>>> Stefano >>>>>>>>> >>>>>>>>> On 05/08/2014 05:54 PM, Valerie Obenchain wrote: >>>>>>>>>> Hi Stefano, >>>>>>>>>> >>>>>>>>>> No, the current mate-pairing doesn't handle the trailing >>>>>>>>>> values. I >>>>>>>>>> will implement this and post back when it's done. >>>>>>>>>> >>>>>>>>>> For reference, where did you download your bam files or what >>>>>>>>>> application outputs QNAMEs in this format? I'd like to have >>>>>>>>>> some for >>>>>>>>>> test data. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks. >>>>>>>>>> Valerie >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 05/08/14 08:14, Stefano Calza wrote: >>>>>>>>>>> Hi everybody >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I am using GenomicAlignments package to read RNAseq pair- end >>>>>>>>>>> data. The >>>>>>>>>>> problem is that readGAlignmentPairsFromBam, after setting >>>>>>>>>>> asMates=TRUE >>>>>>>>>>> in BamFile, returns 0 mates. >>>>>>>>>>> >>>>>>>>>>> The reason is that mates have different QNAMEs. Eg: >>>>>>>>>>> >>>>>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1 >>>>>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2 >>>>>>>>>>> >>>>>>>>>>> that is the two mates have /1 or /2 at the end. >>>>>>>>>>> >>>>>>>>>>> I wrote a Python (and a cpp) program to fix it...but this takes >>>>>>>>>>> still >>>>>>>>>>> quite a substantial amount of time on big files. >>>>>>>>>>> >>>>>>>>>>> Does the mating algorithm allow for this? If so how? >>>>>>>>>>> >>>>>>>>>>> Regards >>>>>>>>>>> >>>>>>>>>>> Stefano >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Bioconductor mailing list >>>>>>>>>>> Bioconductor at r-project.org >>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>>>> Search the archives: >>>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Bioconductor mailing list >>>>>>>>> Bioconductor at r-project.org >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>> Search the archives: >>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>> >>>>>>>> -- >>>>>>>> James W. MacDonald, M.S. >>>>>>>> Biostatistician >>>>>>>> University of Washington >>>>>>>> Environmental and Occupational Health Sciences >>>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>>> Seattle WA 98105-6099 >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> > >

ADD REPLY • link 9.9 years ago Stefano Calza ▴ 90

0

Entering edit mode

When I use double escape the code doesn't hang it just doesn't trim. > fl <- system.file("extdata", "ex1.bam", package="Rsamtools") > bfm <- BamFile(fl, asMates=TRUE, qnamePrefixEnd="\\") > scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2] [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" Using escape with any (?) other character gives an error: > bfm <- BamFile(fl, asMates=TRUE, qnamePrefixEnd="\_") Error: '\_' is an unrecognized escape in character string starting ""\_" I'm not sure why the code is hanging for you with the double escape. If you have a reproducible example you can share I can take a look. Valerie On 06/04/2014 12:11 AM, Stefano Calza wrote: > Dear Valerie > > it might be the string I used: qnameSuffixStart="\\" (while it should > have been "/") > > Being the escape character I guess it has been messing up the code. > > Stefano > > On 06/03/2014 05:47 PM, Valerie Obenchain wrote: >> I can't reproduce this problem. >> >> When a prefix/suffix is not found in the qname no error is thrown - it >> just doesn't do any trimming. I've clarified this in the documentation >> in 1.17.22. >> >> With Rsamtools 1.17.21: >> >> fl <- system.file("extdata", "ex1.bam", package="Rsamtools") >> >> ## original qnames: >> bfm <- BamFile(fl, asMates=TRUE) >> scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2] >>> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" >> >> ## bogus prefix: >> bfm <- BamFile(fl, asMates=TRUE, qnamePrefixEnd="*") >> scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2] >>> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" >> >> ## bogus suffix: >> bfm <- BamFile(fl, asMates=TRUE, qnameSuffixStart="*") >> scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2] >>> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" >> >> Can you provide a reproducible example? Maybe you were calling >> scanBam() differently? >> >> Valerie >> >> On 06/03/2014 03:43 AM, Stefano Calza wrote: >>> Valerie, >>> >>> just a little note. In case the character in qnameSuffixEnd does not >>> exist in QNAME (i.e. you specify the wrong character...) scanBam (I >>> tested that) hangs forever. >>> >>> I guess there should be a test on the very first read: if the character >>> is not in the string then report error... >>> >>> >>> Stefano >>> >>> On 06/02/2014 05:34 PM, Valerie Obenchain wrote: >>>> You're welcome. Let us know if you run into other problems. >>>> >>>> Valerie >>>> >>>> >>>> On 06/01/2014 10:01 AM, Stefano Calza wrote: >>>> > Great! >>>> > >>>> > I would like to thank Valerie and all Bioconductor developers. >>>> > >>>> > Stefano >>>> > >>>> > Il 31/05/2014 21:39, Valerie Obenchain ha scritto: >>>>> Trimming the qname is implemented in Rsamtools 1.17.18 and should be >>>>> available via biocLite() ~ noon tomorrow. >>>>> >>>>> 'qnamePrefixStart' and 'qnameSuffixEnd' fields have been added to the >>>>> BamFile class. Currently the trimming is implemented for mate pairing >>>>> only, not just reading records from a bam. >>>>> >>>>> Unique qnames aren't a problem in this sample file - just using it to >>>>> demonstrate the output. >>>>> >>>>> fl <- system.file("extdata", "ex1.bam", package="Rsamtools") >>>>> param <- ScanBamParam(what="qname") >>>>> >>>>> ## no trimming >>>>> bf <- BamFile(fl, asMates=TRUE) >>>>>>> scanBam(bf, param=param)[[1]]$qname[1:3] >>>>>> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" >>>>>> [3] "EAS219_FC30151:7:51:1429:1043" >>>>> >>>>> ## trim prefix >>>>> qnamePrefixEnd(bf) <- "_" >>>>>>> scanBam(bf, param=param)[[1]]$qname[1:3] >>>>>> [1] "61:4:143:69:578" "61:4:143:69:578" >>>>>> "FC30151:7:51:1429:1043" >>>>> >>>>> ## trim prefix and suffix >>>>> qnameSuffixStart(bf) <- ":" >>>>>>> scanBam(bf, param=param)[[1]]$qname[1:3] >>>>>> [1] "61:4:143:69" "61:4:143:69" "FC30151:7:51:1429" >>>>> >>>>> >>>>> Valerie >>>>> >>>>> >>>>> On 05/08/14 12:17, Stefano Calza wrote: >>>>>> Yes, I say that would be easier to use than regexp >>>>>> >>>>>> Stefano >>>>>> >>>>>> On 05/08/2014 08:59 PM, Valerie Obenchain wrote: >>>>>>> Thanks for the SRA tips. >>>>>>> >>>>>>> It's starting to look like the modifications are an additional >>>>>>> prefix >>>>>>> or suffix separated by dot or slash (possibly underscore?). Maybe >>>>>>> simply adding an option to trim the QNAME by the pre/post term >>>>>>> separated by a given character would be sufficient. This allows >>>>>>> flexibilty but prevents unwarranted QNAME mangling. >>>>>>> >>>>>>> Valerie >>>>>>> >>>>>>> >>>>>>> On 05/08/14 10:56, Stefano Calza wrote: >>>>>>>> Right this is how I got some other example. I think it would add >>>>>>>> the >>>>>>>> files names, as from 2 SRA files (SRR1234 & SRR1235) my reads are >>>>>>>> named >>>>>>>> STARTING with SRR1234 or SRR1235 for the two mates, followed by >>>>>>>> actual >>>>>>>> read QNAME. >>>>>>>> >>>>>>>> Stefano >>>>>>>> >>>>>>>> On 05/08/2014 07:05 PM, James W. MacDonald wrote: >>>>>>>>> Hi Valerie, >>>>>>>>> >>>>>>>>> You get something similar from the .sra files that you can >>>>>>>>> download >>>>>>>>> from the SRA, if they are paired data. If you use the SRA >>>>>>>>> toolkit to >>>>>>>>> convert to fastq (fastq-dump), it will spit out two fastq >>>>>>>>> files, and >>>>>>>>> the QNAME in each of the fastq files will be appended with a .1 >>>>>>>>> for >>>>>>>>> the first pairs and a .2 for the second pairs. >>>>>>>>> >>>>>>>>> As an example: >>>>>>>>> >>>>>>>>> zcat SRR833731_1.fastq.gz | head -n 1 >>>>>>>>> @SRR833731.1.1 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >>>>>>>>> zcat SRR833731_2.fastq.gz | head -n 1 >>>>>>>>> @SRR833731.1.2 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >>>>>>>>> >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> >>>>>>>>> Jim >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thursday, May 08, 2014 12:03:29 PM, Stefano Calza wrote: >>>>>>>>>> Thanks Valerie >>>>>>>>>> >>>>>>>>>> I have got this BAM files from different sources but they >>>>>>>>>> cannot be >>>>>>>>>> distributed. >>>>>>>>>> >>>>>>>>>> Up to now I experienced twp different 'patterns' in QNAME. One >>>>>>>>>> is the >>>>>>>>>> trailing value as we said (/1, /2). Another one is a leading >>>>>>>>>> string. >>>>>>>>>> Eg. (made up QNAME) >>>>>>>>>> >>>>>>>>>> SRR1122.12345HTR >>>>>>>>>> SRR1123.12345HTR >>>>>>>>>> >>>>>>>>>> So there must be removed SRR1122 and SRR1123) >>>>>>>>>> >>>>>>>>>> My little program actually uses a regex substitution, so the >>>>>>>>>> user can >>>>>>>>>> decide what pattern to edit. This second one though it seems quit >>>>>>>>>> unusual. >>>>>>>>>> >>>>>>>>>> Those with trailing values were downloaded by TCGA (if I recall >>>>>>>>>> correctly the use a pipeline called MapSplice) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Regards >>>>>>>>>> >>>>>>>>>> Stefano >>>>>>>>>> >>>>>>>>>> On 05/08/2014 05:54 PM, Valerie Obenchain wrote: >>>>>>>>>>> Hi Stefano, >>>>>>>>>>> >>>>>>>>>>> No, the current mate-pairing doesn't handle the trailing >>>>>>>>>>> values. I >>>>>>>>>>> will implement this and post back when it's done. >>>>>>>>>>> >>>>>>>>>>> For reference, where did you download your bam files or what >>>>>>>>>>> application outputs QNAMEs in this format? I'd like to have >>>>>>>>>>> some for >>>>>>>>>>> test data. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks. >>>>>>>>>>> Valerie >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 05/08/14 08:14, Stefano Calza wrote: >>>>>>>>>>>> Hi everybody >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I am using GenomicAlignments package to read RNAseq pair- end >>>>>>>>>>>> data. The >>>>>>>>>>>> problem is that readGAlignmentPairsFromBam, after setting >>>>>>>>>>>> asMates=TRUE >>>>>>>>>>>> in BamFile, returns 0 mates. >>>>>>>>>>>> >>>>>>>>>>>> The reason is that mates have different QNAMEs. Eg: >>>>>>>>>>>> >>>>>>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1 >>>>>>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2 >>>>>>>>>>>> >>>>>>>>>>>> that is the two mates have /1 or /2 at the end. >>>>>>>>>>>> >>>>>>>>>>>> I wrote a Python (and a cpp) program to fix it...but this takes >>>>>>>>>>>> still >>>>>>>>>>>> quite a substantial amount of time on big files. >>>>>>>>>>>> >>>>>>>>>>>> Does the mating algorithm allow for this? If so how? >>>>>>>>>>>> >>>>>>>>>>>> Regards >>>>>>>>>>>> >>>>>>>>>>>> Stefano >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Bioconductor mailing list >>>>>>>>>>>> Bioconductor at r-project.org >>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>>>>> Search the archives: >>>>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Bioconductor mailing list >>>>>>>>>> Bioconductor at r-project.org >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>>> Search the archives: >>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>>> >>>>>>>>> -- >>>>>>>>> James W. MacDonald, M.S. >>>>>>>>> Biostatistician >>>>>>>>> University of Washington >>>>>>>>> Environmental and Occupational Health Sciences >>>>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>>>> Seattle WA 98105-6099 >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Bioconductor mailing list >>>>>>>>> Bioconductor at r-project.org >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>> Search the archives: >>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>> >> >> > -- Valerie Obenchain Program in Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, Seattle, WA 98109 Email: vobencha at fhcrc.org Phone: (206) 667-3158

ADD REPLY • link 9.9 years ago Valerie Obenchain ★ 6.8k

Login before adding your answer.