GenomicAlignments and QNAME collision
1
0
Entering edit mode
@valerie-obenchain-4275
Last seen 22 months ago
United States
You're welcome. Let us know if you run into other problems. Valerie On 06/01/2014 10:01 AM, Stefano Calza wrote: > Great! > > I would like to thank Valerie and all Bioconductor developers. > > Stefano > > Il 31/05/2014 21:39, Valerie Obenchain ha scritto: > Trimming the qname is implemented in Rsamtools 1.17.18 and should be > available via biocLite() ~ noon tomorrow. > > 'qnamePrefixStart' and 'qnameSuffixEnd' fields have been added to the > BamFile class. Currently the trimming is implemented for mate pairing > only, not just reading records from a bam. > > Unique qnames aren't a problem in this sample file - just using it to > demonstrate the output. > > fl <- system.file("extdata", "ex1.bam", package="Rsamtools") > param <- ScanBamParam(what="qname") > > ## no trimming > bf <- BamFile(fl, asMates=TRUE) >>> scanBam(bf, param=param)[[1]]$qname[1:3] >> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" >> [3] "EAS219_FC30151:7:51:1429:1043" > > ## trim prefix > qnamePrefixEnd(bf) <- "_" >>> scanBam(bf, param=param)[[1]]$qname[1:3] >> [1] "61:4:143:69:578" "61:4:143:69:578" >> "FC30151:7:51:1429:1043" > > ## trim prefix and suffix > qnameSuffixStart(bf) <- ":" >>> scanBam(bf, param=param)[[1]]$qname[1:3] >> [1] "61:4:143:69" "61:4:143:69" "FC30151:7:51:1429" > > > Valerie > > > On 05/08/14 12:17, Stefano Calza wrote: >> Yes, I say that would be easier to use than regexp >> >> Stefano >> >> On 05/08/2014 08:59 PM, Valerie Obenchain wrote: >>> Thanks for the SRA tips. >>> >>> It's starting to look like the modifications are an additional prefix >>> or suffix separated by dot or slash (possibly underscore?). Maybe >>> simply adding an option to trim the QNAME by the pre/post term >>> separated by a given character would be sufficient. This allows >>> flexibilty but prevents unwarranted QNAME mangling. >>> >>> Valerie >>> >>> >>> On 05/08/14 10:56, Stefano Calza wrote: >>>> Right this is how I got some other example. I think it would add the >>>> files names, as from 2 SRA files (SRR1234 & SRR1235) my reads are named >>>> STARTING with SRR1234 or SRR1235 for the two mates, followed by actual >>>> read QNAME. >>>> >>>> Stefano >>>> >>>> On 05/08/2014 07:05 PM, James W. MacDonald wrote: >>>>> Hi Valerie, >>>>> >>>>> You get something similar from the .sra files that you can download >>>>> from the SRA, if they are paired data. If you use the SRA toolkit to >>>>> convert to fastq (fastq-dump), it will spit out two fastq files, and >>>>> the QNAME in each of the fastq files will be appended with a .1 for >>>>> the first pairs and a .2 for the second pairs. >>>>> >>>>> As an example: >>>>> >>>>> zcat SRR833731_1.fastq.gz | head -n 1 >>>>> @SRR833731.1.1 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >>>>> zcat SRR833731_2.fastq.gz | head -n 1 >>>>> @SRR833731.1.2 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >>>>> >>>>> >>>>> Best, >>>>> >>>>> Jim >>>>> >>>>> >>>>> >>>>> On Thursday, May 08, 2014 12:03:29 PM, Stefano Calza wrote: >>>>>> Thanks Valerie >>>>>> >>>>>> I have got this BAM files from different sources but they cannot be >>>>>> distributed. >>>>>> >>>>>> Up to now I experienced twp different 'patterns' in QNAME. One is the >>>>>> trailing value as we said (/1, /2). Another one is a leading string. >>>>>> Eg. (made up QNAME) >>>>>> >>>>>> SRR1122.12345HTR >>>>>> SRR1123.12345HTR >>>>>> >>>>>> So there must be removed SRR1122 and SRR1123) >>>>>> >>>>>> My little program actually uses a regex substitution, so the user can >>>>>> decide what pattern to edit. This second one though it seems quit >>>>>> unusual. >>>>>> >>>>>> Those with trailing values were downloaded by TCGA (if I recall >>>>>> correctly the use a pipeline called MapSplice) >>>>>> >>>>>> >>>>>> Regards >>>>>> >>>>>> Stefano >>>>>> >>>>>> On 05/08/2014 05:54 PM, Valerie Obenchain wrote: >>>>>>> Hi Stefano, >>>>>>> >>>>>>> No, the current mate-pairing doesn't handle the trailing values. I >>>>>>> will implement this and post back when it's done. >>>>>>> >>>>>>> For reference, where did you download your bam files or what >>>>>>> application outputs QNAMEs in this format? I'd like to have some for >>>>>>> test data. >>>>>>> >>>>>>> >>>>>>> Thanks. >>>>>>> Valerie >>>>>>> >>>>>>> >>>>>>> On 05/08/14 08:14, Stefano Calza wrote: >>>>>>>> Hi everybody >>>>>>>> >>>>>>>> >>>>>>>> I am using GenomicAlignments package to read RNAseq pair-end >>>>>>>> data. The >>>>>>>> problem is that readGAlignmentPairsFromBam, after setting >>>>>>>> asMates=TRUE >>>>>>>> in BamFile, returns 0 mates. >>>>>>>> >>>>>>>> The reason is that mates have different QNAMEs. Eg: >>>>>>>> >>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1 >>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2 >>>>>>>> >>>>>>>> that is the two mates have /1 or /2 at the end. >>>>>>>> >>>>>>>> I wrote a Python (and a cpp) program to fix it...but this takes >>>>>>>> still >>>>>>>> quite a substantial amount of time on big files. >>>>>>>> >>>>>>>> Does the mating algorithm allow for this? If so how? >>>>>>>> >>>>>>>> Regards >>>>>>>> >>>>>>>> Stefano >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> -- >>>>> James W. MacDonald, M.S. >>>>> Biostatistician >>>>> University of Washington >>>>> Environmental and Occupational Health Sciences >>>>> 4225 Roosevelt Way NE, # 100 >>>>> Seattle WA 98105-6099 >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Valerie Obenchain Program in Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, Seattle, WA 98109 Email: vobencha at fhcrc.org Phone: (206) 667-3158
RNASeq Cancer convert Rsamtools GenomicAlignments RNASeq Cancer convert Rsamtools • 835 views
ADD COMMENT
0
Entering edit mode
@stefano-calza-5062
Last seen 7.0 years ago
Italy
Valerie, just a little note. In case the character in qnameSuffixEnd does not exist in QNAME (i.e. you specify the wrong character...) scanBam (I tested that) hangs forever. I guess there should be a test on the very first read: if the character is not in the string then report error... Stefano On 06/02/2014 05:34 PM, Valerie Obenchain wrote: > You're welcome. Let us know if you run into other problems. > > Valerie > > > On 06/01/2014 10:01 AM, Stefano Calza wrote: > > Great! > > > > I would like to thank Valerie and all Bioconductor developers. > > > > Stefano > > > > Il 31/05/2014 21:39, Valerie Obenchain ha scritto: >> Trimming the qname is implemented in Rsamtools 1.17.18 and should be >> available via biocLite() ~ noon tomorrow. >> >> 'qnamePrefixStart' and 'qnameSuffixEnd' fields have been added to the >> BamFile class. Currently the trimming is implemented for mate pairing >> only, not just reading records from a bam. >> >> Unique qnames aren't a problem in this sample file - just using it to >> demonstrate the output. >> >> fl <- system.file("extdata", "ex1.bam", package="Rsamtools") >> param <- ScanBamParam(what="qname") >> >> ## no trimming >> bf <- BamFile(fl, asMates=TRUE) >>>> scanBam(bf, param=param)[[1]]$qname[1:3] >>> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" >>> [3] "EAS219_FC30151:7:51:1429:1043" >> >> ## trim prefix >> qnamePrefixEnd(bf) <- "_" >>>> scanBam(bf, param=param)[[1]]$qname[1:3] >>> [1] "61:4:143:69:578" "61:4:143:69:578" >>> "FC30151:7:51:1429:1043" >> >> ## trim prefix and suffix >> qnameSuffixStart(bf) <- ":" >>>> scanBam(bf, param=param)[[1]]$qname[1:3] >>> [1] "61:4:143:69" "61:4:143:69" "FC30151:7:51:1429" >> >> >> Valerie >> >> >> On 05/08/14 12:17, Stefano Calza wrote: >>> Yes, I say that would be easier to use than regexp >>> >>> Stefano >>> >>> On 05/08/2014 08:59 PM, Valerie Obenchain wrote: >>>> Thanks for the SRA tips. >>>> >>>> It's starting to look like the modifications are an additional prefix >>>> or suffix separated by dot or slash (possibly underscore?). Maybe >>>> simply adding an option to trim the QNAME by the pre/post term >>>> separated by a given character would be sufficient. This allows >>>> flexibilty but prevents unwarranted QNAME mangling. >>>> >>>> Valerie >>>> >>>> >>>> On 05/08/14 10:56, Stefano Calza wrote: >>>>> Right this is how I got some other example. I think it would add the >>>>> files names, as from 2 SRA files (SRR1234 & SRR1235) my reads are >>>>> named >>>>> STARTING with SRR1234 or SRR1235 for the two mates, followed by >>>>> actual >>>>> read QNAME. >>>>> >>>>> Stefano >>>>> >>>>> On 05/08/2014 07:05 PM, James W. MacDonald wrote: >>>>>> Hi Valerie, >>>>>> >>>>>> You get something similar from the .sra files that you can download >>>>>> from the SRA, if they are paired data. If you use the SRA toolkit to >>>>>> convert to fastq (fastq-dump), it will spit out two fastq files, and >>>>>> the QNAME in each of the fastq files will be appended with a .1 for >>>>>> the first pairs and a .2 for the second pairs. >>>>>> >>>>>> As an example: >>>>>> >>>>>> zcat SRR833731_1.fastq.gz | head -n 1 >>>>>> @SRR833731.1.1 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >>>>>> zcat SRR833731_2.fastq.gz | head -n 1 >>>>>> @SRR833731.1.2 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >>>>>> >>>>>> >>>>>> Best, >>>>>> >>>>>> Jim >>>>>> >>>>>> >>>>>> >>>>>> On Thursday, May 08, 2014 12:03:29 PM, Stefano Calza wrote: >>>>>>> Thanks Valerie >>>>>>> >>>>>>> I have got this BAM files from different sources but they cannot be >>>>>>> distributed. >>>>>>> >>>>>>> Up to now I experienced twp different 'patterns' in QNAME. One >>>>>>> is the >>>>>>> trailing value as we said (/1, /2). Another one is a leading >>>>>>> string. >>>>>>> Eg. (made up QNAME) >>>>>>> >>>>>>> SRR1122.12345HTR >>>>>>> SRR1123.12345HTR >>>>>>> >>>>>>> So there must be removed SRR1122 and SRR1123) >>>>>>> >>>>>>> My little program actually uses a regex substitution, so the >>>>>>> user can >>>>>>> decide what pattern to edit. This second one though it seems quit >>>>>>> unusual. >>>>>>> >>>>>>> Those with trailing values were downloaded by TCGA (if I recall >>>>>>> correctly the use a pipeline called MapSplice) >>>>>>> >>>>>>> >>>>>>> Regards >>>>>>> >>>>>>> Stefano >>>>>>> >>>>>>> On 05/08/2014 05:54 PM, Valerie Obenchain wrote: >>>>>>>> Hi Stefano, >>>>>>>> >>>>>>>> No, the current mate-pairing doesn't handle the trailing values. I >>>>>>>> will implement this and post back when it's done. >>>>>>>> >>>>>>>> For reference, where did you download your bam files or what >>>>>>>> application outputs QNAMEs in this format? I'd like to have >>>>>>>> some for >>>>>>>> test data. >>>>>>>> >>>>>>>> >>>>>>>> Thanks. >>>>>>>> Valerie >>>>>>>> >>>>>>>> >>>>>>>> On 05/08/14 08:14, Stefano Calza wrote: >>>>>>>>> Hi everybody >>>>>>>>> >>>>>>>>> >>>>>>>>> I am using GenomicAlignments package to read RNAseq pair-end >>>>>>>>> data. The >>>>>>>>> problem is that readGAlignmentPairsFromBam, after setting >>>>>>>>> asMates=TRUE >>>>>>>>> in BamFile, returns 0 mates. >>>>>>>>> >>>>>>>>> The reason is that mates have different QNAMEs. Eg: >>>>>>>>> >>>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1 >>>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2 >>>>>>>>> >>>>>>>>> that is the two mates have /1 or /2 at the end. >>>>>>>>> >>>>>>>>> I wrote a Python (and a cpp) program to fix it...but this takes >>>>>>>>> still >>>>>>>>> quite a substantial amount of time on big files. >>>>>>>>> >>>>>>>>> Does the mating algorithm allow for this? If so how? >>>>>>>>> >>>>>>>>> Regards >>>>>>>>> >>>>>>>>> Stefano >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Bioconductor mailing list >>>>>>>>> Bioconductor at r-project.org >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>> Search the archives: >>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>>> -- >>>>>> James W. MacDonald, M.S. >>>>>> Biostatistician >>>>>> University of Washington >>>>>> Environmental and Occupational Health Sciences >>>>>> 4225 Roosevelt Way NE, # 100 >>>>>> Seattle WA 98105-6099 >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >
ADD COMMENT
0
Entering edit mode
I can't reproduce this problem. When a prefix/suffix is not found in the qname no error is thrown - it just doesn't do any trimming. I've clarified this in the documentation in 1.17.22. With Rsamtools 1.17.21: fl <- system.file("extdata", "ex1.bam", package="Rsamtools") ## original qnames: bfm <- BamFile(fl, asMates=TRUE) scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2] > [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" ## bogus prefix: bfm <- BamFile(fl, asMates=TRUE, qnamePrefixEnd="*") scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2] > [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" ## bogus suffix: bfm <- BamFile(fl, asMates=TRUE, qnameSuffixStart="*") scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2] > [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" Can you provide a reproducible example? Maybe you were calling scanBam() differently? Valerie On 06/03/2014 03:43 AM, Stefano Calza wrote: > Valerie, > > just a little note. In case the character in qnameSuffixEnd does not > exist in QNAME (i.e. you specify the wrong character...) scanBam (I > tested that) hangs forever. > > I guess there should be a test on the very first read: if the character > is not in the string then report error... > > > Stefano > > On 06/02/2014 05:34 PM, Valerie Obenchain wrote: >> You're welcome. Let us know if you run into other problems. >> >> Valerie >> >> >> On 06/01/2014 10:01 AM, Stefano Calza wrote: >> > Great! >> > >> > I would like to thank Valerie and all Bioconductor developers. >> > >> > Stefano >> > >> > Il 31/05/2014 21:39, Valerie Obenchain ha scritto: >>> Trimming the qname is implemented in Rsamtools 1.17.18 and should be >>> available via biocLite() ~ noon tomorrow. >>> >>> 'qnamePrefixStart' and 'qnameSuffixEnd' fields have been added to the >>> BamFile class. Currently the trimming is implemented for mate pairing >>> only, not just reading records from a bam. >>> >>> Unique qnames aren't a problem in this sample file - just using it to >>> demonstrate the output. >>> >>> fl <- system.file("extdata", "ex1.bam", package="Rsamtools") >>> param <- ScanBamParam(what="qname") >>> >>> ## no trimming >>> bf <- BamFile(fl, asMates=TRUE) >>>>> scanBam(bf, param=param)[[1]]$qname[1:3] >>>> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" >>>> [3] "EAS219_FC30151:7:51:1429:1043" >>> >>> ## trim prefix >>> qnamePrefixEnd(bf) <- "_" >>>>> scanBam(bf, param=param)[[1]]$qname[1:3] >>>> [1] "61:4:143:69:578" "61:4:143:69:578" >>>> "FC30151:7:51:1429:1043" >>> >>> ## trim prefix and suffix >>> qnameSuffixStart(bf) <- ":" >>>>> scanBam(bf, param=param)[[1]]$qname[1:3] >>>> [1] "61:4:143:69" "61:4:143:69" "FC30151:7:51:1429" >>> >>> >>> Valerie >>> >>> >>> On 05/08/14 12:17, Stefano Calza wrote: >>>> Yes, I say that would be easier to use than regexp >>>> >>>> Stefano >>>> >>>> On 05/08/2014 08:59 PM, Valerie Obenchain wrote: >>>>> Thanks for the SRA tips. >>>>> >>>>> It's starting to look like the modifications are an additional prefix >>>>> or suffix separated by dot or slash (possibly underscore?). Maybe >>>>> simply adding an option to trim the QNAME by the pre/post term >>>>> separated by a given character would be sufficient. This allows >>>>> flexibilty but prevents unwarranted QNAME mangling. >>>>> >>>>> Valerie >>>>> >>>>> >>>>> On 05/08/14 10:56, Stefano Calza wrote: >>>>>> Right this is how I got some other example. I think it would add the >>>>>> files names, as from 2 SRA files (SRR1234 & SRR1235) my reads are >>>>>> named >>>>>> STARTING with SRR1234 or SRR1235 for the two mates, followed by >>>>>> actual >>>>>> read QNAME. >>>>>> >>>>>> Stefano >>>>>> >>>>>> On 05/08/2014 07:05 PM, James W. MacDonald wrote: >>>>>>> Hi Valerie, >>>>>>> >>>>>>> You get something similar from the .sra files that you can download >>>>>>> from the SRA, if they are paired data. If you use the SRA toolkit to >>>>>>> convert to fastq (fastq-dump), it will spit out two fastq files, and >>>>>>> the QNAME in each of the fastq files will be appended with a .1 for >>>>>>> the first pairs and a .2 for the second pairs. >>>>>>> >>>>>>> As an example: >>>>>>> >>>>>>> zcat SRR833731_1.fastq.gz | head -n 1 >>>>>>> @SRR833731.1.1 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >>>>>>> zcat SRR833731_2.fastq.gz | head -n 1 >>>>>>> @SRR833731.1.2 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >>>>>>> >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Jim >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thursday, May 08, 2014 12:03:29 PM, Stefano Calza wrote: >>>>>>>> Thanks Valerie >>>>>>>> >>>>>>>> I have got this BAM files from different sources but they cannot be >>>>>>>> distributed. >>>>>>>> >>>>>>>> Up to now I experienced twp different 'patterns' in QNAME. One >>>>>>>> is the >>>>>>>> trailing value as we said (/1, /2). Another one is a leading >>>>>>>> string. >>>>>>>> Eg. (made up QNAME) >>>>>>>> >>>>>>>> SRR1122.12345HTR >>>>>>>> SRR1123.12345HTR >>>>>>>> >>>>>>>> So there must be removed SRR1122 and SRR1123) >>>>>>>> >>>>>>>> My little program actually uses a regex substitution, so the >>>>>>>> user can >>>>>>>> decide what pattern to edit. This second one though it seems quit >>>>>>>> unusual. >>>>>>>> >>>>>>>> Those with trailing values were downloaded by TCGA (if I recall >>>>>>>> correctly the use a pipeline called MapSplice) >>>>>>>> >>>>>>>> >>>>>>>> Regards >>>>>>>> >>>>>>>> Stefano >>>>>>>> >>>>>>>> On 05/08/2014 05:54 PM, Valerie Obenchain wrote: >>>>>>>>> Hi Stefano, >>>>>>>>> >>>>>>>>> No, the current mate-pairing doesn't handle the trailing values. I >>>>>>>>> will implement this and post back when it's done. >>>>>>>>> >>>>>>>>> For reference, where did you download your bam files or what >>>>>>>>> application outputs QNAMEs in this format? I'd like to have >>>>>>>>> some for >>>>>>>>> test data. >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks. >>>>>>>>> Valerie >>>>>>>>> >>>>>>>>> >>>>>>>>> On 05/08/14 08:14, Stefano Calza wrote: >>>>>>>>>> Hi everybody >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I am using GenomicAlignments package to read RNAseq pair- end >>>>>>>>>> data. The >>>>>>>>>> problem is that readGAlignmentPairsFromBam, after setting >>>>>>>>>> asMates=TRUE >>>>>>>>>> in BamFile, returns 0 mates. >>>>>>>>>> >>>>>>>>>> The reason is that mates have different QNAMEs. Eg: >>>>>>>>>> >>>>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1 >>>>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2 >>>>>>>>>> >>>>>>>>>> that is the two mates have /1 or /2 at the end. >>>>>>>>>> >>>>>>>>>> I wrote a Python (and a cpp) program to fix it...but this takes >>>>>>>>>> still >>>>>>>>>> quite a substantial amount of time on big files. >>>>>>>>>> >>>>>>>>>> Does the mating algorithm allow for this? If so how? >>>>>>>>>> >>>>>>>>>> Regards >>>>>>>>>> >>>>>>>>>> Stefano >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Bioconductor mailing list >>>>>>>>>> Bioconductor at r-project.org >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>>> Search the archives: >>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> >>>>>>> -- >>>>>>> James W. MacDonald, M.S. >>>>>>> Biostatistician >>>>>>> University of Washington >>>>>>> Environmental and Occupational Health Sciences >>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>> Seattle WA 98105-6099 >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> >>>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > -- Valerie Obenchain Program in Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, Seattle, WA 98109 Email: vobencha at fhcrc.org Phone: (206) 667-3158
ADD REPLY
0
Entering edit mode
Dear Valerie it might be the string I used: qnameSuffixStart="\\" (while it should have been "/") Being the escape character I guess it has been messing up the code. Stefano On 06/03/2014 05:47 PM, Valerie Obenchain wrote: > I can't reproduce this problem. > > When a prefix/suffix is not found in the qname no error is thrown - it > just doesn't do any trimming. I've clarified this in the documentation > in 1.17.22. > > With Rsamtools 1.17.21: > > fl <- system.file("extdata", "ex1.bam", package="Rsamtools") > > ## original qnames: > bfm <- BamFile(fl, asMates=TRUE) > scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2] >> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" > > ## bogus prefix: > bfm <- BamFile(fl, asMates=TRUE, qnamePrefixEnd="*") > scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2] >> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" > > ## bogus suffix: > bfm <- BamFile(fl, asMates=TRUE, qnameSuffixStart="*") > scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2] >> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" > > Can you provide a reproducible example? Maybe you were calling > scanBam() differently? > > Valerie > > On 06/03/2014 03:43 AM, Stefano Calza wrote: >> Valerie, >> >> just a little note. In case the character in qnameSuffixEnd does not >> exist in QNAME (i.e. you specify the wrong character...) scanBam (I >> tested that) hangs forever. >> >> I guess there should be a test on the very first read: if the character >> is not in the string then report error... >> >> >> Stefano >> >> On 06/02/2014 05:34 PM, Valerie Obenchain wrote: >>> You're welcome. Let us know if you run into other problems. >>> >>> Valerie >>> >>> >>> On 06/01/2014 10:01 AM, Stefano Calza wrote: >>> > Great! >>> > >>> > I would like to thank Valerie and all Bioconductor developers. >>> > >>> > Stefano >>> > >>> > Il 31/05/2014 21:39, Valerie Obenchain ha scritto: >>>> Trimming the qname is implemented in Rsamtools 1.17.18 and should be >>>> available via biocLite() ~ noon tomorrow. >>>> >>>> 'qnamePrefixStart' and 'qnameSuffixEnd' fields have been added to the >>>> BamFile class. Currently the trimming is implemented for mate pairing >>>> only, not just reading records from a bam. >>>> >>>> Unique qnames aren't a problem in this sample file - just using it to >>>> demonstrate the output. >>>> >>>> fl <- system.file("extdata", "ex1.bam", package="Rsamtools") >>>> param <- ScanBamParam(what="qname") >>>> >>>> ## no trimming >>>> bf <- BamFile(fl, asMates=TRUE) >>>>>> scanBam(bf, param=param)[[1]]$qname[1:3] >>>>> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" >>>>> [3] "EAS219_FC30151:7:51:1429:1043" >>>> >>>> ## trim prefix >>>> qnamePrefixEnd(bf) <- "_" >>>>>> scanBam(bf, param=param)[[1]]$qname[1:3] >>>>> [1] "61:4:143:69:578" "61:4:143:69:578" >>>>> "FC30151:7:51:1429:1043" >>>> >>>> ## trim prefix and suffix >>>> qnameSuffixStart(bf) <- ":" >>>>>> scanBam(bf, param=param)[[1]]$qname[1:3] >>>>> [1] "61:4:143:69" "61:4:143:69" "FC30151:7:51:1429" >>>> >>>> >>>> Valerie >>>> >>>> >>>> On 05/08/14 12:17, Stefano Calza wrote: >>>>> Yes, I say that would be easier to use than regexp >>>>> >>>>> Stefano >>>>> >>>>> On 05/08/2014 08:59 PM, Valerie Obenchain wrote: >>>>>> Thanks for the SRA tips. >>>>>> >>>>>> It's starting to look like the modifications are an additional >>>>>> prefix >>>>>> or suffix separated by dot or slash (possibly underscore?). Maybe >>>>>> simply adding an option to trim the QNAME by the pre/post term >>>>>> separated by a given character would be sufficient. This allows >>>>>> flexibilty but prevents unwarranted QNAME mangling. >>>>>> >>>>>> Valerie >>>>>> >>>>>> >>>>>> On 05/08/14 10:56, Stefano Calza wrote: >>>>>>> Right this is how I got some other example. I think it would add >>>>>>> the >>>>>>> files names, as from 2 SRA files (SRR1234 & SRR1235) my reads are >>>>>>> named >>>>>>> STARTING with SRR1234 or SRR1235 for the two mates, followed by >>>>>>> actual >>>>>>> read QNAME. >>>>>>> >>>>>>> Stefano >>>>>>> >>>>>>> On 05/08/2014 07:05 PM, James W. MacDonald wrote: >>>>>>>> Hi Valerie, >>>>>>>> >>>>>>>> You get something similar from the .sra files that you can >>>>>>>> download >>>>>>>> from the SRA, if they are paired data. If you use the SRA >>>>>>>> toolkit to >>>>>>>> convert to fastq (fastq-dump), it will spit out two fastq >>>>>>>> files, and >>>>>>>> the QNAME in each of the fastq files will be appended with a .1 >>>>>>>> for >>>>>>>> the first pairs and a .2 for the second pairs. >>>>>>>> >>>>>>>> As an example: >>>>>>>> >>>>>>>> zcat SRR833731_1.fastq.gz | head -n 1 >>>>>>>> @SRR833731.1.1 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >>>>>>>> zcat SRR833731_2.fastq.gz | head -n 1 >>>>>>>> @SRR833731.1.2 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >>>>>>>> >>>>>>>> >>>>>>>> Best, >>>>>>>> >>>>>>>> Jim >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thursday, May 08, 2014 12:03:29 PM, Stefano Calza wrote: >>>>>>>>> Thanks Valerie >>>>>>>>> >>>>>>>>> I have got this BAM files from different sources but they >>>>>>>>> cannot be >>>>>>>>> distributed. >>>>>>>>> >>>>>>>>> Up to now I experienced twp different 'patterns' in QNAME. One >>>>>>>>> is the >>>>>>>>> trailing value as we said (/1, /2). Another one is a leading >>>>>>>>> string. >>>>>>>>> Eg. (made up QNAME) >>>>>>>>> >>>>>>>>> SRR1122.12345HTR >>>>>>>>> SRR1123.12345HTR >>>>>>>>> >>>>>>>>> So there must be removed SRR1122 and SRR1123) >>>>>>>>> >>>>>>>>> My little program actually uses a regex substitution, so the >>>>>>>>> user can >>>>>>>>> decide what pattern to edit. This second one though it seems quit >>>>>>>>> unusual. >>>>>>>>> >>>>>>>>> Those with trailing values were downloaded by TCGA (if I recall >>>>>>>>> correctly the use a pipeline called MapSplice) >>>>>>>>> >>>>>>>>> >>>>>>>>> Regards >>>>>>>>> >>>>>>>>> Stefano >>>>>>>>> >>>>>>>>> On 05/08/2014 05:54 PM, Valerie Obenchain wrote: >>>>>>>>>> Hi Stefano, >>>>>>>>>> >>>>>>>>>> No, the current mate-pairing doesn't handle the trailing >>>>>>>>>> values. I >>>>>>>>>> will implement this and post back when it's done. >>>>>>>>>> >>>>>>>>>> For reference, where did you download your bam files or what >>>>>>>>>> application outputs QNAMEs in this format? I'd like to have >>>>>>>>>> some for >>>>>>>>>> test data. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks. >>>>>>>>>> Valerie >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 05/08/14 08:14, Stefano Calza wrote: >>>>>>>>>>> Hi everybody >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I am using GenomicAlignments package to read RNAseq pair- end >>>>>>>>>>> data. The >>>>>>>>>>> problem is that readGAlignmentPairsFromBam, after setting >>>>>>>>>>> asMates=TRUE >>>>>>>>>>> in BamFile, returns 0 mates. >>>>>>>>>>> >>>>>>>>>>> The reason is that mates have different QNAMEs. Eg: >>>>>>>>>>> >>>>>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1 >>>>>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2 >>>>>>>>>>> >>>>>>>>>>> that is the two mates have /1 or /2 at the end. >>>>>>>>>>> >>>>>>>>>>> I wrote a Python (and a cpp) program to fix it...but this takes >>>>>>>>>>> still >>>>>>>>>>> quite a substantial amount of time on big files. >>>>>>>>>>> >>>>>>>>>>> Does the mating algorithm allow for this? If so how? >>>>>>>>>>> >>>>>>>>>>> Regards >>>>>>>>>>> >>>>>>>>>>> Stefano >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Bioconductor mailing list >>>>>>>>>>> Bioconductor at r-project.org >>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>>>> Search the archives: >>>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Bioconductor mailing list >>>>>>>>> Bioconductor at r-project.org >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>> Search the archives: >>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>> >>>>>>>> -- >>>>>>>> James W. MacDonald, M.S. >>>>>>>> Biostatistician >>>>>>>> University of Washington >>>>>>>> Environmental and Occupational Health Sciences >>>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>>> Seattle WA 98105-6099 >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> > >
ADD REPLY
0
Entering edit mode
When I use double escape the code doesn't hang it just doesn't trim. > fl <- system.file("extdata", "ex1.bam", package="Rsamtools") > bfm <- BamFile(fl, asMates=TRUE, qnamePrefixEnd="\\") > scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2] [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" Using escape with any (?) other character gives an error: > bfm <- BamFile(fl, asMates=TRUE, qnamePrefixEnd="\_") Error: '\_' is an unrecognized escape in character string starting ""\_" I'm not sure why the code is hanging for you with the double escape. If you have a reproducible example you can share I can take a look. Valerie On 06/04/2014 12:11 AM, Stefano Calza wrote: > Dear Valerie > > it might be the string I used: qnameSuffixStart="\\" (while it should > have been "/") > > Being the escape character I guess it has been messing up the code. > > Stefano > > On 06/03/2014 05:47 PM, Valerie Obenchain wrote: >> I can't reproduce this problem. >> >> When a prefix/suffix is not found in the qname no error is thrown - it >> just doesn't do any trimming. I've clarified this in the documentation >> in 1.17.22. >> >> With Rsamtools 1.17.21: >> >> fl <- system.file("extdata", "ex1.bam", package="Rsamtools") >> >> ## original qnames: >> bfm <- BamFile(fl, asMates=TRUE) >> scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2] >>> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" >> >> ## bogus prefix: >> bfm <- BamFile(fl, asMates=TRUE, qnamePrefixEnd="*") >> scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2] >>> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" >> >> ## bogus suffix: >> bfm <- BamFile(fl, asMates=TRUE, qnameSuffixStart="*") >> scanBam(bfm, param=ScanBamParam(what="qname"))[[1]]$qname[1:2] >>> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" >> >> Can you provide a reproducible example? Maybe you were calling >> scanBam() differently? >> >> Valerie >> >> On 06/03/2014 03:43 AM, Stefano Calza wrote: >>> Valerie, >>> >>> just a little note. In case the character in qnameSuffixEnd does not >>> exist in QNAME (i.e. you specify the wrong character...) scanBam (I >>> tested that) hangs forever. >>> >>> I guess there should be a test on the very first read: if the character >>> is not in the string then report error... >>> >>> >>> Stefano >>> >>> On 06/02/2014 05:34 PM, Valerie Obenchain wrote: >>>> You're welcome. Let us know if you run into other problems. >>>> >>>> Valerie >>>> >>>> >>>> On 06/01/2014 10:01 AM, Stefano Calza wrote: >>>> > Great! >>>> > >>>> > I would like to thank Valerie and all Bioconductor developers. >>>> > >>>> > Stefano >>>> > >>>> > Il 31/05/2014 21:39, Valerie Obenchain ha scritto: >>>>> Trimming the qname is implemented in Rsamtools 1.17.18 and should be >>>>> available via biocLite() ~ noon tomorrow. >>>>> >>>>> 'qnamePrefixStart' and 'qnameSuffixEnd' fields have been added to the >>>>> BamFile class. Currently the trimming is implemented for mate pairing >>>>> only, not just reading records from a bam. >>>>> >>>>> Unique qnames aren't a problem in this sample file - just using it to >>>>> demonstrate the output. >>>>> >>>>> fl <- system.file("extdata", "ex1.bam", package="Rsamtools") >>>>> param <- ScanBamParam(what="qname") >>>>> >>>>> ## no trimming >>>>> bf <- BamFile(fl, asMates=TRUE) >>>>>>> scanBam(bf, param=param)[[1]]$qname[1:3] >>>>>> [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" >>>>>> [3] "EAS219_FC30151:7:51:1429:1043" >>>>> >>>>> ## trim prefix >>>>> qnamePrefixEnd(bf) <- "_" >>>>>>> scanBam(bf, param=param)[[1]]$qname[1:3] >>>>>> [1] "61:4:143:69:578" "61:4:143:69:578" >>>>>> "FC30151:7:51:1429:1043" >>>>> >>>>> ## trim prefix and suffix >>>>> qnameSuffixStart(bf) <- ":" >>>>>>> scanBam(bf, param=param)[[1]]$qname[1:3] >>>>>> [1] "61:4:143:69" "61:4:143:69" "FC30151:7:51:1429" >>>>> >>>>> >>>>> Valerie >>>>> >>>>> >>>>> On 05/08/14 12:17, Stefano Calza wrote: >>>>>> Yes, I say that would be easier to use than regexp >>>>>> >>>>>> Stefano >>>>>> >>>>>> On 05/08/2014 08:59 PM, Valerie Obenchain wrote: >>>>>>> Thanks for the SRA tips. >>>>>>> >>>>>>> It's starting to look like the modifications are an additional >>>>>>> prefix >>>>>>> or suffix separated by dot or slash (possibly underscore?). Maybe >>>>>>> simply adding an option to trim the QNAME by the pre/post term >>>>>>> separated by a given character would be sufficient. This allows >>>>>>> flexibilty but prevents unwarranted QNAME mangling. >>>>>>> >>>>>>> Valerie >>>>>>> >>>>>>> >>>>>>> On 05/08/14 10:56, Stefano Calza wrote: >>>>>>>> Right this is how I got some other example. I think it would add >>>>>>>> the >>>>>>>> files names, as from 2 SRA files (SRR1234 & SRR1235) my reads are >>>>>>>> named >>>>>>>> STARTING with SRR1234 or SRR1235 for the two mates, followed by >>>>>>>> actual >>>>>>>> read QNAME. >>>>>>>> >>>>>>>> Stefano >>>>>>>> >>>>>>>> On 05/08/2014 07:05 PM, James W. MacDonald wrote: >>>>>>>>> Hi Valerie, >>>>>>>>> >>>>>>>>> You get something similar from the .sra files that you can >>>>>>>>> download >>>>>>>>> from the SRA, if they are paired data. If you use the SRA >>>>>>>>> toolkit to >>>>>>>>> convert to fastq (fastq-dump), it will spit out two fastq >>>>>>>>> files, and >>>>>>>>> the QNAME in each of the fastq files will be appended with a .1 >>>>>>>>> for >>>>>>>>> the first pairs and a .2 for the second pairs. >>>>>>>>> >>>>>>>>> As an example: >>>>>>>>> >>>>>>>>> zcat SRR833731_1.fastq.gz | head -n 1 >>>>>>>>> @SRR833731.1.1 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >>>>>>>>> zcat SRR833731_2.fastq.gz | head -n 1 >>>>>>>>> @SRR833731.1.2 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >>>>>>>>> >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> >>>>>>>>> Jim >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thursday, May 08, 2014 12:03:29 PM, Stefano Calza wrote: >>>>>>>>>> Thanks Valerie >>>>>>>>>> >>>>>>>>>> I have got this BAM files from different sources but they >>>>>>>>>> cannot be >>>>>>>>>> distributed. >>>>>>>>>> >>>>>>>>>> Up to now I experienced twp different 'patterns' in QNAME. One >>>>>>>>>> is the >>>>>>>>>> trailing value as we said (/1, /2). Another one is a leading >>>>>>>>>> string. >>>>>>>>>> Eg. (made up QNAME) >>>>>>>>>> >>>>>>>>>> SRR1122.12345HTR >>>>>>>>>> SRR1123.12345HTR >>>>>>>>>> >>>>>>>>>> So there must be removed SRR1122 and SRR1123) >>>>>>>>>> >>>>>>>>>> My little program actually uses a regex substitution, so the >>>>>>>>>> user can >>>>>>>>>> decide what pattern to edit. This second one though it seems quit >>>>>>>>>> unusual. >>>>>>>>>> >>>>>>>>>> Those with trailing values were downloaded by TCGA (if I recall >>>>>>>>>> correctly the use a pipeline called MapSplice) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Regards >>>>>>>>>> >>>>>>>>>> Stefano >>>>>>>>>> >>>>>>>>>> On 05/08/2014 05:54 PM, Valerie Obenchain wrote: >>>>>>>>>>> Hi Stefano, >>>>>>>>>>> >>>>>>>>>>> No, the current mate-pairing doesn't handle the trailing >>>>>>>>>>> values. I >>>>>>>>>>> will implement this and post back when it's done. >>>>>>>>>>> >>>>>>>>>>> For reference, where did you download your bam files or what >>>>>>>>>>> application outputs QNAMEs in this format? I'd like to have >>>>>>>>>>> some for >>>>>>>>>>> test data. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks. >>>>>>>>>>> Valerie >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 05/08/14 08:14, Stefano Calza wrote: >>>>>>>>>>>> Hi everybody >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I am using GenomicAlignments package to read RNAseq pair- end >>>>>>>>>>>> data. The >>>>>>>>>>>> problem is that readGAlignmentPairsFromBam, after setting >>>>>>>>>>>> asMates=TRUE >>>>>>>>>>>> in BamFile, returns 0 mates. >>>>>>>>>>>> >>>>>>>>>>>> The reason is that mates have different QNAMEs. Eg: >>>>>>>>>>>> >>>>>>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1 >>>>>>>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2 >>>>>>>>>>>> >>>>>>>>>>>> that is the two mates have /1 or /2 at the end. >>>>>>>>>>>> >>>>>>>>>>>> I wrote a Python (and a cpp) program to fix it...but this takes >>>>>>>>>>>> still >>>>>>>>>>>> quite a substantial amount of time on big files. >>>>>>>>>>>> >>>>>>>>>>>> Does the mating algorithm allow for this? If so how? >>>>>>>>>>>> >>>>>>>>>>>> Regards >>>>>>>>>>>> >>>>>>>>>>>> Stefano >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Bioconductor mailing list >>>>>>>>>>>> Bioconductor at r-project.org >>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>>>>> Search the archives: >>>>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Bioconductor mailing list >>>>>>>>>> Bioconductor at r-project.org >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>>> Search the archives: >>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>>> >>>>>>>>> -- >>>>>>>>> James W. MacDonald, M.S. >>>>>>>>> Biostatistician >>>>>>>>> University of Washington >>>>>>>>> Environmental and Occupational Health Sciences >>>>>>>>> 4225 Roosevelt Way NE, # 100 >>>>>>>>> Seattle WA 98105-6099 >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Bioconductor mailing list >>>>>>>>> Bioconductor at r-project.org >>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>>> Search the archives: >>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>> >> >> > -- Valerie Obenchain Program in Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, Seattle, WA 98109 Email: vobencha at fhcrc.org Phone: (206) 667-3158
ADD REPLY

Login before adding your answer.

Traffic: 163 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6