GenomicAlignments and QNAME collision

0

Entering edit mode

Stefano Calza ▴ 90

@stefano-calza-5062

Last seen 9.6 years ago

Italy

Hi everybody I am using GenomicAlignments package to read RNAseq pair-end data. The problem is that readGAlignmentPairsFromBam, after setting asMates=TRUE in BamFile, returns 0 mates. The reason is that mates have different QNAMEs. Eg: UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1 UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2 that is the two mates have /1 or /2 at the end. I wrote a Python (and a cpp) program to fix it...but this takes still quite a substantial amount of time on big files. Does the mating algorithm allow for this? If so how? Regards Stefano

RNASeq GenomicAlignments RNASeq GenomicAlignments • 1.2k views

ADD COMMENT • link updated 10.0 years ago by Valerie Obenchain ★ 6.8k • written 10.0 years ago by Stefano Calza ▴ 90

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 2.3 years ago

United States

Hi Stefano, No, the current mate-pairing doesn't handle the trailing values. I will implement this and post back when it's done. For reference, where did you download your bam files or what application outputs QNAMEs in this format? I'd like to have some for test data. Thanks. Valerie On 05/08/14 08:14, Stefano Calza wrote: > Hi everybody > > > I am using GenomicAlignments package to read RNAseq pair-end data. The > problem is that readGAlignmentPairsFromBam, after setting asMates=TRUE > in BamFile, returns 0 mates. > > The reason is that mates have different QNAMEs. Eg: > > UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1 > UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2 > > that is the two mates have /1 or /2 at the end. > > I wrote a Python (and a cpp) program to fix it...but this takes still > quite a substantial amount of time on big files. > > Does the mating algorithm allow for this? If so how? > > Regards > > Stefano > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Valerie Obenchain Program in Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, Seattle, WA 98109 Email: vobencha at fhcrc.org Phone: (206) 667-3158

ADD COMMENT • link 10.0 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Thanks Valerie I have got this BAM files from different sources but they cannot be distributed. Up to now I experienced twp different 'patterns' in QNAME. One is the trailing value as we said (/1, /2). Another one is a leading string. Eg. (made up QNAME) SRR1122.12345HTR SRR1123.12345HTR So there must be removed SRR1122 and SRR1123) My little program actually uses a regex substitution, so the user can decide what pattern to edit. This second one though it seems quit unusual. Those with trailing values were downloaded by TCGA (if I recall correctly the use a pipeline called MapSplice) Regards Stefano On 05/08/2014 05:54 PM, Valerie Obenchain wrote: > Hi Stefano, > > No, the current mate-pairing doesn't handle the trailing values. I > will implement this and post back when it's done. > > For reference, where did you download your bam files or what > application outputs QNAMEs in this format? I'd like to have some for > test data. > > > Thanks. > Valerie > > > On 05/08/14 08:14, Stefano Calza wrote: >> Hi everybody >> >> >> I am using GenomicAlignments package to read RNAseq pair-end data. The >> problem is that readGAlignmentPairsFromBam, after setting asMates=TRUE >> in BamFile, returns 0 mates. >> >> The reason is that mates have different QNAMEs. Eg: >> >> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1 >> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2 >> >> that is the two mates have /1 or /2 at the end. >> >> I wrote a Python (and a cpp) program to fix it...but this takes still >> quite a substantial amount of time on big files. >> >> Does the mating algorithm allow for this? If so how? >> >> Regards >> >> Stefano >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD REPLY • link 10.0 years ago Stefano Calza ▴ 90

0

Entering edit mode

Hi Valerie, You get something similar from the .sra files that you can download from the SRA, if they are paired data. If you use the SRA toolkit to convert to fastq (fastq-dump), it will spit out two fastq files, and the QNAME in each of the fastq files will be appended with a .1 for the first pairs and a .2 for the second pairs. As an example: zcat SRR833731_1.fastq.gz | head -n 1 @SRR833731.1.1 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 zcat SRR833731_2.fastq.gz | head -n 1 @SRR833731.1.2 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 Best, Jim On Thursday, May 08, 2014 12:03:29 PM, Stefano Calza wrote: > Thanks Valerie > > I have got this BAM files from different sources but they cannot be > distributed. > > Up to now I experienced twp different 'patterns' in QNAME. One is the > trailing value as we said (/1, /2). Another one is a leading string. > Eg. (made up QNAME) > > SRR1122.12345HTR > SRR1123.12345HTR > > So there must be removed SRR1122 and SRR1123) > > My little program actually uses a regex substitution, so the user can > decide what pattern to edit. This second one though it seems quit > unusual. > > Those with trailing values were downloaded by TCGA (if I recall > correctly the use a pipeline called MapSplice) > > > Regards > > Stefano > > On 05/08/2014 05:54 PM, Valerie Obenchain wrote: >> Hi Stefano, >> >> No, the current mate-pairing doesn't handle the trailing values. I >> will implement this and post back when it's done. >> >> For reference, where did you download your bam files or what >> application outputs QNAMEs in this format? I'd like to have some for >> test data. >> >> >> Thanks. >> Valerie >> >> >> On 05/08/14 08:14, Stefano Calza wrote: >>> Hi everybody >>> >>> >>> I am using GenomicAlignments package to read RNAseq pair-end data. The >>> problem is that readGAlignmentPairsFromBam, after setting asMates=TRUE >>> in BamFile, returns 0 mates. >>> >>> The reason is that mates have different QNAMEs. Eg: >>> >>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1 >>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2 >>> >>> that is the two mates have /1 or /2 at the end. >>> >>> I wrote a Python (and a cpp) program to fix it...but this takes still >>> quite a substantial amount of time on big files. >>> >>> Does the mating algorithm allow for this? If so how? >>> >>> Regards >>> >>> Stefano >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 10.0 years ago James W. MacDonald 65k

0

Entering edit mode

Right this is how I got some other example. I think it would add the files names, as from 2 SRA files (SRR1234 & SRR1235) my reads are named STARTING with SRR1234 or SRR1235 for the two mates, followed by actual read QNAME. Stefano On 05/08/2014 07:05 PM, James W. MacDonald wrote: > Hi Valerie, > > You get something similar from the .sra files that you can download > from the SRA, if they are paired data. If you use the SRA toolkit to > convert to fastq (fastq-dump), it will spit out two fastq files, and > the QNAME in each of the fastq files will be appended with a .1 for > the first pairs and a .2 for the second pairs. > > As an example: > > zcat SRR833731_1.fastq.gz | head -n 1 > @SRR833731.1.1 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 > zcat SRR833731_2.fastq.gz | head -n 1 > @SRR833731.1.2 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 > > > Best, > > Jim > > > > On Thursday, May 08, 2014 12:03:29 PM, Stefano Calza wrote: >> Thanks Valerie >> >> I have got this BAM files from different sources but they cannot be >> distributed. >> >> Up to now I experienced twp different 'patterns' in QNAME. One is the >> trailing value as we said (/1, /2). Another one is a leading string. >> Eg. (made up QNAME) >> >> SRR1122.12345HTR >> SRR1123.12345HTR >> >> So there must be removed SRR1122 and SRR1123) >> >> My little program actually uses a regex substitution, so the user can >> decide what pattern to edit. This second one though it seems quit >> unusual. >> >> Those with trailing values were downloaded by TCGA (if I recall >> correctly the use a pipeline called MapSplice) >> >> >> Regards >> >> Stefano >> >> On 05/08/2014 05:54 PM, Valerie Obenchain wrote: >>> Hi Stefano, >>> >>> No, the current mate-pairing doesn't handle the trailing values. I >>> will implement this and post back when it's done. >>> >>> For reference, where did you download your bam files or what >>> application outputs QNAMEs in this format? I'd like to have some for >>> test data. >>> >>> >>> Thanks. >>> Valerie >>> >>> >>> On 05/08/14 08:14, Stefano Calza wrote: >>>> Hi everybody >>>> >>>> >>>> I am using GenomicAlignments package to read RNAseq pair-end data. The >>>> problem is that readGAlignmentPairsFromBam, after setting asMates=TRUE >>>> in BamFile, returns 0 mates. >>>> >>>> The reason is that mates have different QNAMEs. Eg: >>>> >>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1 >>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2 >>>> >>>> that is the two mates have /1 or /2 at the end. >>>> >>>> I wrote a Python (and a cpp) program to fix it...but this takes still >>>> quite a substantial amount of time on big files. >>>> >>>> Does the mating algorithm allow for this? If so how? >>>> >>>> Regards >>>> >>>> Stefano >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 10.0 years ago Stefano Calza ▴ 90

0

Entering edit mode

Thanks for the SRA tips. It's starting to look like the modifications are an additional prefix or suffix separated by dot or slash (possibly underscore?). Maybe simply adding an option to trim the QNAME by the pre/post term separated by a given character would be sufficient. This allows flexibilty but prevents unwarranted QNAME mangling. Valerie On 05/08/14 10:56, Stefano Calza wrote: > Right this is how I got some other example. I think it would add the > files names, as from 2 SRA files (SRR1234 & SRR1235) my reads are named > STARTING with SRR1234 or SRR1235 for the two mates, followed by actual > read QNAME. > > Stefano > > On 05/08/2014 07:05 PM, James W. MacDonald wrote: >> Hi Valerie, >> >> You get something similar from the .sra files that you can download >> from the SRA, if they are paired data. If you use the SRA toolkit to >> convert to fastq (fastq-dump), it will spit out two fastq files, and >> the QNAME in each of the fastq files will be appended with a .1 for >> the first pairs and a .2 for the second pairs. >> >> As an example: >> >> zcat SRR833731_1.fastq.gz | head -n 1 >> @SRR833731.1.1 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >> zcat SRR833731_2.fastq.gz | head -n 1 >> @SRR833731.1.2 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >> >> >> Best, >> >> Jim >> >> >> >> On Thursday, May 08, 2014 12:03:29 PM, Stefano Calza wrote: >>> Thanks Valerie >>> >>> I have got this BAM files from different sources but they cannot be >>> distributed. >>> >>> Up to now I experienced twp different 'patterns' in QNAME. One is the >>> trailing value as we said (/1, /2). Another one is a leading string. >>> Eg. (made up QNAME) >>> >>> SRR1122.12345HTR >>> SRR1123.12345HTR >>> >>> So there must be removed SRR1122 and SRR1123) >>> >>> My little program actually uses a regex substitution, so the user can >>> decide what pattern to edit. This second one though it seems quit >>> unusual. >>> >>> Those with trailing values were downloaded by TCGA (if I recall >>> correctly the use a pipeline called MapSplice) >>> >>> >>> Regards >>> >>> Stefano >>> >>> On 05/08/2014 05:54 PM, Valerie Obenchain wrote: >>>> Hi Stefano, >>>> >>>> No, the current mate-pairing doesn't handle the trailing values. I >>>> will implement this and post back when it's done. >>>> >>>> For reference, where did you download your bam files or what >>>> application outputs QNAMEs in this format? I'd like to have some for >>>> test data. >>>> >>>> >>>> Thanks. >>>> Valerie >>>> >>>> >>>> On 05/08/14 08:14, Stefano Calza wrote: >>>>> Hi everybody >>>>> >>>>> >>>>> I am using GenomicAlignments package to read RNAseq pair-end data. The >>>>> problem is that readGAlignmentPairsFromBam, after setting asMates=TRUE >>>>> in BamFile, returns 0 mates. >>>>> >>>>> The reason is that mates have different QNAMEs. Eg: >>>>> >>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1 >>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2 >>>>> >>>>> that is the two mates have /1 or /2 at the end. >>>>> >>>>> I wrote a Python (and a cpp) program to fix it...but this takes still >>>>> quite a substantial amount of time on big files. >>>>> >>>>> Does the mating algorithm allow for this? If so how? >>>>> >>>>> Regards >>>>> >>>>> Stefano >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> -- >> James W. MacDonald, M.S. >> Biostatistician >> University of Washington >> Environmental and Occupational Health Sciences >> 4225 Roosevelt Way NE, # 100 >> Seattle WA 98105-6099 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Valerie Obenchain Program in Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, Seattle, WA 98109 Email: vobencha at fhcrc.org Phone: (206) 667-3158

ADD REPLY • link 10.0 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Yes, I say that would be easier to use than regexp Stefano On 05/08/2014 08:59 PM, Valerie Obenchain wrote: > Thanks for the SRA tips. > > It's starting to look like the modifications are an additional prefix > or suffix separated by dot or slash (possibly underscore?). Maybe > simply adding an option to trim the QNAME by the pre/post term > separated by a given character would be sufficient. This allows > flexibilty but prevents unwarranted QNAME mangling. > > Valerie > > > On 05/08/14 10:56, Stefano Calza wrote: >> Right this is how I got some other example. I think it would add the >> files names, as from 2 SRA files (SRR1234 & SRR1235) my reads are named >> STARTING with SRR1234 or SRR1235 for the two mates, followed by actual >> read QNAME. >> >> Stefano >> >> On 05/08/2014 07:05 PM, James W. MacDonald wrote: >>> Hi Valerie, >>> >>> You get something similar from the .sra files that you can download >>> from the SRA, if they are paired data. If you use the SRA toolkit to >>> convert to fastq (fastq-dump), it will spit out two fastq files, and >>> the QNAME in each of the fastq files will be appended with a .1 for >>> the first pairs and a .2 for the second pairs. >>> >>> As an example: >>> >>> zcat SRR833731_1.fastq.gz | head -n 1 >>> @SRR833731.1.1 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >>> zcat SRR833731_2.fastq.gz | head -n 1 >>> @SRR833731.1.2 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >>> >>> >>> Best, >>> >>> Jim >>> >>> >>> >>> On Thursday, May 08, 2014 12:03:29 PM, Stefano Calza wrote: >>>> Thanks Valerie >>>> >>>> I have got this BAM files from different sources but they cannot be >>>> distributed. >>>> >>>> Up to now I experienced twp different 'patterns' in QNAME. One is the >>>> trailing value as we said (/1, /2). Another one is a leading string. >>>> Eg. (made up QNAME) >>>> >>>> SRR1122.12345HTR >>>> SRR1123.12345HTR >>>> >>>> So there must be removed SRR1122 and SRR1123) >>>> >>>> My little program actually uses a regex substitution, so the user can >>>> decide what pattern to edit. This second one though it seems quit >>>> unusual. >>>> >>>> Those with trailing values were downloaded by TCGA (if I recall >>>> correctly the use a pipeline called MapSplice) >>>> >>>> >>>> Regards >>>> >>>> Stefano >>>> >>>> On 05/08/2014 05:54 PM, Valerie Obenchain wrote: >>>>> Hi Stefano, >>>>> >>>>> No, the current mate-pairing doesn't handle the trailing values. I >>>>> will implement this and post back when it's done. >>>>> >>>>> For reference, where did you download your bam files or what >>>>> application outputs QNAMEs in this format? I'd like to have some for >>>>> test data. >>>>> >>>>> >>>>> Thanks. >>>>> Valerie >>>>> >>>>> >>>>> On 05/08/14 08:14, Stefano Calza wrote: >>>>>> Hi everybody >>>>>> >>>>>> >>>>>> I am using GenomicAlignments package to read RNAseq pair-end >>>>>> data. The >>>>>> problem is that readGAlignmentPairsFromBam, after setting >>>>>> asMates=TRUE >>>>>> in BamFile, returns 0 mates. >>>>>> >>>>>> The reason is that mates have different QNAMEs. Eg: >>>>>> >>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1 >>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2 >>>>>> >>>>>> that is the two mates have /1 or /2 at the end. >>>>>> >>>>>> I wrote a Python (and a cpp) program to fix it...but this takes >>>>>> still >>>>>> quite a substantial amount of time on big files. >>>>>> >>>>>> Does the mating algorithm allow for this? If so how? >>>>>> >>>>>> Regards >>>>>> >>>>>> Stefano >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> -- >>> James W. MacDonald, M.S. >>> Biostatistician >>> University of Washington >>> Environmental and Occupational Health Sciences >>> 4225 Roosevelt Way NE, # 100 >>> Seattle WA 98105-6099 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD REPLY • link 10.0 years ago Stefano Calza ▴ 90

0

Entering edit mode

Trimming the qname is implemented in Rsamtools 1.17.18 and should be available via biocLite() ~ noon tomorrow. 'qnamePrefixStart' and 'qnameSuffixEnd' fields have been added to the BamFile class. Currently the trimming is implemented for mate pairing only, not just reading records from a bam. Unique qnames aren't a problem in this sample file - just using it to demonstrate the output. fl <- system.file("extdata", "ex1.bam", package="Rsamtools") param <- ScanBamParam(what="qname") ## no trimming bf <- BamFile(fl, asMates=TRUE) >> scanBam(bf, param=param)[[1]]$qname[1:3] > [1] "EAS54_61:4:143:69:578" "EAS54_61:4:143:69:578" > [3] "EAS219_FC30151:7:51:1429:1043" ## trim prefix qnamePrefixEnd(bf) <- "_" >> scanBam(bf, param=param)[[1]]$qname[1:3] > [1] "61:4:143:69:578" "61:4:143:69:578" "FC30151:7:51:1429:1043" ## trim prefix and suffix qnameSuffixStart(bf) <- ":" >> scanBam(bf, param=param)[[1]]$qname[1:3] > [1] "61:4:143:69" "61:4:143:69" "FC30151:7:51:1429" Valerie On 05/08/14 12:17, Stefano Calza wrote: > Yes, I say that would be easier to use than regexp > > Stefano > > On 05/08/2014 08:59 PM, Valerie Obenchain wrote: >> Thanks for the SRA tips. >> >> It's starting to look like the modifications are an additional prefix >> or suffix separated by dot or slash (possibly underscore?). Maybe >> simply adding an option to trim the QNAME by the pre/post term >> separated by a given character would be sufficient. This allows >> flexibilty but prevents unwarranted QNAME mangling. >> >> Valerie >> >> >> On 05/08/14 10:56, Stefano Calza wrote: >>> Right this is how I got some other example. I think it would add the >>> files names, as from 2 SRA files (SRR1234 & SRR1235) my reads are named >>> STARTING with SRR1234 or SRR1235 for the two mates, followed by actual >>> read QNAME. >>> >>> Stefano >>> >>> On 05/08/2014 07:05 PM, James W. MacDonald wrote: >>>> Hi Valerie, >>>> >>>> You get something similar from the .sra files that you can download >>>> from the SRA, if they are paired data. If you use the SRA toolkit to >>>> convert to fastq (fastq-dump), it will spit out two fastq files, and >>>> the QNAME in each of the fastq files will be appended with a .1 for >>>> the first pairs and a .2 for the second pairs. >>>> >>>> As an example: >>>> >>>> zcat SRR833731_1.fastq.gz | head -n 1 >>>> @SRR833731.1.1 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >>>> zcat SRR833731_2.fastq.gz | head -n 1 >>>> @SRR833731.1.2 HWI-ST423:250:D0JRLACXX:8:1101:1473:1978 length=101 >>>> >>>> >>>> Best, >>>> >>>> Jim >>>> >>>> >>>> >>>> On Thursday, May 08, 2014 12:03:29 PM, Stefano Calza wrote: >>>>> Thanks Valerie >>>>> >>>>> I have got this BAM files from different sources but they cannot be >>>>> distributed. >>>>> >>>>> Up to now I experienced twp different 'patterns' in QNAME. One is the >>>>> trailing value as we said (/1, /2). Another one is a leading string. >>>>> Eg. (made up QNAME) >>>>> >>>>> SRR1122.12345HTR >>>>> SRR1123.12345HTR >>>>> >>>>> So there must be removed SRR1122 and SRR1123) >>>>> >>>>> My little program actually uses a regex substitution, so the user can >>>>> decide what pattern to edit. This second one though it seems quit >>>>> unusual. >>>>> >>>>> Those with trailing values were downloaded by TCGA (if I recall >>>>> correctly the use a pipeline called MapSplice) >>>>> >>>>> >>>>> Regards >>>>> >>>>> Stefano >>>>> >>>>> On 05/08/2014 05:54 PM, Valerie Obenchain wrote: >>>>>> Hi Stefano, >>>>>> >>>>>> No, the current mate-pairing doesn't handle the trailing values. I >>>>>> will implement this and post back when it's done. >>>>>> >>>>>> For reference, where did you download your bam files or what >>>>>> application outputs QNAMEs in this format? I'd like to have some for >>>>>> test data. >>>>>> >>>>>> >>>>>> Thanks. >>>>>> Valerie >>>>>> >>>>>> >>>>>> On 05/08/14 08:14, Stefano Calza wrote: >>>>>>> Hi everybody >>>>>>> >>>>>>> >>>>>>> I am using GenomicAlignments package to read RNAseq pair-end >>>>>>> data. The >>>>>>> problem is that readGAlignmentPairsFromBam, after setting >>>>>>> asMates=TRUE >>>>>>> in BamFile, returns 0 mates. >>>>>>> >>>>>>> The reason is that mates have different QNAMEs. Eg: >>>>>>> >>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/1 >>>>>>> UNC15-SN850:240:D148CACXX:3:1308:19719:99367/2 >>>>>>> >>>>>>> that is the two mates have /1 or /2 at the end. >>>>>>> >>>>>>> I wrote a Python (and a cpp) program to fix it...but this takes >>>>>>> still >>>>>>> quite a substantial amount of time on big files. >>>>>>> >>>>>>> Does the mating algorithm allow for this? If so how? >>>>>>> >>>>>>> Regards >>>>>>> >>>>>>> Stefano >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> -- >>>> James W. MacDonald, M.S. >>>> Biostatistician >>>> University of Washington >>>> Environmental and Occupational Health Sciences >>>> 4225 Roosevelt Way NE, # 100 >>>> Seattle WA 98105-6099 >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >

ADD REPLY • link 9.9 years ago Valerie Obenchain ★ 6.8k

Login before adding your answer.