deqseq_count and BWA-based SAM files

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.6 years ago

Hello all, I'm trying to use DEXSeq to identify and quantitate alternative splicing events following transgene expression. I've used BWA to map my reads and now have the resulting SAM files. However, when I use dexseq_count (a Python script) to convert this to a ExonCountSet, I get the following error message: "/common/groups/dac/Hawaii/my_new_env/lib/python2.6/site- packages/HTSeq/__init__.py:592: UserWarning: Read HWUSI- EAS381R:1:10:12324:1388#CGATGT claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)" I can see that my SAM file is not "sorted" properly. Does anyone have a workaround for this? I've searched all over and all I can find is people writing, "I used a perl script as a workaround." Thanks in advance, Wyatt -- output of sessionInfo(): R version 2.14.0 (2011-10-31) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] DEXSeq_1.0.2 Biobase_2.14.0 loaded via a namespace (and not attached): [1] hwriter_1.3 plyr_1.6 statmod_1.4.14 stringr_0.6 -- Sent via the guest posting facility at bioconductor.org.

convert DEXSeq convert DEXSeq • 1.5k views

ADD COMMENT • link updated 12.4 years ago by Shona Wood ▴ 70 • written 12.4 years ago by Guest User ★ 13k

0

Entering edit mode

wang peter ★ 2.0k

@wang-peter-4647

Last seen 9.6 years ago

you can sort them by linux command sort -T . -S 2G -k 1,1 -k 3,3 $target_fa.psl.sam > $target_fa.psl.nameSorted.sam -- shan gao Room 231(Dr.Fei lab) Boyce Thompson Institute Cornell University Tower Road, Ithaca, NY 14853-1801 Office phone: 1-607-254-1267(day) Official email:sg839 at cornell.edu Facebook:http://www.facebook.com/profile.php?id=100001986532253

ADD COMMENT • link 12.4 years ago wang peter ★ 2.0k

0

Entering edit mode

I'd rather use `samtools sort ...` You probably want to (1) convert your sam to bam, (2) then sort, (3) then index, (4) then use. >From the command line: ## convert to bam samtools view -bS your.sam > your-tmp.bam ## sort bam samtools sort your-tmp.bam your ## index bam samtools index your.bam Now you can junk your sam file and save mucho HD space to boot. -steve On Tue, Dec 13, 2011 at 10:41 AM, wang peter <wng.peter at="" gmail.com=""> wrote: > you can sort them by linux command > > sort -T . -S 2G -k 1,1 -k 3,3 $target_fa.psl.sam > $target_fa.psl.nameSorted.sam > -- > shan gao > Room 231(Dr.Fei lab) > Boyce Thompson Institute > Cornell University > Tower Road, Ithaca, NY 14853-1801 > Office phone: 1-607-254-1267(day) > Official email:sg839 at cornell.edu > Facebook:http://www.facebook.com/profile.php?id=100001986532253 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 12.4 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Thanks to everyone who has replied to this. I really appreciate it! Unfortunately, none of these has worked. I've used both Shan's script as well as samtools and am still having the same problem. Despite everything being very nicely sorted, I still getting the same error message. I'm not sure what to do. I may write the developer of the dexseq_count script. . . . Thanks so much, Wyatt ----- Original Message ----- From: "Steve Lianoglou" <mailinglist.honeypot@gmail.com> To: "wang peter" <wng.peter at="" gmail.com=""> Cc: "Wyatt McMahon [guest]" <guest at="" bioconductor.org="">, bioconductor at r-project.org, wmcmahon at vbi.vt.edu Sent: Tuesday, December 13, 2011 4:02:01 PM Subject: Re: [BioC] deqseq_count and BWA-based SAM files I'd rather use `samtools sort ...` You probably want to (1) convert your sam to bam, (2) then sort, (3) then index, (4) then use. >From the command line: ## convert to bam samtools view -bS your.sam > your-tmp.bam ## sort bam samtools sort your-tmp.bam your ## index bam samtools index your.bam Now you can junk your sam file and save mucho HD space to boot. -steve On Tue, Dec 13, 2011 at 10:41 AM, wang peter <wng.peter at="" gmail.com=""> wrote: > you can sort them by linux command > > sort -T . -S 2G -k 1,1 -k 3,3 $target_fa.psl.sam > $target_fa.psl.nameSorted.sam > -- > shan gao > Room 231(Dr.Fei lab) > Boyce Thompson Institute > Cornell University > Tower Road, Ithaca, NY 14853-1801 > Office phone: 1-607-254-1267(day) > Official email:sg839 at cornell.edu > Facebook:http://www.facebook.com/profile.php?id=100001986532253 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD REPLY • link 12.4 years ago Wyatt McMahon ▴ 20

0

Entering edit mode

Hi Wyatt On 2011-12-13 22:06, Wyatt McMahon wrote: > Unfortunately, none of these has worked. I've used both Shan's > script as well as samtools and am still having the same problem. > Despite everything being very nicely sorted, I still getting the same > error message. Do you get the error for every read or only for some. The latter is typically harmless. To explain: The way how the SAM format stores paired-end reads is, IMO, rather unfortunate. Each mate gets its own SAM line, and the two SAM lines can be at rather different places in the file. Once you sort by name, the mates will be close to each other (even though they may still be mixed up in case there is more than one alignment for the pair). HTSeq takes a chunk of adjacent lines with the same read ID and arranges them into matching pairs (by using the MRNM and MPOS (or RNEXT and PNEXT in the new terminology) columns). If this does not work, the warning is displayed. Often, if you do some filtering, you might remove a SAM line for a read but leave in the line for its mate. HTSeq will simply skip such reads but display the warning you saw. You can silence the warnings (but also all others) with the '-q' option if they bother you. Simon

ADD REPLY • link 12.4 years ago Simon Anders ★ 3.7k

0

Entering edit mode

Shona Wood ▴ 70

@shona-wood-4559

Last seen 9.6 years ago

Hi Wyatt, I use SAMTOOLS to sort sam files: http://samtools.sourceforge.net/samtools.shtml ________________________________________ From: bioconductor-bounces@r-project.org [bioconductor- bounces@r-project.org] on behalf of Wyatt McMahon [guest] [guest@bioconductor.org] Sent: 13 December 2011 14:42 To: bioconductor at r-project.org; wmcmahon at vbi.vt.edu Subject: [BioC] deqseq_count and BWA-based SAM files Hello all, I'm trying to use DEXSeq to identify and quantitate alternative splicing events following transgene expression. I've used BWA to map my reads and now have the resulting SAM files. However, when I use dexseq_count (a Python script) to convert this to a ExonCountSet, I get the following error message: "/common/groups/dac/Hawaii/my_new_env/lib/python2.6/site- packages/HTSeq/__init__.py:592: UserWarning: Read HWUSI- EAS381R:1:10:12324:1388#CGATGT claims to have an aligned mate which could not be found. (Is the SAM file properly sorted?)" I can see that my SAM file is not "sorted" properly. Does anyone have a workaround for this? I've searched all over and all I can find is people writing, "I used a perl script as a workaround." Thanks in advance, Wyatt -- output of sessionInfo(): R version 2.14.0 (2011-10-31) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] DEXSeq_1.0.2 Biobase_2.14.0 loaded via a namespace (and not attached): [1] hwriter_1.3 plyr_1.6 statmod_1.4.14 stringr_0.6 -- Sent via the guest posting facility at bioconductor.org. _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.4 years ago Shona Wood ▴ 70

Login before adding your answer.