Entering edit mode
Guest User
★
13k
@guest-user-4897
Last seen 10.6 years ago
Hi I'm working with the Pasilla and DEXSeq bioconductor package and
I'm trying to replicate the example data, but the flattened .txt files
processed using the provided dexseq_count.py script is not matching
the example data provided in the source package. Although this isn't
directly an 'R' script I'm hoping to someone here may be able to help
since it is part of an R package. Additionally, I think the R
documentation may contain an error (???) that I address in the text
below. Thanks in advance for any help. I'm hoping to use these
examples in order to understand how they are working so I can process
my own data eventually, but want to make sure they are functioning
properly with the provided example data as well.
Here is what I did:
- download copies of the source packages for pasilla and DEXSeq
- copied the python scripts to the directory containing example data
for pasilla (/inst/extdata)
- went to http://www.embl.de/~reyes/Graveley/bam/
- downloaded the bam files for treated1.bam and treated2.bam (because
the source packages didn't seem to have these, but the above link was
reference in the R documentation for downloading the BAM files).
- used the .gff package already processed in the pasilla package
- did the following in terminal
#This is for an example for processing a single-read type alignment
#creates an indexed bam file
$ samtools index treated1.bam
#converts the bam file to a sam file
$ samtools view treated1.bam > treated1.sam
#calls the python script dexseq_count to count the reads in the non-
overlapping exotic parts.
$ python dexseq_count.py Dmel.BDGP5.25.62.DEXSeq.chr.gff treated1.sam
treated1fb_TEST.txt
#This is an example for processing a paired-end type alignment
#creates an indexed bamfile
$ samtools index treated2.bam
#converts the nam file to a sam file
$ samtools view treated2.bam > treated2.sam
#sorts the sam file by position (-k option); and I'm assuming the bam
in the tutorial is a typo here because in the R documentation the
output is a sorted BAM file, but in the next line it is a sorted SAM
file. I did try the leaving it has a bam file as well, but as expected
received an error "no such file or directory"
$ sort -k 1,1 -k2,2n treated2.sam > treated2_sorted.sam
#runs the python script dexseq_count on paired-end data for the
processed sam file treated_2_sorted in order to count thee reads in
each non-overlapping exonic part.
python dexseq_count.py -p yes Dmel.BDGP5.25.62.DEXSeq.chr.gff
treated2_sorted.sam treated2fb_TEST.txt
- I've provided the first 10 lines of both the example data and the
processed output.
I also have a second question (which I will also post on a Seq forum
in case it can not be addressed here in an R/bioconductor forum);
since these python scripts are part of the HTSeq library why must the
paired-end data be sorted by position and not by read name using
samtools (since it was already called twice previously) as is
conducted for the htseq-count script?
Thanks again
output of .txt (1st 10 lines)
SINGLE READ:
treated1fb_pasilla example.txt
FBgn0000003:001 0
FBgn0000008:001 0
FBgn0000008:002 1
FBgn0000008:003 3
FBgn0000008:004 2
FBgn0000008:005 8
FBgn0000008:006 0
FBgn0000008:007 17
FBgn0000008:008 4
FBgn0000008:009 35
treated1fb_my processed output.txt (TEST)
FBgn0000003:001 0
FBgn0000008:001 0
FBgn0000008:002 0
FBgn0000008:003 0
FBgn0000008:004 1
FBgn0000008:005 4
FBgn0000008:006 1
FBgn0000008:007 18
FBgn0000008:008 4
FBgn0000008:009 16
PAIRED-END:
treated2fb_pasilla example.txt
FBgn0000003:001 1
FBgn0000008:001 0
FBgn0000008:002 0
FBgn0000008:003 1
FBgn0000008:004 0
FBgn0000008:005 2
FBgn0000008:006 1
FBgn0000008:007 22
FBgn0000008:008 7
FBgn0000008:009 46
treated2fb_my processed output.txt (TEST)
FBgn0000003:001 0
FBgn0000008:001 0
FBgn0000008:002 0
FBgn0000008:003 1
FBgn0000008:004 0
FBgn0000008:005 1
FBgn0000008:006 0
FBgn0000008:007 8
FBgn0000008:008 1
FBgn0000008:009 17
-- output of sessionInfo():
R version 2.15.0 (2012-03-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] pasilla_0.2.11 DESeq_1.8.3 locfit_1.5-8
DEXSeq_1.2.0 Biobase_2.16.0 BiocGenerics_0.2.0
loaded via a namespace (and not attached):
[1] annotate_1.34.1 AnnotationDbi_1.18.1 biomaRt_2.12.0
DBI_0.2-5 genefilter_1.38.0 geneplotter_1.34.0
[7] grid_2.15.0 hwriter_1.3 IRanges_1.14.4
lattice_0.20-6 plyr_1.7.1 RColorBrewer_1.0-5
[13] RCurl_1.91-1 RSQLite_0.11.1 splines_2.15.0
statmod_1.4.14 stats4_2.15.0 stringr_0.6
[19] survival_2.36-14 XML_3.9-4 xtable_1.7-0
--
Sent via the guest posting facility at bioconductor.org.