Hi Margaret
Thanks for the feed-back. We have started working on an improved
vignette but this might still take a bit to get finished. The main
issue
with the current one is that it somehow start in the middle, namely
after the use of the Python scripts, and then explains these
initial steps only at the end, with some parts even been moved to the
pasilla vignette. So, you are certainly right, this needs to be
cleaned
up and brought in a chronological order.
To not let you wait for our new vignettes, let's go through your
issues:
> Issue #1
>
> module load htseq/0.5.3p9
> python
> import HTSeq
> alignment_file = HTSeq.SAM_Reader(" sam file ")
>
> This step generates a SAM alignment object.
>
> However, I need to exit in order to use PBS code to run the
> dexseq_prepare_annotation.py
> and dexseq_count.py and generate gff and txt files,
> and thus lose the SAM alignment object which I am guessing should
> be used as the gtf file?
I am unusure where you found this code fragment. Have yiu started
reading the "Tour through HTSeq" page? This is nice, but really, there
is no need learn about how o program your own Python scripts that use
HTSeq when you just want to use the HTSeq-based Python scripts
supplied
with DEXSeq.
You need to start of with preparing your annotation file. This is the
GTF file with the gene models for your species that you get from
Ensembl. (Make sure that the GTF file matches the genome you aligned
against.) Just run, on the bash shell
python dexseq_prepare_annotation.py Mus_musculus.GRCm38.71.gtf.gz
\
mouse_flattened.gff
This takes the GTF file from Ensembl (here the current one for mouse,
taken from
ftp://ftp.ensembl.org/pub/release-71/gtf/ ) and "flattens"
it. See Figure 1 of our paper for an explanation what we mean by
"flattening".
Then, you run for each of your BAM files the counting script:
python dexseq_count.py mouse_flattened.gff sample1.sam
sample1.counts
Here, sample1.sam is the file produced by tophat (i.e.,
"accepted_hits.sam" for the respective sample). The file
"sample1.counts" contains the counts. Look into it to see whether it
makes sense.
Then, you follow the Section "Creating ExonCountSet objects from fi
les
produced by HTSeq" to read in your count files. And then, you have the
ExonCountDataSet that you need to perform the analysis described in
the
initial sections of the vignette.
> Can I actually skip the alignment_file step if I already posses a
gtf file?
Not sure which step you mean. You always need to get a GTF file
(preferably from Ensembl, the UCSC ones only work after some
tweaking),
"flatten it", and then count.
> Issue #2
>
> If I have a homo sapiens transcript.bam file (not from ensembl) how
do
> I process the file in order to make it work with
> dexseq_prepare_annotation.py?
>
> Should I pass the transcript.sam file through the alignment_file
code
> first?
A ".bam" file? Are you sure you don't mean a gtf or gff file?
Explain a bit more what this file contains, please.
> Issue #3
>
> Using an ensembl animal gene data, I have to specify strandedness=no
within
> the
> python dexseq_count.py code and I am wondering if this will
> lead to errors down the line.. The dexseq_count code won't work
without
> -s no, despite that the bam file is probably from the ensembl site.
Somehow you got confused here. The bam files contain the aligned reads
from your samples, not the annotation. And the "stranded" argument is
about whether the wet-lab protocol you used for sample preparation
preserves strandedness (i.e., whether the strand to which a read gets
aligned can tell you about which strand the original mRNA was
transcribed from.)
> Issue #4
>
> The two vignettes use different ways to create an ExonCountSet
object.I am
> confused about which to use...
Well, they both work. We prefer to use the Python scripts. However,
many
users said they preferred an solution that works completely in R,
without resorting to another language such as Python, and the
GenomicRanges workflow is an attempt to accomodate these wishes.
> Also, if I have a homo sapiens gene transcript data from illumina,
put
> through tophat, what type of functions should I use to prep the
data for
> the ExonCountSet
> creation? Same question for ensembl animal gene data...
I am still confused by what you mean by "data from ensemble". Do you
have your own RNA-Seq data, or are you using published ones?
> Should I prep the data by using this workflow:
>
http://www.bioconductor.org/packages/release/bioc/vignettes/GenomicF
eatures/inst/doc/GenomicFeatures.pdf
This is the vignette for the GenomicFeatures package. You can use this
package _instead_ of our Python scripts, but then, there is still no
need to work through this vignette. Rather, just follow the
instructions
in the DEXSeq vignette, Section "Creating ExonCountSet objects From
GRanges, BamListFiles and transcriptDb objects".
I hope this helps you to get started.
Simon