Question

DEXSeq Problems

0

Entering edit mode

Bio152 ▴ 150

@bio152-5954

Last seen 8.2 years ago

Hi- This is my first time using a bioconductor package. I also have never worked with genetic data before. Here is a list of my DEXSeq issues thus far. Just places within the April 4th and Feb 27th vignettes that I haven't succeeded in replicating for my own data using module r/3.0.1 within a bash shell. Info: I used tophat with a fastq file, and then was able to to create a sam file. Issue #1 module load htseq/0.5.3p9 python import HTSeq alignment_file = HTSeq.SAM_Reader(" sam file ") This step generates a SAM alignment object. However, I need to exit in order to use PBS code to run the dexseq_prepare_annotation.py and dexseq_count.py and generate gff and txt files, and thus lose the SAM alignment object which I am guessing should be used as the gtf file? Can I actually skip the alignment_file step if I already posses a gtf file? Issue #2 If I have a homo sapiens transcript.bam file (not from ensembl) how do I process the file in order to make it work with dexseq_prepare_annotation.py? Should I pass the transcript.sam file through the alignment_file code first? Issue #3 Using an ensembl animal gene data, I have to specify strandedness=no within the python dexseq_count.py code and I am wondering if this will lead to errors down the line.. The dexseq_count code won't work without -s no, despite that the bam file is probably from the ensembl site. Issue #4 The two vignettes use different ways to create an ExonCountSet object.I am confused about which to use... Also, if I have a homo sapiens gene transcript data from illumina, put through tophat, what type of functions should I use to prep the data for the ExonCountSet creation? Same question for ensembl animal gene data... Should I prep the data by using this workflow: http://www.bioconductor.org/packages/release/bioc/vignettes/GenomicFea tures/inst/doc/GenomicFeatures.pdf Thanks, ML ASU Grad Student [[alternative HTML version deleted]]

Alignment Homo sapiens PROcess Alignment Homo sapiens PROcess • 1.7k views

ADD COMMENT • link updated 10.9 years ago by Simon Anders ★ 3.7k • written 10.9 years ago by Bio152 ▴ 150

score 0 · Answer 1 · 2013-05-26

Hi Margaret Thanks for the feed-back. We have started working on an improved vignette but this might still take a bit to get finished. The main issue with the current one is that it somehow start in the middle, namely after the use of the Python scripts, and then explains these initial steps only at the end, with some parts even been moved to the pasilla vignette. So, you are certainly right, this needs to be cleaned up and brought in a chronological order. To not let you wait for our new vignettes, let's go through your issues: > Issue #1 > > module load htseq/0.5.3p9 > python > import HTSeq > alignment_file = HTSeq.SAM_Reader(" sam file ") > > This step generates a SAM alignment object. > > However, I need to exit in order to use PBS code to run the > dexseq_prepare_annotation.py > and dexseq_count.py and generate gff and txt files, > and thus lose the SAM alignment object which I am guessing should > be used as the gtf file? I am unusure where you found this code fragment. Have yiu started reading the "Tour through HTSeq" page? This is nice, but really, there is no need learn about how o program your own Python scripts that use HTSeq when you just want to use the HTSeq-based Python scripts supplied with DEXSeq. You need to start of with preparing your annotation file. This is the GTF file with the gene models for your species that you get from Ensembl. (Make sure that the GTF file matches the genome you aligned against.) Just run, on the bash shell python dexseq_prepare_annotation.py Mus_musculus.GRCm38.71.gtf.gz \ mouse_flattened.gff This takes the GTF file from Ensembl (here the current one for mouse, taken from ftp://ftp.ensembl.org/pub/release-71/gtf/ ) and "flattens" it. See Figure 1 of our paper for an explanation what we mean by "flattening". Then, you run for each of your BAM files the counting script: python dexseq_count.py mouse_flattened.gff sample1.sam sample1.counts Here, sample1.sam is the file produced by tophat (i.e., "accepted_hits.sam" for the respective sample). The file "sample1.counts" contains the counts. Look into it to see whether it makes sense. Then, you follow the Section "Creating ExonCountSet objects from fi les produced by HTSeq" to read in your count files. And then, you have the ExonCountDataSet that you need to perform the analysis described in the initial sections of the vignette. > Can I actually skip the alignment_file step if I already posses a gtf file? Not sure which step you mean. You always need to get a GTF file (preferably from Ensembl, the UCSC ones only work after some tweaking), "flatten it", and then count. > Issue #2 > > If I have a homo sapiens transcript.bam file (not from ensembl) how do > I process the file in order to make it work with > dexseq_prepare_annotation.py? > > Should I pass the transcript.sam file through the alignment_file code > first? A ".bam" file? Are you sure you don't mean a gtf or gff file? Explain a bit more what this file contains, please. > Issue #3 > > Using an ensembl animal gene data, I have to specify strandedness=no within > the > python dexseq_count.py code and I am wondering if this will > lead to errors down the line.. The dexseq_count code won't work without > -s no, despite that the bam file is probably from the ensembl site. Somehow you got confused here. The bam files contain the aligned reads from your samples, not the annotation. And the "stranded" argument is about whether the wet-lab protocol you used for sample preparation preserves strandedness (i.e., whether the strand to which a read gets aligned can tell you about which strand the original mRNA was transcribed from.) > Issue #4 > > The two vignettes use different ways to create an ExonCountSet object.I am > confused about which to use... Well, they both work. We prefer to use the Python scripts. However, many users said they preferred an solution that works completely in R, without resorting to another language such as Python, and the GenomicRanges workflow is an attempt to accomodate these wishes. > Also, if I have a homo sapiens gene transcript data from illumina, put > through tophat, what type of functions should I use to prep the data for > the ExonCountSet > creation? Same question for ensembl animal gene data... I am still confused by what you mean by "data from ensemble". Do you have your own RNA-Seq data, or are you using published ones? > Should I prep the data by using this workflow: > http://www.bioconductor.org/packages/release/bioc/vignettes/GenomicF eatures/inst/doc/GenomicFeatures.pdf This is the vignette for the GenomicFeatures package. You can use this package _instead_ of our Python scripts, but then, there is still no need to work through this vignette. Rather, just follow the instructions in the DEXSeq vignette, Section "Creating ExonCountSet objects From GRanges, BamListFiles and transcriptDb objects". I hope this helps you to get started. Simon