Question: DEXSeq Issue
0
6.5 years ago by
Bio152150
Bio152150 wrote:
Hi Simon, Thank you for all of your advice. I am using the code in Section 9.3 of the "Analyzing RNA-seq data for differential exon usage with the DEXSeq package." However I ran into a problem with my file.bam see below: > library("DEXSeq") > exonicParts <- prepareAnnotationForDEXSeq(hse, aggregateGenes=TRUE) > bamDir<-file.path("C:/Users/mlinan/Desktop") > fls <- list.files(bamDir, pattern="bam", full=TRUE) > library("Rsamtools") Loading required package: Biostrings > bamlst <- BamFileList(fls) > library("DEXSeq") > SE <- countReadsForDEXSeq(exonicParts, bamlst) Error in value[[3L]](cond) : failed to open BamFile: SAM/BAM header missing or empty file: 'C:/Users/mlinan/Desktop/file.bam' In addition: Warning messages: 1: In doTryCatch(return(expr), name, parentenv, handler) : [bam_header_read] EOF marker is absent. The input is probably truncated. 2: In doTryCatch(return(expr), name, parentenv, handler) : [bam_header_read] invalid BAM binary header (this is not a BAM file). > SE Error: object 'SE' not found Are there any steps that I can take to fix the errors? My file.bam is a .bam file and is located in the path specified. Thanks, Margaret --------------------------------------------------- Hi Margaret Thanks for the feed-back. We have started working on an improved vignette but this might still take a bit to get finished. The main issue with the current one is that it somehow start in the middle, namely after the use of the Python scripts, and then explains these initial steps only at the end, with some parts even been moved to the pasilla vignette. So, you are certainly right, this needs to be cleaned up and brought in a chronological order. To not let you wait for our new vignettes, let's go through your issues: > Issue #1 > > module load htseq/0.5.3p9 > python > import HTSeq > alignment_file = HTSeq.SAM_Reader(" sam file ") > > This step generates a SAM alignment object. > > However, I need to exit in order to use PBS code to run the > dexseq_prepare_annotation.py > and dexseq_count.py and generate gff and txt files, > and thus lose the SAM alignment object which I am guessing should > be used as the gtf file? I am unusure where you found this code fragment. Have yiu started reading the "Tour through HTSeq" page? This is nice, but really, there is no need learn about how o program your own Python scripts that use HTSeq when you just want to use the HTSeq-based Python scripts supplied with DEXSeq. You need to start of with preparing your annotation file. This is the GTF file with the gene models for your species that you get from Ensembl. (Make sure that the GTF file matches the genome you aligned against.) Just run, on the bash shell python dexseq_prepare_annotation.py Mus_musculus.GRCm38.71.gtf.gz \ mouse_flattened.gff This takes the GTF file from Ensembl (here the current one for mouse, taken from ftp://ftp.ensembl.org/pub/release-71/gtf/ ) and "flattens" it. See Figure 1 of our paper for an explanation what we mean by "flattening". Then, you run for each of your BAM files the counting script: python dexseq_count.py mouse_flattened.gff sample1.sam sample1.counts Here, sample1.sam is the file produced by tophat (i.e., "accepted_hits.sam" for the respective sample). The file "sample1.counts" contains the counts. Look into it to see whether it makes sense. Then, you follow the Section "Creating ExonCountSet objects from fi les produced by HTSeq" to read in your count files. And then, you have the ExonCountDataSet that you need to perform the analysis described in the initial sections of the vignette. > Can I actually skip the alignment_file step if I already posses a gtf file? Not sure which step you mean. You always need to get a GTF file (preferably from Ensembl, the UCSC ones only work after some tweaking), "flatten it", and then count. > Issue #2 > > If I have a homo sapiens transcript.bam file (not from ensembl) how do > I process the file in order to make it work with > dexseq_prepare_annotation.py? > > Should I pass the transcript.sam file through the alignment_file code > first? A ".bam" file? Are you sure you don't mean a gtf or gff file? Explain a bit more what this file contains, please. > Issue #3 > > Using an ensembl animal gene data, I have to specify strandedness=no within > the > python dexseq_count.py code and I am wondering if this will > lead to errors down the line.. The dexseq_count code won't work without > -s no, despite that the bam file is probably from the ensembl site. Somehow you got confused here. The bam files contain the aligned reads from your samples, not the annotation. And the "stranded" argument is about whether the wet-lab protocol you used for sample preparation preserves strandedness (i.e., whether the strand to which a read gets aligned can tell you about which strand the original mRNA was transcribed from.) > Issue #4 > > The two vignettes use different ways to create an ExonCountSet object.I am > confused about which to use... Well, they both work. We prefer to use the Python scripts. However, many users said they preferred an solution that works completely in R, without resorting to another language such as Python, and the GenomicRanges workflow is an attempt to accomodate these wishes. > Also, if I have a homo sapiens gene transcript data from illumina, put > through tophat, what type of functions should I use to prep the data for > the ExonCountSet > creation? Same question for ensembl animal gene data... I am still confused by what you mean by "data from ensemble". Do you have your own RNA-Seq data, or are you using published ones? > Should I prep the data by using this workflow: > http://www.bioconductor.org/packages/release/bioc/vignettes/GenomicFea tures/inst/doc/GenomicFeatures.pdf This is the vignette for the GenomicFeatures package. You can use this package _instead_ of our Python scripts, but then, there is still no need to work through this vignette. Rather, just follow the instructions in the DEXSeq vignette, Section "Creating ExonCountSet objects From GRanges, BamListFiles and transcriptDb objects". I hope this helps you to get started. Simon [[alternative HTML version deleted]] ADD COMMENTlink modified 6.5 years ago by Michael Love26k • written 6.5 years ago by Bio152150 Answer: DEXSeq Issue 0 6.5 years ago by Michael Love26k United States Michael Love26k wrote: hi Margaret, On Tue, May 28, 2013 at 10:26 PM, Margaret Linan <mlinan at="" asu.edu=""> wrote: > > However I ran into a problem with my file.bam see below: > >> library("DEXSeq") >> exonicParts <- prepareAnnotationForDEXSeq(hse, aggregateGenes=TRUE) >> bamDir<-file.path("C:/Users/mlinan/Desktop") >> fls <- list.files(bamDir, pattern="bam", full=TRUE) >> library("Rsamtools") > Loading required package: Biostrings > >> bamlst <- BamFileList(fls) >> library("DEXSeq") > >> SE <- countReadsForDEXSeq(exonicParts, bamlst) > Error in value[[3L]](cond) : > failed to open BamFile: SAM/BAM header missing or empty > file: 'C:/Users/mlinan/Desktop/file.bam' > In addition: Warning messages: > 1: In doTryCatch(return(expr), name, parentenv, handler) : > [bam_header_read] EOF marker is absent. The input is probably truncated. > 2: In doTryCatch(return(expr), name, parentenv, handler) : > [bam_header_read] invalid BAM binary header (this is not a BAM file). >> SE > Error: object 'SE' not found > > Maybe we can troubleshoot with a simpler use of the Rsamtools package, which is used by countReadsForDEXSeq(). Could you try this code: file <- "C:/Users/mlinan/Desktop/file.bam" library(Rsamtools) which <- GRanges("chr1",IRanges(10e6,width=1000)) # assuming you are using 'chr' style coordinates what <- c("rname", "strand", "pos", "qwidth", "seq") param <- ScanBamParam(which=which, what=what) bam <- scanBam(file, param=param) str(bam) best, Mike
Hi Michael, I tried the code but got the exact same errors as before. Thanks, Margaret Linan On Tue, May 28, 2013 at 1:52 PM, Michael Love <michaelisaiahlove@gmail.com>wrote: > hi Margaret, > > On Tue, May 28, 2013 at 10:26 PM, Margaret Linan <mlinan@asu.edu> wrote: > > > > However I ran into a problem with my file.bam see below: > > > >> library("DEXSeq") > >> exonicParts <- prepareAnnotationForDEXSeq(hse, aggregateGenes=TRUE) > >> bamDir<-file.path("C:/Users/mlinan/Desktop") > >> fls <- list.files(bamDir, pattern="bam$", full=TRUE) > >> library("Rsamtools") > > Loading required package: Biostrings > > > >> bamlst <- BamFileList(fls) > >> library("DEXSeq") > > > >> SE <- countReadsForDEXSeq(exonicParts, bamlst) > > Error in value[[3L]](cond) : > > failed to open BamFile: SAM/BAM header missing or empty > > file: 'C:/Users/mlinan/Desktop/file.bam' > > In addition: Warning messages: > > 1: In doTryCatch(return(expr), name, parentenv, handler) : > > [bam_header_read] EOF marker is absent. The input is probably > truncated. > > 2: In doTryCatch(return(expr), name, parentenv, handler) : > > [bam_header_read] invalid BAM binary header (this is not a BAM file). > >> SE > > Error: object 'SE' not found > > > > > > Maybe we can troubleshoot with a simpler use of the Rsamtools package, > which is used by countReadsForDEXSeq(). > > Could you try this code: > > file <- "C:/Users/mlinan/Desktop/file.bam" > library(Rsamtools) > which <- GRanges("chr1",IRanges(10e6,width=1000)) # assuming you are > using 'chr' style coordinates > what <- c("rname", "strand", "pos", "qwidth", "seq") > param <- ScanBamParam(which=which, what=what) > bam <- scanBam(file, param=param) > str(bam) > > best, > > Mike > [[alternative HTML version deleted]] ADD REPLYlink written 6.5 years ago by Bio152150 Hi Margaret, The error says that the header of your bam files is missing. Do you have any output if you do: samtools view -H yourbamfile.bam ? Alejandro > Hi Michael, > > I tried the code but got the exact same errors as before. > > Thanks, > Margaret Linan > > > On Tue, May 28, 2013 at 1:52 PM, Michael Love > <michaelisaiahlove at="" gmail.com="">wrote: > >> hi Margaret, >> >> On Tue, May 28, 2013 at 10:26 PM, Margaret Linan <mlinan at="" asu.edu=""> wrote: >>> However I ran into a problem with my file.bam see below: >>> >>>> library("DEXSeq") >>>> exonicParts <- prepareAnnotationForDEXSeq(hse, aggregateGenes=TRUE) >>>> bamDir<-file.path("C:/Users/mlinan/Desktop") >>>> fls <- list.files(bamDir, pattern="bam$", full=TRUE) >>>> library("Rsamtools") >>> Loading required package: Biostrings >>> >>>> bamlst <- BamFileList(fls) >>>> library("DEXSeq") >>>> SE <- countReadsForDEXSeq(exonicParts, bamlst) >>> Error in value[[3L]](cond) : >>> failed to open BamFile: SAM/BAM header missing or empty >>> file: 'C:/Users/mlinan/Desktop/file.bam' >>> In addition: Warning messages: >>> 1: In doTryCatch(return(expr), name, parentenv, handler) : >>> [bam_header_read] EOF marker is absent. The input is probably >> truncated. >>> 2: In doTryCatch(return(expr), name, parentenv, handler) : >>> [bam_header_read] invalid BAM binary header (this is not a BAM file). >>>> SE >>> Error: object 'SE' not found >>> >>> >> Maybe we can troubleshoot with a simpler use of the Rsamtools package, >> which is used by countReadsForDEXSeq(). >> >> Could you try this code: >> >> file <- "C:/Users/mlinan/Desktop/file.bam" >> library(Rsamtools) >> which <- GRanges("chr1",IRanges(10e6,width=1000)) # assuming you are >> using 'chr' style coordinates >> what <- c("rname", "strand", "pos", "qwidth", "seq") >> param <- ScanBamParam(which=which, what=what) >> bam <- scanBam(file, param=param) >> str(bam) >> >> best, >> >> Mike >> > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor