DEXSeq error: Count files do not correspond to the flattened annotation file

0

Entering edit mode

Darwin Sorento Dichmann ▴ 50

@darwin-sorento-dichmann-5702

Last seen 11.4 years ago

Greetings, I created the count files using dexseq_count.py according to the instructions. However, when I attempt to create ExonCountSets using read.HTSeqCounts I get the following error: --- Error in read.HTSeqCounts(countfiles = file.path(inDir, countfiles), design = tra2bdata, : Count files do not correspond to the flattened annotation file --- I *know* that the ecs files corresponds to the flat GFF and I have rerun that step thrice. So, I assume that the error message is really about something else and I would appreciate any help at getting to it. FWIW, DEXSeq works on the same data set, but with a slightly different flat GFF (the one that gives me trouble have been changed the gene names). The GFF is flattened from a Cufflinks GTF. This is the minimal code that reproduce the error: # R version 2.15.2 (2012-10-26) -- "Trick or Treat" # Load DEXSeq and setwd. source("http://bioconductor.org/biocLite.R") biocLite("DEXSeq") library(DEXSeq) setwd("~/Dropbox/DEXSeq_tra2b_Named/") # Create data frame of tra2b experimental setup: sampleID <- c("CTR1","CTR2","CTR3","TRA1","TRA2","TRA3") condition <-c(rep("uninjected", 3), rep("tra2bMO2", 3)) type <- c(rep("paired-end", 6)) stage<- c(rep("14", 6)) tra2bdata <- data.frame(condition, type, stage, row.names = sampleID) # Put paths to files into variables. inDir <- file.path("~/Dropbox/DEXSeq_tra2b_Named/") # Directory to files. countfiles <- list.files(inDir, pattern = ".txt") # Path to exon count files from HT-Seq. annotationTra <- file.path( "~/Dropbox/DEXSeq_tra2b_Named/flat_frog_q1_fixed_v2.gff") # Create the ExonCountSet. This is where things go wrong. traExons <- read.HTSeqCounts(countfiles = file.path(inDir, countfiles), design = tra2bdata, flattenedfile = annotationTra) --- Error in read.HTSeqCounts(countfiles = file.path(inDir, countfiles), design = tra2bdata, : Count files do not correspond to the flattened annotation file --- Any help is greatly appreciated. Best, Darwin ________________________________ Darwin Sorento Dichmann, M.S., PhD University of California, Berkeley Harland Lab Molecular and Cell Biology 571 Life Sciences Addition Berkeley, CA 94720 Phone# (510) 643-7830 E-mail: dichmann at berkeley.edu Please send Fedex packages to: 163 Life Sciences Addition, attn: Harland lab room 571

Annotation GO DEXSeq Annotation GO DEXSeq • 3.3k views

ADD COMMENT • link updated 12.7 years ago by Simon Anders ★ 3.8k • written 12.7 years ago by Darwin Sorento Dichmann ▴ 50

0

Entering edit mode

Simon Anders ★ 3.8k

@simon-anders-3855

Last seen 5.5 years ago

Zentrum für Molekularbiologie, Universi…

Hi Darwin > FWIW, DEXSeq works on the same data set, but with a slightly > different flat GFF (the one that gives me trouble have been changed > the gene names). The GFF is flattened from a Cufflinks GTF. Wait, what do you mean by "changed the gene names"? Each count file has two columns, one with the gene IDs, the other with the counts. The flattened GFF file also contains the gene ID, and to make sure that count files and GFF file fit together, DEXSeq compares the gene IDs and complains if they don't match. Simon

ADD COMMENT • link 12.7 years ago Simon Anders ★ 3.8k

0

Entering edit mode

Hi Simon, I have not changed the count files (except their file names) or the flattened GFF after running dexseq_count.py. For example: --- lsa-579-005:DEXSeq_tra2b_Named darwin$ grep 'bmp2\b' CTR1.txt bmp2:001 160 bmp2:002 45 bmp2:003 47 lsa-579-005:DEXSeq_tra2b_Named darwin$ grep 'bmp2\b' flat_frog_q1_fixed_v2.gff scaffold_5 merged_stranded_q1.GFF aggregate_gene 101514148 101526061 . - . gene_id "bmp2" scaffold_5 merged_stranded_q1.GFF exonic_part 101514148 101516034 . - . transcripts "TCONS_00042267"; exonic_part_number "001"; gene_id "bmp2" scaffold_5 merged_stranded_q1.GFF exonic_part 101522614 101522972 . - . transcripts "TCONS_00042267"; exonic_part_number "002"; gene_id "bmp2" scaffold_5 merged_stranded_q1.GFF exonic_part 101525481 101526061 . - . transcripts "TCONS_00042267"; exonic_part_number "003"; gene_id "bmp2" --- What I tried to say in what you quote is that using the same data set (same aligned reads) and an very similar flat GFF I can run DEXSeq without problems. In that case I took the combined.GTF from Cufflinks and flattened it. However, in that case the gene_name from the Cufflinks GTF is lost and gene_id is the not very readable "XLOC" ID that Cufflinks use to keep track of genes: --- dichmann@genepool01:/global/projectb/scratch/dichmann/HTSeq_frog_q1$ grep 'XLOC_021009' flat_frog_q1.gff merged.gtf flat_frog_q1.gff:scaffold_5 merged_stranded_q1.GFF aggregate_gene 101514148 101526061 . - . gene_id "XLOC_021009" flat_frog_q1.gff:scaffold_5 merged_stranded_q1.GFF exonic_part 101514148 101516034 . - . transcripts "TCONS_00042267"; exonic_part_number "001"; gene_id "XLOC_021009" flat_frog_q1.gff:scaffold_5 merged_stranded_q1.GFF exonic_part 101522614 101522972 . - . transcripts "TCONS_00042267"; exonic_part_number "002"; gene_id "XLOC_021009" flat_frog_q1.gff:scaffold_5 merged_stranded_q1.GFF exonic_part 101525481 101526061 . - . transcripts "TCONS_00042267"; exonic_part_number "003"; gene_id "XLOC_021009" merged.gtf:scaffold_5 Cufflinks exon 101514148 101516034 . - . gene_id "XLOC_021009"; transcript_id "TCONS_00042267"; exon_number "1"; gene_name "bmp2"; oId "PAC:20701023"; nearest_ref "PAC:20701023"; class_code "="; tss_id "TSS27823"; p_id "P24249"; merged.gtf:scaffold_5 Cufflinks exon 101522614 101522972 . - . gene_id "XLOC_021009"; transcript_id "TCONS_00042267"; exon_number "2"; gene_name "bmp2"; oId "PAC:20701023"; nearest_ref "PAC:20701023"; class_code "="; tss_id "TSS27823"; p_id "P24249"; merged.gtf:scaffold_5 Cufflinks exon 101525481 101526061 . - . gene_id "XLOC_021009"; transcript_id "TCONS_00042267"; exon_number "3"; gene_name "bmp2"; oId "PAC:20701023"; nearest_ref "PAC:20701023"; class_code "="; tss_id "TSS27823"; p_id "P24249"; --- So I changed the flattened GFF so that it has the human-sensible gene_name instead of the gene_id, where possible and run dexseq_count.py using that flat GFF. Hope it makes sense. Btw, I used HTSeq-0.5.4p3. I guess I could hack it and add the gene names to the final DEXSeq output, but since I see myself using this package a lot, I would to figure out what is going on. Cheers, Darwin On May 24, 2013, at 1:05 PM, Simon Anders <anders@embl.de> wrote: > Hi Darwin > >> FWIW, DEXSeq works on the same data set, but with a slightly >> different flat GFF (the one that gives me trouble have been changed >> the gene names). The GFF is flattened from a Cufflinks GTF. > > Wait, what do you mean by "changed the gene names"? > > Each count file has two columns, one with the gene IDs, the other with the counts. The flattened GFF file also contains the gene ID, and to make sure that count files and GFF file fit together, DEXSeq compares the gene IDs and complains if they don't match. > > Simon > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 12.7 years ago Darwin Sorento Dichmann ▴ 50

Login before adding your answer.