Question

Count files do not correspond to the flattened annotation file

0

Entering edit mode

so2346 • 0

@so2346-7210

Last seen 9.3 years ago

United States

Hi,

I'm getting the error in the subject line using DEXseq version 1.12.1, explained in detail below. Preprocessing of files and count data generation was carried out using the two python scripts provided with the package.

My flattened gff was initially an igenomes ucsc mouse mm10 gtf - I know ye recommend Ensembl but I would have had to realign all my BAM files - which I then flattened using dexseq_prepare_annotation.py. I generated a counts file using dexseq_count.py -p 'yes -s 'no' -f 'bam'.

I then removed the 5 lines of 'unmapped' info from all count files:
_ambiguous   0
_ambiguous_readpair_position   0
_empty   32653
_lowaqual   0
_notaligned   0

Then I did a count:

wc -l count_file = 216656

grep -c "exonic_part" flattened_file = 216656

It seemed OK, I thought.

I then ran the following code in R to generate the error:
dxd = DEXSeqDataSetFromHTSeq(
countsFiles,
sampleData=sample_names,
design= ~ sample + exon + condition:exon,
flattenedfile=flattened_gtf )

Error in DEXSeqDataSetFromHTSeq(countsFiles, sampleData = sample_names, :
Count files do not correspond to the flattened annotation file

sample_names is a 2-column data.frame of samples and a condition for each sample.

I'd appreciate any help you might provide other than to realign using Ensembl and use their gtf :)

Thanks & A happy new year to all,
Sean.
=================================================================

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] DEXSeq_1.12.1 BiocParallel_1.0.0 DESeq2_1.6.3 RcppArmadillo_0.4.550.1.0 Rcpp_0.11.3
[6] GenomicRanges_1.18.3 GenomeInfoDb_1.2.4 IRanges_2.0.1 S4Vectors_0.4.0 Biobase_2.26.0
[11] BiocGenerics_0.12.1

loaded via a namespace (and not attached):
[1] acepack_1.3-3.3 annotate_1.44.0 AnnotationDbi_1.28.1 base64enc_0.1-2 BatchJobs_1.5 BBmisc_1

dexseq • 1.7k views

ADD COMMENT • link updated 9.3 years ago by Alejandro Reyes ★ 1.9k • written 9.3 years ago by so2346 • 0

score 0 · Answer 1 · 2015-01-19

0

Entering edit mode

Alejandro Reyes ★ 1.9k

@alejandro-reyes-5124

Last seen 1 day ago

Novartis Institutes for BioMedical Rese…

Hi,

There is no need to delete those 5 lines manually, the function "DEXSeqDataSetFromHTSeq" will remove them automatically.

Alejandro

ADD COMMENT • link 9.3 years ago Alejandro Reyes ★ 1.9k