Dear all,
I am a new user of DEXSeq and I would need some help concerning an error message that I get.
When I call the DEXSeqDataSetFromHTSeq, I obtain the following error:
> library(DEXSeq) > dxd = DEXSeqDataSetFromHTSeq(countFiles, sampleData=sampleTable, design= ~ sample + exon + condition:exon, flattenedfile=flattenedFile) Error in strsplit(rownames(dcounts), ":") : non-character argument
Here is what countFiles and sampleTable look like:
> countFiles
[1] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/basal_1.txt"
[2] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/basal_2.txt"
[3] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/basal_3.txt"
[4] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/bulge_1.txt"
[5] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/bulge_2.txt"
[6] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/bulge_3.txt"
[7] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/cdh3_1.txt"
[8] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/cdh3_2.txt"
[9] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/cdh3_3.txt"
[10] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/cdh3_4.txt"
[11] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/gli_1.txt"
[12] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/gli_2.txt"
[13] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/gli_4.txt"
[14] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lef_1.txt"
[15] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lef_4.txt"
[16] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lef_5.txt"
[17] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lef_6.txt"
[18] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lgr6_10_1.txt"
[19] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lgr6_10_2.txt"
[20] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lgr6_40_1.txt"
[21] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lgr6_40_2.txt"
[22] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lgr6_40_3.txt"
[23] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/pdgfra_10_1.txt"
[24] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/pdgfra_10_2.txt"
[25] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/pdgfra_10_3.txt"
[26] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/pdgfra_40_1.txt"
[27] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/pdgfra_40_2.txt"
[28] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/pdgfra_40_3.txt"
> sampleTable
condition libType
basal1 basal single-end
basal2 basal single-end
basal3 basal single-end
bulge1 bulge single-end
bulge2 bulge single-end
bulge3 bulge single-end
cdh31 cdh3 single-end
cdh32 cdh3 single-end
cdh33 cdh3 single-end
cdh34 cdh3 single-end
gli11 gli1 single-end
gli12 gli1 single-end
gli14 gli1 single-end
lef11 lef1 single-end
lef14 lef1 single-end
lef15 lef1 single-end
lef16 lef1 single-end
lgr6101 lgr6 single-end
lgr6102 lgr6 single-end
lgr6401 lgr6 single-end
lgr6402 lgr6 single-end
lgr6403 lgr6 single-end
pdgfra101 pdgfra single-end
pdgfra102 pdgfra single-end
pdgfra103 pdgfra single-end
pdgfra401 pdgfra single-end
pdgfra402 pdgfra single-end
pdgfra403 pdgfra single-end
I guess this is correct. However, when looking at the count files, I have the impression that the .txt look weird. Here is the first one (sample: basal_1) as an example. This is all that is contained in the file:
_ambiguous 0
_ambiguous_readpair_position 0
_empty 9651846
_lowaqual 846199
_notaligned 421258
I created this file with this command (and successfully installed pysam before):
$ python /Library/Frameworks/R.framework/Versions/3.2/Resources/library/DEXSeq/python_scripts/dexseq_count.py -f bam mm10.genes.gff Sample_Basal_1_159/accepted_hits_no_rRNA.bam counts/basal_1.txt
Is this output normal with dexseq_count.py? If not, does anyone know what could be the problem? I tried with different options, but as my data are not paired-end and strand specific, I should use it like this.
Here is my sessionInfo() output:
> sessionInfo() R version 3.2.3 (2015-12-10) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.11.4 (El Capitan) locale: [1] C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] DEXSeq_1.16.10 DESeq2_1.10.1 [3] RcppArmadillo_0.6.700.3.0 Rcpp_0.12.4 [5] SummarizedExperiment_1.0.2 GenomicRanges_1.22.4 [7] GenomeInfoDb_1.6.3 IRanges_2.4.8 [9] S4Vectors_0.8.11 Biobase_2.30.0 [11] BiocGenerics_0.16.1 BiocParallel_1.4.3 loaded via a namespace (and not attached): [1] genefilter_1.52.1 statmod_1.4.24 locfit_1.5-9.1 [4] splines_3.2.3 lattice_0.20-33 colorspace_1.2-6 [7] survival_2.39-2 XML_3.98-1.4 foreign_0.8-66 [10] DBI_0.3.1 RColorBrewer_1.1-2 lambda.r_1.1.7 [13] plyr_1.8.3 stringr_1.0.0 zlibbioc_1.16.0 [16] Biostrings_2.38.4 munsell_0.4.3 gtable_0.2.0 [19] futile.logger_1.4.1 hwriter_1.3.2 latticeExtra_0.6-28 [22] geneplotter_1.48.0 biomaRt_2.26.1 AnnotationDbi_1.32.3 [25] acepack_1.3-3.3 xtable_1.8-2 scales_0.4.0 [28] Hmisc_3.17-3 annotate_1.48.0 XVector_0.10.0 [31] Rsamtools_1.22.0 gridExtra_2.2.1 ggplot2_2.1.0 [34] stringi_1.0-1 grid_3.2.3 tools_3.2.3 [37] bitops_1.0-6 magrittr_1.5 RCurl_1.95-4.8 [40] RSQLite_1.0.0 Formula_1.2-1 cluster_2.0.4 [43] futile.options_1.0.0 Matrix_1.2-5 rsconnect_0.4.2.1 [46] rpart_4.1-10 nnet_7.3-12

Hi Sakura,
Can you post the first lines of your annotation file?
Alejandro
Hello, of course, here they are:
chr1 unknown gene 3214482 3671498 . - . ID=Xkr4;Name=NM_001011874
chr1 unknown exon 3214482 3216968 . - . Parent=Xkr4
chr1 unknown stop_codon 3216022 3216024 . - . Parent=Xkr4
chr1 unknown CDS 3216025 3216968 . - 2 Parent=Xkr4
chr1 unknown CDS 3421702 3421901 . - 1 Parent=Xkr4
chr1 unknown exon 3421702 3421901 . - . Parent=Xkr4
chr1 unknown CDS 3670552 3671348 . - 0 Parent=Xkr4
chr1 unknown exon 3670552 3671498 . - . Parent=Xkr4
chr1 unknown start_codon 3671346 3671348 . - . Parent=Xkr4
chr1 unknown gene 4290846 4293012 . - . ID=Rp1;Name=NM_001195662
Hi Sakura,
You file is not a gtf file, it is a gff file. You need a gtf file, as the ones provided by ensembl, and preprocess it using the script dexseq_prepare_annotation.py. Please read the DEXSeq vignette for more details.
Alejandro
Yes, sorry you are right, this was a gff file that I obtained previously by converting the gtf.
So I realised the problem happened when preprocessing the gtf. Here are the 10 first lines of gtf file:
chr1 unknown exon 3214482 3216968 . - . gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1 unknown stop_codon 3216022 3216024 . - . gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1 unknown CDS 3216025 3216968 . - 2 gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1 unknown CDS 3421702 3421901 . - 1 gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1 unknown exon 3421702 3421901 . - . gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1 unknown CDS 3670552 3671348 . - 0 gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1 unknown exon 3670552 3671498 . - . gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1 unknown start_codon 3671346 3671348 . - . gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1 unknown exon 4290846 4293012 . - . gene_id "Rp1"; gene_name "Rp1"; p_id "P16188"; transcript_id "NM_001195662"; tss_id "TSS5760";
chr1 unknown stop_codon 4292981 4292983 . - . gene_id "Rp1"; gene_name "Rp1"; p_id "P16188"; transcript_id "NM_001195662"; tss_id "TSS5760";
When I run the script like this:
The output gives an empty gff file. That is why my count file is also only with empty counts...
Hi Sakura,
Could you send me your gtf file so I could have a closer look?
Alejandro
Here it is:
https://www.dropbox.com/s/m2zhchpc6vf5nmo/mm10.genes.gtf?dl=0
Hi Sakura,
I did get an error message while running your same line of code:
Traceback (most recent call last):File "/Users/reyes/Work/Rpcks/DEXSeq/inst/python_scripts/dexseq_prepare_annotation.py", line 129, in <module>
raise ValueError, "Same name found on two chromosomes: %s, %s" % ( str(l[i]), str(l[i+1]) )
ValueError: Same gene name found on two chromosomes: <GenomicFeature: exonic_part 'Eno1+Gm5506' at chr18: 48046732 -> 48048383 (strand '+')>, <GenomicFeature: exonic_part 'Eno1+Gm5506' at chr4: 150237196 -> 150237323 (strand '+')>
The error message should be self-explanatory.
Alejandro
Hello,
Thanks for the reply, I reinstalled DEXSeq and indeed obtained the same error, as it happened to me a few days ago. At this point I had tried to follow the advices given here: DEXSeq: problem with dexseq_prepare_annotation.py and this started to give me empty files. But now that I tried again, it seems to be better, so it was probably an error that came from the first time I tried to do modifications to the python script.
Thank you again for your help!!
Sakura