Dear all,
I am a new user of DEXSeq and I would need some help concerning an error message that I get.
When I call the DEXSeqDataSetFromHTSeq, I obtain the following error:
> library(DEXSeq) > dxd = DEXSeqDataSetFromHTSeq(countFiles, sampleData=sampleTable, design= ~ sample + exon + condition:exon, flattenedfile=flattenedFile) Error in strsplit(rownames(dcounts), ":") : non-character argument
Here is what countFiles and sampleTable look like:
> countFiles [1] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/basal_1.txt" [2] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/basal_2.txt" [3] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/basal_3.txt" [4] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/bulge_1.txt" [5] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/bulge_2.txt" [6] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/bulge_3.txt" [7] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/cdh3_1.txt" [8] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/cdh3_2.txt" [9] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/cdh3_3.txt" [10] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/cdh3_4.txt" [11] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/gli_1.txt" [12] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/gli_2.txt" [13] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/gli_4.txt" [14] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lef_1.txt" [15] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lef_4.txt" [16] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lef_5.txt" [17] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lef_6.txt" [18] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lgr6_10_1.txt" [19] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lgr6_10_2.txt" [20] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lgr6_40_1.txt" [21] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lgr6_40_2.txt" [22] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lgr6_40_3.txt" [23] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/pdgfra_10_1.txt" [24] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/pdgfra_10_2.txt" [25] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/pdgfra_10_3.txt" [26] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/pdgfra_40_1.txt" [27] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/pdgfra_40_2.txt" [28] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/pdgfra_40_3.txt" > sampleTable condition libType basal1 basal single-end basal2 basal single-end basal3 basal single-end bulge1 bulge single-end bulge2 bulge single-end bulge3 bulge single-end cdh31 cdh3 single-end cdh32 cdh3 single-end cdh33 cdh3 single-end cdh34 cdh3 single-end gli11 gli1 single-end gli12 gli1 single-end gli14 gli1 single-end lef11 lef1 single-end lef14 lef1 single-end lef15 lef1 single-end lef16 lef1 single-end lgr6101 lgr6 single-end lgr6102 lgr6 single-end lgr6401 lgr6 single-end lgr6402 lgr6 single-end lgr6403 lgr6 single-end pdgfra101 pdgfra single-end pdgfra102 pdgfra single-end pdgfra103 pdgfra single-end pdgfra401 pdgfra single-end pdgfra402 pdgfra single-end pdgfra403 pdgfra single-end
I guess this is correct. However, when looking at the count files, I have the impression that the .txt look weird. Here is the first one (sample: basal_1) as an example. This is all that is contained in the file:
_ambiguous 0
_ambiguous_readpair_position 0
_empty 9651846
_lowaqual 846199
_notaligned 421258
I created this file with this command (and successfully installed pysam before):
$ python /Library/Frameworks/R.framework/Versions/3.2/Resources/library/DEXSeq/python_scripts/dexseq_count.py -f bam mm10.genes.gff Sample_Basal_1_159/accepted_hits_no_rRNA.bam counts/basal_1.txt
Is this output normal with dexseq_count.py? If not, does anyone know what could be the problem? I tried with different options, but as my data are not paired-end and strand specific, I should use it like this.
Here is my sessionInfo() output:
> sessionInfo() R version 3.2.3 (2015-12-10) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.11.4 (El Capitan) locale: [1] C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] DEXSeq_1.16.10 DESeq2_1.10.1 [3] RcppArmadillo_0.6.700.3.0 Rcpp_0.12.4 [5] SummarizedExperiment_1.0.2 GenomicRanges_1.22.4 [7] GenomeInfoDb_1.6.3 IRanges_2.4.8 [9] S4Vectors_0.8.11 Biobase_2.30.0 [11] BiocGenerics_0.16.1 BiocParallel_1.4.3 loaded via a namespace (and not attached): [1] genefilter_1.52.1 statmod_1.4.24 locfit_1.5-9.1 [4] splines_3.2.3 lattice_0.20-33 colorspace_1.2-6 [7] survival_2.39-2 XML_3.98-1.4 foreign_0.8-66 [10] DBI_0.3.1 RColorBrewer_1.1-2 lambda.r_1.1.7 [13] plyr_1.8.3 stringr_1.0.0 zlibbioc_1.16.0 [16] Biostrings_2.38.4 munsell_0.4.3 gtable_0.2.0 [19] futile.logger_1.4.1 hwriter_1.3.2 latticeExtra_0.6-28 [22] geneplotter_1.48.0 biomaRt_2.26.1 AnnotationDbi_1.32.3 [25] acepack_1.3-3.3 xtable_1.8-2 scales_0.4.0 [28] Hmisc_3.17-3 annotate_1.48.0 XVector_0.10.0 [31] Rsamtools_1.22.0 gridExtra_2.2.1 ggplot2_2.1.0 [34] stringi_1.0-1 grid_3.2.3 tools_3.2.3 [37] bitops_1.0-6 magrittr_1.5 RCurl_1.95-4.8 [40] RSQLite_1.0.0 Formula_1.2-1 cluster_2.0.4 [43] futile.options_1.0.0 Matrix_1.2-5 rsconnect_0.4.2.1 [46] rpart_4.1-10 nnet_7.3-12
Hi Sakura,
Can you post the first lines of your annotation file?
Alejandro
Hello, of course, here they are:
chr1 unknown gene 3214482 3671498 . - . ID=Xkr4;Name=NM_001011874
chr1 unknown exon 3214482 3216968 . - . Parent=Xkr4
chr1 unknown stop_codon 3216022 3216024 . - . Parent=Xkr4
chr1 unknown CDS 3216025 3216968 . - 2 Parent=Xkr4
chr1 unknown CDS 3421702 3421901 . - 1 Parent=Xkr4
chr1 unknown exon 3421702 3421901 . - . Parent=Xkr4
chr1 unknown CDS 3670552 3671348 . - 0 Parent=Xkr4
chr1 unknown exon 3670552 3671498 . - . Parent=Xkr4
chr1 unknown start_codon 3671346 3671348 . - . Parent=Xkr4
chr1 unknown gene 4290846 4293012 . - . ID=Rp1;Name=NM_001195662
Hi Sakura,
You file is not a gtf file, it is a gff file. You need a gtf file, as the ones provided by ensembl, and preprocess it using the script dexseq_prepare_annotation.py. Please read the DEXSeq vignette for more details.
Alejandro
Yes, sorry you are right, this was a gff file that I obtained previously by converting the gtf.
So I realised the problem happened when preprocessing the gtf. Here are the 10 first lines of gtf file:
chr1 unknown exon 3214482 3216968 . - . gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1 unknown stop_codon 3216022 3216024 . - . gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1 unknown CDS 3216025 3216968 . - 2 gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1 unknown CDS 3421702 3421901 . - 1 gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1 unknown exon 3421702 3421901 . - . gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1 unknown CDS 3670552 3671348 . - 0 gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1 unknown exon 3670552 3671498 . - . gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1 unknown start_codon 3671346 3671348 . - . gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1 unknown exon 4290846 4293012 . - . gene_id "Rp1"; gene_name "Rp1"; p_id "P16188"; transcript_id "NM_001195662"; tss_id "TSS5760";
chr1 unknown stop_codon 4292981 4292983 . - . gene_id "Rp1"; gene_name "Rp1"; p_id "P16188"; transcript_id "NM_001195662"; tss_id "TSS5760";
When I run the script like this:
The output gives an empty gff file. That is why my count file is also only with empty counts...
Hi Sakura,
Could you send me your gtf file so I could have a closer look?
Alejandro
Here it is:
https://www.dropbox.com/s/m2zhchpc6vf5nmo/mm10.genes.gtf?dl=0
Hi Sakura,
I did get an error message while running your same line of code:
Traceback (most recent call last):
File "/Users/reyes/Work/Rpcks/DEXSeq/inst/python_scripts/dexseq_prepare_annotation.py", line 129, in <module>
raise ValueError, "Same name found on two chromosomes: %s, %s" % ( str(l[i]), str(l[i+1]) )
ValueError: Same gene name found on two chromosomes: <GenomicFeature: exonic_part 'Eno1+Gm5506' at chr18: 48046732 -> 48048383 (strand '+')>, <GenomicFeature: exonic_part 'Eno1+Gm5506' at chr4: 150237196 -> 150237323 (strand '+')>
The error message should be self-explanatory.
Alejandro
Hello,
Thanks for the reply, I reinstalled DEXSeq and indeed obtained the same error, as it happened to me a few days ago. At this point I had tried to follow the advices given here: DEXSeq: problem with dexseq_prepare_annotation.py and this started to give me empty files. But now that I tried again, it seems to be better, so it was probably an error that came from the first time I tried to do modifications to the python script.
Thank you again for your help!!
Sakura