Question: Error in strsplit() : non-character argument - using DEXSeq
0
gravatar for sakura.nussbaum
3.6 years ago by
sakura.nussbaum10 wrote:

Dear all,

I am a new user of DEXSeq and I would need some help concerning an error message that I get.

 

When I call the DEXSeqDataSetFromHTSeq, I obtain the following error:

> library(DEXSeq)
> dxd = DEXSeqDataSetFromHTSeq(countFiles, sampleData=sampleTable, design= ~ sample + exon + condition:exon, flattenedfile=flattenedFile)
Error in strsplit(rownames(dcounts), ":") : non-character argument

Here is what countFiles and sampleTable look like:

> countFiles
[1] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/basal_1.txt"   
[2] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/basal_2.txt"   
[3] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/basal_3.txt"   
[4] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/bulge_1.txt"   
[5] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/bulge_2.txt"   
[6] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/bulge_3.txt"   
[7] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/cdh3_1.txt"    
[8] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/cdh3_2.txt"    
[9] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/cdh3_3.txt"    
[10] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/cdh3_4.txt"    
[11] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/gli_1.txt"     
[12] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/gli_2.txt"     
[13] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/gli_4.txt"     
[14] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lef_1.txt"     
[15] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lef_4.txt"     
[16] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lef_5.txt"     
[17] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lef_6.txt"     
[18] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lgr6_10_1.txt" 
[19] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lgr6_10_2.txt" 
[20] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lgr6_40_1.txt" 
[21] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lgr6_40_2.txt" 
[22] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/lgr6_40_3.txt" 
[23] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/pdgfra_10_1.txt"
[24] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/pdgfra_10_2.txt"
[25] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/pdgfra_10_3.txt"
[26] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/pdgfra_40_1.txt"
[27] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/pdgfra_40_2.txt"
[28] "/Volumes/TOSHIBAEXT/3.mapping-hisat/Counts/pdgfra_40_3.txt"
> sampleTable
          condition    libType
basal1        basal single-end
basal2        basal single-end
basal3        basal single-end
bulge1        bulge single-end
bulge2        bulge single-end
bulge3        bulge single-end
cdh31          cdh3 single-end
cdh32          cdh3 single-end
cdh33          cdh3 single-end
cdh34          cdh3 single-end
gli11          gli1 single-end
gli12          gli1 single-end
gli14          gli1 single-end
lef11          lef1 single-end
lef14          lef1 single-end
lef15          lef1 single-end
lef16          lef1 single-end
lgr6101        lgr6 single-end
lgr6102        lgr6 single-end
lgr6401        lgr6 single-end
lgr6402        lgr6 single-end
lgr6403        lgr6 single-end
pdgfra101    pdgfra single-end
pdgfra102    pdgfra single-end
pdgfra103    pdgfra single-end
pdgfra401    pdgfra single-end
pdgfra402    pdgfra single-end
pdgfra403    pdgfra single-end

 

I guess this is correct. However, when looking at the count files, I have the impression that the .txt look weird. Here is the first one (sample: basal_1) as an example. This is all that is contained in the file:

_ambiguous    0
_ambiguous_readpair_position    0
_empty    9651846
_lowaqual    846199
_notaligned    421258

I created this file with this command (and successfully installed pysam before):

 

$ python /Library/Frameworks/R.framework/Versions/3.2/Resources/library/DEXSeq/python_scripts/dexseq_count.py -f bam mm10.genes.gff Sample_Basal_1_159/accepted_hits_no_rRNA.bam counts/basal_1.txt



​Is this output normal with dexseq_count.py? If not, does anyone know what could be the problem? I tried with different options, but as my data are not paired-end and strand specific, I should use it like this.

Here is my sessionInfo() output:
 

> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.4 (El Capitan)

locale:
[1] C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] DEXSeq_1.16.10             DESeq2_1.10.1             
 [3] RcppArmadillo_0.6.700.3.0  Rcpp_0.12.4               
 [5] SummarizedExperiment_1.0.2 GenomicRanges_1.22.4      
 [7] GenomeInfoDb_1.6.3         IRanges_2.4.8             
 [9] S4Vectors_0.8.11           Biobase_2.30.0            
[11] BiocGenerics_0.16.1        BiocParallel_1.4.3        

loaded via a namespace (and not attached):
 [1] genefilter_1.52.1    statmod_1.4.24       locfit_1.5-9.1      
 [4] splines_3.2.3        lattice_0.20-33      colorspace_1.2-6    
 [7] survival_2.39-2      XML_3.98-1.4         foreign_0.8-66      
[10] DBI_0.3.1            RColorBrewer_1.1-2   lambda.r_1.1.7      
[13] plyr_1.8.3           stringr_1.0.0        zlibbioc_1.16.0     
[16] Biostrings_2.38.4    munsell_0.4.3        gtable_0.2.0        
[19] futile.logger_1.4.1  hwriter_1.3.2        latticeExtra_0.6-28 
[22] geneplotter_1.48.0   biomaRt_2.26.1       AnnotationDbi_1.32.3
[25] acepack_1.3-3.3      xtable_1.8-2         scales_0.4.0        
[28] Hmisc_3.17-3         annotate_1.48.0      XVector_0.10.0      
[31] Rsamtools_1.22.0     gridExtra_2.2.1      ggplot2_2.1.0       
[34] stringi_1.0-1        grid_3.2.3           tools_3.2.3         
[37] bitops_1.0-6         magrittr_1.5         RCurl_1.95-4.8      
[40] RSQLite_1.0.0        Formula_1.2-1        cluster_2.0.4       
[43] futile.options_1.0.0 Matrix_1.2-5         rsconnect_0.4.2.1   
[46] rpart_4.1-10         nnet_7.3-12
ADD COMMENTlink modified 3.6 years ago by aditi20 • written 3.6 years ago by sakura.nussbaum10

Hi Sakura, 

Can you post the first lines of your annotation file?

mm10.genes.gff

Alejandro

ADD REPLYlink written 3.6 years ago by Alejandro Reyes1.7k

Hello, of course, here they are:

 

chr1    unknown    gene    3214482    3671498    .    -    .    ID=Xkr4;Name=NM_001011874
chr1    unknown    exon    3214482    3216968    .    -    .    Parent=Xkr4
chr1    unknown    stop_codon    3216022    3216024    .    -    .    Parent=Xkr4
chr1    unknown    CDS    3216025    3216968    .    -    2    Parent=Xkr4
chr1    unknown    CDS    3421702    3421901    .    -    1    Parent=Xkr4
chr1    unknown    exon    3421702    3421901    .    -    .    Parent=Xkr4
chr1    unknown    CDS    3670552    3671348    .    -    0    Parent=Xkr4
chr1    unknown    exon    3670552    3671498    .    -    .    Parent=Xkr4
chr1    unknown    start_codon    3671346    3671348    .    -    .    Parent=Xkr4
chr1    unknown    gene    4290846    4293012    .    -    .    ID=Rp1;Name=NM_001195662

ADD REPLYlink written 3.6 years ago by sakura.nussbaum10

Hi Sakura, 

You file is not a gtf file, it is a gff file. You need a gtf file, as the ones provided by ensembl, and preprocess it using the script dexseq_prepare_annotation.py. Please read the DEXSeq vignette for more details.

Alejandro

ADD REPLYlink written 3.6 years ago by Alejandro Reyes1.7k

Yes, sorry you are right, this was a gff file that I obtained previously by converting the gtf. 


So I realised the problem happened when preprocessing the gtf. Here are the 10 first lines of gtf file:

chr1    unknown    exon    3214482    3216968    .    -    .    gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1    unknown    stop_codon    3216022    3216024    .    -    .    gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1    unknown    CDS    3216025    3216968    .    -    2    gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1    unknown    CDS    3421702    3421901    .    -    1    gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1    unknown    exon    3421702    3421901    .    -    .    gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1    unknown    CDS    3670552    3671348    .    -    0    gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1    unknown    exon    3670552    3671498    .    -    .    gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1    unknown    start_codon    3671346    3671348    .    -    .    gene_id "Xkr4"; gene_name "Xkr4"; p_id "P14345"; transcript_id "NM_001011874"; tss_id "TSS25485";
chr1    unknown    exon    4290846    4293012    .    -    .    gene_id "Rp1"; gene_name "Rp1"; p_id "P16188"; transcript_id "NM_001195662"; tss_id "TSS5760";
chr1    unknown    stop_codon    4292981    4292983    .    -    .    gene_id "Rp1"; gene_name "Rp1"; p_id "P16188"; transcript_id "NM_001195662"; tss_id "TSS5760";

 


When I run the script like this: 
 

 

python /Library/Frameworks/R.framework/Versions/3.2/Resources/library/DEXSeq/python_scripts/dexseq_prepare_annotation.py mm10.genes.gtf mm10.genes.gff
​


The output gives an empty gff file. That is why my count file is also only with empty counts...

ADD REPLYlink written 3.6 years ago by sakura.nussbaum10

Hi Sakura, 

Could you send me your gtf file so I could have a closer look?

Alejandro

ADD REPLYlink written 3.6 years ago by Alejandro Reyes1.7k

Here it is:

https://www.dropbox.com/s/m2zhchpc6vf5nmo/mm10.genes.gtf?dl=0

ADD REPLYlink written 3.6 years ago by sakura.nussbaum10

Hi Sakura, 

I did get an error message while running your same line of code:

Traceback (most recent call last):
  File "/Users/reyes/Work/Rpcks/DEXSeq/inst/python_scripts/dexseq_prepare_annotation.py", line 129, in <module>
    raise ValueError, "Same name found on two chromosomes: %s, %s" % ( str(l[i]), str(l[i+1]) )
ValueError: Same gene name found on two chromosomes: <GenomicFeature: exonic_part 'Eno1+Gm5506' at chr18: 48046732 -> 48048383 (strand '+')>, <GenomicFeature: exonic_part 'Eno1+Gm5506' at chr4: 150237196 -> 150237323 (strand '+')>

The error message should be self-explanatory.

Alejandro

 

ADD REPLYlink written 3.6 years ago by Alejandro Reyes1.7k

Hello,

Thanks for the reply, I reinstalled DEXSeq and indeed obtained the same error, as it happened to me a few days ago. At this point I had tried to follow the advices given here: DEXSeq: problem with dexseq_prepare_annotation.py and this started to give me empty files. But now that I tried again, it seems to be better, so it was probably an error that came from the first time I tried to do modifications to the python script. 

Thank you again for your help!!

Sakura

ADD REPLYlink written 3.6 years ago by sakura.nussbaum10
Answer: Error in strsplit() : non-character argument - using DEXSeq
0
gravatar for aditi
3.6 years ago by
aditi20
Indian Institute of Science,Bangalore, India
aditi20 wrote:

Hi Sakura,

Your count files seem to be full of empty counts(9million seems too high). If your library is strand specific I'm guessing that you must have taken care of the defining the library type during the alignment (firststrand or secondstrand if you used tophat, that is). Also, with some library prep kits like NEB ultra-directional kit and Illumina TruSeq kits the you need to specify strandedness as 'reverse' and not just 'yes' ('yes' is the default). You can also check for the kind of strandedness using IGV. Using the wrong information on strandedness usually gives you empty counts. Also, I'm hoping that the annotation file that you used for alignment is the same as the one that you used to create your gff.

Hope this helps!

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by aditi20

Thank you for your answer!

It was indeed (one of) the problem(s). I put -s reverse and now I have 1 million empty counts. However, it still seems high. Do you think it can come from somewhere else?

Best,
Sakura

 

ADD REPLYlink written 3.6 years ago by sakura.nussbaum10

Hi Sakura, 

How many reads do you have per library? 
In my experience, it is not that uncommon to get around ~10 to ~15% of "empty counts".

Alejandro

ADD REPLYlink written 3.6 years ago by Alejandro Reyes1.7k

Hi Alejandro,

Indeed, I have around 10mio per library. So it seems ok!
Thank you again,

Sakura

ADD REPLYlink written 3.6 years ago by sakura.nussbaum10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 411 users visited in the last hour