deseq2 warning for the design formula
1
0
Entering edit mode
Assa Yeroslaviz ★ 1.5k
@assa-yeroslaviz-1597
Last seen 18 days ago
Germany

Hi,

 

I am running deseq2 on a data set with multiple factors and getting this warning:

the design formula contains a numeric variable with integer values,
  specifying a model with increasing fold change for higher values.
  did you mean for this to be a factor? if so, first convert
  this variable to a factor using the factor() function

What difference does it make for deseq2 if I am using quantitative variable (numerical values) instead of quantitative variables (factors)?

Will there be a difference in the end results? (I guess so, otherwise, there is no reason to put the warning, but I can't understand what)

 

thanks

Assa

 

 

> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.2 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] readr_0.2.2                WriteXLS_4.0.0             BiocParallel_1.4.3         data.table_1.9.6          
 [5] hwriter_1.3.2              GOstats_2.36.0             graph_1.48.0               Category_2.36.0           
 [9] GO.db_3.2.2                AnnotationDbi_1.32.3       Matrix_1.2-3               ggplot2_2.0.0             
[13] gplots_2.17.0              biomaRt_2.26.1             ReportingTools_2.10.0      RSQLite_1.0.0             
[17] DBI_0.3.1                  knitr_1.12.3               RColorBrewer_1.1-2         genefilter_1.52.0         
[21] DESeq2_1.10.1              RcppArmadillo_0.6.400.2.2  Rcpp_0.12.3                SummarizedExperiment_1.0.2
[25] Biobase_2.30.0             GenomicRanges_1.22.3       GenomeInfoDb_1.6.3         IRanges_2.4.6             
[29] S4Vectors_0.8.7            BiocGenerics_0.16.1        stringr_1.0.0             

loaded via a namespace (and not attached):
 [1] edgeR_3.12.0             splines_3.2.3            R.utils_2.2.0            gtools_3.5.0             Formula_1.2-1           
 [6] highr_0.5.1              latticeExtra_0.6-26      RBGL_1.46.0              BSgenome_1.38.0          Rsamtools_1.22.0        
[11] lattice_0.20-33          biovizBase_1.18.0        limma_3.26.6             chron_2.3-47             XVector_0.10.0          
[16] colorspace_1.2-6         ggbio_1.18.3             R.oo_1.19.0              plyr_1.8.3               OrganismDbi_1.12.1      
[21] GSEABase_1.32.0          XML_3.98-1.3             zlibbioc_1.16.0          xtable_1.8-0             scales_0.3.0            
[26] gdata_2.17.0             annotate_1.48.0          PFAM.db_3.2.2            GenomicFeatures_1.22.11  nnet_7.3-11             
[31] survival_2.38-3          magrittr_1.5             evaluate_0.8             R.methodsS3_1.7.0        GGally_1.0.1            
[36] foreign_0.8-66           BiocInstaller_1.20.1     tools_3.2.3              formatR_1.2.1            munsell_0.4.2           
[41] locfit_1.5-9.1           cluster_2.0.3            lambda.r_1.1.7           Biostrings_2.38.3        caTools_1.17.1          
[46] futile.logger_1.4.1      grid_3.2.3               RCurl_1.95-4.7           dichromat_2.0-0          VariantAnnotation_1.16.4
[51] AnnotationForge_1.12.2   bitops_1.0-6             gtable_0.1.2             reshape_0.8.5            reshape2_1.4.1          
[56] GenomicAlignments_1.6.3  gridExtra_2.0.0          rtracklayer_1.30.1       Hmisc_3.17-1             futile.options_1.0.0    
[61] KernSmooth_2.23-15       stringi_1.0-1            geneplotter_1.48.0       rpart_4.1-10             acepack_1.3-3.3
deseq2 design • 7.4k views
ADD COMMENT
3
Entering edit mode
@mikelove
Last seen 18 hours ago
United States

"What difference does it make for deseq2 if I am using quantitative variable (numerical values) instead of quantitative variables (factors)? Will there be a difference in the end results?"

Yes there is a difference. 

This is actually just a "message". There are three levels in R: message, warning and error.

The point of the message is that, users may have a column, such as condition: 1,2,3,4. Converting this variable to a factor means that each of these groups is modeled with a separate term. This is a flexible and very useful way to do modeling, and this is the typical way that DESeq2 is used. If the condition was labeled A,B,C,D then this happens automatically.

It is also possible to have numeric covariates with DESeq2. But this means something very specific which the majority of users (I believe) do not intend. This means: there is a constant fold change for every unit of change in the condition. So if the estimated fold change is 2, this implies that condition 2 = 2x condition 1, condition 3 = 2x condition 2, etc. Or in other words, the relationship is linear on the log counts scale. 

It is a message instead of a warning, because some users may want to express such a relationship if the covariate is, e.g. titration of a chemical with a known log-linear relationship on gene expression, or something like this.

But even with numeric covariates like "age", it is preferred to use the more flexible modeling where you cut() the covariate into bins and fit a term for each bin. Please see the FAQ in the DESeq2 vignette about continuous variables.

ADD COMMENT
0
Entering edit mode

Do I understand it correctly, that in my case, as I don't have any fold-changes, but Time points and replica, I will need to change the columns into factors.

isn't it better to make it a warning, than just a message?

thanks

ADD REPLY
1
Entering edit mode

Yes, in general I recommend users to code time points as a factor, as this is the most flexible and general purpose model, and doesn't require statistical expertise.

The exception is if you are performing your own modeling of expression over time by choosing a space of smooth functions. If you want to do this kind of modeling, but are not sure how or exactly what this means, you will need to partner with someone with expertise in this area, as there are many choices to make, and these choices are important and will influence results.

No, I think a message is appropriate here, because this is standard R variable coding. A message should be sufficient for users who did not mean to encode a variable as numeric.

ADD REPLY

Login before adding your answer.

Traffic: 512 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6