Hello,
I am using DESeq2 for analysis of RNAseq data. I would like to ask you about the design in the DESEq2 formula. I have tissue from animals treated with a chemical and my animal model is a colorectal cancer model. My variables are gender (male or female), treatment (treated or untreated), genotype (knockout or wild-type), tissue type (tumour or healthy). In my initial analysis I saw that gender plays a role if I compare condition across animals, so I decided to look specifically on males or females across treatment. I read that I need to include all samples in the analysis for calculation of gene dispersion unless some of the samples are highly dispersed by looking a PCA plot. My PCA plot showed that tissue type was a main factor so I decided to make two datasets one for ileum samples and one for colon. In my initial design I had
dds_colon<-DESeqDataSetFromMatrix(countData=countdata_colon,colData=coldata_colon, design= ~Batch+ Gender +Treatment + Genotype + Treatment:Genotype
However as in my analysis I found that gender had a role, I decided to look on males and females separately. So for example wild type untreated males vs knockout untreated males or wild type untreated males vs knockout treated males. I am wondering how I shall make the design as it gets really complicated. What I did it is that I created a new column where I used the paste command combining information from different columns as you will see in the code below where I have created the group variable. My question is: is it a right approach? Shall I make a design like the one on the above command specifying separately gender, treatment and genotype or is it fine that I created a new variable with all the groups I am going to need?
coldata<-read.csv('metadata.csv',header=TRUE,row.names=1,stringsAsFactors = FALSE)#a table with information about the samples
head(coldata)
str(coldata)
coldata$Treatment<-factor(coldata$Treatment)
coldata$Treatment %<>% relevel("water") #we need to have as reference always the control or untreated condition
coldata$Genotype<-factor(coldata$Genotype)
coldata$Genotype %<>% relevel("wt")
coldata$Gender<-factor(coldata$Gender)
coldata$Tissue<-factor(coldata$Tissue)
coldata$Tumour<-factor(coldata$Tumour)
coldata$Parental<-factor(coldata$Parental)
coldata$Batch<-factor(coldata$Batch)
all(colnames(countdata) %in% rownames(coldata)) #very important to check manually that the columns of the count matrix correspond to the rows of the sample information table.
all(rownames(coldata) == colnames(countdata))#to check they are in the same order
countdata <- countdata[, rownames(coldata)]
all(rownames(coldata) == colnames(countdata))
#subset for colon as colon had the biggest variation samples were split based on the area of the gut they were collected
coldata_colon<- coldata%>% filter(Tissue=="Colon")
countdata_colon<-countdata[,rownames(coldata_colon)]
coldata_colon$groupt<- factor(paste0(coldata_colon$Genotype,coldata_colon$Treatment,coldata_colon$Gender))
#format the data so that we can apply DEseq analysis
dds_colon<-DESeqDataSetFromMatrix(countData=countdata_colon,colData=coldata_colon, design= ~Batch +group)
sessionInfo( )
R version 4.1.0 (2021-05-18) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 
[2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils
 [8] datasets  methods   base
other attached packages:
 [1] org.Mm.eg.db_3.13.0         AnnotationDbi_1.54.1
 [3] sva_3.40.0                  BiocParallel_1.26.2
 [5] genefilter_1.74.0           mgcv_1.8-35
 [7] nlme_3.1-153                glmpca_0.2.0
 [9] PoiClaClu_1.0.2.1           ggbeeswarm_0.6.0
[11] RColorBrewer_1.1-2          vsn_3.60.0
[13] pcaExplorer_2.18.0          UpSetR_1.4.0
[15] pheatmap_1.0.12             magrittr_2.0.1
[17] DEFormats_1.20.0            reshape2_1.4.4
[19] ggpubr_0.4.0                DESeq2_1.32.0
[21] SummarizedExperiment_1.22.0 Biobase_2.52.0
[23] MatrixGenerics_1.4.3        matrixStats_0.61.0
[25] GenomicRanges_1.44.0        GenomeInfoDb_1.28.4
[27] IRanges_2.26.0              S4Vectors_0.30.0
[29] BiocGenerics_0.38.0         forcats_0.5.1
[31] stringr_1.4.0               dplyr_1.0.7
[33] purrr_0.3.4                 readr_2.0.2
[35] tidyr_1.1.4                 tibble_3.1.5
[37] ggplot2_3.3.5               tidyverse_1.3.1
[39] edgeR_3.34.1                limma_3.48.3
loaded via a namespace (and not attached):
  [1] utf8_1.2.2             shinydashboard_0.7.2   tidyselect_1.1.1
  [4] heatmaply_1.3.0        RSQLite_2.2.8          htmlwidgets_1.5.4
  [7] TSP_1.1-11             munsell_0.5.0          preprocessCore_1.54.0 
 [10] codetools_0.2-18       DT_0.19                withr_2.4.2
 [13] colorspace_2.0-2       Category_2.58.0        filelock_1.0.2
 [16] knitr_1.36             rstudioapi_0.13        ggsignif_0.6.3
 [19] NMF_0.23.0             labeling_0.4.2         GenomeInfoDbData_1.2.6
 [22] topGO_2.44.0           farver_2.1.0           bit64_4.0.5
 [25] vctrs_0.3.8            generics_0.1.0         xfun_0.26
 [28] BiocFileCache_2.0.0    R6_2.5.1               doParallel_1.0.16
 [31] seriation_1.3.1        locfit_1.5-9.4         bitops_1.0-7
 [34] cachem_1.0.6           shinyAce_0.4.1         DelayedArray_0.18.0
 [37] assertthat_0.2.1       promises_1.2.0.1       scales_1.1.1
 [40] beeswarm_0.4.0         gtable_0.3.0           affy_1.70.0
 [43] rlang_0.4.11           splines_4.1.0          rstatix_0.7.0
 [46] lazyeval_0.2.2         hexbin_1.28.2          shinyBS_0.61
 [49] broom_0.7.9            checkmate_2.0.0        BiocManager_1.30.16
 [52] abind_1.4-5            modelr_0.1.8           threejs_0.3.3
 [55] crosstalk_1.1.1        backports_1.2.1        httpuv_1.6.3
 [58] RBGL_1.68.0            tools_4.1.0            gridBase_0.4-7
 [61] affyio_1.62.0          ellipsis_0.3.2         Rcpp_1.0.7
 [64] plyr_1.8.6             base64enc_0.1-3        progress_1.2.2
 [67] zlibbioc_1.38.0        RCurl_1.98-1.5         prettyunits_1.1.1
 [70] viridis_0.6.2          haven_2.4.3            ggrepel_0.9.1
 [73] cluster_2.1.2          fs_1.5.0               data.table_1.14.2
 [76] openxlsx_4.2.4         SparseM_1.81           reprex_2.0.1
 [79] hms_1.1.1              mime_0.12              evaluate_0.14
 [82] xtable_1.8-4           XML_3.99-0.8           rio_0.5.27
 [85] readxl_1.3.1           gridExtra_2.3          compiler_4.1.0
 [88] biomaRt_2.48.3         crayon_1.4.1           htmltools_0.5.2
 [91] GOstats_2.58.0         later_1.3.0            tzdb_0.1.2
 [94] geneplotter_1.70.0     lubridate_1.8.0        DBI_1.1.1
 [97] dbplyr_2.1.1           MASS_7.3-54            rappdirs_0.3.3
[100] Matrix_1.3-3           car_3.0-11             cli_3.0.1
[103] igraph_1.2.7           pkgconfig_2.0.3        registry_0.5-1
[106] foreign_0.8-81         plotly_4.10.0          xml2_1.3.2
[109] foreach_1.5.1          annotate_1.70.0        vipor_0.4.5
[112] rngtools_1.5.2         pkgmaker_0.32.2        webshot_0.5.2
[115] XVector_0.32.0         AnnotationForge_1.34.0 rvest_1.0.2
[118] digest_0.6.28          graph_1.70.0           Biostrings_2.60.2
[121] rmarkdown_2.11         cellranger_1.1.0       dendextend_1.15.1
[124] GSEABase_1.54.0        curl_4.3.2             shiny_1.7.1
[127] lifecycle_1.0.1        jsonlite_1.7.2         carData_3.0-4
[130] viridisLite_0.4.0      fansi_0.5.0            pillar_1.6.4
[133] lattice_0.20-44        KEGGREST_1.32.0        fastmap_1.1.0
[136] httr_1.4.2             survival_3.2-11        GO.db_3.13.0
[139] glue_1.4.2             zip_2.2.0              png_0.1-7
[142] iterators_1.0.13       Rgraphviz_2.36.0       bit_4.0.4
[145] stringi_1.7.5          blob_1.2.2             memoise_2.0.0

Thank you! Much appreciated all the help.