Question

Design formula in DESeq2

0

Entering edit mode

m.glymenaki • 0

@998340ab

Last seen 2.0 years ago

United Kingdom

Hello,

I am using DESeq2 for analysis of RNAseq data. I would like to ask you about the design in the DESEq2 formula. I have tissue from animals treated with a chemical and my animal model is a colorectal cancer model. My variables are gender (male or female), treatment (treated or untreated), genotype (knockout or wild-type), tissue type (tumour or healthy). In my initial analysis I saw that gender plays a role if I compare condition across animals, so I decided to look specifically on males or females across treatment. I read that I need to include all samples in the analysis for calculation of gene dispersion unless some of the samples are highly dispersed by looking a PCA plot. My PCA plot showed that tissue type was a main factor so I decided to make two datasets one for ileum samples and one for colon. In my initial design I had

dds_colon<-DESeqDataSetFromMatrix(countData=countdata_colon,colData=coldata_colon, design= ~Batch+ Gender +Treatment + Genotype + Treatment:Genotype

However as in my analysis I found that gender had a role, I decided to look on males and females separately. So for example wild type untreated males vs knockout untreated males or wild type untreated males vs knockout treated males. I am wondering how I shall make the design as it gets really complicated. What I did it is that I created a new column where I used the paste command combining information from different columns as you will see in the code below where I have created the group variable. My question is: is it a right approach? Shall I make a design like the one on the above command specifying separately gender, treatment and genotype or is it fine that I created a new variable with all the groups I am going to need?

coldata<-read.csv('metadata.csv',header=TRUE,row.names=1,stringsAsFactors = FALSE)#a table with information about the samples
head(coldata)
str(coldata)
coldata$Treatment<-factor(coldata$Treatment)
coldata$Treatment %<>% relevel("water") #we need to have as reference always the control or untreated condition
coldata$Genotype<-factor(coldata$Genotype)
coldata$Genotype %<>% relevel("wt")
coldata$Gender<-factor(coldata$Gender)
coldata$Tissue<-factor(coldata$Tissue)
coldata$Tumour<-factor(coldata$Tumour)
coldata$Parental<-factor(coldata$Parental)
coldata$Batch<-factor(coldata$Batch)

all(colnames(countdata) %in% rownames(coldata)) #very important to check manually that the columns of the count matrix correspond to the rows of the sample information table.
all(rownames(coldata) == colnames(countdata))#to check they are in the same order
countdata <- countdata[, rownames(coldata)]
all(rownames(coldata) == colnames(countdata))
#subset for colon as colon had the biggest variation samples were split based on the area of the gut they were collected
coldata_colon<- coldata%>% filter(Tissue=="Colon")
countdata_colon<-countdata[,rownames(coldata_colon)]
coldata_colon$groupt<- factor(paste0(coldata_colon$Genotype,coldata_colon$Treatment,coldata_colon$Gender))

#format the data so that we can apply DEseq analysis
dds_colon<-DESeqDataSetFromMatrix(countData=countdata_colon,colData=coldata_colon, design= ~Batch +group)


sessionInfo( )

R version 4.1.0 (2021-05-18) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale: [1] LC_COLLATE=English_United Kingdom.1252 [2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 [4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252

attached base packages: [1] grid parallel stats4 stats graphics grDevices utils
[8] datasets methods base

other attached packages: [1] org.Mm.eg.db_3.13.0 AnnotationDbi_1.54.1
[3] sva_3.40.0 BiocParallel_1.26.2
[5] genefilter_1.74.0 mgcv_1.8-35
[7] nlme_3.1-153 glmpca_0.2.0
[9] PoiClaClu_1.0.2.1 ggbeeswarm_0.6.0
[11] RColorBrewer_1.1-2 vsn_3.60.0
[13] pcaExplorer_2.18.0 UpSetR_1.4.0
[15] pheatmap_1.0.12 magrittr_2.0.1
[17] DEFormats_1.20.0 reshape2_1.4.4
[19] ggpubr_0.4.0 DESeq2_1.32.0
[21] SummarizedExperiment_1.22.0 Biobase_2.52.0
[23] MatrixGenerics_1.4.3 matrixStats_0.61.0
[25] GenomicRanges_1.44.0 GenomeInfoDb_1.28.4
[27] IRanges_2.26.0 S4Vectors_0.30.0
[29] BiocGenerics_0.38.0 forcats_0.5.1
[31] stringr_1.4.0 dplyr_1.0.7
[33] purrr_0.3.4 readr_2.0.2
[35] tidyr_1.1.4 tibble_3.1.5
[37] ggplot2_3.3.5 tidyverse_1.3.1
[39] edgeR_3.34.1 limma_3.48.3

loaded via a namespace (and not attached): [1] utf8_1.2.2 shinydashboard_0.7.2 tidyselect_1.1.1
[4] heatmaply_1.3.0 RSQLite_2.2.8 htmlwidgets_1.5.4
[7] TSP_1.1-11 munsell_0.5.0 preprocessCore_1.54.0 [10] codetools_0.2-18 DT_0.19 withr_2.4.2
[13] colorspace_2.0-2 Category_2.58.0 filelock_1.0.2
[16] knitr_1.36 rstudioapi_0.13 ggsignif_0.6.3
[19] NMF_0.23.0 labeling_0.4.2 GenomeInfoDbData_1.2.6 [22] topGO_2.44.0 farver_2.1.0 bit64_4.0.5
[25] vctrs_0.3.8 generics_0.1.0 xfun_0.26
[28] BiocFileCache_2.0.0 R6_2.5.1 doParallel_1.0.16
[31] seriation_1.3.1 locfit_1.5-9.4 bitops_1.0-7
[34] cachem_1.0.6 shinyAce_0.4.1 DelayedArray_0.18.0
[37] assertthat_0.2.1 promises_1.2.0.1 scales_1.1.1
[40] beeswarm_0.4.0 gtable_0.3.0 affy_1.70.0
[43] rlang_0.4.11 splines_4.1.0 rstatix_0.7.0
[46] lazyeval_0.2.2 hexbin_1.28.2 shinyBS_0.61
[49] broom_0.7.9 checkmate_2.0.0 BiocManager_1.30.16
[52] abind_1.4-5 modelr_0.1.8 threejs_0.3.3
[55] crosstalk_1.1.1 backports_1.2.1 httpuv_1.6.3
[58] RBGL_1.68.0 tools_4.1.0 gridBase_0.4-7
[61] affyio_1.62.0 ellipsis_0.3.2 Rcpp_1.0.7
[64] plyr_1.8.6 base64enc_0.1-3 progress_1.2.2
[67] zlibbioc_1.38.0 RCurl_1.98-1.5 prettyunits_1.1.1
[70] viridis_0.6.2 haven_2.4.3 ggrepel_0.9.1
[73] cluster_2.1.2 fs_1.5.0 data.table_1.14.2
[76] openxlsx_4.2.4 SparseM_1.81 reprex_2.0.1
[79] hms_1.1.1 mime_0.12 evaluate_0.14
[82] xtable_1.8-4 XML_3.99-0.8 rio_0.5.27
[85] readxl_1.3.1 gridExtra_2.3 compiler_4.1.0
[88] biomaRt_2.48.3 crayon_1.4.1 htmltools_0.5.2
[91] GOstats_2.58.0 later_1.3.0 tzdb_0.1.2
[94] geneplotter_1.70.0 lubridate_1.8.0 DBI_1.1.1
[97] dbplyr_2.1.1 MASS_7.3-54 rappdirs_0.3.3
[100] Matrix_1.3-3 car_3.0-11 cli_3.0.1
[103] igraph_1.2.7 pkgconfig_2.0.3 registry_0.5-1
[106] foreign_0.8-81 plotly_4.10.0 xml2_1.3.2
[109] foreach_1.5.1 annotate_1.70.0 vipor_0.4.5
[112] rngtools_1.5.2 pkgmaker_0.32.2 webshot_0.5.2
[115] XVector_0.32.0 AnnotationForge_1.34.0 rvest_1.0.2
[118] digest_0.6.28 graph_1.70.0 Biostrings_2.60.2
[121] rmarkdown_2.11 cellranger_1.1.0 dendextend_1.15.1
[124] GSEABase_1.54.0 curl_4.3.2 shiny_1.7.1
[127] lifecycle_1.0.1 jsonlite_1.7.2 carData_3.0-4
[130] viridisLite_0.4.0 fansi_0.5.0 pillar_1.6.4
[133] lattice_0.20-44 KEGGREST_1.32.0 fastmap_1.1.0
[136] httr_1.4.2 survival_3.2-11 GO.db_3.13.0
[139] glue_1.4.2 zip_2.2.0 png_0.1-7
[142] iterators_1.0.13 Rgraphviz_2.36.0 bit_4.0.4
[145] stringi_1.7.5 blob_1.2.2 memoise_2.0.0

groups dispersion DESeq2 desing • 1.3k views

ADD COMMENT • link 2.6 years ago m.glymenaki • 0

score 1 · Accepted Answer · 2021-12-08

1

Entering edit mode

swbarnes2 ★ 1.4k

@swbarnes2-14086

Last seen 1 minute ago

San Diego

if sex is, like batch, just a 'nuisance' actor that you aren't really going to interrogate, but just want to include as a co-factor to be accounted for, I think piling it into the design like that is what you should do. (Assuming that you have a large enough number of samples to make it feasible to account for all your variables.)

You can make one dds object, and subset that like this:

dds_colon <- dds[ ,dds$Tissue == "colon"]

That's probably simpler than what you are doing. However, you might have to do a droplevels command to remove colData elements not present in your subset.