I have invented a dataset below (that mirrors my real data) with 3 factors - disease (sick/healthy), treatment (drug/saline/untreated), and cell line (#1 and #2). The cell lines are supposed to be my 2 replicates but plotMDS() of my data showed that they are quite different, so I made cell line a factor.
I want to: 1) find the DEGs between Healthy and Sick in all cells, as well as in cell_line #1 and #2 separately 2) see if the drug (versus saline) has an effect on the sick cells. 3) confirm that Saline and Untreated cells do not differ (would it be best to do Saline vs Untreated or compare Drug vs Saline to Drug vs Untreated to see if they match?).
I made the following design matrices
design2 <- model.matrix(~Disease + Disease:Treatment + Cell_line)
design3 <- model.matrix(~Disease + Disease:Cell_line + Disease:Treatment)
colnames(design2) # "(Intercept)" "DiseaseSick" "Cell_line#2" # "DiseaseHealthy:TreatmentSal" "DiseaseSick:TreatmentSal" "DiseaseHealthy:TreatmentUnt" # "DiseaseSick:TreatmentUnt" colnames(design3) # "(Intercept)" "DiseaseSick" "DiseaseHealthy:Cell_line#2" # "DiseaseSick:Cell_line#2" "DiseaseHealthy:TreatmentSal" "DiseaseSick:TreatmentSal" # "DiseaseHealthy:TreatmentUnt" "DiseaseSick:TreatmentUnt"
My questions today:
1) design2 gives me high dispersion values with plotBCV() with asymptote at 0.5, while design3 gives me much lower dispersion values with asymptote at 0.1 - Does this mean that design3 better approximates the relationships between the covariates, therefore is the better model for the data?
2) In both designs, is coef=2 considering Disease-Unt, Disease-Saline, and Disease-Drug as 6 sample replicates and comparing them to Healthy-Unt, Healthy-Saline, and Healthy-Drug? Is the model's adjustment for the treatment factor so complete that it can be ignored?? Would the DEGs generated for the 2 Disease-Unt vs 2 Healthy-Unt (no treatment effect included) samples be the same as the DEGs from coef=2 (6 Disease vs 6 Healthy samples)? Why does coef=2 in design2 and coef=2 in design3 give me different DEGs even though they compare the same samples?
3) I have a general question about interaction terms using DiseaseSick:Cell_line#2 as an example. From my understanding, specifying this coef gives me the differences (DEGs) between Cell_line #2 vs #1 in the Sick group. But what I want is the difference between Sick and Healthy in Cell_line#2 ! I tried inverting the term in the design matrix (see code below) to Cell_line:Disease but the interaction terms dont reverse! Am I missing something?
design3_reverse <- model.matrix(~Disease + Cell_line:Disease + Disease:Treatment) colames(design3_reverse) # "(Intercept)" "DiseaseSick" "DiseaseHealthy:Cell_line#2" # "DiseaseSick:Cell_line#2" "DiseaseHealthy:TreatmentSal" "DiseaseSick:TreatmentSal" # "DiseaseHealthy:TreatmentUnt" "DiseaseSick:TreatmentUnt"
I hope someone can help explain how to extract what I need! I have not been able to find previous posts or examples in the UserGuide that are similar to my dataset. Session info is below.
sessionInfo( ) #R version 4.0.3 (2020-10-10) #Platform: x86_64-w64-mingw32/x64 (64-bit) #Running under: Windows 10 x64 (build 18363) #Matrix products: default #locale: # LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 # LC_MONETARY=English_United States.1252 LC_NUMERIC=C # LC_TIME=English_United States.1252 #attached base packages: # parallel stats4 stats graphics grDevices utils datasets methods base #other attached packages: # statmod_1.4.35 forcats_0.5.0 stringr_1.4.0 purrr_0.3.4 # readr_1.4.0 tidyr_1.1.2 tibble_3.0.4 tidyverse_1.3.0 # calibrate_1.7.7 MASS_7.3-53 knitr_1.30 org.Hs.eg.db_3.12.0 # AnnotationDbi_1.52.0 IRanges_2.24.1 S4Vectors_0.28.1 Biobase_2.50.0 # BiocGenerics_0.36.0 GOplot_1.0.2 gridExtra_2.3 ggdendro_0.1.22 # goseq_1.42.0 geneLenDataBase_1.26.0 BiasedUrn_1.07 plotly_22.214.171.124 # ggplot2_3.3.2 pheatmap_1.0.12 randomcoloR_126.96.36.199 RColorBrewer_1.1-2 # magrittr_2.0.1 gplots_3.1.1 plyr_1.8.6 dplyr_1.0.2 # caTools_1.18.0 edgeR_3.32.0 Glimma_2.0.0 limma_3.46.0 #loaded via a namespace (and not attached): # readxl_1.3.1 backports_1.2.0 BiocFileCache_1.14.0 # lazyeval_0.2.2 splines_4.0.3 BiocParallel_1.24.1 # GenomeInfoDb_1.26.2 digest_0.6.27 htmltools_0.5.0 # GO.db_3.12.1 fansi_0.4.1 memoise_1.1.0 # cluster_2.1.0 Biostrings_2.58.0 annotate_1.68.0 # modelr_0.1.8 matrixStats_0.57.0 askpass_1.1 # prettyunits_1.1.1 colorspace_2.0-0 blob_1.2.1 # rvest_0.3.6 rappdirs_0.3.1 haven_2.3.1 # xfun_0.19 crayon_1.3.4 RCurl_1.98-1.2 # jsonlite_1.7.2 genefilter_1.72.0 survival_3.2-7 # glue_1.4.2 gtable_0.3.0 zlibbioc_1.36.0 # XVector_0.30.0 DelayedArray_0.16.0 V8_3.4.0 # scales_1.1.1 DBI_1.1.0 Rcpp_1.0.5 # viridisLite_0.3.0 xtable_1.8-4 progress_1.2.2 # bit_4.0.4 htmlwidgets_1.5.3 httr_1.4.2 # ellipsis_0.3.1 pkgconfig_2.0.3 XML_3.99-0.5 # dbplyr_2.0.0 locfit_1.5-9.4 tidyselect_1.1.0 # rlang_0.4.9 munsell_0.5.0 cellranger_1.1.0 # tools_4.0.3 cli_2.2.0 generics_0.1.0 # RSQLite_2.2.1 pacman_0.5.1 broom_0.7.3 # evaluate_0.14 yaml_2.2.1 bit64_4.0.5 # fs_1.5.0 nlme_3.1-149 xml2_1.3.2 # biomaRt_2.46.0 compiler_4.0.3 rstudioapi_0.13 # curl_4.3 reprex_0.3.0 geneplotter_1.68.0 # stringi_1.5.3 GenomicFeatures_1.42.1 lattice_0.20-41 # Matrix_1.2-18 vctrs_0.3.5 pillar_1.4.7 # lifecycle_0.2.0 data.table_1.13.4 bitops_1.0-6 # rtracklayer_1.49.5 GenomicRanges_1.42.0 R6_2.5.0 # KernSmooth_2.23-17 gtools_3.8.2 assertthat_0.2.1 # SummarizedExperiment_1.20.0 openssl_1.4.3 DESeq2_1.30.0 # withr_2.3.0 GenomicAlignments_1.26.0 Rsamtools_2.6.0 #  GenomeInfoDbData_1.2.4 mgcv_1.8-33 hms_0.5.3 # grid_4.0.3 rmarkdown_2.6 MatrixGenerics_1.2.0 # Rtsne_0.15 lubridate_188.8.131.52
I thought I might be able to combine Disease and Treatment (Dis.Tr), and cell_line linearly to the model in the end, to make a single factorial design - model.matrix(~ 0 + Dis.Tr + cell_line). But the cell_line effect may not be the same across both genotypes (between Sick and Healthy), so I dont know if this is correct....
It would be nice to have a look at the MDS plot first. If a linear effect of the cell line is observed on the MDS plot, then it would be okay to use
design <- model.matrix(~ 0 + Dis.Tr + cell_line)
Dear Yunshun, I am not sure how to determine a linear or non-linear effect from the MDS plot. Here is the info that I get from the MDS plot below - there are clear differences between Sick and Healthy samples (disease factor), as well as between #1 and #2 samples (cell-line factor), but there does not seem to be any effect of the drug (treatment factor highlighted by color). How do I check if the effect of each factor is linear or not?
Take your MDS plot as an example, if the four clusters of samples are at the four corners (i.e., separated by cell_line in the dim1 and by Disease in dim 2), then I would say the cell_line effect is linear and it is okay to use
design <- model.matrix(~ 0 + Dis.Tr + cell_line). However, in your MDS plot the Healthy samples in cell line 1 are separated from the other groups in the first dimension. It isn't perfect but I think it should still be okay to use the additive model.
ok I understand - Thank you! I will use this model for my dataset.
Would you be able to address my other questions from the original post? They are more general about the edgeR functions and understanding how they work and how to interpret the results they give. As you will see when you read the original post, my questions are specifically regarding plotBCV(), "coef" comparisons in glmQLFitTest(), and interaction terms in modelmatrix().