I am analyzing a dataset that is looking into the effects of age on cancer using patient samples within one cancer type. In short, we are interested in finding genes involved in tumorigenesis that are altered by age. Each patient is classified as young or old, and had tumor tissue and corresponding normal tissue taken. As a result, the data is paired, with both tumor and normal tissue samples for each patient. However, given that each patient is either young or old, the data is nested and not full-rank.
Age HTB TissueType S22 Young 457 Normal S1 Young 457 Tumor S45 Young 824 Normal S2 Young 824 Tumor S18 Young 451 Normal S3 Young 451 Tumor ... S33 Old 668 Normal S20 Old 668 Tumor S83 Old 862 Normal S24 Old 862 Tumor S52 Old 487 Normal S26 Old 487 Tumor ...
There are a total of 25 pairs of samples for young and 23 for old.
I based the design matrix on the DESeq2 tutorial, specifically the troubleshooting section on "model matrix not full rank" and the subsection on individuals nested within groups. The example code suggests using an
ind.n term within each group:
grp ind cnd ind.n 1 X 1 A 1 2 X 1 B 1 3 X 2 A 2 4 X 2 B 2 5 X 3 A 3 6 X 3 B 3 7 Y 4 A 1 8 Y 4 B 1 9 Y 5 A 2 10 Y 5 B 2 11 Y 6 A 3 12 Y 6 B 3
With the corresponding design matrix defined as
model.matrix(~ grp + grp:ind.n + grp:cnd, coldata)
As I understand it, the columns connect from the example to our data as follows:
grp goes with
cnd goes with
TissueType. As a result, the model matrix is generated as follows:
model.matrix(~ Age + Age:ind.n + Age:TissueType, metadata)
ind.n is generated from
HTB (patient ID) for each age group:
Age HTB TissueType ind.n S22 Young 457 Normal 1 S1 Young 457 Tumor 1 S45 Young 824 Normal 2 S2 Young 824 Tumor 2 S18 Young 451 Normal 3 S3 Young 451 Tumor 3 ... S33 Old 668 Normal 1 S20 Old 668 Tumor 1 S83 Old 862 Normal 2 S24 Old 862 Tumor 2 S52 Old 487 Normal 3 S26 Old 487 Tumor 3 ...
To the best of my knowledge, I think this is the proper way to design the matrix based on the tutorial. However, please provide guidance if there is an error / a better way to do this.
My question is how to get three comparisons from this data set:
- fold change and p value for genes tumor/normal in the young patients
- fold change and p value for genes tumor/normal in the old patients
- "fold change" and p value for genes tumor/normal in the young patients versus in the old patients (aka the effect of age on tumor/normal)
The contrasts I currently have for these comparisons are as follows:
contrasts <- list("Tumor vs normal in young" = list("AgeYoung.TissueTypeTumor"), "Tumor vs normal in old" = list("AgeOld.TissueTypeTumor"), "Tumor vs normal in old vs young" = list("AgeOld.TissueTypeTumor", "AgeYoung.TissueTypeTumor"))
The issue I see is that the p values for results for the final comparison are incredibly small. I realize that the "log2FoldChange" calculated by this method is not strictly a logFC, so I am ignoring that for this. The results for the top 5 genes by
padj are as follows:
log2FoldChange pvalue padj 1 -4.153802e-04 2.089027e-06 0.0395599 2 -1.294253e-04 2.364614e-05 0.2238935 3 8.769469e-04 4.797661e-05 0.2271333 4 2.637959e-04 4.197285e-05 0.2271333 5 -5.939257e-04 6.665953e-05 0.2524663
Only one gene is significant, even though they are many genes with pretty strongly different logFCs in the young or old conditions. Ideally I would like to combine the significant genes identified by each of these comparison to find genes that are altered in tumorigenesis, and I find it hard to believe there is only one gene that is significantly altered by age in tumor/normal. Any ideas on possible reasons why so few genes are differentially expressed in this comparison?
Any help on the design matrix and contrast design would be appreciated, thank you!
SessionInfo output as requested in the post guidelines:
> sessionInfo() R version 4.1.2 (2021-11-01) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur 11.6 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib locale:  en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages:  stats4 stats graphics grDevices utils datasets methods base other attached packages:  comprehenr_0.6.10 ggrepel_0.9.1 ensembldb_2.18.2  AnnotationFilter_1.18.0 GenomicFeatures_1.46.1 AnnotationDbi_1.56.2  DEGreport_1.30.0 pheatmap_1.0.12 RColorBrewer_1.1-2  forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7  purrr_0.3.4 readr_2.1.1 tidyr_1.1.4  tibble_3.1.6 ggplot2_3.3.5 tidyverse_1.3.1  DESeq2_1.34.0 SummarizedExperiment_1.24.0 Biobase_2.54.0  MatrixGenerics_1.6.0 matrixStats_0.61.0 GenomicRanges_1.46.1  GenomeInfoDb_1.30.0 IRanges_2.28.0 S4Vectors_0.32.3  BiocGenerics_0.40.0 loaded via a namespace (and not attached):  readxl_1.3.1 backports_1.4.0 circlize_0.4.13  BiocFileCache_2.2.0 plyr_1.8.6 lazyeval_0.2.2  ConsensusClusterPlus_1.58.0 splines_4.1.2 BiocParallel_1.28.2  digest_0.6.29 foreach_1.5.1 fansi_0.5.0  magrittr_2.0.1 memoise_2.0.1 cluster_2.1.2  doParallel_1.0.16 tzdb_0.2.0 limma_3.50.0  ComplexHeatmap_2.10.0 Biostrings_2.62.0 annotate_1.72.0  Nozzle.R1_1.1-1 modelr_0.1.8 vroom_1.5.7  prettyunits_1.1.1 colorspace_2.0-2 blob_1.2.2  rvest_1.0.2 rappdirs_0.3.3 haven_2.4.3  xfun_0.28 crayon_1.4.2 RCurl_1.98-1.5  jsonlite_1.7.2 genefilter_1.76.0 survival_3.2-13  iterators_1.0.13 glue_1.5.1 gtable_0.3.0  zlibbioc_1.40.0 XVector_0.34.0 GetoptLong_1.0.5  DelayedArray_0.20.0 shape_1.4.6 scales_1.1.1  DBI_1.1.1 edgeR_3.36.0 Rcpp_1.0.7  xtable_1.8-4 progress_1.2.2 lasso2_1.2-22  gridtext_0.1.4 tmvnsim_1.0-2 clue_0.3-60  bit_4.0.4 httr_1.4.2 ellipsis_0.3.2  pkgconfig_2.0.3 reshape_0.8.8 XML_3.99-0.8  dbplyr_2.1.1 locfit_1.5-9.4 utf8_1.2.2  tidyselect_1.1.1 rlang_0.4.12 munsell_0.5.0  cellranger_1.1.0 tools_4.1.2 cachem_1.0.6  cli_3.1.0 generics_0.1.1 RSQLite_2.2.8  broom_0.7.10 fastmap_1.1.0 ggdendro_0.1.22  yaml_2.2.1 knitr_1.36 bit64_4.0.5  fs_1.5.1 KEGGREST_1.34.0 nlme_3.1-153  xml2_1.3.3 biomaRt_2.50.1 compiler_4.1.2  rstudioapi_0.13 filelock_1.0.2 curl_4.3.2  png_0.1-7 reprex_2.0.1 geneplotter_1.72.0  stringi_1.7.6 lattice_0.20-45 ProtGenerics_1.26.0  Matrix_1.3-4 psych_2.1.9 vctrs_0.3.8  pillar_1.6.4 lifecycle_1.0.1 GlobalOptions_0.1.2  cowplot_1.1.1 bitops_1.0-7 rtracklayer_1.54.0  R6_2.5.1 BiocIO_1.4.0 codetools_0.2-18  MASS_7.3-54 assertthat_0.2.1 rjson_0.2.20  withr_2.4.3 GenomicAlignments_1.30.0 Rsamtools_2.10.0  mnormt_2.0.2 GenomeInfoDbData_1.2.7 parallel_4.1.2  hms_1.1.1 ggtext_0.1.1 grid_4.1.2  logging_0.10-108 lubridate_1.8.0 restfulr_0.0.13