Paired sample design in DESeq2
Entering edit mode
Liam • 0
Last seen 9 days ago
United States

I am analyzing a dataset that is looking into the effects of age on cancer using patient samples within one cancer type. In short, we are interested in finding genes involved in tumorigenesis that are altered by age. Each patient is classified as young or old, and had tumor tissue and corresponding normal tissue taken. As a result, the data is paired, with both tumor and normal tissue samples for each patient. However, given that each patient is either young or old, the data is nested and not full-rank.

      Age HTB TissueType
S22 Young 457     Normal
S1  Young 457      Tumor
S45 Young 824     Normal
S2  Young 824      Tumor
S18 Young 451     Normal
S3  Young 451      Tumor
S33   Old 668     Normal
S20   Old 668      Tumor
S83   Old 862     Normal
S24   Old 862      Tumor
S52   Old 487     Normal
S26   Old 487      Tumor

There are a total of 25 pairs of samples for young and 23 for old.

I based the design matrix on the DESeq2 tutorial, specifically the troubleshooting section on "model matrix not full rank" and the subsection on individuals nested within groups. The example code suggests using an ind.n term within each group:

   grp ind cnd ind.n
1    X   1   A     1
2    X   1   B     1
3    X   2   A     2
4    X   2   B     2
5    X   3   A     3
6    X   3   B     3
7    Y   4   A     1
8    Y   4   B     1
9    Y   5   A     2
10   Y   5   B     2
11   Y   6   A     3
12   Y   6   B     3

With the corresponding design matrix defined as

model.matrix(~ grp + grp:ind.n + grp:cnd, coldata)

As I understand it, the columns connect from the example to our data as follows: grp goes with Age and cnd goes with TissueType. As a result, the model matrix is generated as follows:

model.matrix(~ Age + Age:ind.n + Age:TissueType, metadata)

where ind.n is generated from HTB (patient ID) for each age group:

      Age HTB TissueType ind.n
S22 Young 457     Normal     1
S1  Young 457      Tumor     1
S45 Young 824     Normal     2
S2  Young 824      Tumor     2
S18 Young 451     Normal     3
S3  Young 451      Tumor     3
S33   Old 668     Normal     1
S20   Old 668      Tumor     1
S83   Old 862     Normal     2
S24   Old 862      Tumor     2
S52   Old 487     Normal     3
S26   Old 487      Tumor     3

To the best of my knowledge, I think this is the proper way to design the matrix based on the tutorial. However, please provide guidance if there is an error / a better way to do this.

My question is how to get three comparisons from this data set:

  • fold change and p value for genes tumor/normal in the young patients
  • fold change and p value for genes tumor/normal in the old patients
  • "fold change" and p value for genes tumor/normal in the young patients versus in the old patients (aka the effect of age on tumor/normal)

The contrasts I currently have for these comparisons are as follows:

contrasts <- list("Tumor vs normal in young" = list("AgeYoung.TissueTypeTumor"),
                  "Tumor vs normal in old" = list("AgeOld.TissueTypeTumor"),
                  "Tumor vs normal in old vs young" = list("AgeOld.TissueTypeTumor",

The issue I see is that the p values for results for the final comparison are incredibly small. I realize that the "log2FoldChange" calculated by this method is not strictly a logFC, so I am ignoring that for this. The results for the top 5 genes by padj are as follows:

    log2FoldChange       pvalue      padj
1    -4.153802e-04 2.089027e-06 0.0395599
2    -1.294253e-04 2.364614e-05 0.2238935
3     8.769469e-04 4.797661e-05 0.2271333
4     2.637959e-04 4.197285e-05 0.2271333
5    -5.939257e-04 6.665953e-05 0.2524663

Only one gene is significant, even though they are many genes with pretty strongly different logFCs in the young or old conditions. Ideally I would like to combine the significant genes identified by each of these comparison to find genes that are altered in tumorigenesis, and I find it hard to believe there is only one gene that is significantly altered by age in tumor/normal. Any ideas on possible reasons why so few genes are differentially expressed in this comparison?

Any help on the design matrix and contrast design would be appreciated, thank you!

SessionInfo output as requested in the post guidelines:

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.6

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] comprehenr_0.6.10           ggrepel_0.9.1               ensembldb_2.18.2           
 [4] AnnotationFilter_1.18.0     GenomicFeatures_1.46.1      AnnotationDbi_1.56.2       
 [7] DEGreport_1.30.0            pheatmap_1.0.12             RColorBrewer_1.1-2         
[10] forcats_0.5.1               stringr_1.4.0               dplyr_1.0.7                
[13] purrr_0.3.4                 readr_2.1.1                 tidyr_1.1.4                
[16] tibble_3.1.6                ggplot2_3.3.5               tidyverse_1.3.1            
[19] DESeq2_1.34.0               SummarizedExperiment_1.24.0 Biobase_2.54.0             
[22] MatrixGenerics_1.6.0        matrixStats_0.61.0          GenomicRanges_1.46.1       
[25] GenomeInfoDb_1.30.0         IRanges_2.28.0              S4Vectors_0.32.3           
[28] BiocGenerics_0.40.0        

loaded via a namespace (and not attached):
  [1] readxl_1.3.1                backports_1.4.0             circlize_0.4.13            
  [4] BiocFileCache_2.2.0         plyr_1.8.6                  lazyeval_0.2.2             
  [7] ConsensusClusterPlus_1.58.0 splines_4.1.2               BiocParallel_1.28.2        
 [10] digest_0.6.29               foreach_1.5.1               fansi_0.5.0                
 [13] magrittr_2.0.1              memoise_2.0.1               cluster_2.1.2              
 [16] doParallel_1.0.16           tzdb_0.2.0                  limma_3.50.0               
 [19] ComplexHeatmap_2.10.0       Biostrings_2.62.0           annotate_1.72.0            
 [22] Nozzle.R1_1.1-1             modelr_0.1.8                vroom_1.5.7                
 [25] prettyunits_1.1.1           colorspace_2.0-2            blob_1.2.2                 
 [28] rvest_1.0.2                 rappdirs_0.3.3              haven_2.4.3                
 [31] xfun_0.28                   crayon_1.4.2                RCurl_1.98-1.5             
 [34] jsonlite_1.7.2              genefilter_1.76.0           survival_3.2-13            
 [37] iterators_1.0.13            glue_1.5.1                  gtable_0.3.0               
 [40] zlibbioc_1.40.0             XVector_0.34.0              GetoptLong_1.0.5           
 [43] DelayedArray_0.20.0         shape_1.4.6                 scales_1.1.1               
 [46] DBI_1.1.1                   edgeR_3.36.0                Rcpp_1.0.7                 
 [49] xtable_1.8-4                progress_1.2.2              lasso2_1.2-22              
 [52] gridtext_0.1.4              tmvnsim_1.0-2               clue_0.3-60                
 [55] bit_4.0.4                   httr_1.4.2                  ellipsis_0.3.2             
 [58] pkgconfig_2.0.3             reshape_0.8.8               XML_3.99-0.8               
 [61] dbplyr_2.1.1                locfit_1.5-9.4              utf8_1.2.2                 
 [64] tidyselect_1.1.1            rlang_0.4.12                munsell_0.5.0              
 [67] cellranger_1.1.0            tools_4.1.2                 cachem_1.0.6               
 [70] cli_3.1.0                   generics_0.1.1              RSQLite_2.2.8              
 [73] broom_0.7.10                fastmap_1.1.0               ggdendro_0.1.22            
 [76] yaml_2.2.1                  knitr_1.36                  bit64_4.0.5                
 [79] fs_1.5.1                    KEGGREST_1.34.0             nlme_3.1-153               
 [82] xml2_1.3.3                  biomaRt_2.50.1              compiler_4.1.2             
 [85] rstudioapi_0.13             filelock_1.0.2              curl_4.3.2                 
 [88] png_0.1-7                   reprex_2.0.1                geneplotter_1.72.0         
 [91] stringi_1.7.6               lattice_0.20-45             ProtGenerics_1.26.0        
 [94] Matrix_1.3-4                psych_2.1.9                 vctrs_0.3.8                
 [97] pillar_1.6.4                lifecycle_1.0.1             GlobalOptions_0.1.2        
[100] cowplot_1.1.1               bitops_1.0-7                rtracklayer_1.54.0         
[103] R6_2.5.1                    BiocIO_1.4.0                codetools_0.2-18           
[106] MASS_7.3-54                 assertthat_0.2.1            rjson_0.2.20               
[109] withr_2.4.3                 GenomicAlignments_1.30.0    Rsamtools_2.10.0           
[112] mnormt_2.0.2                GenomeInfoDbData_1.2.7      parallel_4.1.2             
[115] hms_1.1.1                   ggtext_0.1.1                grid_4.1.2                 
[118] logging_0.10-108            lubridate_1.8.0             restfulr_0.0.13
DESeq2 • 160 views
Entering edit mode
Last seen 1 hour ago
United States

Due to time constraints, I only have time to answer software-related questions on the support site. For questions about statistical design and analysis, I recommend collaborating with a local statistician or someone familiar with analyzing linear models in R.


Login before adding your answer.

Traffic: 223 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6