Paired sample design in DESeq2
2
0
Entering edit mode
Liam • 0
@c40ced44
Last seen 2.2 years ago
United States

I am analyzing a dataset that is looking into the effects of age on cancer using patient samples within one cancer type. In short, we are interested in finding genes involved in tumorigenesis that are altered by age. Each patient is classified as young or old, and had tumor tissue and corresponding normal tissue taken. As a result, the data is paired, with both tumor and normal tissue samples for each patient. However, given that each patient is either young or old, the data is nested and not full-rank.

      Age HTB TissueType
S22 Young 457     Normal
S1  Young 457      Tumor
S45 Young 824     Normal
S2  Young 824      Tumor
S18 Young 451     Normal
S3  Young 451      Tumor
...
S33   Old 668     Normal
S20   Old 668      Tumor
S83   Old 862     Normal
S24   Old 862      Tumor
S52   Old 487     Normal
S26   Old 487      Tumor
...

There are a total of 25 pairs of samples for young and 23 for old.

I based the design matrix on the DESeq2 tutorial, specifically the troubleshooting section on "model matrix not full rank" and the subsection on individuals nested within groups. The example code suggests using an ind.n term within each group:

   grp ind cnd ind.n
1    X   1   A     1
2    X   1   B     1
3    X   2   A     2
4    X   2   B     2
5    X   3   A     3
6    X   3   B     3
7    Y   4   A     1
8    Y   4   B     1
9    Y   5   A     2
10   Y   5   B     2
11   Y   6   A     3
12   Y   6   B     3

With the corresponding design matrix defined as

model.matrix(~ grp + grp:ind.n + grp:cnd, coldata)

As I understand it, the columns connect from the example to our data as follows: grp goes with Age and cnd goes with TissueType. As a result, the model matrix is generated as follows:

model.matrix(~ Age + Age:ind.n + Age:TissueType, metadata)

where ind.n is generated from HTB (patient ID) for each age group:

      Age HTB TissueType ind.n
S22 Young 457     Normal     1
S1  Young 457      Tumor     1
S45 Young 824     Normal     2
S2  Young 824      Tumor     2
S18 Young 451     Normal     3
S3  Young 451      Tumor     3
...
S33   Old 668     Normal     1
S20   Old 668      Tumor     1
S83   Old 862     Normal     2
S24   Old 862      Tumor     2
S52   Old 487     Normal     3
S26   Old 487      Tumor     3
...

To the best of my knowledge, I think this is the proper way to design the matrix based on the tutorial. However, please provide guidance if there is an error / a better way to do this.

My question is how to get three comparisons from this data set:

  • fold change and p value for genes tumor/normal in the young patients
  • fold change and p value for genes tumor/normal in the old patients
  • "fold change" and p value for genes tumor/normal in the young patients versus in the old patients (aka the effect of age on tumor/normal)

The contrasts I currently have for these comparisons are as follows:

contrasts <- list("Tumor vs normal in young" = list("AgeYoung.TissueTypeTumor"),
                  "Tumor vs normal in old" = list("AgeOld.TissueTypeTumor"),
                  "Tumor vs normal in old vs young" = list("AgeOld.TissueTypeTumor",
                                                           "AgeYoung.TissueTypeTumor"))

The issue I see is that the p values for results for the final comparison are incredibly small. I realize that the "log2FoldChange" calculated by this method is not strictly a logFC, so I am ignoring that for this. The results for the top 5 genes by padj are as follows:

    log2FoldChange       pvalue      padj
1    -4.153802e-04 2.089027e-06 0.0395599
2    -1.294253e-04 2.364614e-05 0.2238935
3     8.769469e-04 4.797661e-05 0.2271333
4     2.637959e-04 4.197285e-05 0.2271333
5    -5.939257e-04 6.665953e-05 0.2524663

Only one gene is significant, even though they are many genes with pretty strongly different logFCs in the young or old conditions. Ideally I would like to combine the significant genes identified by each of these comparison to find genes that are altered in tumorigenesis, and I find it hard to believe there is only one gene that is significantly altered by age in tumor/normal. Any ideas on possible reasons why so few genes are differentially expressed in this comparison?

Any help on the design matrix and contrast design would be appreciated, thank you!

SessionInfo output as requested in the post guidelines:

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.6

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] comprehenr_0.6.10           ggrepel_0.9.1               ensembldb_2.18.2           
 [4] AnnotationFilter_1.18.0     GenomicFeatures_1.46.1      AnnotationDbi_1.56.2       
 [7] DEGreport_1.30.0            pheatmap_1.0.12             RColorBrewer_1.1-2         
[10] forcats_0.5.1               stringr_1.4.0               dplyr_1.0.7                
[13] purrr_0.3.4                 readr_2.1.1                 tidyr_1.1.4                
[16] tibble_3.1.6                ggplot2_3.3.5               tidyverse_1.3.1            
[19] DESeq2_1.34.0               SummarizedExperiment_1.24.0 Biobase_2.54.0             
[22] MatrixGenerics_1.6.0        matrixStats_0.61.0          GenomicRanges_1.46.1       
[25] GenomeInfoDb_1.30.0         IRanges_2.28.0              S4Vectors_0.32.3           
[28] BiocGenerics_0.40.0        

loaded via a namespace (and not attached):
  [1] readxl_1.3.1                backports_1.4.0             circlize_0.4.13            
  [4] BiocFileCache_2.2.0         plyr_1.8.6                  lazyeval_0.2.2             
  [7] ConsensusClusterPlus_1.58.0 splines_4.1.2               BiocParallel_1.28.2        
 [10] digest_0.6.29               foreach_1.5.1               fansi_0.5.0                
 [13] magrittr_2.0.1              memoise_2.0.1               cluster_2.1.2              
 [16] doParallel_1.0.16           tzdb_0.2.0                  limma_3.50.0               
 [19] ComplexHeatmap_2.10.0       Biostrings_2.62.0           annotate_1.72.0            
 [22] Nozzle.R1_1.1-1             modelr_0.1.8                vroom_1.5.7                
 [25] prettyunits_1.1.1           colorspace_2.0-2            blob_1.2.2                 
 [28] rvest_1.0.2                 rappdirs_0.3.3              haven_2.4.3                
 [31] xfun_0.28                   crayon_1.4.2                RCurl_1.98-1.5             
 [34] jsonlite_1.7.2              genefilter_1.76.0           survival_3.2-13            
 [37] iterators_1.0.13            glue_1.5.1                  gtable_0.3.0               
 [40] zlibbioc_1.40.0             XVector_0.34.0              GetoptLong_1.0.5           
 [43] DelayedArray_0.20.0         shape_1.4.6                 scales_1.1.1               
 [46] DBI_1.1.1                   edgeR_3.36.0                Rcpp_1.0.7                 
 [49] xtable_1.8-4                progress_1.2.2              lasso2_1.2-22              
 [52] gridtext_0.1.4              tmvnsim_1.0-2               clue_0.3-60                
 [55] bit_4.0.4                   httr_1.4.2                  ellipsis_0.3.2             
 [58] pkgconfig_2.0.3             reshape_0.8.8               XML_3.99-0.8               
 [61] dbplyr_2.1.1                locfit_1.5-9.4              utf8_1.2.2                 
 [64] tidyselect_1.1.1            rlang_0.4.12                munsell_0.5.0              
 [67] cellranger_1.1.0            tools_4.1.2                 cachem_1.0.6               
 [70] cli_3.1.0                   generics_0.1.1              RSQLite_2.2.8              
 [73] broom_0.7.10                fastmap_1.1.0               ggdendro_0.1.22            
 [76] yaml_2.2.1                  knitr_1.36                  bit64_4.0.5                
 [79] fs_1.5.1                    KEGGREST_1.34.0             nlme_3.1-153               
 [82] xml2_1.3.3                  biomaRt_2.50.1              compiler_4.1.2             
 [85] rstudioapi_0.13             filelock_1.0.2              curl_4.3.2                 
 [88] png_0.1-7                   reprex_2.0.1                geneplotter_1.72.0         
 [91] stringi_1.7.6               lattice_0.20-45             ProtGenerics_1.26.0        
 [94] Matrix_1.3-4                psych_2.1.9                 vctrs_0.3.8                
 [97] pillar_1.6.4                lifecycle_1.0.1             GlobalOptions_0.1.2        
[100] cowplot_1.1.1               bitops_1.0-7                rtracklayer_1.54.0         
[103] R6_2.5.1                    BiocIO_1.4.0                codetools_0.2-18           
[106] MASS_7.3-54                 assertthat_0.2.1            rjson_0.2.20               
[109] withr_2.4.3                 GenomicAlignments_1.30.0    Rsamtools_2.10.0           
[112] mnormt_2.0.2                GenomeInfoDbData_1.2.7      parallel_4.1.2             
[115] hms_1.1.1                   ggtext_0.1.1                grid_4.1.2                 
[118] logging_0.10-108            lubridate_1.8.0             restfulr_0.0.13
DESeq2 • 1.7k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 13 hours ago
United States

Due to time constraints, I only have time to answer software-related questions on the support site. For questions about statistical design and analysis, I recommend collaborating with a local statistician or someone familiar with analyzing linear models in R.

ADD COMMENT
0
Entering edit mode
@834f6624
Last seen 12 months ago
South Korea

I have similar experimental design with paired normal/tumor samples and facing very similar problem where only few genes are significantly different. If anyone can post answer to the question of original post, it would be tremendous help.

Thank you in advance.

ADD COMMENT
1
Entering edit mode

Take a look at plotCounts for the gene with smallest pvalue. It will give you a sense of the variability and why few genes may be called.

ADD REPLY
0
Entering edit mode

Thank you for taking your valuable time on this trivial question Prof. Michael love.

I have checked the plotCounts for the gene with the smallest pvalue as you suggested and found out that the gene has lower variability compare to genes with higher p-values. Because of that matter, I am redesigning the design section of the DESeq to make a more meaningful analysis. While I am doing it, I have very small questions that I would like to ask. If you are busy, you don't have to answer them.

  1. As I stated, I have experimental samples very similar to the samples of the original post where I have paired normal/tumor samples (Tumor vs Normal) where some tumor samples have KRAS mutation (mutated vs wild). So i have ran very similar code: model.matrix(~ mutation+ mutation:patient_number + mutation:Tissue, metadata). After that, to get a genes upregulated in the KRAS mutated tumor samples compared to KRAS wildtype tumors (which are normalized by their paired normal samples), I ran res <- results(dds, contrast = list('mutationmutated.tissueTumor', 'mutationwild.tissueTumor'), independentFiltering = TRUE, alpha = 0.05). Would this be correct?

  2. In the raw read count matrix (which is in my excel file), there are some genes that have 0 read counts in most of the samples but few hundreds of read counts in 1 or 2 samples (out of 68 total samples), thereby evading the initial filtering using rowSums function. Should I leave them as independentfiltering option in the results function will accomodate them accordingly during result extraction?

  3. After DESeq, I receive a message saying "xx rows did not converge in beta, labelled in mcols(object)$betaConv. I have used the mcols(object)$betaConv to find some TRUE/FALSE values among genes (I presume) but I am not sure how to extract genes with false values (I guess they are the genes that did not converge). If you can provide easy solution or have mentioned them in previous posts and therefore therefore the link to the post, I would very appreciate it.

  4. I have executed DESeq function with parametric fitType but I am not sure if it is doing better than local fitType. Can you please tell me which one is better? Parametric : https://i.imgur.com/MkB7nAT.png Local : https://i.imgur.com/fzdCGKf.png

I have searched and read lots of posts and still am confused due to lack of bioinformatic knowledge. I know you are very busy with answering unlimited questions on bioconductor and also running your lab. I apologize and thank you so much again for spending your time on my questions professor.

Sincerely,

Jae Hyung Jung.

ADD REPLY
1
Entering edit mode

Look correct as you have it.

"there are some genes that have 0 read counts in most of the samples but few hundreds of read counts in 1 or 2 samples (out of 68 total samples), thereby evading the initial filtering using rowSums function"

I would use the pre-filter for these. I generally like to find trends present in most samples.

The non-convergence usually happens when the data is not a good match to the assumed distribution. You can remove those genes. You can remove them with logical indexing: dds[mcols(dds)$betaConv,]

Usually parametric is the better fit. If it runs without a message about not being able to fit, then I'd go with that one.

ADD REPLY
1
Entering edit mode

Thank you again for the elaborate answers and time you have spent on them. I would definitely follow your suggestions to adequately modify my current analysis for more accurate analysis of the data. I really enjoy using DESeq2 as it is one of the masterpiece packages in the R environment. I guess I don't have to describe how hospitable the package is to non-bioinformaticians like me.

I hope you have a wonderful day and wish best for your continued health and happiness.

Sincerely,

Jae Hyung Jung.

ADD REPLY

Login before adding your answer.

Traffic: 729 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6