I am analyzing a dataset that is looking into the effects of age on cancer using patient samples within one cancer type. In short, we are interested in finding genes involved in tumorigenesis that are altered by age. Each patient is classified as young or old, and had tumor tissue and corresponding normal tissue taken. As a result, the data is paired, with both tumor and normal tissue samples for each patient. However, given that each patient is either young or old, the data is nested and not full-rank.
Age HTB TissueType
S22 Young 457 Normal
S1 Young 457 Tumor
S45 Young 824 Normal
S2 Young 824 Tumor
S18 Young 451 Normal
S3 Young 451 Tumor
...
S33 Old 668 Normal
S20 Old 668 Tumor
S83 Old 862 Normal
S24 Old 862 Tumor
S52 Old 487 Normal
S26 Old 487 Tumor
...
There are a total of 25 pairs of samples for young and 23 for old.
I based the design matrix on the DESeq2 tutorial, specifically the troubleshooting section on "model matrix not full rank" and the subsection on individuals nested within groups. The example code suggests using an ind.n
term within each group:
grp ind cnd ind.n
1 X 1 A 1
2 X 1 B 1
3 X 2 A 2
4 X 2 B 2
5 X 3 A 3
6 X 3 B 3
7 Y 4 A 1
8 Y 4 B 1
9 Y 5 A 2
10 Y 5 B 2
11 Y 6 A 3
12 Y 6 B 3
With the corresponding design matrix defined as
model.matrix(~ grp + grp:ind.n + grp:cnd, coldata)
As I understand it, the columns connect from the example to our data as follows: grp
goes with Age
and cnd
goes with TissueType
. As a result, the model matrix is generated as follows:
model.matrix(~ Age + Age:ind.n + Age:TissueType, metadata)
where ind.n
is generated from HTB
(patient ID) for each age group:
Age HTB TissueType ind.n
S22 Young 457 Normal 1
S1 Young 457 Tumor 1
S45 Young 824 Normal 2
S2 Young 824 Tumor 2
S18 Young 451 Normal 3
S3 Young 451 Tumor 3
...
S33 Old 668 Normal 1
S20 Old 668 Tumor 1
S83 Old 862 Normal 2
S24 Old 862 Tumor 2
S52 Old 487 Normal 3
S26 Old 487 Tumor 3
...
To the best of my knowledge, I think this is the proper way to design the matrix based on the tutorial. However, please provide guidance if there is an error / a better way to do this.
My question is how to get three comparisons from this data set:
- fold change and p value for genes tumor/normal in the young patients
- fold change and p value for genes tumor/normal in the old patients
- "fold change" and p value for genes tumor/normal in the young patients versus in the old patients (aka the effect of age on tumor/normal)
The contrasts I currently have for these comparisons are as follows:
contrasts <- list("Tumor vs normal in young" = list("AgeYoung.TissueTypeTumor"),
"Tumor vs normal in old" = list("AgeOld.TissueTypeTumor"),
"Tumor vs normal in old vs young" = list("AgeOld.TissueTypeTumor",
"AgeYoung.TissueTypeTumor"))
The issue I see is that the p values for results for the final comparison are incredibly small. I realize that the "log2FoldChange" calculated by this method is not strictly a logFC, so I am ignoring that for this. The results for the top 5 genes by padj
are as follows:
log2FoldChange pvalue padj
1 -4.153802e-04 2.089027e-06 0.0395599
2 -1.294253e-04 2.364614e-05 0.2238935
3 8.769469e-04 4.797661e-05 0.2271333
4 2.637959e-04 4.197285e-05 0.2271333
5 -5.939257e-04 6.665953e-05 0.2524663
Only one gene is significant, even though they are many genes with pretty strongly different logFCs in the young or old conditions. Ideally I would like to combine the significant genes identified by each of these comparison to find genes that are altered in tumorigenesis, and I find it hard to believe there is only one gene that is significantly altered by age in tumor/normal. Any ideas on possible reasons why so few genes are differentially expressed in this comparison?
Any help on the design matrix and contrast design would be appreciated, thank you!
SessionInfo output as requested in the post guidelines:
> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.6
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] comprehenr_0.6.10 ggrepel_0.9.1 ensembldb_2.18.2
[4] AnnotationFilter_1.18.0 GenomicFeatures_1.46.1 AnnotationDbi_1.56.2
[7] DEGreport_1.30.0 pheatmap_1.0.12 RColorBrewer_1.1-2
[10] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7
[13] purrr_0.3.4 readr_2.1.1 tidyr_1.1.4
[16] tibble_3.1.6 ggplot2_3.3.5 tidyverse_1.3.1
[19] DESeq2_1.34.0 SummarizedExperiment_1.24.0 Biobase_2.54.0
[22] MatrixGenerics_1.6.0 matrixStats_0.61.0 GenomicRanges_1.46.1
[25] GenomeInfoDb_1.30.0 IRanges_2.28.0 S4Vectors_0.32.3
[28] BiocGenerics_0.40.0
loaded via a namespace (and not attached):
[1] readxl_1.3.1 backports_1.4.0 circlize_0.4.13
[4] BiocFileCache_2.2.0 plyr_1.8.6 lazyeval_0.2.2
[7] ConsensusClusterPlus_1.58.0 splines_4.1.2 BiocParallel_1.28.2
[10] digest_0.6.29 foreach_1.5.1 fansi_0.5.0
[13] magrittr_2.0.1 memoise_2.0.1 cluster_2.1.2
[16] doParallel_1.0.16 tzdb_0.2.0 limma_3.50.0
[19] ComplexHeatmap_2.10.0 Biostrings_2.62.0 annotate_1.72.0
[22] Nozzle.R1_1.1-1 modelr_0.1.8 vroom_1.5.7
[25] prettyunits_1.1.1 colorspace_2.0-2 blob_1.2.2
[28] rvest_1.0.2 rappdirs_0.3.3 haven_2.4.3
[31] xfun_0.28 crayon_1.4.2 RCurl_1.98-1.5
[34] jsonlite_1.7.2 genefilter_1.76.0 survival_3.2-13
[37] iterators_1.0.13 glue_1.5.1 gtable_0.3.0
[40] zlibbioc_1.40.0 XVector_0.34.0 GetoptLong_1.0.5
[43] DelayedArray_0.20.0 shape_1.4.6 scales_1.1.1
[46] DBI_1.1.1 edgeR_3.36.0 Rcpp_1.0.7
[49] xtable_1.8-4 progress_1.2.2 lasso2_1.2-22
[52] gridtext_0.1.4 tmvnsim_1.0-2 clue_0.3-60
[55] bit_4.0.4 httr_1.4.2 ellipsis_0.3.2
[58] pkgconfig_2.0.3 reshape_0.8.8 XML_3.99-0.8
[61] dbplyr_2.1.1 locfit_1.5-9.4 utf8_1.2.2
[64] tidyselect_1.1.1 rlang_0.4.12 munsell_0.5.0
[67] cellranger_1.1.0 tools_4.1.2 cachem_1.0.6
[70] cli_3.1.0 generics_0.1.1 RSQLite_2.2.8
[73] broom_0.7.10 fastmap_1.1.0 ggdendro_0.1.22
[76] yaml_2.2.1 knitr_1.36 bit64_4.0.5
[79] fs_1.5.1 KEGGREST_1.34.0 nlme_3.1-153
[82] xml2_1.3.3 biomaRt_2.50.1 compiler_4.1.2
[85] rstudioapi_0.13 filelock_1.0.2 curl_4.3.2
[88] png_0.1-7 reprex_2.0.1 geneplotter_1.72.0
[91] stringi_1.7.6 lattice_0.20-45 ProtGenerics_1.26.0
[94] Matrix_1.3-4 psych_2.1.9 vctrs_0.3.8
[97] pillar_1.6.4 lifecycle_1.0.1 GlobalOptions_0.1.2
[100] cowplot_1.1.1 bitops_1.0-7 rtracklayer_1.54.0
[103] R6_2.5.1 BiocIO_1.4.0 codetools_0.2-18
[106] MASS_7.3-54 assertthat_0.2.1 rjson_0.2.20
[109] withr_2.4.3 GenomicAlignments_1.30.0 Rsamtools_2.10.0
[112] mnormt_2.0.2 GenomeInfoDbData_1.2.7 parallel_4.1.2
[115] hms_1.1.1 ggtext_0.1.1 grid_4.1.2
[118] logging_0.10-108 lubridate_1.8.0 restfulr_0.0.13
Take a look at plotCounts for the gene with smallest pvalue. It will give you a sense of the variability and why few genes may be called.
Thank you for taking your valuable time on this trivial question Prof. Michael love.
I have checked the plotCounts for the gene with the smallest pvalue as you suggested and found out that the gene has lower variability compare to genes with higher p-values. Because of that matter, I am redesigning the design section of the DESeq to make a more meaningful analysis. While I am doing it, I have very small questions that I would like to ask. If you are busy, you don't have to answer them.
As I stated, I have experimental samples very similar to the samples of the original post where I have paired normal/tumor samples (Tumor vs Normal) where some tumor samples have KRAS mutation (mutated vs wild). So i have ran very similar code:
model.matrix(~ mutation+ mutation:patient_number + mutation:Tissue, metadata)
. After that, to get a genes upregulated in the KRAS mutated tumor samples compared to KRAS wildtype tumors (which are normalized by their paired normal samples), I ranres <- results(dds, contrast = list('mutationmutated.tissueTumor', 'mutationwild.tissueTumor'), independentFiltering = TRUE, alpha = 0.05)
. Would this be correct?In the raw read count matrix (which is in my excel file), there are some genes that have 0 read counts in most of the samples but few hundreds of read counts in 1 or 2 samples (out of 68 total samples), thereby evading the initial filtering using rowSums function. Should I leave them as independentfiltering option in the results function will accomodate them accordingly during result extraction?
After DESeq, I receive a message saying "xx rows did not converge in beta, labelled in mcols(object)$betaConv. I have used the mcols(object)$betaConv to find some TRUE/FALSE values among genes (I presume) but I am not sure how to extract genes with false values (I guess they are the genes that did not converge). If you can provide easy solution or have mentioned them in previous posts and therefore therefore the link to the post, I would very appreciate it.
I have executed DESeq function with parametric fitType but I am not sure if it is doing better than local fitType. Can you please tell me which one is better? Parametric : https://i.imgur.com/MkB7nAT.png Local : https://i.imgur.com/fzdCGKf.png
I have searched and read lots of posts and still am confused due to lack of bioinformatic knowledge. I know you are very busy with answering unlimited questions on bioconductor and also running your lab. I apologize and thank you so much again for spending your time on my questions professor.
Sincerely,
Jae Hyung Jung.
Look correct as you have it.
"there are some genes that have 0 read counts in most of the samples but few hundreds of read counts in 1 or 2 samples (out of 68 total samples), thereby evading the initial filtering using rowSums function"
I would use the pre-filter for these. I generally like to find trends present in most samples.
The non-convergence usually happens when the data is not a good match to the assumed distribution. You can remove those genes. You can remove them with logical indexing:
dds[mcols(dds)$betaConv,]
Usually parametric is the better fit. If it runs without a message about not being able to fit, then I'd go with that one.
Thank you again for the elaborate answers and time you have spent on them. I would definitely follow your suggestions to adequately modify my current analysis for more accurate analysis of the data. I really enjoy using DESeq2 as it is one of the masterpiece packages in the R environment. I guess I don't have to describe how hospitable the package is to non-bioinformaticians like me.
I hope you have a wonderful day and wish best for your continued health and happiness.
Sincerely,
Jae Hyung Jung.