- I am working quite a while now on RNAseq data. Brief study design: We have human tissue biopsies. The tissues come from 4 different groups, which can be categorized due to their BMI. Hence, we have group 0, 1, 2, 3 for this tissue. I run different pipelines to find differential expressed genes among the groups. However, I now have an idea in mind which I want to test and looking for help. Let's assume it makes a lot of biological sense to find genes which follow a linear regression from group 0 -> group 3. Hence, I want to perform a linear regression for every gene to get the slope and p-value of each gene (accordingly the expression from gene increases or decreases constantly from group 0 to group 3). I know that I will e.g. not detect genes which might regulated in only one group but this not the aim of this analysis. Can somebody out there give me hint what would be the best way to do this? I was thinking of using the voom/limma approach. However, is there also another/better way or is the best approach to do so? Can I implant this question in DESeq2? Could I run my own glm on e.g. DESeq2 normalized samples?
- On my search I stumbled over a quite similar previous post (Question: DESEQ2 linear model of dose) where they advised to out the vector of interest into the sample information while building the DESeq data set. So I tried this:
sample_group <- sample
sample_group$groupe <-c(0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3)
(sample_group$groupe <- factor(sample_group$groupe , levels = c(0,1,2,3), labels = c(0,1,2,3)))
cds <- DESeqDataSetFromMatrix(counts_3, sample_group, design = ~ groupe )
cd_DESeq <- DESeq(cds)
results(cd_DESeq)
> results(cd_DESeq)
log2 fold change (MAP): groupe 3 vs 0
Wald test p-value: groupe 3 vs 0
DataFrame with 12889 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
7SK 59.7498 0.25838015 0.13543938 1.9077181 0.05642766 0.3133326
A2M 13990.2247 -0.07047253 0.15876606 -0.4438765 0.65713185 0.8821618
A2M-AS1 189.2881 0.21297458 0.13393174 1.5901725 0.11179593 0.4326408
In this previous post, they then claimed that the log2FoldChange is the regression coefficient. However, I can still form different contrasts say
results(cd_DESeq, contrast = c("groupe", "3", "1"))
> results(cd_DESeq, contrast = c("groupe", "3", "1"))
log2 fold change (MAP): groupe 3 vs 1
Wald test p-value: groupe 3 vs 1
DataFrame with 12889 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
7SK 59.7498 0.30781862 0.13015504 2.3650151 0.01802934 0.1655048
A2M 13990.2247 0.02960835 0.15492933 0.1911087 0.84844040 0.9526261
A2M-AS1 189.2881 0.31484488 0.12915912 2.4376513 0.01478303 0.1516203
So, this gives different log2FoldChanges. I would have expected that these changes would stay the same independent of the contrast. Or gives me the first table with 'Wald test p-value: groupe 3 vs 0' already the needed information I want? Hence, I get the regression coefficient from group 0 -> group 3 and in 'Wald test p-value: groupe 3 vs 1' I would then get the regression coefficient from group 1 -> 3 (thus excluding group 0 from the liner regression?)?
Many thanks for any help!
> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.3
locale:
[1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] DESeq2_1.14.1 SummarizedExperiment_1.4.0 Biobase_2.34.0 GenomicRanges_1.26.3 GenomeInfoDb_1.10.3 IRanges_2.8.1 S4Vectors_0.12.1 BiocGenerics_0.20.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.9 RColorBrewer_1.1-2 plyr_1.8.4 XVector_0.14.0 bitops_1.0-6 tools_3.3.2 zlibbioc_1.20.0 digest_0.6.12 rpart_4.1-10 base64_2.0
[11] memoise_1.0.0 RSQLite_1.1-2 annotate_1.52.1 tibble_1.2 gtable_0.2.0 htmlTable_1.9 lattice_0.20-34 Matrix_1.2-8 DBI_0.5-1 gridExtra_2.2.1
[21] genefilter_1.56.0 stringr_1.2.0 knitr_1.15.1 cluster_2.0.5 locfit_1.5-9.1 grid_3.3.2 nnet_7.3-12 data.table_1.10.4 AnnotationDbi_1.36.2 XML_3.98-1.5
[31] survival_2.40-1 BiocParallel_1.8.1 foreign_0.8-67 latticeExtra_0.6-28 Formula_1.2-1 geneplotter_1.52.0 ggplot2_2.2.1 magrittr_1.5 htmltools_0.3.5 Hmisc_4.0-2
[41] scales_0.4.1 splines_3.3.2 assertthat_0.1 xtable_1.8-2 colorspace_1.3-2 stringi_1.1.2 acepack_1.4.1 RCurl_1.95-4.8 lazyeval_0.2.0 openssl_0.9.6
[51] munsell_0.4.3
Many thanks for quick and absolutely helpful response.
Sorry for my first noobish question, but the contrasts you modeled can then be added within the limma pipeline right? I am not familiar with limma yet, but with your first model I should get for every gene a regression coefficient and p.value telling me whether I see an linear increased/decreased expression?
Sure, I am not very confident how (biological) meaningful the results will be, but I have a second analysis in mind for which I would like to have these data (and thus I just want to try it out).
However, I really like your second approach! I will definitely try this as well (and again this contrast can then be included in limma?).
For DESeq2, I understood it as you also explained it, that's why I was confused. Most likely I understood the initial question wrongly.