Dear Community,
I am planning to compare the mean of all samples to one sample so I can rank genes according the difference from the mean of all. Are there any pitfalls to do that?
In more detail: I have 9 tissue samples (with 3 or 4 replicates) and I'd like to find tissue enhanced genes. So the plan is to perform DE analysis with DESeq2 with a dataset where I included all samples twice, but the design matrix would tell that the first half of the dataset are in one group ("all") and in case of the second half I have labels for different tissues (tissue1...tissue9). So, I would compare "all" with "tissue1" and rank significantly up regulated genes based on shrinked log2fold changes. Then "all" with tissue2 and so on. At the end I would have genes for each tissue, which expressions are significantly higher than the mean of all samples.
Is this procedure statistically ok? Any pitfalls?
Thanks a lot!
Torda
Thank you for the quick response and for the suggestion! Unfortunately the goal would be to compare one tissue to the mean of 9 tissues. So if I do 9 different analysis for each tissues, every time the reference will be the same. Defining numeric contrast is new for me and I am trying to understand it but until now I couldn't entirely understand how it works. Furthermore the whole picture is more complicated because I have a batch_effect variable with 3 factors. So the real design would look like: ~ batch + condition. Here I totally lost...
You may want to work with a statistician on designing the statistical analysis part. I think the most straightforward analysis is the one I outlined above.
Hi! I tried to dig in the literature of numeric contrast design and I found what I need: deviation coding. (https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/#DEVIATION).
With this I can compare one tissue to the grand mean. And it works well, but two new questions has been come up: 1) If I have a design of ~ 0 + variable1(3 levels) + variable2(9 levels) I will have 3 coefficients for variable 1 and just 8 for variable2. Which I think means that only the intercept of varialbe1 could be eliminated from the design, but the reference level in variable2 is still valid. Is there any idea how to keep the reference level of variable1 but eliminate intercept from variable2? The ultimate goal would be to have two variables in the design with treatment coding for the first and deviation coding for the second. Similary to the section "Two factors: one treatment-coded, one deviation-coded" here: https://rstudio-pubs-static.s3.amazonaws.com/84177_4604ecc1bae246c9926865db53b6cc29.html, but within the DESeq2 framework. 2) To rank genes I would use a shrinkage estimator but just for curiousity is there a matematical limiation to allow use of contrast in apeglm?
Thanks, Torda
You can provide DESeq2 with a matrix for the design (see example below), but I would just make sure you are interpreting the coefficients correctly if you are setting these up yourself, e.g. check in with a statistical collaborator. I don't see a problem with applying apeglm to coefficients associated with columns of the design matrix (though we require an intercept, as we want to shrink only cell differences in general).