Question

How to decide whether a variate should be in the model or not

0

Entering edit mode

Raymond ▴ 20

@raymond-14020

Last seen 5.5 years ago

Hi, all/Micheal Love,

Is there a way in DESeq2 package to hep with this kind of problems? For example, I may be able to think of tens of possible variables that may affect gene expressions(Genotypes of a couple of genes, genders, age, PMI, RIN, RIN^2, mapping rate, batches....). Obviously, I should only include a limited number of those variables.

And how could I choose these variables? How the number of my samples would restrain my selections, in order to make a robust estimation?

From the literature, I saw someone using the PC1 as the corresponding factor. Then ANOVA model could be applied to testing the contribution of each possible variables to PC1. This is reasonable, but with obvious limitations. Especially, when sometimes you see the PC1 is mainly dominated by a single factor(such as Batch), then (PC2,PC3, etc) may also be used to identify other factors.

Any suggestions? (I know this is not a pure DESeq2 package problem, but I guess Micheal would have some clue about this:-) )

Thanks in advance,

Raymond

deseq2 rnaseq • 1.4k views

ADD COMMENT • link updated 21 months ago by Michael Love 43k • written 6.1 years ago by Raymond ▴ 20

score 2 · Accepted Answer · 2018-11-15

Hi Raymond,

Usually to decide what variables would go in a model, I'd have a long in-person discussion with the investigator about their experiment. I'll give my quick opinion on these:

I don't like to put in RIN, RIN^2, mapping rate, etc. but instead if I think there is a lot of technical variability in the data, and many samples to estimate technical effects, I typically would use factor analysis methods such as SVA, RUV or PEER. These are more data driven and should capture whatever f(RIN) is attempting to capture if it is systematic and not confounded with the biological condition (hopefully it's not!). I would never use PC1, because it should be strictly worse than these other methods that take into account the design (I think they all compare to simplistic PC1 in their publications to show their superiority).

You can also use quantification methods like Salmon which model numerous technical biases directly, and per-sample, and so removing partially confounded technical effects is not a problem.

If I have clearly annotated library preparation batches, I tend to use batch instead of the factor analysis methods.

Throwing in lots of variables does seem a bit like trying to deal on the back end with observational data, and I'm not sure that it really works that well.

Some of these make sense, like sex, and I would put that in most datasets where there is a block design. If there is a sex specific effect you can include an interaction term with sex and condition.

Remember when you put in something like log(counts) ~ age, you are assuming that the effect is linear on log counts throughout the range of your data.