I understand that this question has popped up over and over again in this forum. The simple solution is to remove redundant categories to make rank full again. I would like to go further than this simple approach and figure out a conceptually more correct solution to this kind of problem.
Let me describe my experiment first: I have several strains of Strep. pneumoiae bacteria, with different virulence. We have collected 2 replicates of rnaseq from each strain and want to find out genes that are differentially expressed across virulence levels. There are more than one strain with 'High' (or 'Low') virulence. In my initial design below, I want to study the effect of virulence on gene expression level, taking into consideration that there are multiple strains in each virulence level (design=~Strain+Virulence).
Experiment No. | Strain | Virulence |
1 | SP01 | High |
2 | SP01 | High |
3 | SP04 | High |
4 | SP04 | High |
5 | SP05 | Low |
6 | SP05 | Low |
Apparently, the design is faulty ("not full rank"), because the "Strain" category has been accounted for by the "Virulence". In other word, in testing the effect of "Virulence" on gene expression, we cannot control for "Strain". One common solution is to simply remove "Strain" category in the design (design=~Virulence).
Here are my concerns:
If we remove "Strain" category, we are treating, in the example above, experiment 1-4 as replicates, implying that they are generated from the same strain, under same condition. Later in analysis, we assume that counts from these 4 experiments follow a negative binomial distribution and fit a glm against them. We know, in fact, they are not from a single strain. Their gene count numbers comes from 2 negative binomial distributions! I worry that the subsequent glm would totally break down.
I am considering an alternative solution to this problem:
Lets analyze gene differential expression between all High-Low strain pairs. The core set of genes that contribute to the Virulence phenotype should be the intersection of all sets of differentially expressed genes. For example, we would test DE between SP01 (High) and SP05 (Low) to identify a set of genes that contribute to their phenotype differences. We would also test DE between SP04 (High) and SP05 (Low) to identify another set of genes that contribute to their phenotype differences. In each set, there are real signals and noises. The intersection of the two sets should give the set of core genes that contribute to Virulence phenotype.
While this seems to be a viable solution, this approach removes p-values (or padj values). This is very bad in my opinion.
I would like to ask what would you recommend to do in this kind of situation? Is it OK to just remove "Strain" category and run DEseq2 and the glm should still work? I am not strong in stats, I would really appreciate it if you could share some insight in DE analysis.
Best regards to all
Lingyu Zhou
Thanks. I will go and study limma a bit more.