DE Analysis strategies for designs that are "not full rank"
1
0
Entering edit mode
zhouly • 0
@zhouly-14171
Last seen 6.5 years ago

I understand that this question has popped up over and over again in this forum. The simple solution is to remove redundant categories to make rank full again. I would like to go further than this simple approach and figure out a conceptually more correct solution to this kind of problem.

 

Let me describe my experiment first: I have several strains of Strep. pneumoiae bacteria, with different virulence. We have collected 2 replicates of rnaseq from each strain and want to find out genes that are differentially expressed across virulence levels. There are more than one strain with 'High' (or 'Low') virulence. In my initial design below, I want to study the effect of virulence on gene expression level, taking into consideration that there are multiple strains in each virulence level (design=~Strain+Virulence).

Experiment No. Strain Virulence
1 SP01 High
2 SP01 High
3 SP04 High
4 SP04 High
5 SP05 Low
6 SP05 Low

 

Apparently, the design is faulty ("not full rank"), because the "Strain" category has been accounted for by the "Virulence". In other word, in testing the effect of "Virulence" on gene expression, we cannot control for "Strain". One common solution is to simply remove "Strain" category in the design (design=~Virulence).

 

Here are my concerns:

If we remove "Strain" category, we are treating, in the example above, experiment 1-4 as replicates, implying that they are generated from the same strain, under same condition. Later in analysis, we assume that counts from these 4 experiments follow a negative binomial distribution and fit a glm against them. We know, in fact, they are not from a single strain. Their gene count numbers comes from 2 negative binomial distributions! I worry that the subsequent glm would totally break down.

 

I am considering an alternative solution to this problem:

Lets analyze gene differential expression between all High-Low strain pairs. The core set of genes that contribute to the Virulence phenotype should be the intersection of all sets of differentially expressed genes. For example, we would test DE between SP01 (High) and SP05 (Low) to identify a set of genes that contribute to their phenotype differences. We would also test DE between SP04 (High) and SP05 (Low) to identify another set of genes that contribute to their phenotype differences. In each set, there are real signals and noises. The intersection of the two sets should give the set of core genes that contribute to Virulence phenotype.

While this seems to be a viable solution, this approach removes p-values (or padj values). This is very bad in my opinion.

 

I would like to ask what would you recommend to do in this kind of situation? Is it OK to just remove "Strain" category and run DEseq2 and the glm should still work? I am not strong in stats, I would really appreciate it if you could share some insight in DE analysis.

 

Best regards to all

Lingyu Zhou

deseq2 rnaseq • 532 views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 8 hours ago
United States

When users seek to account for correlations within a group, and then compare directly across group (not a before/after comparison across group, which can be accounted for using a technique in the vignette), my answer I give on the support site is that this cannot be done with fixed effects, and that you can instead use limma with the duplicateCorrelation() approach. See the limma User Guide. You would need replicates for each strain of course.

ADD COMMENT
0
Entering edit mode

Thanks. I will go and study limma a bit more.

ADD REPLY

Login before adding your answer.

Traffic: 558 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6