Question: how to use edgeR analysis RNA-seq data which has four factors
0
3.9 years ago by
xiaoaozqd0 wrote:

I have a RNA-seq data about the plant inoculated with a fungi pathogen.Three biological replicates of the experiment were conducted at separate times and using independently grown plants and fungi. There are one pair of near-isogenic line, two fungi races and a series of sampling times post inoculation.

The experimental design as follow, can anyone hlep me about how to find the differenttial expression genes with edgeR.

The object is to clear which genes confer the resistance of the plant and which genes confer the virulence of the fungi.

Comprehensive understand the interaction between plant and fungi.

 factor1
factor2 factor3 factor4

Sample NO
 plant

fungi

 time
 batch
genotype

1

NIL-R A
 0h
1 mock_R
2 NIL-R A 0h

2

mock_R
3 NIL-R A 0h 3 mock_R
4 NIL-R A 18h 1 susceptible
5 NIL-R A
 18h
2 susceptible
6 NIL-R A
 18h
3 susceptible
7 NIL-R A
 24h
1 susceptible
8 NIL-R A
 24h
2 susceptible
9 NIL-R A
 24h
3 susceptible
10
 NIL-R
B
 0h
1 mock_R
11
 NIL-R
B
 0h
2 mock_R
12
 NIL-R
B
 0h
3 mock_R
13
 NIL-R
B
 18h
1 resistance
14
 NIL-R
B
 18h
2 resistance
15
 NIL-R
B
 18h
3 resistance
16
 NIL-R
B 24h 1 resistance
17
 NIL-R
B 24h 2 resistance
18
 NIL-R
B 24h 3 resistance
19
 NIL-S
B 0h 1 mock_S
20
 NIL-S
B 0h 2 mock_S
21
 NIL-S
B 0h 3 mock_S
22
 NIL-S
B 18h 1 susceptible
23
 NIL-S
B 18h 2 susceptible
24
 NIL-S
B 18h 3 susceptible
25
 NIL-S
B 24h 1 susceptible
26
 NIL-S
B 24h 2 susceptible
27
 NIL-S
B 24h 3 susceptible

modified 3.7 years ago • written 3.9 years ago by xiaoaozqd0
Answer: how to use edgeR analysis RNA-seq data which has four factors
2
3.9 years ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

The simplest approach here seems to be a one-way layout for the biological terms, with an additive batch effect:

grouping <- factor(paste(plant, fungi, time, sep="."))
batch <- factor(batch)
design <- model.matrix(~ 0 + grouping + batch)

This avoids making any assumptions about the (lack of) interactions between the interesting experimental factors. Otherwise, if you used an additive approach for all of these factors, you would be assuming that, e.g., the effect of time would be the same in different plants or for different fungi, which may not be true.

Anyway, with the design I've described above, it's fairly easy to do comparisons between specific groups. If you want to identify DE genes between, say, NIL-R inoculated with A at 0h with NIL-R inoculated with B at 0h, you could just do something like:

con <- makeContrasts(groupingNILR.B.0h - groupingNILR.A.0h, levels=design)

With this strategy, you can compare any two (or more) groups of interest. For example, you could identify differences in the inoculation effect of A at 18 hours against that of B in NIL-R plants by doing:

con <- makeContrasts((groupingNILR.B.18h - groupingNILR.B.0h) -
(groupingNILR.A.18h - groupingNILR.A.0h), levels=design)

This will compute the log-fold change between 18 and 0 hours after A inoculation, and compare it to the corresponding value after B inoculation. In both cases, note that I've assumed that you've renamed the levels of plants to get rid of the dash, because makeContrasts doesn't like that in the column names of design.

Thank you very much fro your gudance! Could you kindhearted to point out how to compare the resistance group (groupingNILR.B) to the susceptible group ((groupingNILR.A+groupingNILS.B)/2)?

1

Well, you've almost written it yourself:

con <- makeContrasts(groupingNILR.B.xh -
(groupingNILR.A.xh + groupingNILS.B.xh)/2, levels=design)

... where you replace x with the time point of interest. You'll want to set up comparisons at each time point as separate contrasts, just in case you have a situation where a gene is up in one time point and down in another time point. Such genes would be missed if you summarized samples from all time points into one DE log-fold change.

On a similar note, the contrast I've written above will only test for differences between NILR.B and the average of the other two groups. It is possible to get situations where the reported log-fold change is positive, but the log-fold change between NILR.B and one of the other groups is negative. To protect against this, I tend to prefer ANOVA-like comparisons:

con <- makeContrasts(groupingNILR.B.xh - groupingNILR.A.xh,
groupingNILR.B.xh - groupingNILS.B.xh, levels=design)

I would then focus on significant genes where the log-fold changes are in the same direction, as these are the ones showing a consistent qualitative effect in the resistant group over both of the susceptible groups. If you want to be more rigorous, you can do the NILR.B/NILR.A and NILR.B/NILS.B comparisons separately and intersect the DE lists to obtain genes with significant differences between resistant and both susceptible groups. I think this is probably unnecessarily conservative, though.

Thank you for your response. I really appreciate the advice on ANOVA-like comparisons you have given. From your opinion, it is not sensible to make pairwise comparisons between the resistance group VS the susceptible group which defining each treatment combination as a group. Do you have some good idea about gain the global differential expression genes (needn't set up comparisons at each time point as separate contrasts) involved in the interaction with pathogen, such as resistance or related genes?

2

Why isn't it sensible? I don't recall saying anything like that.

If you want to find globally DE genes, you can set up an ANOVA-like comparison involving all time points:

con <- makeContrasts(groupingNILR.B.0h - groupingNILR.A.0h,
groupingNILR.B.0h - groupingNILS.B.0h,
groupingNILR.B.18h - groupingNILR.A.18h,
groupingNILR.B.18h - groupingNILS.B.18h,
groupingNILR.B.24h - groupingNILR.A.24h,
groupingNILR.B.24h - groupingNILS.B.24h, levels=design)

This will give you genes that are DE between the resistant and susceptible groups at any time point. That said, getting DE genes at a time point of zero mightn't be particularly interesting, because inoculation shouldn't have any effect at the start of the time course. A more sophisticated DE comparison might be something like:

con <- makeContrasts((groupingNILR.B.18h - groupingNILR.B.0h) -
(groupingNILR.A.18h - groupingNILR.A.0h),
(groupingNILR.B.18h - groupingNILR.B.0h) -
(groupingNILS.B.18h - groupingNILS.B.0h),
(groupingNILR.B.24h - groupingNILR.B.0h) -
(groupingNILR.A.24h - groupingNILR.A.0h),
(groupingNILR.B.24h - groupingNILR.B.0h) -
(groupingNILS.B.24h - groupingNILS.B.0h), levels=design)

This will correct the DE at the later time points for any (uninteresting) DE at time point zero. You can then focus on genes where the log-fold changes are of the same sign, as these have a consistent association with resistance against susceptibility for both susceptible groups and across all time points.

Answer: how to use edgeR analysis RNA-seq data which has four factors
0
3.9 years ago by
xiaoaozqd0 wrote:

table move to above

Can you put this table in your original answer, rather than putting it in a new post as you've done here? The ordering of posts changes according to the number of votes, so if you get a lot of replies, this table will get lost.

Answer: how to use edgeR analysis RNA-seq data which has four factors
0
3.7 years ago by
xiaoaozqd0 wrote:

Could you kindhearted to help me again?
In my data, one biological repeat (CYR32_1) in one group is obvious deviation from the other two biological repeats(CYR32_2 and CYR32_3) in the plotMDS with or without the method BCV.

So I just removed this sample, while when the design (= model.matrix(~ 0 + grouping + batch)) was used, it seemed among all of the data only the batch2 and batch3 were used.
design
groupingNILR.B.0h    groupingNILR.A.0h    groupingNILR.B.18h    groupingNILR.B.24h    groupingNILR.S.18h    groupingNILR.S.24h    groupingNILS.B.18h    groupingNILS.B.24h    batchb    batchc

the levels of con：groupingNILR.B.0h    groupingNILR.A.0h    groupingNILR.B.18h    groupingNILR.B.24h    groupingNILR.S.18h    groupingNILR.S.24h    groupingNILS.B.18h    groupingNILS.B.24h    batchb    batchc

After the statistical analysis with glmFit and glmLTR, there are 11994 significantly differentially expressed genes with the FDR<0.05.

I tried to omit the batch effect to use the design3 (= model.matrix(~ 0 + grouping)). The levels of con are groupingMK.26.usp, groupingMK.32.usp, groupingNR.26.18h, groupingNR.26.24h, groupingNR.32.18h, groupingNR.32.24h, groupingNS.32.18h, groupingNS.32.24h. After the statistical analysis with glmFit and glmLTR, there are only 10398 significantly differentially expressed genes (DEG) with the FDR<0.05. compare with the result of “design”, 8722 was identified by both “design” and “design3”. In addition, the FDR and P-value of the same gene is different.

So my questions are:
1. When one of the biological repeat should be removed, how to design the model?
2. In my data, which result seem to be more receivable, the “design” or “design3”, or some others?
3. Does the batch effects have a great effect of the DEG.
4. During the estimation of dispersion, there are three approaches:
d = estimateGLMCommonDisp(d,design)
d = estimateGLMTrendedDisp(d,design)
d = estimateGLMTagwiseDisp(d,design)
After each method was used, the last logFC of each gene will have a minor alteration while the FDR and p-value will have a great change, in some paper only the first and the third method were used, so I’m not so clear whether all of these three approaches should be used together, for in the edgeR users guide just described “The tagwise dispersion approach is strongly recommended in multi-factor experiment cases.”

Put this in a new question. Write some code that allows us to regenerate your design matrix without copy-and-pasting the descriptions for each sample. Finally, show the code you used for every step of the edgeR analysis.