how to use edgeR analysis RNA-seq data which has four factors
3
0
Entering edit mode
xiaoaozqd • 0
@xiaoaozqd-9533
Last seen 7.5 years ago

I have a RNA-seq data about the plant inoculated with a fungi pathogen.Three biological replicates of the experiment were conducted at separate times and using independently grown plants and fungi. There are one pair of near-isogenic line, two fungi races and a series of sampling times post inoculation.

 

The experimental design as follow, can anyone hlep me about how to find the differenttial expression genes with edgeR.

The object is to clear which genes confer the resistance of the plant and which genes confer the virulence of the fungi.

Comprehensive understand the interaction between plant and fungi.

 
factor1
factor2 factor3 factor4

 

Sample NO

plant

 

fungi

time
batch
genotype

1

NIL-R A
0h
1 mock_R
2 NIL-R A 0h

2

mock_R
3 NIL-R A 0h 3 mock_R
4 NIL-R A 18h 1 susceptible
5 NIL-R A
18h
2 susceptible
6 NIL-R A
18h
3 susceptible
7 NIL-R A
24h
1 susceptible
8 NIL-R A
24h
2 susceptible
9 NIL-R A
24h
3 susceptible
10
NIL-R
B
0h
1 mock_R
11
NIL-R
B
0h
2 mock_R
12
NIL-R
B
0h
3 mock_R
13
NIL-R
B
18h
1 resistance
14
NIL-R
B
18h
2 resistance
15
NIL-R
B
18h
3 resistance
16
NIL-R
B 24h 1 resistance
17
NIL-R
B 24h 2 resistance
18
NIL-R
B 24h 3 resistance
19
NIL-S
B 0h 1 mock_S
20
NIL-S
B 0h 2 mock_S
21
NIL-S
B 0h 3 mock_S
22
NIL-S
B 18h 1 susceptible
23
NIL-S
B 18h 2 susceptible
24
NIL-S
B 18h 3 susceptible
25
NIL-S
B 24h 1 susceptible
26
NIL-S
B 24h 2 susceptible
27
NIL-S
B 24h 3 susceptible
           

 

 

edger design matrix comparing more than two groups multiple factor design • 2.3k views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 12 hours ago
The city by the bay

The simplest approach here seems to be a one-way layout for the biological terms, with an additive batch effect:

grouping <- factor(paste(plant, fungi, time, sep="."))
batch <- factor(batch)
design <- model.matrix(~ 0 + grouping + batch)

This avoids making any assumptions about the (lack of) interactions between the interesting experimental factors. Otherwise, if you used an additive approach for all of these factors, you would be assuming that, e.g., the effect of time would be the same in different plants or for different fungi, which may not be true.

Anyway, with the design I've described above, it's fairly easy to do comparisons between specific groups. If you want to identify DE genes between, say, NIL-R inoculated with A at 0h with NIL-R inoculated with B at 0h, you could just do something like:

con <- makeContrasts(groupingNILR.B.0h - groupingNILR.A.0h, levels=design)

With this strategy, you can compare any two (or more) groups of interest. For example, you could identify differences in the inoculation effect of A at 18 hours against that of B in NIL-R plants by doing:

con <- makeContrasts((groupingNILR.B.18h - groupingNILR.B.0h) - 
                     (groupingNILR.A.18h - groupingNILR.A.0h), levels=design)

This will compute the log-fold change between 18 and 0 hours after A inoculation, and compare it to the corresponding value after B inoculation. In both cases, note that I've assumed that you've renamed the levels of plants to get rid of the dash, because makeContrasts doesn't like that in the column names of design.

ADD COMMENT
0
Entering edit mode

Thank you very much fro your gudance! Could you kindhearted to point out how to compare the resistance group (groupingNILR.B) to the susceptible group ((groupingNILR.A+groupingNILS.B)/2)?

ADD REPLY
1
Entering edit mode

Well, you've almost written it yourself:

con <- makeContrasts(groupingNILR.B.xh - 
                    (groupingNILR.A.xh + groupingNILS.B.xh)/2, levels=design)

... where you replace x with the time point of interest. You'll want to set up comparisons at each time point as separate contrasts, just in case you have a situation where a gene is up in one time point and down in another time point. Such genes would be missed if you summarized samples from all time points into one DE log-fold change.

On a similar note, the contrast I've written above will only test for differences between NILR.B and the average of the other two groups. It is possible to get situations where the reported log-fold change is positive, but the log-fold change between NILR.B and one of the other groups is negative. To protect against this, I tend to prefer ANOVA-like comparisons:

con <- makeContrasts(groupingNILR.B.xh - groupingNILR.A.xh,
                     groupingNILR.B.xh - groupingNILS.B.xh, levels=design)

I would then focus on significant genes where the log-fold changes are in the same direction, as these are the ones showing a consistent qualitative effect in the resistant group over both of the susceptible groups. If you want to be more rigorous, you can do the NILR.B/NILR.A and NILR.B/NILS.B comparisons separately and intersect the DE lists to obtain genes with significant differences between resistant and both susceptible groups. I think this is probably unnecessarily conservative, though.

ADD REPLY
0
Entering edit mode

Thank you for your response. I really appreciate the advice on ANOVA-like comparisons you have given. From your opinion, it is not sensible to make pairwise comparisons between the resistance group VS the susceptible group which defining each treatment combination as a group. Do you have some good idea about gain the global differential expression genes (needn't set up comparisons at each time point as separate contrasts) involved in the interaction with pathogen, such as resistance or related genes?

ADD REPLY
2
Entering edit mode

Why isn't it sensible? I don't recall saying anything like that.

If you want to find globally DE genes, you can set up an ANOVA-like comparison involving all time points:

con <- makeContrasts(groupingNILR.B.0h - groupingNILR.A.0h,
                     groupingNILR.B.0h - groupingNILS.B.0h,
                     groupingNILR.B.18h - groupingNILR.A.18h,
                     groupingNILR.B.18h - groupingNILS.B.18h,
                     groupingNILR.B.24h - groupingNILR.A.24h,
                     groupingNILR.B.24h - groupingNILS.B.24h, levels=design)

This will give you genes that are DE between the resistant and susceptible groups at any time point. That said, getting DE genes at a time point of zero mightn't be particularly interesting, because inoculation shouldn't have any effect at the start of the time course. A more sophisticated DE comparison might be something like:

con <- makeContrasts((groupingNILR.B.18h - groupingNILR.B.0h) - 
                     (groupingNILR.A.18h - groupingNILR.A.0h),
                     (groupingNILR.B.18h - groupingNILR.B.0h) - 
                     (groupingNILS.B.18h - groupingNILS.B.0h),
                     (groupingNILR.B.24h - groupingNILR.B.0h) - 
                     (groupingNILR.A.24h - groupingNILR.A.0h),
                     (groupingNILR.B.24h - groupingNILR.B.0h) - 
                     (groupingNILS.B.24h - groupingNILS.B.0h), levels=design)

This will correct the DE at the later time points for any (uninteresting) DE at time point zero. You can then focus on genes where the log-fold changes are of the same sign, as these have a consistent association with resistance against susceptibility for both susceptible groups and across all time points.

ADD REPLY
0
Entering edit mode

So appreciate for your advise! Thank you very much!

ADD REPLY
0
Entering edit mode
xiaoaozqd • 0
@xiaoaozqd-9533
Last seen 7.5 years ago

table move to above

ADD COMMENT
0
Entering edit mode

Can you put this table in your original answer, rather than putting it in a new post as you've done here? The ordering of posts changes according to the number of votes, so if you get a lot of replies, this table will get lost.

ADD REPLY
0
Entering edit mode

thank you for your advise! This have been done!

ADD REPLY
0
Entering edit mode
xiaoaozqd • 0
@xiaoaozqd-9533
Last seen 7.5 years ago

Could you kindhearted to help me again? 
In my data, one biological repeat (CYR32_1) in one group is obvious deviation from the other two biological repeats(CYR32_2 and CYR32_3) in the plotMDS with or without the method BCV.

So I just removed this sample, while when the design (= model.matrix(~ 0 + grouping + batch)) was used, it seemed among all of the data only the batch2 and batch3 were used.
design
    groupingNILR.B.0h    groupingNILR.A.0h    groupingNILR.B.18h    groupingNILR.B.24h    groupingNILR.S.18h    groupingNILR.S.24h    groupingNILS.B.18h    groupingNILS.B.24h    batchb    batchc


the levels of con:groupingNILR.B.0h    groupingNILR.A.0h    groupingNILR.B.18h    groupingNILR.B.24h    groupingNILR.S.18h    groupingNILR.S.24h    groupingNILS.B.18h    groupingNILS.B.24h    batchb    batchc

After the statistical analysis with glmFit and glmLTR, there are 11994 significantly differentially expressed genes with the FDR<0.05.

I tried to omit the batch effect to use the design3 (= model.matrix(~ 0 + grouping)). The levels of con are groupingMK.26.usp, groupingMK.32.usp, groupingNR.26.18h, groupingNR.26.24h, groupingNR.32.18h, groupingNR.32.24h, groupingNS.32.18h, groupingNS.32.24h. After the statistical analysis with glmFit and glmLTR, there are only 10398 significantly differentially expressed genes (DEG) with the FDR<0.05. compare with the result of “design”, 8722 was identified by both “design” and “design3”. In addition, the FDR and P-value of the same gene is different.


So my questions are:
1. When one of the biological repeat should be removed, how to design the model?
2. In my data, which result seem to be more receivable, the “design” or “design3”, or some others?
3. Does the batch effects have a great effect of the DEG.
4. During the estimation of dispersion, there are three approaches:
d = estimateGLMCommonDisp(d,design)
d = estimateGLMTrendedDisp(d,design)
d = estimateGLMTagwiseDisp(d,design)
After each method was used, the last logFC of each gene will have a minor alteration while the FDR and p-value will have a great change, in some paper only the first and the third method were used, so I’m not so clear whether all of these three approaches should be used together, for in the edgeR users guide just described “The tagwise dispersion approach is strongly recommended in multi-factor experiment cases.”

ADD COMMENT
0
Entering edit mode

Put this in a new question. Write some code that allows us to regenerate your design matrix without copy-and-pasting the descriptions for each sample. Finally, show the code you used for every step of the edgeR analysis.

ADD REPLY
0
Entering edit mode

thank you very much of your replay, the new post in how to use edgeR analysis RNA-seq data with a obvious deviation

ADD REPLY

Login before adding your answer.

Traffic: 589 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6