Question

how to use edgeR analysis RNA-seq data which has four factors

0

Entering edit mode

xiaoaozqd • 0

@xiaoaozqd-9533

Last seen 6.8 years ago

I have a RNA-seq data about the plant inoculated with a fungi pathogen.Three biological replicates of the experiment were conducted at separate times and using independently grown plants and fungi. There are one pair of near-isogenic line, two fungi races and a series of sampling times post inoculation.

The experimental design as follow, can anyone hlep me about how to find the differenttial expression genes with edgeR.

The object is to clear which genes confer the resistance of the plant and which genes confer the virulence of the fungi.

Comprehensive understand the interaction between plant and fungi.

factor1

factor2

factor3

factor4

Sample NO

plant

fungi

time

batch

genotype

1

NIL-R

A

0h

1

mock_R

2

NIL-R

A

0h

2

mock_R

3

NIL-R

A

0h

3

mock_R

4

NIL-R

A

18h

1

susceptible

5

NIL-R

A

18h

2

susceptible

6

NIL-R

A

18h

3

susceptible

7

NIL-R

A

24h

1

susceptible

8

NIL-R

A

24h

2

susceptible

9

NIL-R

A

24h

3

susceptible

10

NIL-R

B

0h

1

mock_R

11

NIL-R

B

0h

2

mock_R

12

NIL-R

B

0h

3

mock_R

13

NIL-R

B

18h

1

resistance

14

NIL-R

B

18h

2

resistance

15

NIL-R

B

18h

3

resistance

16

NIL-R

B

24h

1

resistance

17

NIL-R

B

24h

2

resistance

18

NIL-R

B

24h

3

resistance

19

NIL-S

B

0h

1

mock_S

20

NIL-S

B

0h

2

mock_S

21

NIL-S

B

0h

3

mock_S

22

NIL-S

B

18h

1

susceptible

23

NIL-S

B

18h

2

susceptible

24

NIL-S

B

18h

3

susceptible

25

NIL-S

B

24h

1

susceptible

26

NIL-S

B

24h

2

susceptible

27

NIL-S

B

24h

3

susceptible

edger design matrix comparing more than two groups multiple factor design • 2.0k views

ADD COMMENT • link 8.3 years ago • updated 8.1 years ago xiaoaozqd • 0

score 2 · Answer 1 · 2016-01-16

2

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 10 hours ago

The city by the bay

The simplest approach here seems to be a one-way layout for the biological terms, with an additive batch effect:

grouping <- factor(paste(plant, fungi, time, sep="."))
batch <- factor(batch)
design <- model.matrix(~ 0 + grouping + batch)

This avoids making any assumptions about the (lack of) interactions between the interesting experimental factors. Otherwise, if you used an additive approach for all of these factors, you would be assuming that, e.g., the effect of time would be the same in different plants or for different fungi, which may not be true.

Anyway, with the design I've described above, it's fairly easy to do comparisons between specific groups. If you want to identify DE genes between, say, NIL-R inoculated with A at 0h with NIL-R inoculated with B at 0h, you could just do something like:

con <- makeContrasts(groupingNILR.B.0h - groupingNILR.A.0h, levels=design)

With this strategy, you can compare any two (or more) groups of interest. For example, you could identify differences in the inoculation effect of A at 18 hours against that of B in NIL-R plants by doing:

con <- makeContrasts((groupingNILR.B.18h - groupingNILR.B.0h) - 
                     (groupingNILR.A.18h - groupingNILR.A.0h), levels=design)

This will compute the log-fold change between 18 and 0 hours after A inoculation, and compare it to the corresponding value after B inoculation. In both cases, note that I've assumed that you've renamed the levels of plants to get rid of the dash, because makeContrasts doesn't like that in the column names of design.

ADD COMMENT • link 8.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you very much fro your gudance! Could you kindhearted to point out how to compare the resistance group (groupingNILR.B) to the susceptible group ((groupingNILR.A+groupingNILS.B)/2)?

ADD REPLY • link 8.3 years ago xiaoaozqd • 0

1

Entering edit mode

Well, you've almost written it yourself:

con <- makeContrasts(groupingNILR.B.xh - 
                    (groupingNILR.A.xh + groupingNILS.B.xh)/2, levels=design)

... where you replace x with the time point of interest. You'll want to set up comparisons at each time point as separate contrasts, just in case you have a situation where a gene is up in one time point and down in another time point. Such genes would be missed if you summarized samples from all time points into one DE log-fold change.

On a similar note, the contrast I've written above will only test for differences between NILR.B and the average of the other two groups. It is possible to get situations where the reported log-fold change is positive, but the log-fold change between NILR.B and one of the other groups is negative. To protect against this, I tend to prefer ANOVA-like comparisons:

con <- makeContrasts(groupingNILR.B.xh - groupingNILR.A.xh,
                     groupingNILR.B.xh - groupingNILS.B.xh, levels=design)

I would then focus on significant genes where the log-fold changes are in the same direction, as these are the ones showing a consistent qualitative effect in the resistant group over both of the susceptible groups. If you want to be more rigorous, you can do the NILR.B/NILR.A and NILR.B/NILS.B comparisons separately and intersect the DE lists to obtain genes with significant differences between resistant and both susceptible groups. I think this is probably unnecessarily conservative, though.

ADD REPLY • link 8.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you for your response. I really appreciate the advice on ANOVA-like comparisons you have given. From your opinion, it is not sensible to make pairwise comparisons between the resistance group VS the susceptible group which defining each treatment combination as a group. Do you have some good idea about gain the global differential expression genes (needn't set up comparisons at each time point as separate contrasts) involved in the interaction with pathogen, such as resistance or related genes?

ADD REPLY • link 8.3 years ago xiaoaozqd • 0

2

Entering edit mode

Why isn't it sensible? I don't recall saying anything like that.

If you want to find globally DE genes, you can set up an ANOVA-like comparison involving all time points:

con <- makeContrasts(groupingNILR.B.0h - groupingNILR.A.0h,
                     groupingNILR.B.0h - groupingNILS.B.0h,
                     groupingNILR.B.18h - groupingNILR.A.18h,
                     groupingNILR.B.18h - groupingNILS.B.18h,
                     groupingNILR.B.24h - groupingNILR.A.24h,
                     groupingNILR.B.24h - groupingNILS.B.24h, levels=design)

This will give you genes that are DE between the resistant and susceptible groups at any time point. That said, getting DE genes at a time point of zero mightn't be particularly interesting, because inoculation shouldn't have any effect at the start of the time course. A more sophisticated DE comparison might be something like:

con <- makeContrasts((groupingNILR.B.18h - groupingNILR.B.0h) - 
                     (groupingNILR.A.18h - groupingNILR.A.0h),
                     (groupingNILR.B.18h - groupingNILR.B.0h) - 
                     (groupingNILS.B.18h - groupingNILS.B.0h),
                     (groupingNILR.B.24h - groupingNILR.B.0h) - 
                     (groupingNILR.A.24h - groupingNILR.A.0h),
                     (groupingNILR.B.24h - groupingNILR.B.0h) - 
                     (groupingNILS.B.24h - groupingNILS.B.0h), levels=design)

This will correct the DE at the later time points for any (uninteresting) DE at time point zero. You can then focus on genes where the log-fold changes are of the same sign, as these have a consistent association with resistance against susceptibility for both susceptible groups and across all time points.

ADD REPLY • link 8.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

So appreciate for your advise! Thank you very much!

ADD REPLY • link 8.3 years ago xiaoaozqd • 0

score 0 · Answer 2 · 2016-01-16

0

Entering edit mode

xiaoaozqd • 0

@xiaoaozqd-9533

Last seen 6.8 years ago

table move to above

ADD COMMENT • link 8.3 years ago xiaoaozqd • 0

0

Entering edit mode

Can you put this table in your original answer, rather than putting it in a new post as you've done here? The ordering of posts changes according to the number of votes, so if you get a lot of replies, this table will get lost.

ADD REPLY • link 8.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

thank you for your advise! This have been done!

ADD REPLY • link 8.3 years ago xiaoaozqd • 0

score 0 · Answer 3 · 2016-04-01

Could you kindhearted to help me again?
In my data, one biological repeat (CYR32_1) in one group is obvious deviation from the other two biological repeats(CYR32_2 and CYR32_3) in the plotMDS with or without the method BCV.

So I just removed this sample, while when the design (= model.matrix(~ 0 + grouping + batch)) was used, it seemed among all of the data only the batch2 and batch3 were used.
design
groupingNILR.B.0h groupingNILR.A.0h groupingNILR.B.18h groupingNILR.B.24h groupingNILR.S.18h groupingNILR.S.24h groupingNILS.B.18h groupingNILS.B.24h batchb batchc

the levels of con：groupingNILR.B.0h groupingNILR.A.0h groupingNILR.B.18h groupingNILR.B.24h groupingNILR.S.18h groupingNILR.S.24h groupingNILS.B.18h groupingNILS.B.24h batchb batchc

After the statistical analysis with glmFit and glmLTR, there are 11994 significantly differentially expressed genes with the FDR<0.05.

I tried to omit the batch effect to use the design3 (= model.matrix(~ 0 + grouping)). The levels of con are groupingMK.26.usp, groupingMK.32.usp, groupingNR.26.18h, groupingNR.26.24h, groupingNR.32.18h, groupingNR.32.24h, groupingNS.32.18h, groupingNS.32.24h. After the statistical analysis with glmFit and glmLTR, there are only 10398 significantly differentially expressed genes (DEG) with the FDR<0.05. compare with the result of “design”, 8722 was identified by both “design” and “design3”. In addition, the FDR and P-value of the same gene is different.

So my questions are:
1. When one of the biological repeat should be removed, how to design the model?
2. In my data, which result seem to be more receivable, the “design” or “design3”, or some others?
3. Does the batch effects have a great effect of the DEG.
4. During the estimation of dispersion, there are three approaches:
d = estimateGLMCommonDisp(d,design)
d = estimateGLMTrendedDisp(d,design)
d = estimateGLMTagwiseDisp(d,design)
After each method was used, the last logFC of each gene will have a minor alteration while the FDR and p-value will have a great change, in some paper only the first and the third method were used, so I’m not so clear whether all of these three approaches should be used together, for in the edgeR users guide just described “The tagwise dispersion approach is strongly recommended in multi-factor experiment cases.”