how to use edgeR analysis RNA-seq data with a obvious deviation sample
Entering edit mode
xiaoaozqd • 0
Last seen 4.4 years ago

the raw code:


> plant = c(rep("MK",5),rep("NR",30),rep("NS",15))
> fungi = c(rep("32",2),rep("26",18),rep("32",30))
> time = c(rep("usp",5),rep("18hpi",3),rep("24hpi",3),rep("48hpi",3),rep("96hpi",3),rep("168hpi",3),rep("18hpi",3),rep("24hpi",3),rep("48hpi",3),rep("96hpi",3),rep("168hpi",3),rep("18hpi",3),rep("24hpi",3),rep("48hpi",3),rep("96hpi",3),rep("168hpi",3))
> batch = c(c("b","c"),rep(c("a","b","c"),16))
> batch = factor(batch)
> grouping <- factor(paste(plant, fungi, time, sep="."))
> table(grouping)
   MK.26.usp    MK.32.usp NR.26.168hpi  NR.26.18hpi  NR.26.24hpi  NR.26.48hpi  NR.26.96hpi NR.32.168hpi  NR.32.18hpi  NR.32.24hpi  NR.32.48hpi  NR.32.96hpi NS.32.168hpi  NS.32.18hpi  NS.32.24hpi  NS.32.48hpi  NS.32.96hpi
           3            2            3            3            3            3            3            3            3            3            3            3            3            3            3            3            3
> counts <- read.csv("All_Fragments_16_3_30.csv")
> head(counts[,1:5],3)
         GeneID MK.32.usp.b MK.32.usp.c MK.26.usp.a MK.26.usp.b
1 PSTCY32_21349         570         517         820         992
2     Contig410           5           5           1           0
3     Contig413         223         328         220         181
> d <- DGEList(counts = counts[,2:51], genes = counts[,1], group = grouping)
> keep <- rowSums(cpm(d)>1)>=2
> d <- d[keep,]
> d <- DGEList(counts= d$counts, genes = d$genes, group = grouping)
> d  =  calcNormFactors(d)
> design = model.matrix(~ condition, samples)
> design <- model.matrix(~ 0 + grouping + batch)

>  d2 = estimateGLMCommonDisp(d,design)
>  d2 = estimateGLMTrendedDisp(d2, design)
>  d2 = estimateGLMTagwiseDisp(d2, design)
> f = glmFit(d2, design)
> con <- makeContrasts( ( groupingNR.26.18hpi - groupingMK.26.usp ) -
 ( groupingNR.32.18hpi - groupingMK.32.usp ) ,
 ( groupingNR.26.18hpi - groupingMK.26.usp ) -
 ( groupingNS.32.18hpi - groupingMK.32.usp ) ,
 ( groupingNR.26.24hpi - groupingMK.26.usp ) -
 ( groupingNR.32.24hpi - groupingMK.32.usp ) ,
 ( groupingNR.26.24hpi - groupingMK.26.usp ) -
 ( groupingNS.32.24hpi - groupingMK.32.usp ) ,
 ( groupingNR.26.48hpi - groupingMK.26.usp ) -
 ( groupingNR.32.48hpi - groupingMK.32.usp ) ,
 ( groupingNR.26.48hpi - groupingMK.26.usp ) -
 ( groupingNS.32.48hpi - groupingMK.32.usp ) ,
 ( groupingNR.26.96hpi - groupingMK.26.usp ) -
 ( groupingNR.32.96hpi - groupingMK.32.usp ) ,
 ( groupingNR.26.96hpi - groupingMK.26.usp ) -
 ( groupingNS.32.96hpi - groupingMK.32.usp ) ,
 ( groupingNR.26.168hpi - groupingMK.26.usp ) -
 ( groupingNR.32.168hpi - groupingMK.32.usp ) ,
 ( groupingNR.26.168hpi - groupingMK.26.usp ) -
 ( groupingNS.32.168hpi - groupingMK.32.usp ) ,levels=design)
> lrt <- glmLRT(f,contrast=con)
> tt <- topTags(lrt, n=Inf)$table


In my data, one biological repeat (CYR32_1) in one group is obvious deviation from the other two biological repeats(CYR32_2 and CYR32_3) in the plotMDS with or without the method BCV.

So I just removed this sample, while when the design (= model.matrix(~ 0 + grouping + batch)) was used, it seemed among all of the data only the batch2 and batch3 were used.
    groupingNILR.B.0h    groupingNILR.A.0h    groupingNILR.B.18h    groupingNILR.B.24h    groupingNILR.S.18h    groupingNILR.S.24h    groupingNILS.B.18h    groupingNILS.B.24h    batchb    batchc

the levels of con:groupingNILR.B.0h    groupingNILR.A.0h    groupingNILR.B.18h    groupingNILR.B.24h    groupingNILR.S.18h    groupingNILR.S.24h    groupingNILS.B.18h    groupingNILS.B.24h    batchb    batchc

After the statistical analysis with glmFit and glmLTR, there are 11994 significantly differentially expressed genes with the FDR<0.05.

I tried to omit the batch effect to use the design3 (= model.matrix(~ 0 + grouping)). The levels of con are groupingMK.26.usp, groupingMK.32.usp, groupingNR.26.18h, groupingNR.26.24h, groupingNR.32.18h, groupingNR.32.24h, groupingNS.32.18h, groupingNS.32.24h.

After the statistical analysis with glmFit and glmLTR, there are only 10398 significantly differentially expressed genes (DEG) with the FDR<0.05. compare with the result of “design”, 8722 was identified by both “design” and “design3”. In addition, the FDR and P-value of the same gene is different.

> design3 (= model.matrix(~ 0 + grouping))

So my questions are:

1. When one of the biological repeat should be removed, how to design the model?
2. In my data, which result seem to be more receivable, the “design” or “design3”, or some others?
3. Does the batch effects have a great effect of the DEG.
4. During the estimation of dispersion, there are three approaches:
d = estimateGLMCommonDisp(d,design)
d = estimateGLMTrendedDisp(d,design)
d = estimateGLMTagwiseDisp(d,design)
After each method was used, the last logFC of each gene will have a minor alteration while the FDR and p-value will have a great change, in some paper only the first and the third method were used, so I’m not so clear whether all of these three approaches should be used together, for in the edgeR users guide just described “The tagwise dispersion approach is strongly recommended in multi-factor experiment cases.”

edger RNA-seq anova • 953 views
Entering edit mode

There's no NILS or NILR in the code you've shown above, only MK, NR or NS.

Entering edit mode
Aaron Lun ★ 27k
Last seen 1 hour ago
The city by the bay

What makes you think that the first batch is not being used? There's no batcha term, but that doesn't mean that the samples in batch A are being ignored. Every grouping* coefficient represents the log-expression of each group, using the sample in batch A as the "baseline". The batchb and batchc terms represent the log-fold change in each other batch over batch A, averaged across all groups. So, all samples that you've included in the design matrix are being used here. Similarly, there's no need to redesign the model if you take out a CYR32 replicate, even if this is the only sample of the group in batch A. (Indeed, the interpretation of the corresponding coefficient does not change; if there is no actual batch A sample in that group, then the coefficient will represent the average log-expression of a hypothetical sample in batch A.)

To determine whether you need to include the batch effect, I would try a couple of things. First, have a look at the MDS plot, and see if your samples are clustering by batch. If they are, then you need to model the batch effect. If not (i.e., samples from different batches intermingle), then it's probably okay to just using grouping alone. Also, I would try the analysis with and without the batch factor, and see what happens to the the number of DE genes. If there's a strong batch effect, then modelling it should increase the number of DE genes. (Specifically, blocking on batch will decrease the dispersion estimates but will use up residual d.f., which generally increase and decrease power, respectively - so you want to see whether the former can offset the latter for DE detection.)

Finally, you should be using all estimateGLM*Disp calls together, as it's a three-step pipeline which uses all functions. Or better yet, use estimateDisp, which is an upgraded single-step version.

Entering edit mode

Thank you very much for your advice, you well explained my doubts. I’m so sorry about the confuse description. In the code I used NR instead of NILR, NS instead of NILS, MK.usp instead of 0h.
In my MDS plot among the six time-points, only one or two points are clustering by batch.

So I decide to analysis with the batch factor. 
Also, I compare the number of DE genes analysis by the design of no_batch and add_batch.

The design model with batch factor (add_batch) have 1600 (15%) more DE genes than without batch factor (no_batch). This also indicate there's a batch effect in my data.

By the way, if I just contrast the different time point under the same plant and the same fungi, can I use the same design model as above?

Thank you again.

Entering edit mode

Yes, you would use the same design matrix, you would just change the contrast matrix. On a similar note, your current contrast matrix has a lot of comparisons stuffed inside it, I would find the resulting set of DE genes rather difficult to interpret as I wouldn't know which comparison was causing the DE. I'd be inclined to use a more restricted set of comparisons unless there's a good reason for doing everything at once.


Login before adding your answer.

Traffic: 303 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6