Question

Is there are better way to improve the design via blocking or contrasts while using limma-voom for the mentioned datasets?

0

Entering edit mode

vd4mmind • 0

@vd4mmind-11547

Last seen 3.3 years ago

United States

I have 2 RNA-Seq datasets(tumors) coming from different countries. total samples n1=35(having tumors and normals) that is in-house and n2=99 external(only tumors). When I plot the log normalized data via glMDSplot I see 3 distinct clusters( one of in-house and 2 from the external). I used SVA to correct them at sva=5 which handles not only the differences of the lab but also the difference of the external data that was within them since it was not apparent why they had a huge difference. As I downloaded them from a consortium where meta-information is not present from which centers each of the data was coming so could not decide apriori a batch information. So the batch correction proceeded with SVA.

Having corrected via SVA, my interest is to just find DEGs within the n2=99 samples which are external data having some classification as 2 different tumor classes(T1 and T2). I am only interested to find T2 vs T1 now using also the information of the SVA values as factors in the design and the information of the tumors being Solid and also liquid. So I created a design matrix.

cond<-factor(batch$tumorType[1:99]) # this is the main condition on which I want to perform DE
gr<-factor(batch.$dataType[1:99]) # this is another factor which the tumor has like some are solid and others are liquid tumors
sv.5<-sv$sv[1:99,] # using the SVA values as factors in the additive model
design <- model.matrix(~sv.5+condition+group)
design
(Intercept)   sv.51    sv.52      sv.53  sv.54  sv.55   condT2     grSol

1            1 -0.068  8.472e-02 -0.032  0.020  0.026   0           0
2            1 -0.051  6.310e-02  0.154 -0.069 -0.080   0           1
3            1 -0.094  9.266e-02  0.054 -0.082  0.068   0           1
4            1 -0.059  6.381e-02  0.046  0.126 -0.059   0           1
5            1 -0.076  4.110e-03 -0.182 -0.121 -0.065   0           1
6            1 -0.102  6.418e-02  0.064 -0.291 -0.085   0           1
. . .
. . .
93           1 -0.078  7.718e-02  0.164 -0.031 -0.066   1           1
94           1 -0.074 -6.784e-03 -0.061  0.120 -0.138   1           1
95           1 -0.077  6.435e-02 -0.035  0.067 -0.100   1           0
96           1 -0.069  8.491e-02  0.344  0.032 -0.106   1           1
97           1 -0.081  4.205e-02  0.302 -0.048 -0.021   1           1
98           1 -0.065  1.633e-02  0.055  0.049 -0.093   1           1
99           1 -0.064  7.848e-02 -0.029  0.024 -0.025   1           0

nf <- calcNormFactors(counts, method = "TMM")
dge <- DGEList(counts =counts, group = cond, norm.factors = nf)
v <- voom(dge, design, plot=TRUE) 
fit <- lmFit(v, design)
fit <- eBayes(fit)
top <- topTable(fit,coef=ncol(design),n = nrow(counts)) #20k genes in topTable
sum(top$adj.P.Val<0.05) # 6646 genes

#recalcluating FDR since I did only filtering of 0 counts so re-esitmating based on FC and expression

ww <- which(abs(top$logFC)>log2(2) & top$AveExpr > 0)
top$fdr <- 1
top$fdr[ww] <- p.adjust(top$P.Value[ww],method="fdr")
dim(top)
length(ww)
degs <- (top[top$fdr < 0.05,])
degs2.cond <- degs[abs(degs$logFC)>1, ]
dim(degs2.cond) # 1293 degs

Now the above DEGs are only taking into account the factor level group(solid vs liquid) and masking the condition(T1 vs T2 tumor types). If I remove the group from the design and proceed with just condition=tumorType(T1 vs T2) then I find only 120 degs of which half of them show the trend of the condition. Rest DEGs are not very clean when projected in a heatmap. I am not getting a perfect clustering but still I see some differences between the tumorTypes. How can I improve the design? I am sure am not giving the perfect model matrix and should I include some blocking or constrast so that I can still preserve the effect of tumorTypes(perform T1 vs T2) and not let the dataType (solid vs liquid) mask the DE analysis. Is there a way to proceed for this using voom? Can anyone give me pointers about

limma-voom voom limma rna-seq • 2.0k views

ADD COMMENT • link updated 7.1 years ago by Aaron Lun ★ 28k • written 7.1 years ago by vd4mmind • 0

2

Entering edit mode

This isn't a direct answer, but your FDR calculation is flawed, and will likely result in overstating your significance (i.e. underestimating the true FDR), because you are filtering by logFC, which is not independent of the p-value. You should instead consider using the treat function to test for a logFC significantly greater than a nonzero threshold. Also, the average expression filter is appropriate, but should generally be done before the voom step.

ADD REPLY • link 7.1 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

I understand what you say but if you consider of genes that are still significant with low fold changes coming as better FDR estimates then one can, in fact, re-estimate the FDR based on avg expression and the logFC if the starting expression set is not filtered. I am using such FDR calculation as my starting expression set does not account for any apriori filtering strategy based on average expression or read counts. I only remove genes from count table that have zero counts. If I had chosen an expression set apriori with higher average expressed genes then this method would not be required since the FDR estimated by limma-voom will still be pretty good. am starting with the same gene list on which SVA was performed which was also based on the fact of genes filtered for only 0 counts. I agree if I reduce the gene set based on average expression apriori feeding to voom, I will not need to re-estimate the FDR. But that will still not account for my design. Is there also a flaw that you find in the design and give some pointers that I can implement that can take care of the comparison I intend to do?

ADD REPLY • link 7.1 years ago vd4mmind • 0

score 5 · Answer 1 · 2017-10-31

5

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 4 hours ago

The city by the bay

Now the above DEGs are only taking into account the factor level group(solid vs liquid) and masking the condition(T1 vs T2 tumor types).

Well, obviously. You've specified coef=ncol(design), and your design matrix suggests that the last coefficient should represent the group effect, i.e., of solid vs liquid tumours. So this will test for DE between solid and liquid tumours, blocking on all the other factors. If you want to test for differences between T1 and T2, you should start by specifying the correct coef in the topTable call.

I remove the group from the design and proceed with just condition=tumorType(T1 vs T2) then I find only 120 degs of which half of them show the trend of the condition.

This is directly caused by inflated variances when you have factors of variation that are not captured in your design matrix. Also, I don't understand what you mean by "trend of the condition", nor do I understand what "Rest DEGs" are in the next sentence.

I understand what you say but if you consider of genes that are still significant with low fold changes coming as better FDR estimates...

I don't know what you mean by this, but let me assure you, Ryan is right and your filtering strategy is wrong. You cannot apply the BH correction on genes that have been filtered by log-fold change. This is guaranteed to enrich for false positives as the log-fold change is not independent of the p-value under the null hypothesis. Consider this simple scenario:

set.seed(1000)
y <- matrix(rnorm(100000), ncol=10)
design <- model.matrix(~gl(2,5))
fit <- lmFit(y, design)
fit <- eBayes(fit)
res <- topTable(fit, coef=2)
sum(res$adj.P.Val <= 0.05) # 0
ww <- abs(res$logFC) > 1
new.fdr <- p.adjust(res$P.Value[ww], method="BH")
sum(new.fdr <= 0.05) # 10

Congratulations; you've just detected 10 genes in a data set with no true differences. Further investigation will reveal that your filtering approach will distort the p-value distribution under the null hypothesis, rendering it non-uniform, and thereby break the BH correction. Proper filtering should be independent of the test statistic, and there is some literature about this - check it out.

P.S. I just realized that you are trying to merge an in-house data set with an external data set. This is rarely a good idea for DE analyses; for starters, the mean-variance relationship is unlikely to be the same between batches, which reduces the accuracy of the modelling. The batch effect may also distort TMM normalization by causing genes to become "DE". Statistics aside, common sense should prevail here; do you really want to compare your in-house normal samples to tumour samples collected and sequenced by someone else at a different time and in a different lab? A much safer approach is to perform DE analyses separately within each batch and to perform a meta-analysis to extract the features of interest; see Merge different datasets regarding a common number of DE genes in R.

ADD COMMENT • link 7.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks for the explanation. I was trying to understand basically why is it wrong to re-estimate the FDR from the limma, as far as I have read, most of these packages are in fact controlling the false discovery rate. Also, we are not guaranteed by the fact that what are the true-positive genes in the comparison. However , when we perform p.adjust should not we care about genes that have some average expression and logFC and calculcate the FDR for a control-measure. I am obviously stating this since my gene table that I feed for DE analysis does not account for a threshold of expression. I only remove genes with 0 counts and not with a read count threshold. If I start with lower expression set which read count threshold , I will obviously have better FDR estimates. So why is it wrong if I re-measure FDR for genes with certain logFC and avgExp. I might be wrongly interpreting but if you can re-clarify that would be nice. I would anyway think of resorting to restricting expression set of genes to test the DE and not control the FDR but I fail to understand if I do not remove lowly expressed genes apriori , why it will wrong to re-measure the FDR using specific logFC and top$avgExp, am not using only the logFC alone and I know it is wrong to do that alone.

ADD REPLY • link 7.1 years ago vd4mmind • 0

2

Entering edit mode

Aaron has already given you a simulated example where FDR calculation after logFC filtering clearly gives the wrong answer, and he has given you a link to a publication with more information about the topic. We have already explained that filtering genes with low average expression is appropriate and will not compromise the FDR calculation. You seem to believe that combining both filters is acceptable even though filtering only on logFC is not, which is absurd. That is like saying that a diet of vegetables and cookies is healthy because vegetables are healthy. If you want a non-zero logFC threshold, use the treat function as I have already recommended. The treat function provides a way to test a logFC threshold while properly controlling the FDR.

ADD REPLY • link 7.1 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Sure, thanks for the heads up.

ADD REPLY • link 7.1 years ago vd4mmind • 0

0

Entering edit mode

Dear Aaron,

I re-did the analysis, and what I did here is first performing SVA on my the public data and our data(I want to project all data together), correct them for the confounders. I need to find actually the difference within the public data as I said between class of T2 vs T1. However, since the data is having an internal batch as I see in MDS plot. This is a reason I corrected with SVA since meta-data does not account for the information from which centers the data come from. So what I did is just perform DE analysis with voom on the public data and plot the degs on the sva corrected data. I did not use the sva as covariates in the design matrix this time. What I wanted to use is a design matrix with the cond=tumorType (levels(c(T1,T2) and group= c(Solid, Liquid). Below is the code. I want to test DE for T2 vs T1 but block the effect of Solid vs Liquid. Can you tell if this is the correct way to take account for it or not? Or should I use something else? I construct the design matrix taking both but use coeff=2 for topTable

code:
cond<-factor(batch$tumorType[1:99]) # this is the main condition on which I want to perform DE
gr<-factor(batch.$dataType[1:99]) # this is another factor which the tumor has like some are solid and others are liquid tumors

design<-model.matrix(~cond+gr,data=counts)
design
     (Intercept)       condT2              grSol
1           1               0                  0
2           1               0                  1
3           1               0                  1
4           1               0                  1
5           1               0                  1
6           1               0                  1
...
...
94           1              1                  1
95           1              1                  0
96           1              1                  1
97           1              1                  1
98           1              1                  1
99           1              1                  0

dge <- DGEList(counts = counts)
d <- calcNormFactors(dge, method = "TMM")
v <- voom(d, design, plot=TRUE) 
fit <- lmFit(v, design)
fit <- eBayes(fit)
top <- topTable(fit,coef=2,n = nrow(counts))
sum(top$adj.P.Val<0.05)
ind<-which(top$adj.P.Val<0.05 & abs(top$logFC)>1) 
degs_0.05_1<-top[ind,]
dim(degs_0.05_1) 
#[1] 482   6

The above code gives me 482 DEGs which if I plot on the SVA corrected data gives me genes that are mostly showing clustering of T2vsT1 and the effect of Solid vs Liquid is not dominating it. I would like to have some feedback if this is fine or not. Or is it advisable to you the SVA as batch covariates. In that case how should I control for the Solid vs liquid and just perform the DE for T2 vs T1. Should it be still same coeff=2. Looking for some feedback. Thanks

ADD REPLY • link 7.1 years ago vd4mmind • 0

1

Entering edit mode

The coef=2 is correct if you want to test for the T1 vs T2 effect.
You should use treat rather than filtering manually on the log-fold change.
It is fine to block on the SVA-identified factors as well, just put them into the design matrix. However, I just realized you're merging in-house data with external data, see my edit above.

ADD REPLY • link 7.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks for your reply. I am not using SVA as covariates in the model. I just did the batch correction with our data and the public data not for using them for DE, but to project the DEGs that I find in public data and to see how they behave in both our data and public data. We also have tumors in our dataset and also normals. The idea is to precisely find the overlap of DEGs between the similar kind of tumor classes that I also have in our data and public data. And see if those overlapping genes are also holding the same clustering of separating T1 and T2 tumors in 2 separate clusters or not. This is the reason I did SVA to just project the DEGs that I found in the public data and confirm if they have the same clustering of tumors as ours in the SVA corrected data. I am not doing any DE analysis between our data and public data. I am doing separate DE analysis in tumor classes of public data and our data. I cannot since there is a lot of noise in both the data owing to the batch of time and lab, plus different countries and also the public data has the internal batch which is not informed in their metadata. Thanks for your input

ADD REPLY • link 7.1 years ago vd4mmind • 0