I have 2 RNA-Seq datasets(tumors) coming from different countries. total samples n1=35(having tumors and normals) that is in-house and n2=99 external(only tumors). When I plot the log normalized data via glMDSplot I see 3 distinct clusters( one of in-house and 2 from the external). I used SVA to correct them at sva=5 which handles not only the differences of the lab but also the difference of the external data that was within them since it was not apparent why they had a huge difference. As I downloaded them from a consortium where meta-information is not present from which centers each of the data was coming so could not decide apriori a batch information. So the batch correction proceeded with SVA.

Having corrected via SVA, my interest is to just find DEGs within the n2=99 samples which are external data having some classification as 2 different tumor classes(T1 and T2). I am only interested to find T2 vs T1 now using also the information of the SVA values as factors in the design and the information of the tumors being Solid and also liquid. So I created a design matrix.

cond<-factor(batch$tumorType[1:99]) # this is the main condition on which I want to perform DE gr<-factor(batch.$dataType[1:99]) # this is another factor which the tumor has like some are solid and others are liquid tumors sv.5<-sv$sv[1:99,] # using the SVA values as factors in the additive model design <- model.matrix(~sv.5+condition+group) design (Intercept) sv.51 sv.52 sv.53 sv.54 sv.55 condT2 grSol 1 1 -0.068 8.472e-02 -0.032 0.020 0.026 0 0 2 1 -0.051 6.310e-02 0.154 -0.069 -0.080 0 1 3 1 -0.094 9.266e-02 0.054 -0.082 0.068 0 1 4 1 -0.059 6.381e-02 0.046 0.126 -0.059 0 1 5 1 -0.076 4.110e-03 -0.182 -0.121 -0.065 0 1 6 1 -0.102 6.418e-02 0.064 -0.291 -0.085 0 1 . . . . . . 93 1 -0.078 7.718e-02 0.164 -0.031 -0.066 1 1 94 1 -0.074 -6.784e-03 -0.061 0.120 -0.138 1 1 95 1 -0.077 6.435e-02 -0.035 0.067 -0.100 1 0 96 1 -0.069 8.491e-02 0.344 0.032 -0.106 1 1 97 1 -0.081 4.205e-02 0.302 -0.048 -0.021 1 1 98 1 -0.065 1.633e-02 0.055 0.049 -0.093 1 1 99 1 -0.064 7.848e-02 -0.029 0.024 -0.025 1 0 nf <- calcNormFactors(counts, method = "TMM") dge <- DGEList(counts =counts, group = cond, norm.factors = nf) v <- voom(dge, design, plot=TRUE) fit <- lmFit(v, design) fit <- eBayes(fit) top <- topTable(fit,coef=ncol(design),n = nrow(counts)) #20k genes in topTable sum(top$adj.P.Val<0.05) # 6646 genes #recalcluating FDR since I did only filtering of 0 counts so re-esitmating based on FC and expression ww <- which(abs(top$logFC)>log2(2) & top$AveExpr > 0) top$fdr <- 1 top$fdr[ww] <- p.adjust(top$P.Value[ww],method="fdr") dim(top) length(ww) degs <- (top[top$fdr < 0.05,]) degs2.cond <- degs[abs(degs$logFC)>1, ] dim(degs2.cond) # 1293 degs

Now the above DEGs are only taking into account the factor level group(solid vs liquid) and masking the condition(T1 vs T2 tumor types). If I remove the group from the design and proceed with just condition=tumorType(T1 vs T2) then I find only 120 degs of which half of them show the trend of the condition. Rest DEGs are not very clean when projected in a heatmap. I am not getting a perfect clustering but still I see some differences between the tumorTypes. How can I improve the design? I am sure am not giving the perfect model matrix and should I include some blocking or constrast so that I can still preserve the effect of tumorTypes(perform T1 vs T2) and not let the dataType (solid vs liquid) mask the DE analysis. Is there a way to proceed for this using voom? Can anyone give me pointers about

This isn't a direct answer, but your FDR calculation is flawed, and will likely result in overstating your significance (i.e. underestimating the true FDR), because you are filtering by logFC, which is not independent of the p-value. You should instead consider using the treat function to test for a logFC significantly greater than a nonzero threshold. Also, the average expression filter is appropriate, but should generally be done before the voom step.

I understand what you say but if you consider of genes that are still significant with low fold changes coming as better FDR estimates then one can, in fact, re-estimate the FDR based on avg expression and the logFC if the starting expression set is not filtered. I am using such FDR calculation as my starting expression set does not account for any apriori filtering strategy based on average expression or read counts. I only remove genes from count table that have zero counts. If I had chosen an expression set apriori with higher average expressed genes then this method would not be required since the FDR estimated by limma-voom will still be pretty good. am starting with the same gene list on which SVA was performed which was also based on the fact of genes filtered for only 0 counts. I agree if I reduce the gene set based on average expression apriori feeding to voom, I will not need to re-estimate the FDR. But that will still not account for my design. Is there also a flaw that you find in the design and give some pointers that I can implement that can take care of the comparison I intend to do?