I have 2 RNA-Seq datasets(tumors) coming from different countries. total samples n1=35(having tumors and normals) that is in-house and n2=99 external(only tumors). When I plot the log normalized data via glMDSplot I see 3 distinct clusters( one of in-house and 2 from the external). I used SVA to correct them at sva=5 which handles not only the differences of the lab but also the difference of the external data that was within them since it was not apparent why they had a huge difference. As I downloaded them from a consortium where meta-information is not present from which centers each of the data was coming so could not decide apriori a batch information. So the batch correction proceeded with SVA.
Having corrected via SVA, my interest is to just find DEGs within the n2=99 samples which are external data having some classification as 2 different tumor classes(T1 and T2). I am only interested to find T2 vs T1 now using also the information of the SVA values as factors in the design and the information of the tumors being Solid and also liquid. So I created a design matrix.
cond<-factor(batch$tumorType[1:99]) # this is the main condition on which I want to perform DE gr<-factor(batch.$dataType[1:99]) # this is another factor which the tumor has like some are solid and others are liquid tumors sv.5<-sv$sv[1:99,] # using the SVA values as factors in the additive model design <- model.matrix(~sv.5+condition+group) design (Intercept) sv.51 sv.52 sv.53 sv.54 sv.55 condT2 grSol 1 1 -0.068 8.472e-02 -0.032 0.020 0.026 0 0 2 1 -0.051 6.310e-02 0.154 -0.069 -0.080 0 1 3 1 -0.094 9.266e-02 0.054 -0.082 0.068 0 1 4 1 -0.059 6.381e-02 0.046 0.126 -0.059 0 1 5 1 -0.076 4.110e-03 -0.182 -0.121 -0.065 0 1 6 1 -0.102 6.418e-02 0.064 -0.291 -0.085 0 1 . . . . . . 93 1 -0.078 7.718e-02 0.164 -0.031 -0.066 1 1 94 1 -0.074 -6.784e-03 -0.061 0.120 -0.138 1 1 95 1 -0.077 6.435e-02 -0.035 0.067 -0.100 1 0 96 1 -0.069 8.491e-02 0.344 0.032 -0.106 1 1 97 1 -0.081 4.205e-02 0.302 -0.048 -0.021 1 1 98 1 -0.065 1.633e-02 0.055 0.049 -0.093 1 1 99 1 -0.064 7.848e-02 -0.029 0.024 -0.025 1 0 nf <- calcNormFactors(counts, method = "TMM") dge <- DGEList(counts =counts, group = cond, norm.factors = nf) v <- voom(dge, design, plot=TRUE) fit <- lmFit(v, design) fit <- eBayes(fit) top <- topTable(fit,coef=ncol(design),n = nrow(counts)) #20k genes in topTable sum(top$adj.P.Val<0.05) # 6646 genes #recalcluating FDR since I did only filtering of 0 counts so re-esitmating based on FC and expression ww <- which(abs(top$logFC)>log2(2) & top$AveExpr > 0) top$fdr <- 1 top$fdr[ww] <- p.adjust(top$P.Value[ww],method="fdr") dim(top) length(ww) degs <- (top[top$fdr < 0.05,]) degs2.cond <- degs[abs(degs$logFC)>1, ] dim(degs2.cond) # 1293 degs
Now the above DEGs are only taking into account the factor level group(solid vs liquid) and masking the condition(T1 vs T2 tumor types). If I remove the group from the design and proceed with just condition=tumorType(T1 vs T2) then I find only 120 degs of which half of them show the trend of the condition. Rest DEGs are not very clean when projected in a heatmap. I am not getting a perfect clustering but still I see some differences between the tumorTypes. How can I improve the design? I am sure am not giving the perfect model matrix and should I include some blocking or constrast so that I can still preserve the effect of tumorTypes(perform T1 vs T2) and not let the dataType (solid vs liquid) mask the DE analysis. Is there a way to proceed for this using voom? Can anyone give me pointers about