Pre-processing of RNA-Seq data for WGCNA and DGE
2
2
Entering edit mode
gokce.ouz ▴ 70
@gokceouz-11205
Last seen 6.0 years ago

Hi All,

I would like to do differential gene expression analysis and WGCNA usinf paired-end RNA seq data.  However,I confused a lot related to using same pre-processing for both DGE & WGCNA analysis.

In summary, I have 2 experimental condition and each condition has 18 samples. However, they are sequenced in 3 batches. 1 st batch 6 pair (disease & control), 2nd batch 6 pair (disease & control), 3rd batch 6 pair (disease & control). Each batch done by different people over different times so I believe we have batch effect. As I am using DESeq2 for my analysis I set my model design as “Batch+Type”. I was normally using removeBatchEffect from limma package and plotting PCA & hierarchical clustering plots using spermann correlation  to see which samples are outliers.

I have asked  before if I should use rlog() for differential gene expression analysis and getVarianceStabilizedData() for WGCNA or can I use same normalization for both. I got a feedback that it is better  to use voom() the removedbatcheffect() as they belong to the same package.Do you agree with this suggestion ?

Now, I am using voom() & removedbatcheffect() for WGCNA analysis but what do you suggest me to do for DGE ? Should I use limma package or DESeq2 package ( rld() or vsd() ) ?

In addition to these, I would like to take your advice about if I should remove any patients from analysis according to these hierarchical clusters:

Same data, different transformation methods( rld, vsd, voom), removed batch effect and plot hierarchical clustering using Spearmann correlation.

I really appreciate if you give me some insight,

WGCNA RNA-Seq DESeq2 voom removebatcheffect() • 6.8k views
7
Entering edit mode
@mikelove
Last seen 3 days ago
United States

There's not much point asking the package maintainers which package to use for DE, as we both will recommend our own most likely. But I can tell you what makes sense from my point of view in terms of cross package usage:

Q: "I have asked  before if I should use rlog() for differential gene expression analysis and getVarianceStabilizedData() for WGCNA or can I use same normalization for both. I got a feedback that it is better  to use voom() the removedbatcheffect() as they belong to the same package.Do you agree with this suggestion ?"

I recommend that people use removeBatchEffect() after rlog() or vst() functions if they want to make EDA plots of the data or some other downstream analysis of variance stabilized data after removing mean shifts associated with batch variables. There is absolutely nothing wrong with using these functions across packages.

You shouldn't use removeBatchEffect() before doing DE testing though. For DESeq2 analysis, just use ~batch + type and then run DESeq(). Or use the standard limma-voom pipeline.

Q: "Now, I am using voom() & removedbatcheffect() for WGCNA analysis but what do you suggest me to do for DGE ? Should I use limma package or DESeq2 package ( rld() or vsd() )?"

Do not use rld() or vst() for DGE testing. We say as much in the vignette in the first sentence of the section describing these transformations. Use either DESeq() on raw counts or use the standard limma-voom pipeline.

0
Entering edit mode

Hi Michael,

I guess I explained myself wrong.  As you said, I am using the DESeq() on raw counts then using results() on the "dds" variable to do DGE with defining the contrasts & alpha. The aim of rld or vsd transformation is for EDA and to see how samples are clustering via PCA, hierarchical clustering or MDS plots. I actually do not use removebatcheffect for DGE but for the preprocessing the data for WGCNA.

ddsMat<- DESeqDataSetFromMatrix(seMat, sample.table, design=~Batch+Type)
dds<-DESeq(ddsMat)
dds <- dds[ rowSums(counts(dds)) > 1, ]

#DGE
res<-results(dds,alpha=.05, contrast=c("Type", "Sphere", "Tumor"))
res$symbol <- mapIds(org.Hs.eg.db,keys=row.names(res),column="SYMBOL", keytype="ENSEMBL", multiVals="first") resSig <- subset(res, padj < 0.05) #Transformations rld <- rlog(dds, blind= FALSE) rlogMat <- assay(rld) vsd = getVarianceStabilizedData(dds) #removeBatchEffect design=model.matrix(~sample.table$Type)
removedbatch.rld=removeBatchEffect(rlogMat, batch=sample.table$Batch, design= design ) vsd.removed=removeBatchEffect(vsd, batch=sample.table$Batch, design= design )

#PCA
par(mfrow = c(1,2));
cex1 = 1.5;
plotPCA(rld, intgroup=c("Type", "Batch"))
plotPCA(rld.removed, intgroup=c("Type", "Batch"))
dev.off()

So as you suggested I will not be removing any samples for DGE, but what do you suggest for WGCNA? Because in their tutorial  they are removing the sample that is a outlier in Figure1.  https://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-01-dataInput.pdf

Your answer actually cleared the clouds.  There is absolutely nothing wrong with using these functions across packages.

0
Entering edit mode
@peter-langfelder-4469
Last seen 7 months ago
United States

Looking at your sample clustering trees, you don't seem to have outliers; rather, you have 2 very distinct groups. You could add indicators for some of the sample traits to the dendrograms as illustrated in the WGCNA tutorials, to see whether the clusters correspond to different treatments or whether the split is due to something else.

This between-cluster difference will drive the main module signal in WGCNA. If the difference reflects treatment, you are golden, you have very strong signal in the data (I personally tend to be cautious when I see such strong signal since it is often too good to be true). If the difference reflects some other factor, you may want to adjust for it as well using one of the batch effect adjustment functions.

Peter

0
Entering edit mode

Thanks a lot for your answer Peter. As you mentioned I have two distinct group but 1 disease patient (Sph4)  is clustering together with control patients at the right- cluster, so I thought it as outlier. But, you suggest me not to remove it, right ?

For my WGCNA analysis, I am using networkType=Signed hybrid, corType=bicor, pearsonFallback = "individual" & deepSplit= 2 .

#DESeq2
ddsMat<- DESeqDataSetFromMatrix(seMat, sample.table, design=~Batch+Type)
dds<-DESeq(ddsMat)

#Omitting low count rows
dds <- dds[ rowSums(counts(dds)) > 1, ]
datExpr0<- assay(dds)

#Checking for good genes
gsg = goodSamplesGenes(datExpr0, verbose = 5);
gsg$allOK #Varicance Stabilization & Batch effect Removal vsd = getVarianceStabilizedData(dds) design=model.matrix(~sample.table$Type)
vsd.removed=removeBatchEffect(vsd, batch=sample.table$Batch, design= design ) colnames(vsd.removed) <- dds$SampleNo
vsd.removed.t <- t(vsd.removed)
datExpr<- vsd.removed.t

#WGCNA Analysis
powers = c(c(1:10), seq(from = 12, to=40, by=2))
sft = pickSoftThreshold(datExpr,corFnc = "bicor", corOptions=list(maxPOutliers=0.1),networkType = "signed hybrid", powerVector = powers, verbose = 5)

softpower = 9
nethybrid.2 = blockwiseModules(datExpr, power = softpower,maxBlockSize = 10000,
TOMType = "signed", minModuleSize = 30, deepSplit =2,
reassignThreshold = 0, mergeCutHeight = 0.25,
numericLabels = TRUE, pamRespectsDendro = FALSE,
saveTOMs = FALSE,networkType = "signed hybrid",
verbose = 5,corType = "bicor", maxPOutliers = 0.1,
pearsonFallback = "individual")

I am actually facing a problem. Large set of genes are clustering under 1 module .I am observing 19352 genes under "turquoise" module as you can see from the table below :

table(moduleColors.hybrid.2)
moduleColors.hybrid.2
black          blue         brown          cyan     darkgreen
855          1382          1349           360            44
darkred darkturquoise         green   greenyellow          grey
71            39          1089           465         17420
grey60     lightcyan    lightgreen   lightyellow       magenta
178           208           171           122           616
midnightblue          pink        purple           red     royalblue
313           750           580          1073            86
salmon           tan     turquoise        yellow
414           430         19352          1240 

When I set deepSplit into 3 or 4 the number in the module even increases more. To observe how it is correlated with the Sample Traits ,I plot the Module - Trait relationship Heatmap:

As you can see, only the Group variable ( dummy variables  as  1 for control & 2 for disease) is highly relating with the modules. I would like to ask you

1- Can I continue my WGCNA analysis( pathway, network etc) even though large set of genes grouping under 1 module, or is this shows bias, so I shouldn't continue ?

2 - Should I find a way to split the large module ? I actually tried different combinations with parameters but do you have alternative solutions ?

3- If it is ok to continue to my analysis, I would like to focus on the yellow & blue modules  who are significantly highly  positively correlated with the disease group according to the heatmap. Is it a correct approach ?

4 - The red module which has 1073 genes is significantly highly negatively correlated (-0.89,  2e-13). Can I assume the network/ pathway generated from this module is associated with control group not with the disease group ?

I really appreciate if you give me some insight,

Gokce

2
Entering edit mode

Several points.

I have two distinct group but 1 disease patient (Sph4)  is clustering together with control patients at the right- cluster, so I thought it as outlier. But, you suggest me not to remove it, right ?

Unless you can show that this sample is mislabeled or the the patient was really a control, do not remove it or you will be introducing bias.

Large set of genes are clustering under 1 module .I am observing 19352 genes under "turquoise" module as you can see from the table below :

As I said above, you have a very strong driver of expression. This affects most of your genes which therefore end up in a few very large modules. I cannot read your plots (the font size is too small), so I'm not sure what the driver is, although you seem to imply that it is disease status. If differences between patients and controls are not interesting (which I find unlikely), adjust for status and re-run WGCNA again. See WGCNA FAQ, point 5.

If you are looking for differences between disease and control, you seem to have thousands (perhaps tens of thousands) of genes that are strongly associated with group and therefore, in your data, are strongly correlated.

One thing you could try is re-running the batch correction without status as a covariate. Adding it as a covariate sometimes introduces spurious associations. If I understood your situation correctly, the batch and status are perfectly orthogonal, so it should not be a problem in your case, but you could try it anyway.

1- Can I continue my WGCNA analysis( pathway, network etc) even though large set of genes grouping under 1 module, or is this shows bias, so I shouldn't continue ?

The answer is, do you believe that the disease status has a such a huge effect on gene expression? This is not a WGCNA question - you have an extremely strong split between the control and disease clusters, and you must have tens of thousands of DE genes with FDR<0.05. Is biologically it believable? If not, there's something wrong with your data; if yes, WGCNA simply reflects the data.

2 - Should I find a way to split the large module ? I actually tried different combinations with parameters but do you have alternative solutions ?

There's very little you can do to split the large modules in your data, but you could try using other information or data to split the modules up. One possibility is to find a relatively similar data set (same tissue, similar conditions) and run a consensus WGCNA.

3- If it is ok to continue to my analysis, I would like to focus on the yellow & blue modules  who are significantly highly  positively correlated with the disease group according to the heatmap. Is it a correct approach ?

If you believe the data (see above), then yes, it is logical to focus on the most strongly associated modules.

4 - The red module which has 1073 genes is significantly highly negatively correlated (-0.89,  2e-13). Can I assume the network/ pathway generated from this module is associated with control group not with the disease group ?

I would say the red module genes are overexpressed in controls compared to disease, but we probably mean the same thing.

Peter

0
Entering edit mode

Many many thanks Peter, your answers relieved me a lot. I will also run without setting the batch as covariate.

One last question : How can I understand which run gives me the most accurate results ? For example, when I increase the maxBlocksize parameter from 10.000 to 20.000, naturally  all the module sizes and their corrrlation with sample traits are changing. Is there anyway to figure out which parameter settings are best for my data ? Also do you have any  suggestion  how I can plot these multi-runs with different parameters in a graph ?

Best Regards,

Gokce

0
Entering edit mode

I recommend as few blocks as possible - the blockwise aproach is an approximate method to handle a set of genes too large to be handled normally. In other words, if your computer can handle all genes in one block, run the analysis in one block. There's some discussion in WGCNA Tutorial I, section 2c, https://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/index.html.

In general, different settings will produce different modules but overall they should be fairly similar. I don't even know how to define "best", so I cannot answer the question as to which modules are best. I don't have a way to produce a good graphical representation of multiple module identifications either.

0
Entering edit mode

I really appreciate your help. Thanks a lot!

Best Regards,

Gokce