Question

WGCNA - confounding effect based on technical variables (some continuous)

0

Entering edit mode

cats_dogs ▴ 20

@cats_dogs-15904

Last seen 6.3 years ago

Hi all,

I am running WGCNA on a set of 129 samples from three timepoints with three diseases and various physiological traits. Unfortunately, due to project logistics, they were not able to be sequenced at the same time, and looking at the module-trait relationship, it looks like there is quite a strong batch effect resulting from various aspects of sequencing.

Embarrassing facts: Timepoints 1 (16 samples) and 3 (29 samples) were sequenced in the same run, and Timepoint 2 (84 samples) was mostly sequenced in a separate run, with several replicates in a third run. (Originally, the study was planned only for Timepoint 2, but some samples were contaminated or otherwise aligned poorly, so we sequenced additional replicates - and then the PI wanted to extend the study based on a subset of interesting traits from Timepoint 2).

I have single-end 3' reads and used a pseudoaligner (Kallisto, in this particular case, although I get similar results with Salmon, and even featureCounts), which required fragment length and standard devation, so I threw it into the traits matrix along with run batch and RIN and found a strong confounding effect from a combination of the three.

I issued the following command, resulting in 40 modules:

consTime = blockwiseConsensusModules(multiExprTime, power = 6,
                                  networkType = "signed",
TOMType = "signed", minModuleSize = 75, deepSplit=1, 
mergeCutHeight = 0.2,
maxPOutliers = 0.05,
corType="bicor",
numericLabels = TRUE, 
maxBlockSize = 25000, #about 24500 genes in this set
pamRespectsDendro = FALSE,
saveTOMs = TRUE,
saveTOMFileBase = "consT",
saveIndividualTOMs=TRUE,
robustY=FALSE,
verbose = 5)

Fragment length is confounded with run (run 2/timepoints 1 and 3 had shorter fragment lengths overall), and that's confounded with time. The RIN distribution is similar across samples. However, in the consensus module-trait relationship map, there is still an effect from fragment size and RIN, though the strength of the correlation and the p-values differ between timepoints (ie, a module seemingly confounded in one timepoint has a low correlation/high p-value in another).

Fortunately, this isn't exclusively a genomics study and we have physiological data from the population of samples from which this subset was taken, and based on this, some modules that are confounded and also associated with traits seem physiologically relevant. I am not going to get around, nor would I want to get around, reporting the confounding effect, but what are the best practices for reporting putative "real" modules amid the confounding effect? I intend to release all data and code in the supplement, but I want to be forthright in the narrative as well wherever I present the data before publication.

Thank you for your time.

WGCNA batch effect continuous • 2.0k views

ADD COMMENT • link updated 6.4 years ago by Peter Langfelder ★ 3.0k • written 6.4 years ago by cats_dogs ▴ 20

score 1 · Answer 1 · 2019-02-13

It's a bit difficult to give meaningful answers without seeing the data and the specific questions, but here are a few options.

If the modules are expected to be present (i.e., the constituent genes are expected to be correlated) within each run, you can do a consensus WGCNA between the two batches to define modules that are by definition not confounded by batch effects. (edit: this seems to be what you have done, more or less.) You could relate the eigengenes to traits and combine the association using meta-analysis. You could then calculate the module eigengenes of the found modules using combined data and see whether any of the relevant modules show plausible time point progression. Also, if you're after interaction of time point with other traits, that analysis would not necessarily be affected by the batch effect.

If there are enough replicates from time point 2 sequence with the second run (timepoints 1 and 3), you could try to use them to adjust the batch effect. "Enough" means that each relevant condition has at least a few samples in both runs. You could use the adjusted data in the above analysis, i.e., the modules could still be defined using a consensus analysis but the eigengenes would be calculated from adjusted combined data.

One comment about the code above - for network type "signed", the soft thresholding power should be doubled compared to unsigned or signed hybrid networks; the 6 you used is usually low for a "signed" network.