WGCNA - confounding effect based on technical variables (some continuous)
1
0
Entering edit mode
cats_dogs ▴ 20
@cats_dogs-15904
Last seen 5.0 years ago

Hi all,

I am running WGCNA on a set of 129 samples from three timepoints with three diseases and various physiological traits. Unfortunately, due to project logistics, they were not able to be sequenced at the same time, and looking at the module-trait relationship, it looks like there is quite a strong batch effect resulting from various aspects of sequencing.

Embarrassing facts: Timepoints 1 (16 samples) and 3 (29 samples) were sequenced in the same run, and Timepoint 2 (84 samples) was mostly sequenced in a separate run, with several replicates in a third run. (Originally, the study was planned only for Timepoint 2, but some samples were contaminated or otherwise aligned poorly, so we sequenced additional replicates - and then the PI wanted to extend the study based on a subset of interesting traits from Timepoint 2).

I have single-end 3' reads and used a pseudoaligner (Kallisto, in this particular case, although I get similar results with Salmon, and even featureCounts), which required fragment length and standard devation, so I threw it into the traits matrix along with run batch and RIN and found a strong confounding effect from a combination of the three.

I issued the following command, resulting in 40 modules:

consTime = blockwiseConsensusModules(multiExprTime, power = 6,
                                  networkType = "signed",
TOMType = "signed", minModuleSize = 75, deepSplit=1, 
mergeCutHeight = 0.2,
maxPOutliers = 0.05,
corType="bicor",
numericLabels = TRUE, 
maxBlockSize = 25000, #about 24500 genes in this set
pamRespectsDendro = FALSE,
saveTOMs = TRUE,
saveTOMFileBase = "consT",
saveIndividualTOMs=TRUE,
robustY=FALSE,
verbose = 5)

Fragment length is confounded with run (run 2/timepoints 1 and 3 had shorter fragment lengths overall), and that's confounded with time. The RIN distribution is similar across samples. However, in the consensus module-trait relationship map, there is still an effect from fragment size and RIN, though the strength of the correlation and the p-values differ between timepoints (ie, a module seemingly confounded in one timepoint has a low correlation/high p-value in another).

Fortunately, this isn't exclusively a genomics study and we have physiological data from the population of samples from which this subset was taken, and based on this, some modules that are confounded and also associated with traits seem physiologically relevant. I am not going to get around, nor would I want to get around, reporting the confounding effect, but what are the best practices for reporting putative "real" modules amid the confounding effect? I intend to release all data and code in the supplement, but I want to be forthright in the narrative as well wherever I present the data before publication.

Thank you for your time.

WGCNA batch effect continuous • 1.3k views
ADD COMMENT
1
Entering edit mode
@peter-langfelder-4469
Last seen 20 days ago
United States

It's a bit difficult to give meaningful answers without seeing the data and the specific questions, but here are a few options.

If the modules are expected to be present (i.e., the constituent genes are expected to be correlated) within each run, you can do a consensus WGCNA between the two batches to define modules that are by definition not confounded by batch effects. (edit: this seems to be what you have done, more or less.) You could relate the eigengenes to traits and combine the association using meta-analysis. You could then calculate the module eigengenes of the found modules using combined data and see whether any of the relevant modules show plausible time point progression. Also, if you're after interaction of time point with other traits, that analysis would not necessarily be affected by the batch effect.

If there are enough replicates from time point 2 sequence with the second run (timepoints 1 and 3), you could try to use them to adjust the batch effect. "Enough" means that each relevant condition has at least a few samples in both runs. You could use the adjusted data in the above analysis, i.e., the modules could still be defined using a consensus analysis but the eigengenes would be calculated from adjusted combined data.

One comment about the code above - for network type "signed", the soft thresholding power should be doubled compared to unsigned or signed hybrid networks; the 6 you used is usually low for a "signed" network.

ADD COMMENT
0
Entering edit mode

Hi Dr. Langfelder,

Thank you for your response!

Initially, I had split up the analysis by genotype, where I noticed that timepoints 1 and 3 had unrealistically similar eigenegene correlations, like this. I estimate that ~70% of the modules had such an association.

timepoints 1 and 3 have similar eigengene correlations

Here are my eigengene correlation heatmaps as they currently stand. I had to truncate them quite a bit; I hope they are still legible. Since not every condition is replicated in every batch (specifically, mutant 1 and disease 1 are both restricted to timepoint 2), do you recommend reassociating the module eigengenes with a trait matrix that leaves out the non-replicated conditions, even if network topology was calculated including mutant 1 and disease 1? Or do you recommend I recalculate and meta-analyze?

module eigengene expression, by consensus and by timepoint

Regarding your last note, per the table and graph, a scale-free approximation (~0.8) was reached at power 6 - is the general recommendation still to use a higher power?

Thank you so much for your input and suggestions.

scale-free network graph

ADD REPLY

Login before adding your answer.

Traffic: 779 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6