Hi all,
I am running WGCNA on a set of 129 samples from three timepoints with three diseases and various physiological traits. Unfortunately, due to project logistics, they were not able to be sequenced at the same time, and looking at the module-trait relationship, it looks like there is quite a strong batch effect resulting from various aspects of sequencing.
Embarrassing facts: Timepoints 1 (16 samples) and 3 (29 samples) were sequenced in the same run, and Timepoint 2 (84 samples) was mostly sequenced in a separate run, with several replicates in a third run. (Originally, the study was planned only for Timepoint 2, but some samples were contaminated or otherwise aligned poorly, so we sequenced additional replicates - and then the PI wanted to extend the study based on a subset of interesting traits from Timepoint 2).
I have single-end 3' reads and used a pseudoaligner (Kallisto, in this particular case, although I get similar results with Salmon, and even featureCounts), which required fragment length and standard devation, so I threw it into the traits matrix along with run batch and RIN and found a strong confounding effect from a combination of the three.
I issued the following command, resulting in 40 modules:
consTime = blockwiseConsensusModules(multiExprTime, power = 6,
networkType = "signed",
TOMType = "signed", minModuleSize = 75, deepSplit=1,
mergeCutHeight = 0.2,
maxPOutliers = 0.05,
corType="bicor",
numericLabels = TRUE,
maxBlockSize = 25000, #about 24500 genes in this set
pamRespectsDendro = FALSE,
saveTOMs = TRUE,
saveTOMFileBase = "consT",
saveIndividualTOMs=TRUE,
robustY=FALSE,
verbose = 5)
Fragment length is confounded with run (run 2/timepoints 1 and 3 had shorter fragment lengths overall), and that's confounded with time. The RIN distribution is similar across samples. However, in the consensus module-trait relationship map, there is still an effect from fragment size and RIN, though the strength of the correlation and the p-values differ between timepoints (ie, a module seemingly confounded in one timepoint has a low correlation/high p-value in another).
Fortunately, this isn't exclusively a genomics study and we have physiological data from the population of samples from which this subset was taken, and based on this, some modules that are confounded and also associated with traits seem physiologically relevant. I am not going to get around, nor would I want to get around, reporting the confounding effect, but what are the best practices for reporting putative "real" modules amid the confounding effect? I intend to release all data and code in the supplement, but I want to be forthright in the narrative as well wherever I present the data before publication.
Thank you for your time.
Hi Dr. Langfelder,
Thank you for your response!
Initially, I had split up the analysis by genotype, where I noticed that timepoints 1 and 3 had unrealistically similar eigenegene correlations, like this. I estimate that ~70% of the modules had such an association.
Here are my eigengene correlation heatmaps as they currently stand. I had to truncate them quite a bit; I hope they are still legible. Since not every condition is replicated in every batch (specifically, mutant 1 and disease 1 are both restricted to timepoint 2), do you recommend reassociating the module eigengenes with a trait matrix that leaves out the non-replicated conditions, even if network topology was calculated including mutant 1 and disease 1? Or do you recommend I recalculate and meta-analyze?
Regarding your last note, per the table and graph, a scale-free approximation (~0.8) was reached at power 6 - is the general recommendation still to use a higher power?
Thank you so much for your input and suggestions.