We are using WGCNA to analyze some gene expressions. We have 40 tumor-normal paired samples and there are batch effects. So we first use COMBAT to remove batch effectson log(rpkm+1) scaled expression data. And then use the log(T/N) as the input of WGCNA. Is the data we use OK or we should use the original data without batch removed or we should use only the tumor expression to WGCNA?
For some of our trait data, only 0 and 1 are provided. Can these trait data be used to do the "Module−trait relationships" analysis?
In addition, we found that the number of genes differs much in different modules as follows:
I suggest removing batch effects and then using all 40 samples as input to WGCNA. Then you can look for modules that are over/under-expressed in tumor samples vs. normal, or associated with other variables you have.
You could use the T/N or just T expression indices if you have interesting sample traits that are defined only in the tumor samples; otherwise I don't see a point in leaving the normal samples out or doing T/N.
You can use a binary trait as you would use a continuous trait, simply correlate it with eigengenes. You could also use a Student t-test to measure the association of binary traits with eigengenes.
And yes, numbers of genes in modules do tend to vary widely, from a few thousand down to the minimum specified module size, which is usually 20 or 30.
Yes, the more samples the better. I should have said though that the most appropriate way of running WGCNA depends on what sample information you have and what are the questions you want to answer.