Question

WGCNA for RNA-Seq Sample Set, Outliers, and Spearman correlation

5

Entering edit mode

brohawndg ▴ 150

@brohawndg-7386

Last seen 8.7 years ago

United States

Hello,

My name is David Brohawn and I am a 4th year graduate student at Virginia Commonwealth University.

My advisor and I research ALS (Lou Gherigs disease). I have recently generated high density (50 Million 2X150 read pairs) RNA-Sequencing data for 15 human postmortem spinal cord tissue homogenates (7 ALS samples and 8 matched healthy controls). I have processed it with the Tuxedo Suite pipeline and have FPKM measurements for all known annotated genes in the hg19 build for all samples.

I find WGCNA highly intriguing, and would love to perform a consensus network analysis with the data we have prior to assessing differences between the two groups. The R code appears very manageable and does not seem unwieldy.

However, I am unclear if we have enough samples to run this analysis.

I see the first question on the WGCNA FAQ page says:

We do not recommend attempting WGCNA on a data set consisting of fewer than 15 samples. In a typical high-throughput setting, correlations on fewer than 15 samples will simply be too noisy for the network to be biologically meaningful.

Does this mean 15 samples in each group (cases and controls, so 30 total) or 15 total?

Sifting through many publications and online forums (Seqanswers, Biostars) has not shed much light on this numbers question, and I find very few published reports using RNA-Sequencing data and WGCNA with smaller N's.

2) Presuming 15 total samples is the minimum recommended and we can run a consensus analysis, do you recommend use of the spearman correlation or the biweight midcorrelation as we are going to be susceptible to outliers? My understanding is Spearman is more robust to outliers, though it may wipe out true signals.

Please advise, and thank you,

David Brohawn

WGCNA Sample Size Outliers • 14k views

ADD COMMENT • link updated 8.7 years ago by mh.manoj • 0 • written 9.2 years ago by brohawndg ▴ 150

score 2 · Answer 1 · 2015-02-20

Hi David,

I'm the WGCNA author and maintainer. 15 samples (cases plus controls combined) is indeed the minimum.

Unless you have an independent data set, I would run a simple WGCNA on this data and see what comes out. You may very well find cell type-related modules which may or may not be interesting. Running a consensus WGCNA on the ALS and control samples viewed as two separate sets will not make much sense since you only have 7 and 8 samples, respectively.

As for Spearman vs. bicor, I suggest bicor with additional argument maxPOutliers = 0.1 or so. I personally don't have much experience with Spearman correlation but early attempts in our group to use it were not successful.

In terms of pre-processing, please read through the WGCNA FAQ at http://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/faq.html, particularly item 3.

Hope this helps.

Peter

score 1 · Answer 2 · 2015-02-20

1

Entering edit mode

Peter Langfelder ★ 3.0k

@peter-langfelder-4469

Last seen 6 weeks ago

United States

Yes, use a binary variable coded 0 for controls and 1 for ALS as the trait. One major change against the tutorials that we recommend is to use argument "type" and/or "networkType" with value "signed hybrid". This applies to functions like pickSoftThreshold, adjacency and blockwiseModules. We have found that signed networks are usually cleaner and more easy to interpret than the default unsigned networks.

HTH,

Peter

ADD COMMENT • link 9.2 years ago Peter Langfelder ★ 3.0k

0

Entering edit mode

Hey Peter,

Sorry to revive an old thread, but i am curious as to the reasoning you recommend using networkType = "signed hybrid" over "signed" ?

Intuitively, it seems like a "hybrid signed" network would contain less information than a signed network because all correlations below 0 are set to 0. Whereas in the signed network, negative correlations still contain some information because they are represented on a continuous scale.

Thanks for your help,

ADD REPLY • link 8.0 years ago bgudena • 0

score 0 · Answer 3 · 2015-02-20

Hello Peter,

Thanks for your quick reply. I already filtered and log transformed my FPKM values, so they are ready to go.

If we run a standard WGCNA analysis with these 15 samples' pre-processed data using the bicor and maxPOutliers flags you recommended:

Could I then set disease/control status as a binary external trait variable and relate identified modules for all 15 samples to that status (in a manner similar to what they did with weight in Tutorial 1 -http://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-03-relateModsToExt.pdf)

I understand weight is a continuous variable whereas disease status is not. However, I remain curious if our study parameters are amenable to using your software to look for disease-specific changes.

Thanks so much for your consideration,

David Brohawn

score 0 · Answer 4 · 2015-02-20

0

Entering edit mode

brohawndg ▴ 150

@brohawndg-7386

Last seen 8.7 years ago

United States

Many thanks sir! Very excited to see what comes out.

David Brohawn

ADD COMMENT • link 9.2 years ago brohawndg ▴ 150

score 0 · Answer 5 · 2015-02-27

Hello Peter,

I have been practicing with the online WGCNA tutorials in preparation of running our analysis, and have a followup question regarding proper usage of bicor and maxPOutliers:

I successfully created the datExpr and datTraits variables as instructed in the data input and cleaning section (Tutorial part 1)

I opted for the step-by-step network construction choice (Tutorial part 2b)(http://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-02-networkConstr-man.pdf)

I modified the picksoftThreshold and adjacency commands (as well as the TOMsimilarity command) from what is listed in the tutorial to incorporate the bicor and signed hybrid network flags per your suggestion:

sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5, networkType ="signed hybrid", corFnc= "bicor")

adjacencymatrix = adjacency(datExpr, power=softpower, corFnc="bicor", type="signed hybrid")

TOM = TOMsimilarity(adjacency, TOMType="signed")

However, these changes did not incorporate the maxPOutliers = 0.1 flag you recommended. I see in the WGCNA manual it is not an option for the picksoftThreshold, adjacency, or the TOM commands.

I was able to carry out the rest of Tutorial part 2b and 3 without any issues. I see the maxPOutliers flag does appear in the blockwiseModules command for the automated network construction directions in Tutorial part 2c.

Is it appropriate to leave out the bicor command with maxPOutliers=0.1, or is there a way/need to incorporate it if I use the directives in Tutorial part2b (step-by-step network construction)?

Thank you for your help in advance!

David Brohawn

score 0 · Answer 6 · 2015-02-27

0

Entering edit mode

brohawndg ▴ 150

@brohawndg-7386

Last seen 8.7 years ago

United States

Hello,

I should add that I was able to apply the bicor function to my datExpr variable as below:

Test <- bicor(datExpr, use="pairwise.complete.obs", maxPOutliers=0.1)

prior to using the pickSoftThreshold command, though I am not sure if that was appropriate.

Thanks again,

David Brohawn

ADD COMMENT • link 9.2 years ago brohawndg ▴ 150

score 0 · Answer 7 · 2015-02-27

0

Entering edit mode

Peter Langfelder ★ 3.0k

@peter-langfelder-4469

Last seen 6 weeks ago

United States

Use the argument corOptions = list(maxPOutliers =0.1) to pickSoftThreshold() and to adjacency().

Peter

ADD COMMENT • link 9.2 years ago Peter Langfelder ★ 3.0k

score 0 · Answer 8 · 2015-03-03

Hi Peter,

Thank you again for all of your help.

I was able to add corOptions=list(maxPOutliers=0.1) to the pickSoftThreshold command:

sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5, networkType ="signed hybrid", corFnc= "bicor", corOptions=list(maxPOutliers=0.1))

and retrieved a table of soft threshold powers with their corresponding R.sq values.

Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
1 1 0.1500 1.260 0.947 2010.00 1980.00 3220.0
2 2 0.0137 -0.235 0.943 850.00 821.00 1730.0
3 3 0.2760 -1.010 0.946 431.00 404.00 1090.0
4 4 0.4790 -1.370 0.960 244.00 221.00 740.0
5 5 0.6090 -1.650 0.969 149.00 129.00 533.0
6 6 0.6820 -1.850 0.977 96.80 80.20 403.0
7 7 0.7260 -1.990 0.981 65.50 51.70 313.0
8 8 0.7490 -2.120 0.981 45.90 34.50 247.0
9 9 0.7760 -2.160 0.987 33.10 23.70 199.0
10 10 0.7940 -2.190 0.989 24.50 16.70 162.0
11 12 0.8180 -2.220 0.991 14.20 8.82 111.0
12 14 0.8300 -2.210 0.991 8.76 4.97 78.9
13 16 0.8430 -2.170 0.992 5.68 2.94 57.5
14 18 0.8550 -2.130 0.990 3.83 1.82 42.9
15 20 0.8650 -2.060 0.984 2.67 1.16 32.6

However, when I tried to add corOptions=list(maxPOutliers=0.1) to my adjacency command:

adjacency = adjacency(datExpr, power=softpower, corFnc="bicor", type="signed hybrid", corOptions=list(maxPOutliers=0.1))

I recieved the following error message: Error in bicor(datExpr, 0.1) :
'x' and 'y' have incompatible dimensions (unequal numbers of rows).

If I remove the corOptions flag, adjacency will run, albeit with the following warning:

Warning message:
In bicor(datExpr, use = "p") :
bicor: zero MAD in variable 'x'. Pearson correlation was used for individual columns with MAD=NA.

Can you advise?

Thanks!

David Brohawn

score 0 · Answer 9 · 2015-03-09

0

Entering edit mode

brohawndg ▴ 150

@brohawndg-7386

Last seen 8.7 years ago

United States

Hi Dr. Langfelder,

Have you had a chance to look over the error message I received when trying to use the maxPOutliers flag in the adjacency command?

I'm guessing my previous question got buried in the shuffle.

Thanks!

Dave Brohawn

ADD COMMENT • link 9.2 years ago brohawndg ▴ 150

score 0 · Answer 10 · 2015-04-13

0

Entering edit mode

zaynabmousavian • 0

@zaynabmousavian-7581

Last seen 9.1 years ago

Dear David,

I have the same problem like you, could you pass the error?

Regards

ADD COMMENT • link 9.1 years ago zaynabmousavian • 0

score 0 · Answer 11 · 2015-05-28

0

Entering edit mode

Peter Langfelder ★ 3.0k

@peter-langfelder-4469

Last seen 6 weeks ago

United States

Sorry to revive a long-dead thread, I missed this message. The function adjacency takes the corOptions argument in a different format, you need to specify corOptions = "use = 'p', maxPOutliers = 0.1" (i.e. a literal string, not a list).

ADD COMMENT • link 9.0 years ago Peter Langfelder ★ 3.0k

score 0 · Answer 12 · 2015-09-08

0

Entering edit mode

mh.manoj • 0

@mhmanoj-7441

Last seen 7.2 years ago

United States

Hi Peter,

I had a problem similar to this discussion while trying to use the bicor function. I followed your instruction to use the corOptions as a literal string, but then the bicor was not implemented. Please see various combinations I tried. Could you please suggest the right usage?

> adjacency = adjacency(datExpr, power = softPower, type = "signed hybrid",corFnc="bicor");
Warning message:
In bicor(datExpr, use = "p") :
  bicor: zero MAD in variable 'x'. Pearson correlation was used for individual columns with MAD=NA.

> adjacency = adjacency(datExpr, power = softPower, type = "signed hybrid",corFnc="bicor", corOptions = "use = 'p', maxPOutliers = 0.1");
Warning message:
In bicor(datExpr, use = "p", maxPOutliers = 0.1) :
  bicor: zero MAD in variable 'x'. Pearson correlation was used for individual columns with MAD=NA.

> adjacency = adjacency(datExpr, power = softPower, type = "signed",corFnc="bicor", corOptions = "use = 'p'");
Warning message:
In bicor(datExpr, use = "p") :
  bicor: zero MAD in variable 'x'. Pearson correlation was used for individual columns with MAD=NA.

> adjacency = adjacency(datExpr, power = softPower, type = "signed",corFnc="bicor", corOptions = "use = 'p', maxPOutliers = 0.1");
Warning message:
In bicor(datExpr, use = "p", maxPOutliers = 0.1) :
  bicor: zero MAD in variable 'x'. Pearson correlation was used for individual columns with MAD=NA.

> adjacency = adjacency(datExpr, power = softPower, type = "unsigned",corFnc="bicor", corOptions = "use = 'p', maxPOutliers = 0.1");
Warning message:
In bicor(datExpr, use = "p", maxPOutliers = 0.1) :
  bicor: zero MAD in variable 'x'. Pearson correlation was used for individual columns with MAD=NA.

Thanks,

Manoj.

ADD COMMENT • link 8.7 years ago mh.manoj • 0

0

Entering edit mode

The function is simply reporting that some columns in datExpr had their mad (median absolute deviation) NA. It is likely that some of the columns in datExpr are all missing. It also could be that the function does not distinguish properly between mad=NA and mad=0; in any case, since bicor is undefined for such columns, Pearson standardization was used for such columns. Please see the article Langfelder P, Horvath S (2012) Fast R Functions for Robust Correlations and Hierarchical Clustering Journal of Statistical Software 46(11) 1-17 PMID: 23050260 PMCID: PMC3465711 (http://www.jstatsoft.org/v46/i11) for details.

ADD REPLY • link 8.7 years ago Peter Langfelder ★ 3.0k

0

Entering edit mode

Thanks for your comment Peter.

I am not an expert in statistics, but from reading the article, especially section 3.2, I understand that the bicor would switch to hybrid Pearson-robust correlation if columns have MAD=NA. I guess that is what is happening in this case. I hope this is a normal behavior -- please let me know if this is cause for concern.

I get TRUE when I do gsg$allOK -- so I guess there were no missing entries and zero-variance genes.

Is there a way to identify cases where the "zero MAD in variable 'x'" was found? I tried to use the "pearsonFallback = "none"", "pearsonFallback = "individual"" etc as mentioned in the Stat article, but that gave me an error. Do I have to remove such cases before proceeding or is it ok if the Pearson correlation was used for individual columns with MAD=NA..? If yes, please let me know how to remove columns with MAD=NA.

Thanks,

Manoj.

ADD REPLY • link 8.7 years ago mh.manoj • 0

0

Entering edit mode

The warnings actually tell you that some columns had zero MAD; the wording is poor and I will improve it to remove the mentioning of NA. To identify cases where mad=0, simply run mad.x = apply(x,2, mad) (substitute your data for x) or colMads(x) from package matrixStats (faster than apply). Then check which mad.x are zero.

Whether it is normal to see zero MAD in your data depends on what the data represent. I frequently do see zero MADs in RNA-seq data for genes for which zero count is frequent. I usually filter out such genes though.

If you turn off Pearson fallback ("none"), you should not see errors but rather a warning "Missing values generated in calculation of bicor."

ADD REPLY • link 8.7 years ago Peter Langfelder ★ 3.0k