novice: building gene co-expression network using RNA-Seq data
Entering edit mode
7kemZmani ▴ 10
Last seen 6.3 years ago


Background: I'm new to bioinformatics (CS background) and I was assigned the task of implementing a procedure for metabolic network reconstruction that employs a few context-based methods for gene function annotation, which include mRNA coexpression.

Question: I just need to share the details of my work with professionals here to get their yes or no, right or wrong, wither the steps I took were acceptable or not.

Process: before I go through the process of creating this co-expression matrix I want to state my goal clearly:

given two genes A and B in a genome X, I need to have a quantity (preferably between 0 and 1) that express the degree of co-expression between gene A and B.

now as you can see, there's no 'downstream analysis' I'm interested in here; simply, I just want a correctly built coexpression network.

I learned that, nowadays, RNA-Seq expression data is preferable over microarray data; and I found RNA-Seq dataset in GEO database from a study that performed 'differential expression analysis' on a genome.

The study shared (as a supplementary file) the raw counts produced by HTseq-count with rRNA, tRNA, and low coverage genes filtered out. The .txt file is a table with columns represent samples, and rows representing genes. So I took the file and did the following:

# load data
rawCounts = read.delim("/PATH/TO/raw_counts_filtered.txt")

# start 'CountDataSet'
dd = DESeq::newCountDataSet(rawCounts, conditions = colnames(rawCounts))

# estimate factors size
dd = DESeq::estimateSizeFactors(dd)

# apply variance stabilizing transformation (VST) to the count data, as recommended in WGCNA webpage
dd = DESeq2::varianceStabilizingTransformation(dd)

# transpose
dd = t(dd)

# `check` if all samples are ok (i.e. gsg$allOK should be TRUE)
gsg = WGCNA::goodSamplesGenes(dd, verbose = 3);

# `check` for outlier samples with the following:
sampleTree = hclust(dist(dd), method = "average");
par(cex = 0.6);
par(mar = c(0,4,2,0))
plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5, cex.axis = 1.5, cex.main = 2)

# now IF there's an obvious outlier sample(s), then you need to cut it/them off the tree.
# cutting the tree at height, for example say, 26 where the outliers are above the line:
clust = cutreeStatic(sampleTree, cutHeight = 26, minSize = 10)

# optional: if you want to draw a line where exactly the cut will happen, use:
abline(h = 26, col = "red")

# now here you determain the cluster under the line:
clust = WGCNA::cutreeStatic(sampleTree, cutHeight = 26, minSize = 10)
keepSamples = (clust==1)
datExpr = dd[keepSamples, ]

# don't forget
options(stringsAsFactors = FALSE);

# next, choosing the soft-thresholding power: analysis of network topology:
# powers to choose from:
powers = c(c(1:10), seq(from = 12, to=20, by=2))

# Call the network topology analysis function
sft = WGCNA::pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
# Plot the results:
WGCNA::sizeGrWindow(9, 5)
par(mfrow = c(1,2));
cex1 = 0.9;

plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2], xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n", main = paste("Scale independence"));

text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2], labels=powers,cex=cex1,col="red");

# now you clearly see power 5 is the lowest power for which the scale-free topology fit
softPower = 5;
adjacency = WGCNA::adjacency(datExpr, power = softPower, type = "distance");

# write results to file:
write.table(adjacency, file="coexprmatrix.txt", row.names=FALSE, col.names=TRUE)

the comments can be ignored, they were for me. 

to reiterate, is this workflow correct or not?

any comment/answer/advice/direction/feedback will be very appreciated.

rna-seq WGCNA deseq method coexpression • 2.5k views
Entering edit mode

It's all in the WGCNA manual I believe, what you are doing.

Entering edit mode

yes, I learned most of this from their tutorials; however; the last part is not in the tutorial. The reason I didn't continue with the tutorial is because, after showing how to pick soft threshold, they started talking about 'block wise network' and module detection, and things I felt I don't really need, if I just want that "distance matrix", if you may, between expression profiles.

I (accidentally) saw the WGCNA::adjacency() function and thought it's the closest thing to what I want, so I used it. Although the resulting matrix looks okay to me, I started doubting myself that this might be not right, or maybe these numbers I'm looking at are not exactly representative of pairwise coexpression degree between each pair of genes. For that reason I decided to ask.

Entering edit mode
Last seen 3 months ago
United States

I would not use the "distance" adjacency type, which uses Euclidean distance instead of correlation. Distance is more appropriate for calculating (dis-) similarities of samples, not of genes. Just leave the adjacency type either default (unsigned network), or use type = "signed hybrid".  If you want to know the difference between signed and unsigned, read and .

As for terminology, "distance" usually means something more specific (e.g., Euclidean or Manhattan distance) than "dissimilarity", which is a general term for any measure that gets larger when objects become less similar. The function adjacency returns a similarity which is a measure that gets larger when objects become more similar.



Entering edit mode

Thank you Peter for your answer.

I also thought that type = "distance" means the Euclidean distance, and I thought it will show somewhere in the help document; yet when you look at the help document:


you see "... for type = "distance", adjacency = (1-(dist/max(dist))^2)^power"

now the formula is clearly different than the Euclidean or Manhattan distance formula; and along the diagonal of the resulting matrix you see '1's; which, of course, means the two 'vectors' are identical. This observation gave me a little bit of confidence that this is, most likely, a measurement of similarity (i.e. when two vectors are the same then it's a 1).


Entering edit mode

would you mind explaining more why the distance measurment is more appropriate for samples, and not genes?

maybe I'm wrong, but aren't they just vectors being compared!

Entering edit mode

would you mind explaining more why the distance measurement is more appropriate for samples, and not genes?

maybe I'm wrong, but aren't they just vectors being compared!

Entering edit mode

okay, I think I'll switch to type = 'signed hybrid' as you suggested in the second article.

Thank you!


Login before adding your answer.

Traffic: 960 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6