Background: I'm new to bioinformatics (CS background) and I was assigned the task of implementing a procedure for metabolic network reconstruction that employs a few context-based methods for gene function annotation, which include mRNA coexpression.
Question: I just need to share the details of my work with professionals here to get their yes or no, right or wrong, wither the steps I took were acceptable or not.
Process: before I go through the process of creating this co-expression matrix I want to state my goal clearly:
given two genes A and B in a genome X, I need to have a quantity (preferably between 0 and 1) that express the degree of co-expression between gene A and B.
now as you can see, there's no 'downstream analysis' I'm interested in here; simply, I just want a correctly built coexpression network.
I learned that, nowadays, RNA-Seq expression data is preferable over microarray data; and I found RNA-Seq dataset in GEO database from a study that performed 'differential expression analysis' on a genome.
The study shared (as a supplementary file) the raw counts produced by HTseq-count with rRNA, tRNA, and low coverage genes filtered out. The .txt file is a table with columns represent samples, and rows representing genes. So I took the file and did the following:
# load data rawCounts = read.delim("/PATH/TO/raw_counts_filtered.txt") # start 'CountDataSet' dd = DESeq::newCountDataSet(rawCounts, conditions = colnames(rawCounts)) # estimate factors size dd = DESeq::estimateSizeFactors(dd) # apply variance stabilizing transformation (VST) to the count data, as recommended in WGCNA webpage dd = DESeq2::varianceStabilizingTransformation(dd) # transpose dd = t(dd) # `check` if all samples are ok (i.e. gsg$allOK should be TRUE) gsg = WGCNA::goodSamplesGenes(dd, verbose = 3); # `check` for outlier samples with the following: sampleTree = hclust(dist(dd), method = "average"); par(cex = 0.6); par(mar = c(0,4,2,0)) plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5, cex.axis = 1.5, cex.main = 2) # now IF there's an obvious outlier sample(s), then you need to cut it/them off the tree. # cutting the tree at height, for example say, 26 where the outliers are above the line: clust = cutreeStatic(sampleTree, cutHeight = 26, minSize = 10) # optional: if you want to draw a line where exactly the cut will happen, use: abline(h = 26, col = "red") # now here you determain the cluster under the line: clust = WGCNA::cutreeStatic(sampleTree, cutHeight = 26, minSize = 10) keepSamples = (clust==1) datExpr = dd[keepSamples, ] # don't forget WGCNA::enableWGCNAThreads(4) options(stringsAsFactors = FALSE); # next, choosing the soft-thresholding power: analysis of network topology: # powers to choose from: powers = c(c(1:10), seq(from = 12, to=20, by=2)) # Call the network topology analysis function sft = WGCNA::pickSoftThreshold(datExpr, powerVector = powers, verbose = 5) # Plot the results: WGCNA::sizeGrWindow(9, 5) par(mfrow = c(1,2)); cex1 = 0.9; plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2], xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n", main = paste("Scale independence")); text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2], labels=powers,cex=cex1,col="red"); # now you clearly see power 5 is the lowest power for which the scale-free topology fit softPower = 5; adjacency = WGCNA::adjacency(datExpr, power = softPower, type = "distance"); # write results to file: write.table(adjacency, file="coexprmatrix.txt", row.names=FALSE, col.names=TRUE)
the comments can be ignored, they were for me.
to reiterate, is this workflow correct or not?
any comment/answer/advice/direction/feedback will be very appreciated.
It's all in the WGCNA manual I believe, what you are doing.
yes, I learned most of this from their tutorials; however; the last part is not in the tutorial. The reason I didn't continue with the tutorial is because, after showing how to pick soft threshold, they started talking about 'block wise network' and module detection, and things I felt I don't really need, if I just want that "distance matrix", if you may, between expression profiles.
I (accidentally) saw the WGCNA::adjacency() function and thought it's the closest thing to what I want, so I used it. Although the resulting matrix looks okay to me, I started doubting myself that this might be not right, or maybe these numbers I'm looking at are not exactly representative of pairwise coexpression degree between each pair of genes. For that reason I decided to ask.