Background: I'm new to bioinformatics (CS background) and I was assigned the task of implementing a procedure for metabolic network reconstruction that employs a few context-based methods for gene function annotation, which include mRNA coexpression.
Question: I just need to share the details of my work with professionals here to get their yes or no, right or wrong, wither the steps I took were acceptable or not.
Process: before I go through the process of creating this co-expression matrix I want to state my goal clearly:
given two genes A and B in a genome X, I need to have a quantity (preferably between 0 and 1) that express the degree of co-expression between gene A and B.
now as you can see, there's no 'downstream analysis' I'm interested in here; simply, I just want a correctly built coexpression network.
I learned that, nowadays, RNA-Seq expression data is preferable over microarray data; and I found RNA-Seq dataset in GEO database from a study that performed 'differential expression analysis' on a genome.
The study shared (as a supplementary file) the raw counts produced by HTseq-count with rRNA, tRNA, and low coverage genes filtered out. The .txt file is a table with columns represent samples, and rows representing genes. So I took the file and did the following:
# load data rawCounts = read.delim("/PATH/TO/raw_counts_filtered.txt") # start 'CountDataSet' dd = DESeq::newCountDataSet(rawCounts, conditions = colnames(rawCounts)) # estimate factors size dd = DESeq::estimateSizeFactors(dd) # apply variance stabilizing transformation (VST) to the count data, as recommended in WGCNA webpage dd = DESeq2::varianceStabilizingTransformation(dd) # transpose dd = t(dd) # `check` if all samples are ok (i.e. gsg$allOK should be TRUE) gsg = WGCNA::goodSamplesGenes(dd, verbose = 3); # `check` for outlier samples with the following: sampleTree = hclust(dist(dd), method = "average"); par(cex = 0.6); par(mar = c(0,4,2,0)) plot(sampleTree, main = "Sample clustering to detect outliers", sub="", xlab="", cex.lab = 1.5, cex.axis = 1.5, cex.main = 2) # now IF there's an obvious outlier sample(s), then you need to cut it/them off the tree. # cutting the tree at height, for example say, 26 where the outliers are above the line: clust = cutreeStatic(sampleTree, cutHeight = 26, minSize = 10) # optional: if you want to draw a line where exactly the cut will happen, use: abline(h = 26, col = "red") # now here you determain the cluster under the line: clust = WGCNA::cutreeStatic(sampleTree, cutHeight = 26, minSize = 10) keepSamples = (clust==1) datExpr = dd[keepSamples, ] # don't forget WGCNA::enableWGCNAThreads(4) options(stringsAsFactors = FALSE); # next, choosing the soft-thresholding power: analysis of network topology: # powers to choose from: powers = c(c(1:10), seq(from = 12, to=20, by=2)) # Call the network topology analysis function sft = WGCNA::pickSoftThreshold(datExpr, powerVector = powers, verbose = 5) # Plot the results: WGCNA::sizeGrWindow(9, 5) par(mfrow = c(1,2)); cex1 = 0.9; plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2], xlab="Soft Threshold (power)",ylab="Scale Free Topology Model Fit,signed R^2",type="n", main = paste("Scale independence")); text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2], labels=powers,cex=cex1,col="red"); # now you clearly see power 5 is the lowest power for which the scale-free topology fit softPower = 5; adjacency = WGCNA::adjacency(datExpr, power = softPower, type = "distance"); # write results to file: write.table(adjacency, file="coexprmatrix.txt", row.names=FALSE, col.names=TRUE)
the comments can be ignored, they were for me.
to reiterate, is this workflow correct or not?
any comment/answer/advice/direction/feedback will be very appreciated.