I have a question regarding logFC values which stems from the [apparent] incorrect sign given in our EdgeR result table.
The data I am working with is metatranscriptomic data from environmental samples. We have three conditions (anoxic, surface, and oxycline water samples with ~6 million genes in our count table), and we are attempting to get the logFC between the conditions (AnoxicVsSurface, SurfaceVsOxycline, OxyclineVsAnoxic). We have filtered for low cpm from the DGEList object and further filtered results for significant differential expression using the FDR value. For our logFC values, we have thousands of instances in our data where the apparent sign is incorrect.
For example, if I were to inspect all genes which had zero or low count data in only anoxic waters and had high counts in surface waters, I would expect my logFC to be negative based on the "AvsS = Anoxic - Surface" part of my makeContrasts command; from my interpretation this would indicate upregulation in surface waters, a result that would make sense given that non-zero counts would only have occurred in surface samples. For some genes, this is TRUE, but for other genes this is FALSE, it is very puzzling. Our results show that ~40% of genes give the presumably incorrect sign for logFC, in that logFC is positive for some samples and logFC is negative for other samples that have similar count data.
We have attempted to only process data using two conditions at a time, different formulas and experiment designs (namely intercept inclusion), altering count data to not include zeros by substituting 0.0000001, and have used older versions of edgeR; none of which have resulted in a presumably "correct" sign for more than 75% of the data. I understand that the "fancy" math (dispersion, shrinkage, and normalization) may impact logFC, but I would think not to this degree. Below, I have also pasted 4 genes that contains sum of the count data from two conditions for simplicity so this is clear to see (there are a similar number of samples in each condition, >5 in each condition).
I have a small subset of my significant result table (top 50 "downregulated" and top 50 "upregulated" genes, only focusing on AnoxicVsSurface for simplicity), their raw count data, RPKM data, and logFC so the issue above is clear to see, if this needs to be verified through more direct correspondence (would also provide a reproducible dataset). I have also included the code used in the analysis below... If there are improvements/suggestions, I will hope that it was not one of the formulas or designs I attempted, though I am pretty sure I exhausted my options. Thank you for your time.
# Small example of error
table <- data.frame(anoxicSum = c(13,0, 16,0),
SurfaceSum = c(0,13,0,16),
Length.Raw = c(390, 306, 297, 387),
Length = c(1797, 1212, 1620, 990),
logFC = c(-11.36, -11.36, 14.84, 15.94))
print(table)
# Getting Differential Expressed Genes #####
# Factor sample depths
depth <- factor(metadata13$Category_new)
# Construct edgeR object
edOb <- DGEList(geneIDkNum[,-c(1,2,3,4,5)], group=depth, genes=geneIDkNum[,c(1,5), drop = FALSE])
# Remove genes with low counts
keep <- rowSums(cpm(edOb) > 0.1) >= 3
table(keep) # how many are TRUE
edOb2 <- edOb[keep, , keep.lib.sizes=FALSE]
# Normalization for composition bias
edObNorm <- calcNormFactors(edOb2, method = "TMM")
# Design matrix
design <- model.matrix(~0+depth)
colnames(design) <- levels(depth)
design
# Estimate Dispersion
edObDis <- estimateDisp(edObNorm, design, robust=TRUE)
# Full RPKM Table
rpkmData <- as.data.frame(rpkm(edObDis))
rpkmData$Chr <- edObDis$genes$Chr
write.csv(rpkmData, file = "rpkmData.csv", row.names = FALSE)
# Estimate Quasi-Likelihood (QL) dispersions
edObFit <- glmQLFit(edObDis, design, robust=TRUE)
head(edObFit$coefficients)
# Differential expression analysis # A vs S
diffXaAS <- makeContrasts(AvsS = Anoxic - Surface,
levels = design)
# QLFTest = Quasi-Likelihood F Test
res1AS <- glmQLFTest(edObFit, contrast=diffXaAS)
topTags(res1AS) # view the top DE genes
res1.tableAS <- topTags(res1AS, n = "Inf")$table
res1.tableAS$Chr <- res1AS$genes$Chr
resTableAS <- as.data.frame(res1.tableAS)
resTableAS <- resTableAS[ ,c(1,3,7)]
colnames(resTableAS) <- c("Chr", "LogFC_AvsS", "FDR_AvsS")
# Count how many genes are differentially regulated
diffRegAS <- decideTestsDGE(res1AS)
summary(diffRegAS)
sessionInfo( )
Hi Gordon. It was collation with my edgeR results, BASTA taxonomy, and function gene annotation tables. Prior to doing this last check for you I was doing all these joins before diving into the data. I filtered, sorted, and joined data in a different way and was able to get the expected results. Many-to-many relationships must have messed up the orders (it was a warning I noted, but I had not had this issue with smaller data sets, so I overlooked it). The suggestion to look at my collation methods helped me pin and fix this quite easily. Thanks for your time (and thanks for checking in after 2 days! We got hit by a hurricane in the Southeast US so a fresh set of eyes after no having power was my silver lining)