Question

EdgeR - Apparent Wrong Direction of logFC with Low Count Data

0

Entering edit mode

blmurphy • 0

@8f1ecb52

Last seen 9 months ago

United States

I have a question regarding logFC values which stems from the [apparent] incorrect sign given in our EdgeR result table.

The data I am working with is metatranscriptomic data from environmental samples. We have three conditions (anoxic, surface, and oxycline water samples with ~6 million genes in our count table), and we are attempting to get the logFC between the conditions (AnoxicVsSurface, SurfaceVsOxycline, OxyclineVsAnoxic). We have filtered for low cpm from the DGEList object and further filtered results for significant differential expression using the FDR value. For our logFC values, we have thousands of instances in our data where the apparent sign is incorrect.

For example, if I were to inspect all genes which had zero or low count data in only anoxic waters and had high counts in surface waters, I would expect my logFC to be negative based on the "AvsS = Anoxic - Surface" part of my makeContrasts command; from my interpretation this would indicate upregulation in surface waters, a result that would make sense given that non-zero counts would only have occurred in surface samples. For some genes, this is TRUE, but for other genes this is FALSE, it is very puzzling. Our results show that ~40% of genes give the presumably incorrect sign for logFC, in that logFC is positive for some samples and logFC is negative for other samples that have similar count data.

We have attempted to only process data using two conditions at a time, different formulas and experiment designs (namely intercept inclusion), altering count data to not include zeros by substituting 0.0000001, and have used older versions of edgeR; none of which have resulted in a presumably "correct" sign for more than 75% of the data. I understand that the "fancy" math (dispersion, shrinkage, and normalization) may impact logFC, but I would think not to this degree. Below, I have also pasted 4 genes that contains sum of the count data from two conditions for simplicity so this is clear to see (there are a similar number of samples in each condition, >5 in each condition).

I have a small subset of my significant result table (top 50 "downregulated" and top 50 "upregulated" genes, only focusing on AnoxicVsSurface for simplicity), their raw count data, RPKM data, and logFC so the issue above is clear to see, if this needs to be verified through more direct correspondence (would also provide a reproducible dataset). I have also included the code used in the analysis below... If there are improvements/suggestions, I will hope that it was not one of the formulas or designs I attempted, though I am pretty sure I exhausted my options. Thank you for your time.


# Small example of error
table <- data.frame(anoxicSum = c(13,0, 16,0),
                       SurfaceSum = c(0,13,0,16),
                       Length.Raw = c(390, 306, 297, 387),
                       Length = c(1797, 1212, 1620, 990),
                       logFC = c(-11.36, -11.36, 14.84, 15.94))
print(table)

# Getting Differential Expressed Genes #####
# Factor sample depths
depth <- factor(metadata13$Category_new)
# Construct edgeR object
edOb <- DGEList(geneIDkNum[,-c(1,2,3,4,5)], group=depth, genes=geneIDkNum[,c(1,5), drop = FALSE])
# Remove genes with low counts
keep <- rowSums(cpm(edOb) > 0.1) >= 3
table(keep) # how many are TRUE
edOb2 <- edOb[keep, , keep.lib.sizes=FALSE]
# Normalization for composition bias
edObNorm <- calcNormFactors(edOb2, method = "TMM")
# Design matrix
design <- model.matrix(~0+depth)
colnames(design) <- levels(depth)
design
# Estimate Dispersion
edObDis <- estimateDisp(edObNorm, design, robust=TRUE)
# Full RPKM Table
rpkmData <- as.data.frame(rpkm(edObDis))
rpkmData$Chr <- edObDis$genes$Chr
write.csv(rpkmData, file = "rpkmData.csv",  row.names = FALSE)
# Estimate Quasi-Likelihood (QL) dispersions
edObFit <- glmQLFit(edObDis, design, robust=TRUE)
head(edObFit$coefficients)

# Differential expression analysis # A vs S
diffXaAS <- makeContrasts(AvsS = Anoxic - Surface,
                          levels = design)
# QLFTest = Quasi-Likelihood F Test
res1AS <- glmQLFTest(edObFit, contrast=diffXaAS) 
topTags(res1AS) # view the top DE genes
res1.tableAS <-  topTags(res1AS, n = "Inf")$table
res1.tableAS$Chr <- res1AS$genes$Chr
resTableAS <- as.data.frame(res1.tableAS)
resTableAS <- resTableAS[ ,c(1,3,7)]
colnames(resTableAS) <- c("Chr", "LogFC_AvsS", "FDR_AvsS")
# Count how many genes are differentially regulated
diffRegAS <- decideTestsDGE(res1AS)
summary(diffRegAS)

sessionInfo( )

limma edgeR • 668 views

ADD COMMENT • link updated 9 months ago by Gordon Smyth 52k • written 9 months ago by blmurphy • 0

score 0 · Answer 1 · 2024-09-24

edgeR will always produce a large positive logFC if you compare a group with positive counts vs a group with all zero counts, so what you report seeing is not actually possible. It doesn't depend on the number of groups in your analysis, or the number of samples in each group, or the normalization or dispersion estimation, or the version of edgeR, it is simply not possible to find an all-zero group as up-regulated.

There doesn't look to be any problem with the code you show, so the problem must have occurred when you collate the results. I wonder whether you have simply lined up the logFCs and the count rows in the wrong order? The topTable res1.tableAS that you have stored is sorted by p-value, whereas the counts matrix is not sorted. Can you try regenerating the results table by

res1.tableAS <-  topTags(res1AS, n = "Inf", sort="none")$table

so that the rows of the topTable and the rows of the count matrix will be the same order. Then you can just combine (cbind) the count matrix and topTable data.frame together. Does that solve your problem?

If you still have a problem, then I will suggest some more code to investigate your results and ask you to show the output. As it is, you don't show any edgeR output at all.