Hi, I was wondering if I could ask for confirmation regarding my understanding of logFC values. So I know when you get a negative logFC value, the gene is underexpressed, and when you get a positive logFC value, the gene is overexpressed. But is there a cutoff value?

I read in an article that they used the cut off value of 1.5: that a gene has to be below -1.5 to be underexpressed, and above 1.5 to be overexpressed, and anything in between in not differentially expressed. However, how do you assume that value? And is there a way to derive the cut-off?

Hi just incase there are no DG genes identified. How to proceed ? Is it always necessary to get DE genes? Would be happy to get some resources to share data where it is possible to show that "there are no DE genes" . The findings could be "biologically" significant although statistically "not significant"

No, there is no general objective justification for any particular log-fold change threshold. Mathematically speaking, it is possible to reject the null hypothesis at any non-zero log-fold change if the variability is low enough. One could argue that small log-fold changes are not biologically relevant, but the exact definition of "small" is open to interpretation. Larger log-fold changes are also more robustly detected across technologies (e.g., RNA-seq and qPCR), though selecting a threshold on this basis would depend on the sensitivities of the technologies involved. Somewhere between 1.1 to 1.5 is a common choice for a "sensible" threshold.

But all this is getting away from the main point, which is the detection of DE genes. If you want to do this in a statistically rigorous manner, use the BH-adjusted p-values to control the false discovery rate. This ensures that the expected proportion of false positives in your set of significant DE genes is below a certain threshold (usually 5%). Now, you might say that this approach also involves the selection of an arbitrary threshold. However, with this approach, at least the choice of threshold is directly related to the probability of whether the genes are truly DE or not. A log-fold change threshold doesn't tell you much about the error rate, as it doesn't account for the variability of the expression values.

Finally, if you do need a log-fold change threshold, the treat function should be used, and DE genes selected on the basis of the adjusted p-values. This ensures that the FDR is controlled while only considering genes with log-fold changes above a minimum value.

Thank you for your time and clarification. Your explanation makes sense. These are the steps I've undertaken.. I've followed the Limma manual, but I feel like my comprehension could be wrong.

1) From what I understand, adding "robust" in lmFit, eBayes, and Treat would take outliers into consideration. Am I overkilling it by adding it in all three functions? Same goes for "trend" in both eBayes and Treat.

2) Second, I performed a log2 on my counts (+ a small constant), then filtered out any rows that had a sum 0(basically no values), prior to doing lmFit. I'm only interested in an absolute log fold change of greater than 1.5, so should it be lfc=log2(1.5), or just lfc=1.5?

Thanks again in advance. I hope everyone is having a great weekend :)

Use voom if you're working on RNA-seq data, rather than log-transforming manually.

lfc is the threshold on the log-fold change (base 2), so set the value accordingly.

I don't usually set method="robust" in lmFit. In general, robustification results in some loss of efficiency, and there's not much information to spare on a gene-wise basis in experimental designs with few replicates. This is not a problem in treat and eBayes, where information is shared across thousands of genes.

Regarding voom, the reason why I never used it was because I have FPKM values and not read counts (I know its not ideal, but thats what I've got to work with. And yes, I have to use limma specifically and not cummerbund). My understanding is that voom normalises counts, whereas FPKM is already normalised, hence the need to log2 transform manually, before proceeding to do lmFit and Bayes stat(treat) to get the logFC. Please correct me if I'm wrong.

I'm trying to compare my logFC results to results that are in FC to see if I achieved the same gene expressions as they did, like if gene A is down-regulated in theirs, if its down-regulated in mine too.

I know the computation would be table$FC<- 2^(abs(table$logFC)). But then everything would be positive integers, and I would like to show the differentiation of up and down regulation with a positive/negative sign. Saying that, would it make sense to keep the the integer sign by using table$FC<-ifelse(table$logFC<0, table$FC*(-1),table$logFC*1)

I've really appreciated all your advice. Its helped me a lot.

And finally, one last question before my analysis is complete!

edgeR's glmTreat is equivalent to Limma's treat, and both use Bayes method. However I noticed only Limma uses the adjust method "BH", which calculates an adj P-value. glmTreat does not provide any adj. p-value in its output. Would that mean the outputted p-value has already been adjusted?

Thanks again for your time and help Aaron! I've learned a lot from you.

Hi,
I am working on the pretty much same assignment. Just to verify my findings can you please confirm that did you transformed FPKM values to log2 manually by using excel? How did you do that. How did you chose your cut off value;on what bases.

In addition to my earlier post. I was fiddling around with my results and found that if I added robust to both lmFit and Treat, it would result to more DEGs, with warnings saying small variances detected (I assume due to its robust regression):

And if I removed "robust" from lmFIt, and just left it in Treat, it would result to less DEGs, which is expected.

Treatmentvscontrol
-1 23
0 37430
1 26

My additional question is, as much as more DEGs would indeed be interesting, would I be producing a bias in my results if I use the robust option in lmFit?

I'd bee inclined to avoid method="robust" unless you have an explicit reason to use it. Not only is it slower, the reduced efficiency of robust regression results in a decrease in the precision of the parameter estimates. This drop in precision falls outside the standard linear modelling framework (i.e., using least-squares fitting), which makes it difficult to handle in later steps in the limma workflow. As such, it can result in some anticonservativeness:

# Paraphrased example from ?lmFit.
sd <- 0.3*sqrt(4/rchisq(10000,df=4))
y <- matrix(rnorm(10000*6,sd=sd),10000,6)
rownames(y) <- paste("Gene",1:10000)
design <- cbind(Grp1=1,Grp2vs1=c(0,0,0,1,1,1))
# Ordinary fit gives a uniform p-value distribution.
fit <- lmFit(y,design)
fit <- eBayes(fit)
hist(fit$p.value[,2])
# Robust fit is skewed towards low values.
fit2 <- lmFit(y,design, method="robust")
fit2 <- eBayes(fit2)
hist(fit2$p.value[,2])

Hi just incase there are no DG genes identified. How to proceed ? Is it always necessary to get DE genes? Would be happy to get some resources to share data where it is possible to show that "there are no DE genes" . The findings could be "biologically" significant although statistically "not significant"