LogFC: how do you determine the cutoff for differentially expressed genes?
1
2
Entering edit mode
cms72 ▴ 20
@cms72-13556
Last seen 4.7 years ago

Hi, I was wondering if I could ask for confirmation regarding my understanding of logFC values. So I know when you get a negative logFC value, the gene is underexpressed, and when you get a positive logFC value, the gene is overexpressed. But is there a cutoff value?

I read in an article that they used the cut off value of 1.5: that a gene has to be below -1.5 to be underexpressed, and above 1.5 to be overexpressed, and anything in between in not differentially expressed. However, how do you assume that value? And is there a way to derive the cut-off?

Thank you in advance :)

 

rnaseq differential gene expression logfc limma R • 38k views
ADD COMMENT
0
Entering edit mode

Hi just incase there are no DG genes identified. How to proceed ? Is it always necessary to get DE genes? Would be happy to get some resources to share data where it is possible to show that "there are no DE genes" . The findings could be "biologically" significant although statistically "not significant"

ADD REPLY
12
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 2 hours ago
The city by the bay

No, there is no general objective justification for any particular log-fold change threshold. Mathematically speaking, it is possible to reject the null hypothesis at any non-zero log-fold change if the variability is low enough. One could argue that small log-fold changes are not biologically relevant, but the exact definition of "small" is open to interpretation. Larger log-fold changes are also more robustly detected across technologies (e.g., RNA-seq and qPCR), though selecting a threshold on this basis would depend on the sensitivities of the technologies involved. Somewhere between 1.1 to 1.5 is a common choice for a "sensible" threshold.

But all this is getting away from the main point, which is the detection of DE genes. If you want to do this in a statistically rigorous manner, use the BH-adjusted p-values to control the false discovery rate. This ensures that the expected proportion of false positives in your set of significant DE genes is below a certain threshold (usually 5%). Now, you might say that this approach also involves the selection of an arbitrary threshold. However, with this approach, at least the choice of threshold is directly related to the probability of whether the genes are truly DE or not. A log-fold change threshold doesn't tell you much about the error rate, as it doesn't account for the variability of the expression values.

Finally, if you do need a log-fold change threshold, the treat function should be used, and DE genes selected on the basis of the adjusted p-values. This ensures that the FDR is controlled while only considering genes with log-fold changes above a minimum value.

ADD COMMENT
0
Entering edit mode

Hi Aaron,

Thank you for your time and clarification. Your explanation makes sense. These are the steps I've undertaken.. I've followed the Limma manual, but I feel like my comprehension could be wrong.

fit<-lmFit(Count, design, method = "robust")
cont.matrix<-makeContrasts(Treatmentvscontrol=treatment-control, levels=design)
fit2<-contrasts.fit(fit,cont.matrix)
fit2<-eBayes(fit2, robust =TRUE, trend=TRUE)
fit2<- treat(fit2, lfc=log2(1.5), trend=TRUE, robust=TRUE)
top_treat <- topTreat(fit2, n=nrow(fit2), adjust.method="BH")

Two things I just wanted to clarify:

1) From what I understand, adding "robust" in lmFit, eBayes, and Treat would take outliers into consideration. Am I overkilling it by adding it in all three functions? Same goes for "trend" in both eBayes and Treat.

2) Second, I performed a log2 on my counts (+ a small constant), then filtered out any rows that had a sum 0(basically no values), prior to doing lmFit.  I'm only interested in an absolute log fold change of greater than 1.5, so should it be lfc=log2(1.5), or just lfc=1.5?

Thanks again in advance. I hope everyone is having a great weekend :)

ADD REPLY
1
Entering edit mode
  • You don't need to call eBayes if you use treat.
  • Use voom if you're working on RNA-seq data, rather than log-transforming manually.
  • lfc is the threshold on the log-fold change (base 2), so set the value accordingly.
  • I don't usually set method="robust" in lmFit. In general, robustification results in some loss of efficiency, and there's not much information to spare on a gene-wise basis in experimental designs with few replicates. This is not a problem in treat and eBayes, where information is shared across thousands of genes.
ADD REPLY
0
Entering edit mode

I see. Thank you again.

Regarding voom, the reason why I never used it was because I have FPKM values and not read counts (I know its not ideal, but thats what I've got to work with. And yes, I have to use limma specifically and not cummerbund). My understanding is that voom normalises counts, whereas FPKM is already normalised, hence the need to log2 transform manually, before proceeding to do lmFit and Bayes stat(treat) to get the logFC. Please correct me if I'm wrong.

ADD REPLY
0
Entering edit mode

Hi Aaron, 

Just a quick question, would the equivalent of topTreat in Limma be glmTreat in edgeR? 

Thank you again in advance.

ADD REPLY
1
Entering edit mode

Yes, though there are some subtle differences related to the null argument - see ?glmTreat for details.

ADD REPLY
0
Entering edit mode

Hi Aaron,

I'm trying to compare my logFC results to results that are in FC to see if I achieved the same gene expressions as they did, like if gene A is down-regulated in theirs, if its down-regulated in mine too. 

I know the computation would be table$FC<- 2^(abs(table$logFC)). But then everything would be positive integers, and I would like to show the differentiation of up and down regulation with a positive/negative sign. Saying that, would it make sense to keep the the integer sign by using table$FC<-ifelse(table$logFC<0, table$FC*(-1),table$logFC*1)

I've really appreciated all your advice. Its helped me a lot. 

Kind regards,

ADD REPLY
0
Entering edit mode

And finally, one last question before my analysis is complete!

edgeR's glmTreat is equivalent to Limma's treat, and both use Bayes method. However I noticed only Limma uses the adjust method "BH", which calculates an adj P-value. glmTreat does not provide any adj. p-value in its output. Would that mean the outputted p-value has already been adjusted?

Thanks again for your time and help Aaron! I've learned a lot from you.

ADD REPLY
0
Entering edit mode

You need to use topTags on the output of glmTreat to compute adjusted p-values.

ADD REPLY
0
Entering edit mode

Hi, I am working on the pretty much same assignment. Just to verify my findings can you please confirm that did you transformed FPKM values to log2 manually by using excel? How did you do that. How did you chose your cut off value;on what bases.

ADD REPLY
0
Entering edit mode

In addition to my earlier post. I was fiddling around with my results and found that if I added robust to both lmFit and Treat, it would result to more DEGs, with warnings saying small variances detected (I assume due to its robust regression):

fit<-lmFit(Count, design, method = "robust")

.....

fit2<- treat(fit2, lfc=log2(1.5), trend=TRUE, robust =TRUE)

        Treatmeantvscontrol
-1               58
0             37276
1               145

And if I removed "robust" from lmFIt, and just left it in Treat, it would result to less DEGs, which is expected.

        Treatmentvscontrol
-1               23
0             37430
1                26

My additional question is, as much as more DEGs would indeed be interesting, would I be producing a bias in my results if I use the robust option in lmFit?

Thanks again :)

ADD REPLY
1
Entering edit mode

I'd bee inclined to avoid method="robust" unless you have an explicit reason to use it. Not only is it slower, the reduced efficiency of robust regression results in a decrease in the precision of the parameter estimates. This drop in precision falls outside the standard linear modelling framework (i.e., using least-squares fitting), which makes it difficult to handle in later steps in the limma workflow. As such, it can result in some anticonservativeness:

# Paraphrased example from ?lmFit.
sd <- 0.3*sqrt(4/rchisq(10000,df=4))
y <- matrix(rnorm(10000*6,sd=sd),10000,6)
rownames(y) <- paste("Gene",1:10000)
design <- cbind(Grp1=1,Grp2vs1=c(0,0,0,1,1,1))

# Ordinary fit gives a uniform p-value distribution.
fit <- lmFit(y,design)
fit <- eBayes(fit)
hist(fit$p.value[,2])

# Robust fit is skewed towards low values.
fit2 <- lmFit(y,design, method="robust")
fit2 <- eBayes(fit2)
hist(fit2$p.value[,2])
ADD REPLY
0
Entering edit mode

Hi Aaron, I have been performing some Diffrential Methylation Analysis and found some logFC values really closer to 0 ( from - 0.07 to 0.11).

Notice that the p-values here seem to be significant and they range from 0.008 to 0.03.

Can we claim here a significant difference between the 2 groups ( the control group and the diseased group ). Because I notice that in your comment you said : "Mathematically speaking, it is possible to reject the null hypothesis at any non-zero log-fold change".

Thanks in advance !

ADD REPLY

Login before adding your answer.

Traffic: 703 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6