I have 35 lung cancer samples and 4 normal tissues. I'm trying to do differential analysis. With the available read counts data using edgeR for differential analysis.
For the filtering steps I'm using this which is mentioned in edgeR tutorial
keep <- rowSums(cpm(y) > 0.5) >= 2
This is where we keep genes with cpm values greater than 0.5 in at least two cases. But with this among 19k genes after filtering it kept 17k genes and when I do differential analysis between lung cancer samples and Normal cases I got only 1000 DEG's with log2 FC 1.2 and FDR <= 0.05
I expected more differential expressed genes. As only 1000 genes were Differentially expressed Is this because of less number of normal samples or do I need to change anything in the filtering step?
Any help is appreciated.
I'm using glmTreat function only. It is like following:
Ok, so you are not using a logFC threshold of 1.2 as you said. You are using a fold change threshold of 1.2, which is roughly a logFC threshold of 0.26. Those are very different. This is why you should provide example code to show what methods and parameter values you have used.
In any case, you should still check what results you get from an ordinary differential expression test with a null hypothesis of logFC=0, to see how much of a difference your threshold is making. Obviously, if there are many genes with small but robust changes, you would be filtering those out by using glmTreat.
More generally, you should make all the usual diagnostic plots to make sure your data looks reasonable. This includes the aveLogCPM histogram I described above, an MDS plot to verify that the sample are clustering as expected, and the dispersion plot to verify that the dispersion estimation is working well.
Sorry, that was a typo in my question. it is fold change threshold of 1.2 only. Yes, with this I get only 1000 DEGs. Yes, I already checked the clustering part and everything for my samples. Everything was good.
With glmQLFTest I got 1200 DEG's and with glmTreat (log2FC 1.2 and FDR <=0.05) I got 1000 DEGs.
If all the statistics and QC look good, then the only possible explanations for a non-significant result are a small (or zero) effect size or a large variance (since roughly speaking, significance is determined by effect size / variance). In the case of cancer samples, tissue heterogeneity can contribute to both of these causes.