Question

Filtering step for differential analysis in edgeR

0

Entering edit mode

Biologist ▴ 120

@biologist-9801

Last seen 5.9 years ago

I have 35 lung cancer samples and 4 normal tissues. I'm trying to do differential analysis. With the available read counts data using edgeR for differential analysis.

For the filtering steps I'm using this which is mentioned in edgeR tutorial

keep <- rowSums(cpm(y) > 0.5) >= 2

This is where we keep genes with cpm values greater than 0.5 in at least two cases. But with this among 19k genes after filtering it kept 17k genes and when I do differential analysis between lung cancer samples and Normal cases I got only 1000 DEG's with log2 FC 1.2 and FDR <= 0.05

I expected more differential expressed genes. As only 1000 genes were Differentially expressed Is this because of less number of normal samples or do I need to change anything in the filtering step?

Any help is appreciated.

r edger differential gene expression filtering bioconductor • 11k views

ADD COMMENT • link updated 7.7 years ago by Gordon Smyth 53k • written 7.7 years ago by Biologist ▴ 120

score 1 · Answer 1 · 2018-05-11

1

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 15 months ago

Icahn School of Medicine at Mount Sinai…

A logFC threshold of 1.2 is quite a stringent cutoff. You are ignoring all genes with a fold change less than 2.3 (i.e. 2^1.2). I'm not surprised that relatively few genes pass this threshold.

Also note that using separate thresholds for logFC and FDR is suboptimal, since the double filtering may make your FDR values unreliable. You should have a look at the glmTreat function, which supports testing the logFC relative to a specified threshold value while also properly controlling the FDR.

As for abundance filtering, my recommendation is to filter based on the values returned by aveLogCPM. Plot the values and verify that you get a bimodal distribution, and choose your threshold so as to discard the lower mode while keeping the higher mode. But as I said above, your low-count filtering criterion is unlikely to be the reason you are only getting 1000 significant genes.

ADD COMMENT • link 7.7 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

I'm using glmTreat function only. It is like following:

tr <- glmTreat(fit, contrast=contrast.matrix, lfc=log2(1.2))
topTags(tr)
tab2 <- topTags(tr,n=Inf)
keep <- tab2$table$FDR <= 0.05

DEG <- tab2$table[keep,]

ADD REPLY • link 7.7 years ago Biologist ▴ 120

0

Entering edit mode

Ok, so you are not using a logFC threshold of 1.2 as you said. You are using a fold change threshold of 1.2, which is roughly a logFC threshold of 0.26. Those are very different. This is why you should provide example code to show what methods and parameter values you have used.

In any case, you should still check what results you get from an ordinary differential expression test with a null hypothesis of logFC=0, to see how much of a difference your threshold is making. Obviously, if there are many genes with small but robust changes, you would be filtering those out by using glmTreat.

More generally, you should make all the usual diagnostic plots to make sure your data looks reasonable. This includes the aveLogCPM histogram I described above, an MDS plot to verify that the sample are clustering as expected, and the dispersion plot to verify that the dispersion estimation is working well.

ADD REPLY • link 7.7 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Sorry, that was a typo in my question. it is fold change threshold of 1.2 only. Yes, with this I get only 1000 DEGs. Yes, I already checked the clustering part and everything for my samples. Everything was good.

With glmQLFTest I got 1200 DEG's and with glmTreat (log2FC 1.2 and FDR <=0.05) I got 1000 DEGs.

ADD REPLY • link 7.7 years ago Biologist ▴ 120

0

Entering edit mode

If all the statistics and QC look good, then the only possible explanations for a non-significant result are a small (or zero) effect size or a large variance (since roughly speaking, significance is determined by effect size / variance). In the case of cancer samples, tissue heterogeneity can contribute to both of these causes.

ADD REPLY • link 7.7 years ago Ryan C. Thompson ★ 7.9k

score 1 · Answer 2 · 2018-05-11

1

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 8 hours ago

WEHI, Melbourne, Australia

These issues are discussed in the workflow article

You don't say which edgeR tutorial you are reading, but I guess it is the workflow published in F1000Research:

https://f1000research.com/articles/5-1438

or else the Bioconductor workflow version of the same article:

edgeRQL Vignette

How to choose the filtering step (and whether or not to use glmTreat) is carefully discussed in that article. I'm a bit puzzled why you are not following the advice given in the article. How much of the article have you read? Was the advice not clear?

Just use filterByExpr

You seem to have copied the filtering step used for a different dataset without adapting it to your dataset. The filtering has to be adapted to your sample sizes and library sizes -- you can't just copy the code like that. In the current version of edgeR, we have made it easier for you. You can now simply use:

keep <- filterByExpr(y, design)

That function will then tune the filtering to be appropriate for your sequencing depth and experimental design.

glmTreat is designed to reduce the number of DE genes

As explained in the workflow article, glmTreat is designed to reduce the number of DE genes that you get, and to prioritize the most biologically meaningful of them. I'm a bit puzzled why you would use glmTreat(), but then complain that you don't have enough DE genes, considering that reducing the amount of DE is the whole purpose of the function. 1000 DE genes sounds a lot to me. Why would you want any more DE genes than that?

ADD COMMENT • link 7.7 years ago Gordon Smyth 53k

0

Entering edit mode

Thanks for the reply. Sorry, I didn't know that. I'm interested in DE lncRNA genes. lncRNAs are basically with low expression value. When I used glmTreat I got only 200 DE lncRNA genes among 1000 (both Differentially expressed protein coding and lncRNAs). I expected them to be more. If not glmTreat which function I should use? To use filterByExpr function I tried updating edgeR but it didn't work. Currently edgeR version I'm using is 3.18.1

ADD REPLY • link 7.7 years ago Biologist ▴ 120

2

Entering edit mode

It seems that you are using Bioconductor Release 3.5 instead of 3.7. In other words, you are two Bioconductor Releases out of date. See the Bioconductor website for how to install the current version. You will need to be running R 3.5.0.

As for alternatives to glmTreat, why not read the workflow? All of this is explained. Or else just put lfc=0.

ADD REPLY • link 7.7 years ago Gordon Smyth 53k

0

Entering edit mode

Hello Gordon,

As you said, I did all the installations again and now I can see filterByExpr function. Could you please tell how does this function works? how is it different from other step using cpm? Nothing mentioned about this "filterByExpr" function in the tutorial.

ADD REPLY • link 7.7 years ago Biologist ▴ 120

1

Entering edit mode

filterByExpr() simply implements the filtering strategy already described intuitively in the F1000Research article. It is not different from using cpm, with the cpm cutoff and minimum number chosen appropriately according to the library sizes and the experimental design.

filterByExpr() is used in the edgeRQL Vignette (linked above), although obviously not in the F1000Research article from two years ago.

ADD REPLY • link 7.7 years ago Gordon Smyth 53k

0

Entering edit mode

one last question. When I used "keep <- filterByExpr(y, design)” it retained only 5000 among 15000 genes for further analysis and when I used "keep <- rowSums(cpm(y) > 0.5) >= 1" it retained 12000 among 15000 genes. May I know why there is difference in number.

I see that you mentioned that "filterByExpr" uses cpm cutoff, library size and exp design for filtering. how this cutoff is given by the function?

ADD REPLY • link 7.7 years ago Biologist ▴ 120

0

Entering edit mode

Well, it is obvious that if you use a less stringent filter that you will keep more genes. There's nothing more I can say without knowing anything about your data. Have you understood that the filtering rule is supposed to depend on the library size? It isn't correct to use cpm(y)>0.05 all the time regardless of sequencing depth.

Anyway, I think that how to filter is well explained in the F1000Research article, and I feel I have answered your original question.

ADD REPLY • link 7.7 years ago Gordon Smyth 53k