Filtering of RNA-seq data in EdgeR
Entering edit mode
Last seen 5.7 years ago

Can any one define the filtering criteria of this command? 

> keep <- aveLogCPM(y) > 0

> y <- y[keep,,keep.lib.sizes=FALSE]

rnaseq edger filtering • 5.6k views
Entering edit mode

Firstly, please update this top-level question with the correct version (whatever it may be) instead of adding a new/updated question as a response.

Secondly, it's not clear what you're asking. You have defined the filtering criteria in the keep variable:

keep <- aveLogCPM(y) > 4
y <- y[keep,,keep.lib.sizes=FALSE]

And keep will be TRUE when the average log CPM of the gene is greater than four ... but that's pretty much exactly like it reads, which is why I say that it's not really clear what you're asking.

When you subset a DGEList and specify keep.lib.sizes=FALSE, the lib.size for each sample (cf. the y$samples data.frame) will be recalculated to be the sum of the counts left in the rows of the experiment for each sample.

Entering edit mode

Below is what I have done. I used logCPM for filtering as you can see. First I'd like to make to sure that I used the correct filtering criteria. Secondly I want to know how to write it (the criteria) down for the manuscript?

y <- DGEList(counts=rawdata[,2:9], genes=rawdata[,1])
> keep <- aveLogCPM(y) > 1 (library size =1 according to y$sample)
> y <- y[keep,,keep.lib.sizes=FALSE]
> y <- calcNormFactors(y)
> y <- calcNormFactors(y)
> y$samples

 group lib.size norm.factors
infec1      1 57033472    0.9956575
infect2     1 43491567    0.9781244
infec3      1 49528718    0.9889648
infec4      1 55217200    1.0118344
ctrl1       1 48728124    0.9950521
ctrl2       1 49750792    1.0024793
ctrl3       1 49755660    1.0368674
ctrl4       1 52538386    0.9921132
> plotMDS(y)
> infected<- factor(c(1,1,1,1,0,0,0,0))
>  design <- model.matrix(~infected)
> y <- estimateDisp(y, design)
> fit <- glmQLFit(y, design, robust=TRUE)
> res <- glmQLFTest(fit, coef='infected1')
> topTags(res)

Thank you

Entering edit mode
Last seen 4 days ago
United States

Your first question is unanswerable in the general sense, because there is no way to say what the 'correct' filtering criterion is. Any analysis is based on assumptions that the analyst makes about things that she really doesn't know the answer to. For example, you are filtering out any gene with an average log CPM less than 1. Is that correct? Who knows? You could use plotBCV to see at what average log CPM the data start to look less reliable, or something, but that is just an eyeballometric measure, where you decide that the values below some cutoff are starting to look like noise is predominating.

For the second question, I find myself struggling to answer you (even though I have already done so). This is pretty much the simplest part of the analysis to understand, and you are having problems saying in words what you have done. How do you expect to explain the generalized linear model with a quasi-likelihood dispersion estimate? Or the quasi-likelihood F-test? Those concepts are orders of magnitude more difficult to understand, let alone explain to an unsophisticated audience.

I understand wanting to do the analysis yourself, and certainly having access to free, world-class analysis tools makes it easy to attempt. But most people go to a lawyer for their lawyering, and a doctor for their doctoring. If I were you, I would seriously consider finding a local statistician to (at the very least) help you with your analyzing.

Entering edit mode
Last seen 5.7 years ago

I have done the analysis myself, looked at my results, plotBCV and all that. They look good and I have reasonable results, wrote down the paper and ready to submit. The statistician and 2 other scientists read my the paper  where I explained the QL and F tests and approved it. However, I don't have a bioinformatician and wanted to double check my codes, and make sure that  I did not miss any criteria for filtering etc... This was only a double check on the codes as well as scientific explanation on the filtering, maybe I'm being too concern about this!!!!Thanks for taking your time and writing that (irrelevant) response.....



Entering edit mode
Last seen 9 hours ago
WEHI, Melbourne, Australia

The filtering step you have done is perfectly reasonable. When you write it for publication you might say:

"Genes were filtered from the analysis if their average log2 count per million (as computed by edgeR's aveLogCPM function) was negative. This had the effect of keeping genes with an average count of about 5 or more per sample."

In other words, the secret is to describe precisely what you did.

Note: This corresponds to keep <- aveLogCPM(y) > 0, which is what you wrote in your original post. However you later said you had used keep <- aveLogCPM(y) > 1 when you replied to Steve. The latter would be more conservative, and would correspond to an average count of about 10 per sample.

Entering edit mode
David ROUX ▴ 10
Last seen 3.0 years ago
France (Avignon University)


I have quite exactly the same question, but I did "cpm" instead of  "aveLogCPM", according to the official book p11 :

keep <- rowSums(cpm(y)>1) >= 3

y <- y[keep, , keep.lib.sizes=FALSE]

I put "3" due to my 3 replicats per conditions.

By doing this I go from 60 000 to 20 000 contigs. But I have difficulties to understand later why in my DGE "condition 1 VS condition 2" table,  I still see very low logCPM (-0.86). Do I have to filter again at this time and where to fix the limit on these logCPM ? 0 ? 1,858 (log2(3) ?


Note, I was frightened by the comment of JW MacDonald. I believe that the “open world” and all other “open source and stuffs” are only rich from their communities and exchanges.

One should think about that : “Do you have to be a car manufacturer to drive a car ? No. You have to learn to drive, change a wheel and put gas”. 

I am also working with the help of a statistician but my goal is to gain in autonomy. Unfortunately, some scientists (often the good ones) are lacking the capacities to talk to “unsophisticated audience”.

Entering edit mode

If you want to add a comment, you should click on the ADD COMMENT button and type in the box that comes up. The 'Add your answer' box below is for adding answers, not for adding more questions or comments.

And your question is an example of what I was talking about. Do note that if you have three samples with a CPM of 1, and three with a CPM of zero (e.g, no counts for that gene), you will get

> log2(mean(c(1,1,1,0,0,0)))
[1] -1

So a logCPM of -0.86 is to be expected, given the filtering criterion you used! The fact that you were surprised by this should be a warning to you.

As for your example; no you don't need to be a car manufacturer to drive a car, in the same way that you don't have to be an R package developer to be an R user. But you do have to show proficiency in the act of driving a car before you will be allowed to do so. However, you don't have to show any proficiency in, like anything at all, to download R and Bioconductor and then start analyzing data.

The point I was trying to make is that understanding the underlying statistics is, for most people, much more difficult than figuring out how to get R to do something. You can read through the edgeR User's Guide, and try to extend what Gordon has written there to your own experiment, but that doesn't mean you understand what you have done, that you are sure you have done something valid, and that you can then clearly explain to someone else what you have done and why.

So if my comment frightened you, then that's good! I fully support people learning new things and extending their skill sets, but I do not support the idea that learning how to get edgeR to spit something out is remotely sufficient. You also need to learn some statistics so you understand what you have done and can credibly explain it to someone else.

Entering edit mode

Thanks for your answer. (Next time I will post new thread).

Elsewhere on the Web I asked if the final logCPM in a DGE analysis table was a mean of the CPM of the samples compared together. One gave me the answer "no".

I am reassured by your answer. I was going to make a filter on positive logCPM, so it is not absurd. And maybe also to make a filter like logCPM > log2(3)

I totally agree that we have to learn deeper statistics but it is hard to find the appropriate lesson. This forum appears as the best place to ask our questions. I agree that some of them can be naive.

Via edgeR, that do the best job in NGS analysis according to literature, we have access and are encouraged to make our own settings and filters. So it raises questions.

For the core of the script like the "Generalized linear model", although we can document ourselves, we can do nothing more that believe the script does a good job !  :-)

It would be nice to find somewhere a kind of "edgeR for dummies" with plenty of details concerning stats and plots and interpretations. The nice existing 104p guide-book assumes that we are already statisticians.

Entering edit mode

Start a new question, this thread is getting old.


Login before adding your answer.

Traffic: 386 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6