Your first question is unanswerable in the general sense, because there is no way to say what the 'correct' filtering criterion is. Any analysis is based on assumptions that the analyst makes about things that she really doesn't know the answer to. For example, you are filtering out any gene with an average log CPM less than 1. Is that correct? Who knows? You could use plotBCV to see at what average log CPM the data start to look less reliable, or something, but that is just an eyeballometric measure, where you decide that the values below some cutoff are starting to look like noise is predominating.
For the second question, I find myself struggling to answer you (even though I have already done so). This is pretty much the simplest part of the analysis to understand, and you are having problems saying in words what you have done. How do you expect to explain the generalized linear model with a quasi-likelihood dispersion estimate? Or the quasi-likelihood F-test? Those concepts are orders of magnitude more difficult to understand, let alone explain to an unsophisticated audience.
I understand wanting to do the analysis yourself, and certainly having access to free, world-class analysis tools makes it easy to attempt. But most people go to a lawyer for their lawyering, and a doctor for their doctoring. If I were you, I would seriously consider finding a local statistician to (at the very least) help you with your analyzing.
I have done the analysis myself, looked at my results, plotBCV and all that. They look good and I have reasonable results, wrote down the paper and ready to submit. The statistician and 2 other scientists read my the paper where I explained the QL and F tests and approved it. However, I don't have a bioinformatician and wanted to double check my codes, and make sure that I did not miss any criteria for filtering etc... This was only a double check on the codes as well as scientific explanation on the filtering, maybe I'm being too concern about this!!!!Thanks for taking your time and writing that (irrelevant) response.....
The filtering step you have done is perfectly reasonable. When you write it for publication you might say:
"Genes were filtered from the analysis if their average log2 count per million (as computed by edgeR's aveLogCPM function) was negative. This had the effect of keeping genes with an average count of about 5 or more per sample."
In other words, the secret is to describe precisely what you did.
Note: This corresponds to
keep <- aveLogCPM(y) > 0, which is what you wrote in your original post. However you later said you had used
keep <- aveLogCPM(y) > 1 when you replied to Steve. The latter would be more conservative, and would correspond to an average count of about 10 per sample.
I have quite exactly the same question, but I did "cpm" instead of "aveLogCPM", according to the official book p11 :
keep <- rowSums(cpm(y)>1) >= 3 y <- y[keep, , keep.lib.sizes=FALSE]
I put "3" due to my 3 replicats per conditions.
By doing this I go from 60 000 to 20 000 contigs. But I have difficulties to understand later why in my DGE "condition 1 VS condition 2" table, I still see very low logCPM (-0.86). Do I have to filter again at this time and where to fix the limit on these logCPM ? 0 ? 1,858 (log2(3) ?
Note, I was frightened by the comment of JW MacDonald. I believe that the “open world” and all other “open source and stuffs” are only rich from their communities and exchanges.
One should think about that : “Do you have to be a car manufacturer to drive a car ? No. You have to learn to drive, change a wheel and put gas”.
I am also working with the help of a statistician but my goal is to gain in autonomy. Unfortunately, some scientists (often the good ones) are lacking the capacities to talk to “unsophisticated audience”.