This is a head of my dataset, with expected counts from RSEM, with 3 biological replicates of 2 samples (control and treatment), as can be seen in the group variable below.
> head(rsem, 10) X4_7 X7_9 X8_13 X7_10 X4_27 X5_25 comp0_c0_seq1 0 0 0 0 0 0 comp100001_c0_seq1 0 1 3 2 1 1 comp100006_c0_seq1 2 0 0 0 0 2 comp100009_c0_seq1 0 1 0 0 0 0 comp100010_c0_seq1 0 0 0 0 0 0 comp100012_c0_seq1 0 3 0 3 1 0 comp100013_c0_seq1 0 0 0 3 3 6 comp100017_c0_seq1 0 0 0 0 0 0 comp100018_c0_seq1 2 0 2 3 5 4 comp100019_c0_seq1 0 0 0 3 0 0
> group [1] CS CS CS TS TS TS Levels: CS TS
I have 303872 tags in total
If I do not apply any filter I get Disp = 1.28528; BCV = 1.1337 and DE tags:
[,1] -1 180 0 303530 1 162
With the next filter (CPM > 1 in at least 3 samples):
orig_filter <- function(x){if (rowSums(t(x>1))>=3) TRUE else FALSE;}
I obtain Pass-filter records: 44,256; Disp=0.5114; BCV=0.7151. DE tags:
[,1] -1 207 0 43910 1 139
Now I apply a different one in which a gene has all CPMs either > 0 or = 0 for all the replicates in both treatments:
edit: the filter function was unclear. I completed the function and, as a comment, the tags for which the function return TRUE are the pass-filter tags, while if the function returns FALSE the tags are filtered-out.
filter_1 <- function(x){if (((x[1]==0 & x[2]==0 & x[3]==0) | (x[1]>0 & x[2]>0 & x[3]>0)) & ((x[4]==0 & x[5]==0 & x[6]==0) | (x[4]>0 & x[5]>0 & x[6]>0))) return TRUE else return FALSE}
Obtaining Pass-filter records: 162,651; Disp=0.22766; BCV=0.4771. DE tags:
[,1] -1 1000 0 160308 1 1343
The MDS before filtering:
And after filtering by CPMs:
After filtering based on zero values:
I use the plots with all the datapoints but paiting the ones to be filtered out in red:
BCV - CPM plots. CPM based filter (right) and zero values based filter (left):
Smear plots:
Volcano plots:
Finally, volcano plots with the data after being filtered out. In red, those declared as DE (pvalue 0.05, method="BH"):
Questions:
- Which filter should be applied?
- Is there any technical/statistical problem with the zero based filter? What are the different "trends" that are observed in the BCV plot and, should they be separated and analyzed as different datasets with their own BCV fitting?
- In case of using the zero based filter, should I apply a second filter based on CPMs?
Note: I did a regression of rsem expected counts and CPMs and the slope is around 25 in my data.
Thank you.
"I don't know why there are different trends in your BCV plot. I have never seen such trends before. It suggests a basic problem with your data, but I have no way of knowing what that problem might be."
I don't know why, but they seem to come from the datapoints for which the "filter_1" function return FALSE.
"You have over 300 million tags, which seems a bit over the top. Are the tags transcripts perhaps? Would you consider summarizing at the gene level instead? That would give more stable results. I usually find it more rewarding to work at the gene level, using subread and featureCounts to get read counts."
Yes, they are transcripts. However, the results for genes are somewhat similar, although yes, a bit more stable. Yes, maybe RSEM expected counts are not the best thing here. Moreover, the abundance numbers I have for genes are just the sum of the expected counts of their isoforms, so many of the datapoints that are filtered out by "filter_1" function are not filtered out when using genes.
"Your "zero based filter" doesn't seem sensible to me. Amongst other things, it will remove tags that are zero in one group and high in the other. These tags would seem to be of considerable interest in a DE analysis, so why would you want to remove them?"
I disagree that is going to remove tags that are zero in one group and high in the other. It is going to remove tags for which there is a replicate with 0 and another one high within a group, which is very different. In fact, you can see in the smear and volcano plots that there is a huge amount of observations with 0 in one group and CPM>0 in the other (which I have always thought that are the upper and lower tails of the smear plot).
"We do not recommend that you necessarily filter using a theshold of 1 for the cpm. You should instead choose the cutoff so you keep tags with a reasonable number of counts, say at least 5-10 in at least 3 libraries."
I have tried with CPM = 0.3 (maybe around 25/3 counts), and as consequence the BCV increases and the MDS plot becomes worse. I guess this happens because actually the CPM based filter is (indirectly) filtering many of the observations filtered out by the zero based filter. Note that even reducing the CPM threshold to 0.3 I am evaluating only 25,544 genes of around 102,000 ones, while with the zero based filter I am evaluating 64,712 of them.
"You are overriding the default MDS plot method. Please use the default. It will be more robust for your experiment."
I have been comparing both plots and the logFC based one is better in comparison with the BCV one, in the case of the CPM filter, but it becomes worse again when decreasing the CPM threshold, while the zero based filter gives the best plot for both logFC and BCV method.