Different filters with edgeR
2
0
Entering edit mode
@cpcantalapiedra-7532
Last seen 6.3 years ago
Spain

This is a head of my dataset, with expected counts from RSEM, with 3 biological replicates of 2 samples (control and treatment), as can be seen in the group variable below.

> head(rsem, 10)
X4_7 X7_9 X8_13 X7_10 X4_27 X5_25
comp0_c0_seq1         0    0     0     0     0     0
comp100001_c0_seq1    0    1     3     2     1     1
comp100006_c0_seq1    2    0     0     0     0     2
comp100009_c0_seq1    0    1     0     0     0     0
comp100010_c0_seq1    0    0     0     0     0     0
comp100012_c0_seq1    0    3     0     3     1     0
comp100013_c0_seq1    0    0     0     3     3     6
comp100017_c0_seq1    0    0     0     0     0     0
comp100018_c0_seq1    2    0     2     3     5     4
comp100019_c0_seq1    0    0     0     3     0     0
> group
[1] CS CS CS TS TS TS
Levels: CS TS

I have 303872 tags in total

If I do not apply any filter I get Disp = 1.28528; BCV = 1.1337 and DE tags:

 [,1]
-1    180
0  303530
1     162

With the next filter (CPM > 1 in at least 3 samples):

orig_filter <- function(x){if (rowSums(t(x>1))>=3) TRUE else FALSE;}

I obtain Pass-filter records: 44,256; Disp=0.5114; BCV=0.7151. DE tags:

[,1]
-1   207
0  43910
1    139

Now I apply a different one in which a gene has all CPMs either > 0 or = 0 for all the replicates in both treatments:

edit: the filter function was unclear. I completed the function and, as a comment, the tags for which the function return TRUE are the pass-filter tags, while if the function returns FALSE the tags are filtered-out.

filter_1 <- function(x){if (((x[1]==0 & x[2]==0 & x[3]==0) | (x[1]>0 & x[2]>0 & x[3]>0)) &
((x[4]==0 & x[5]==0 & x[6]==0) | (x[4]>0 & x[5]>0 & x[6]>0))) return TRUE else return FALSE}

Obtaining Pass-filter records: 162,651; Disp=0.22766; BCV=0.4771. DE tags:

[,1]
-1   1000
0  160308
1    1343

The MDS before filtering:

And after filtering by CPMs:

After filtering based on zero values:

I use the plots with all the datapoints but paiting the ones to be filtered out in red:

BCV - CPM plots. CPM based filter (right) and zero values based filter (left):

Smear plots:

Volcano plots:

Finally, volcano plots with the data after being filtered out. In red, those declared as DE (pvalue 0.05, method="BH"):

Questions:

- Which filter should be applied?

- Is there any technical/statistical problem with the zero based filter? What are the different "trends" that are observed in the BCV plot and, should they be separated and analyzed as different datasets with their own BCV fitting?

- In case of using the zero based filter, should I apply a second filter based on CPMs?

Note: I did a regression of rsem expected counts and CPMs and the slope is around 25 in my data.

Thank you.

edger filtering • 1.8k views
1
Entering edit mode
@gordon-smyth
Last seen 18 minutes ago
WEHI, Melbourne, Australia

I don't know why there are different trends in your BCV plot. I have never seen such trends before. It suggests a basic problem with your data, but I have no way of knowing what that problem might be.

You have over 300 million tags, which seems a bit over the top. Are the tags transcripts perhaps? Would you consider summarizing at the gene level instead? That would give more stable results. I usually find it more rewarding to work at the gene level, using subread and featureCounts to get read counts.

Your "zero based filter" doesn't seem sensible to me. Amongst other things, it will remove tags that are zero in one group and high in the other. These tags would seem to be of considerable interest in a DE analysis, so why would you want to remove them? (Edit: the definition of the "zero based filter" is unclear to me and I have probably misinterpreted it. I will post a new response as a separate answer.)

We do not recommend that you necessarily filter using a theshold of 1 for the cpm. You should instead choose the cutoff so you keep tags with a reasonable number of counts, say at least 5-10 in at least 3 libraries.

The variability between your biological replicates is enormous. Why are the samples so inconsistent? This would worry me if it was my experiment.

You are overriding the default MDS plot method. Please use the default. It will be more robust for your experiment.

0
Entering edit mode

"I don't know why there are different trends in your BCV plot. I have never seen such trends before. It suggests a basic problem with your data, but I have no way of knowing what that problem might be."

I don't know why, but they seem to come from the datapoints for which the "filter_1" function return FALSE.

"You have over 300 million tags, which seems a bit over the top. Are the tags transcripts perhaps? Would you consider summarizing at the gene level instead? That would give more stable results. I usually find it more rewarding to work at the gene level, using subread and featureCounts to get read counts."

Yes, they are transcripts. However, the results for genes are somewhat similar, although yes, a bit more stable. Yes, maybe RSEM expected counts are not the best thing here. Moreover, the abundance numbers I have for genes are just the sum of the expected counts of their isoforms, so many of the datapoints that are filtered out by "filter_1" function are not filtered out when using genes.

"Your "zero based filter" doesn't seem sensible to me. Amongst other things, it will remove tags that are zero in one group and high in the other. These tags would seem to be of considerable interest in a DE analysis, so why would you want to remove them?"

I disagree that is going to remove tags that are zero in one group and high in the other. It is going to remove tags for which there is a replicate with 0 and another one high within a group, which is very different. In fact, you can see in the smear and volcano plots that there is a huge amount of observations with 0 in one group and CPM>0 in the other (which I have always thought that are the upper and lower tails of the smear plot).

"We do not recommend that you necessarily filter using a theshold of 1 for the cpm. You should instead choose the cutoff so you keep tags with a reasonable number of counts, say at least 5-10 in at least 3 libraries."

I have tried with CPM = 0.3 (maybe around 25/3 counts), and as consequence the BCV increases and the MDS plot becomes worse. I guess this happens because actually the CPM based filter is (indirectly) filtering many of the observations filtered out by the zero based filter. Note that even reducing the CPM threshold to 0.3 I am evaluating only 25,544 genes of around 102,000 ones, while with the zero based filter I am evaluating 64,712 of them.

"You are overriding the default MDS plot method. Please use the default. It will be more robust for your experiment."

I have been comparing both plots and the logFC based one is better in comparison with the BCV one, in the case of the CPM filter, but it becomes worse again when decreasing the CPM threshold, while the zero based filter gives the best plot for both logFC and BCV method.

1
Entering edit mode
@gordon-smyth
Last seen 18 minutes ago
WEHI, Melbourne, Australia

I found the definition of the "zero based filter" to be unclear, because the code you give for filter_1 is not a complete function and because it wasn't immediately clear to me whether the function was intended to select tags to be kept or tags to be filtered. I will assume that the purpose of the filter is to remove any tag that has both zero and non-zero counts for either of the two groups.

It is a basic requirement of any filter that it should be independent of differential expression. In particular, a filter should not depend on knowledge of the treatment groups in your experiment. If it doesn't then it is highly likely that you are biasing the dispersion estimates or the p-values in any subsequent analysis.

Your filter uses knowledge of which sample belongs to each group. Hence it violates the statistical requirements that a filter much satisfy, and it is likely that many of the extra DE tags you get with this filter may be false discoveries.

If you are concerned about outliers in your data (a single large count in a group, or a single zero in a group), then a far better approach would be to use the edgeR function estimateGLMRobustDisp(), which is specially designed to handle data like this.

0
Entering edit mode

You are right, the function was unclear. I edited it.

I feel like not filtering is biasing even more the dispersion estimates, and like I am having a lot of false negatives. But I have no means to demonstrate (although a little qRT-PCR comparison seems to support this, it is maybe too small).

This is how the plots look after filtering. In red the DE tags.

CPM based filter:

Zero based filter:

I will of course try the estimateGLMRobustDisp.

0
Entering edit mode

After reading the paper of the robust function (which I guess is Zhou et al, NAR 2014), are you sure such a method does not use knowledge of which sample belongs to each group? As far as I understand, the weights are calculated within group. Nevertheless, I am not sure whether such a method can be reliably applied to samples with just 2 or 3 replicates. This prevents as well the use of the non-parametric method of Li et al, SMMR 22(5) 519-536. Finally, the weights from Zhou et al are somewhat a (partial) filter of the observations that are considered outliers (figure 3c), and, from the paper "It is somewhat of a philosophical decision as to whether to completely filter out features or to down-weight them; the observation weight strategy allows both."

0
Entering edit mode

Robust weights are not equivalent to gene filtering. Unlike filtering, the fact that the robust weights are estimated from the data is taken into account as part of the statistical method.

Robust weights may or may not be fully reliable with groups of size 3, but they are likely nevertheless to be better than selecting outliers based on a very ad hoc rule, which is essentially what you were proposing.

0
Entering edit mode

Well, I am not proposing nothing, really. That is far far beyond my knowledge. I am just asking questions.

Weights seem to be used for dispersion estimation and regression coefficients calculation. I guess I will check what values are assigned to the observations I am just filtering out. Sincerely, I am not sure whether I would believe in any weight very different to 0 for them (for example, in the case of 2 replicates with CPMs 0 and 10; or 3 replicates with 10, 40 and 0). In the case that the weights were 0, does that mean that they are not going to be declared as DE but including them helps to control the FDR? Is that what you mean?