Question

Filtering, minimum sample number, and condition in edgeR

0

Entering edit mode

rproendo ▴ 30

@rproendo-17985

Last seen 7 months ago

United States

Hi all,

I'm using edgeR for DEG analysis and have run into a snag with my filtering approach. I have an experimental design wherein individuals from two different populations were treated with a drug or left untreated. This is not a repeated measures, and replicates are not shared between treated and untreated for a populations. The populations are A and B, and the treatments are T and U (untreated), such that we have four levels: A_T, A_U, B_T, and B_U. There are seven replicates in each of the four groups (n=28 for the whole study), and I'm filtering with:

y <- DGEList(counts=data,group=group)
nrow(y$counts)
#[1] 44112

keep_7 <- rowSums(cpm(y)>1) >=7
y_7 <- y[keep_7, , keep.lib.sizes=FALSE]
nrow(y_7$counts)
#[1] 24931

So, filtering (if I've interpreted this correctly), is keeping any gene with cpm>1 in a minimum of 7 libraries (of the 28 total). This is being done without any consideration for which of the four groups those required 7 libraries might belong to. My goal is not only to remove low-expression genes, but to also avoid spurious DEGs that are being driven by one library in a treatment group with crazy-high expression but with zeroes for the remaining six libraries in that group.

I have three experimental goals that I'm trying to answer with this dataset, and therein is my confusion with filtering. Goal (1) is to describe basic differences between populations absent any treatment. This would be A_U versus B_U. Goal (2) is to describe how each population responds to treatment independently (A_T versus A_U; B_T versus B_U). And goal (3) is to describe the interaction between population and treatment. I'm specifying a list of contrasts to identify DEGs for each goal. Each group is given a letter designation (A,B,C,D)

my.contrasts <- makeContrasts(A_TvsA_U=groupB-groupA, B_TvsB_U=groupD-groupC, B_UvsA_U=groupC-groupA, A_TvsA_UvsB_TvsB_U=groupB-groupA-groupD+groupC, levels=design)

What I'm worried about happening is that genes will pass this filter based on their expression in either the treated or the untreated groups and lead to spurious DEGs. To put this in an example: I detect a number of DEGs in goal (2) [A_U versus B_U] that are being driven by one very high read count in A_U, while all the other reads are 0. This gene would have passed the filter because it's expressed in all of the treated samples and ONE of the untreated samples.

What is the ideal approach here? To increase the library threshold above 7? Or to create an entirely independent DGElist object with JUST the untreated samples and use the same filtering approach already described? I would worry that this would rob power from our analysis overall.

Thank you!

edger filtering • 2.2k views

ADD COMMENT • link updated 5.9 years ago by Ryan C. Thompson ★ 7.9k • written 5.9 years ago by rproendo ▴ 30

score 1 · Answer 1 · 2018-11-30

1

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 2 days ago

United States

A more simple approach (for you) is to use filterByExpr to do the filtering. It has an additional filter for the total counts across all samples (default 15, which you could increase).

But do note that filtering isn't really intended to eliminate false positives, but instead is intended to get rid of genes that are probably not being expressed, and which will supply only noise to the estimate of BCV. It's not impossible to get significant genes that are being driven by a single sample with a high count in one sample, particularly if you are using the GLM approach rather than the quasi-likelihood approach. You don't show code, but if you are using e.g glmFit instead of glmQLFit and friends you might consider switching.

ADD COMMENT • link 5.9 years ago James W. MacDonald 67k

0

Entering edit mode

Hi, James

Thanks for the quick response. I will look into filterByExpr immediately. And I understand the distinction, and I'll have to look into some means to address how many DEGs are coming back post-filtering that might be due to one very high sample.

In fact, the inspiration for the post was a previous attempt using a different filtering approach (average counts > 10, no minimum sample representation) that, with the more conservative glmQLFit, was returning ~40% of total genes as significant in the A_U versus B_U contrast. Manually plotting CPM data for some of the highest logFC objects revealed a number that were being driven by a single high count. After that, I tried adjusting filtering methods in hopes of mitigating that issue, adopting the approach described above. And now we're getting an even higher proportion (but I haven't plotted anything, yet).

ADD REPLY • link 5.9 years ago rproendo ▴ 30

score 1 · Answer 2 · 2018-11-30

If you have many genes with many counts in one library and zero counts in all the replicates of that library, the first thing you should check is whether the outlier is the same library for all of these genes. If so, you may simply have a poor-quality library that consistently diverges from its replicates. You could discard this library from your analysis, or you could switch to using limma with the function voomWithQualityWeights and see if it assigns a lower weight to that library. One other thing to check is library size: if one library has 10x the counts of all other libraries, then it will detect many genes that have zero counts in the lower-depth libraries.

If you have many genes with a single outlier count, but the count does not always come from the same library, then you may have a different issue, which others can perhaps give you suggestions for.