Hi all,
I'm using edgeR for DEG analysis and have run into a snag with my filtering approach. I have an experimental design wherein individuals from two different populations were treated with a drug or left untreated. This is not a repeated measures, and replicates are not shared between treated and untreated for a populations. The populations are A and B, and the treatments are T and U (untreated), such that we have four levels: A_T, A_U, B_T, and B_U. There are seven replicates in each of the four groups (n=28 for the whole study), and I'm filtering with:
y <- DGEList(counts=data,group=group) nrow(y$counts) #[1] 44112 keep_7 <- rowSums(cpm(y)>1) >=7 y_7 <- y[keep_7, , keep.lib.sizes=FALSE] nrow(y_7$counts) #[1] 24931
So, filtering (if I've interpreted this correctly), is keeping any gene with cpm>1 in a minimum of 7 libraries (of the 28 total). This is being done without any consideration for which of the four groups those required 7 libraries might belong to. My goal is not only to remove low-expression genes, but to also avoid spurious DEGs that are being driven by one library in a treatment group with crazy-high expression but with zeroes for the remaining six libraries in that group.
I have three experimental goals that I'm trying to answer with this dataset, and therein is my confusion with filtering. Goal (1) is to describe basic differences between populations absent any treatment. This would be A_U versus B_U. Goal (2) is to describe how each population responds to treatment independently (A_T versus A_U; B_T versus B_U). And goal (3) is to describe the interaction between population and treatment. I'm specifying a list of contrasts to identify DEGs for each goal. Each group is given a letter designation (A,B,C,D)
my.contrasts <- makeContrasts(A_TvsA_U=groupB-groupA, B_TvsB_U=groupD-groupC, B_UvsA_U=groupC-groupA, A_TvsA_UvsB_TvsB_U=groupB-groupA-groupD+groupC, levels=design)
What I'm worried about happening is that genes will pass this filter based on their expression in either the treated or the untreated groups and lead to spurious DEGs. To put this in an example: I detect a number of DEGs in goal (2) [A_U versus B_U] that are being driven by one very high read count in A_U, while all the other reads are 0. This gene would have passed the filter because it's expressed in all of the treated samples and ONE of the untreated samples.
What is the ideal approach here? To increase the library threshold above 7? Or to create an entirely independent DGElist object with JUST the untreated samples and use the same filtering approach already described? I would worry that this would rob power from our analysis overall.
Thank you!
Hi, James
Thanks for the quick response. I will look into filterByExpr immediately. And I understand the distinction, and I'll have to look into some means to address how many DEGs are coming back post-filtering that might be due to one very high sample.
In fact, the inspiration for the post was a previous attempt using a different filtering approach (average counts > 10, no minimum sample representation) that, with the more conservative glmQLFit, was returning ~40% of total genes as significant in the A_U versus B_U contrast. Manually plotting CPM data for some of the highest logFC objects revealed a number that were being driven by a single high count. After that, I tried adjusting filtering methods in hopes of mitigating that issue, adopting the approach described above. And now we're getting an even higher proportion (but I haven't plotted anything, yet).