Question

Independent filtering of NGS data

0

Entering edit mode

Rubén • 0

@ruben-24613

Last seen 5.0 years ago

Hello all,

This is generic question, not really dependent on code. I tag DESeq2 because that is what I am using and it would be ideal to reach a solution using it, but the main question is about independent filtering. I have a dataset with two treatments and 4 biological replicates each. This is an amplicon sequencing experiment where all amplicons are the same size, and the total number of amplicons is ~200K. Similarly to a CRISPR, within each biological replicate I have internal replicates, which are amplicons with different DNA sequence that I would expect to give the same phenotype. So there are 3 of these internal replicates per amplicon, and therefore 3 internal reps x 66K amplicon groups = 200K amplicons in total. The problem I am facing is a very strong dispersion between the biological replicates of the treated samples (the control samples have a good correlation, similar to other datasets I am working with). I have looked at the internal replicates, and seen that there is a lot of variation between them as well. So I am assuming that the experimental system is very noisy. A comparison of data between independent experiments suggests that I have many false positives.

I have been thinking about the independent filtering approach described in DESeq2 and the corresponding filtering paper, so I would like your opinion about this:

would it be OK to discard the amplicons with high variation between internal replicates, and repeat the analysis with the ones with low variation? I understand that doing this would be independent from the null hypothesis (OK according to the paper), but not correlated to the alternative (not OK according to the paper).
is there a way to include this internal replication in the DESeq2 model? If not, could you give me a pointer to literature to approach this? The difference between my experiment and a CRISPR one is that in CRISPR the guides target different parts of a gene, so they are different. In my case, the DNA sequence of the internal replicates is different, but they all encode for the same peptide, so other than differences between codons I would expect low variability in the phenotype for the internal reps within each biological rep.

Thanks a lot, Ruben

independentfiltering DESeq2 CRISPR • 1.7k views

ADD COMMENT • link 5.0 years ago Rubén • 0

score 2 · Answer 1 · 2021-01-25

1) We don't recommend to use within-group variance for independent filtering.

2) "Is there a way to include this internal replication in the DESeq2 model?" If the structure is nested within conditions, as in a random effect, we don't have a way to fit this. Limma has a duplicteCorrelation function, so you could use limma-voom to model replicate structure within condition with that pipeline.

score 0 · Answer 2 · 2021-01-31

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 4 months ago

EMBL European Molecular Biology Laborat…

The fact that you are "facing .... a very strong dispersion between the biological replicates of the treated samples" suggests that the DESeq2 (or similar, like edgeR, limma) approach of finding consistent effects within biological replicates and between conditions may not immediately be the best one, i.e. would lead to too many FPs, FNs, or a suboptimal ranking of your hits. Maybe you can fix this by:

first doing exploratory data analysis on your replicates and from that decide which ones "worked" and which ones did not; and then revisit DESeq2 et al.
or coming up with a different, hand-crafted test statistic such as using the 2nd- or 3rd-strongest effect among your replicates (i.e., essentially, quantiles), and some similar rule between your 3 amplicons per gene. This wouldn't give you p-values, but a ranking, and you could tune the thresholds and parameter choices based on the known positive and negative control genes (which I am sure you have). Such rules could also include filtering ideas such as you mentioned, as long as you give up on the requirement to get p-values, and instead estimate the FPR, FNR, FDR of your analysis algorithm using known control genes and are transparent about it in the reporting.

Hope this helps, Wolfgang

PS DESeq2 is a wonderful hammer, but not everything is a nail.

ADD COMMENT • link 5.0 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Hello Wolfgang, Thank you for the comments (and thank you also Michael!). First I have to mention that I come from a biology background, which is great because it gives me the understanding of the biology and the lab-technical implications of what we are doing. However my data analysis background I am in the process of building, which makes me wary about being creative. I am looking forward to having the tools, like I have in molecular biology, that allow me to control that the results I get are likely to be real and not an artifact of wrong assumptions. I have the exploratory analysis, and I can say there is no good reason to discard any biological rep. My PCA showed a potential outlier biorep, but I realised that most of the variation came from a range of amplicons with low counts. Once I filtered them out, the outlier was no more. I confirmed that by looking at replicate-to-replicate heat scatterplots with Spearman correlation, and comparing each rep against the log mean of all reps from the same treatment in unfiltered data.

I like the idea about the ranking, and I have those pos/neg controls in place. Perhaps you could elaborate a little bit more on "the 2nd- or 3rd-strongest effect among your replicates"? And what would be the dataset to work with, an rlog transformation of the normalised counts? Something I tried was to get a local z-score (log2FoldChange/lfcSE) for each amplicon in my dataset, filter the amplicons with good signal, filter again those with low coefficient of variation between the internal replicates, and then rank by median log2FoldChange. Visually, they make sense on the MA plots: I get a denser cloud of amplicons around the log2FoldChange=0, and they look like potential true DE amplicons. But to see that I am looking just at the population of amplicons with low variation between internal replicates, and discarding the rest of the data. From Michael's answer I see that it's not recommended to run DESeq2 on the filtered dataset, but does it work for ranking on the lines you mention?

Very nice, that comment about hammer an nail. That's what I want to build, a good toolbox! I find it especially rewarding to become a bridge that can communicate biologists with statisticians.

Best, Ruben

ADD REPLY • link 5.0 years ago Rubén • 0