Question

deseq2 + filtering low counts with noiseq function

1

Entering edit mode

andreia ▴ 10

@andreia-23745

Last seen 2.1 years ago

Portugal

Hi there,

I am having some doubts about filtering low counts. I already read and read ... papers, questions in forums and i still have the same issue. I know that deseq2 has the independent filtering in the results() however after trying different alpha thresholds my results differs on number of DEGs ok but not quite the genes that my client looking for :) i know, i know this is the result, if dont appear its because its not suppose too! However, if i apply the method 2 of filtering with noiseq package i got the genes that client want! This method was applied before the normalization which removed alot of genes... My question is can i use the noiseq filter or just give the results that i got only with independent filter from deseq2?

Thanks in advance.

deseq2 noiseq lowcounts rnaseq • 1.7k views

ADD COMMENT • link updated 2.3 years ago by Michael Love 41k • written 3.6 years ago by andreia ▴ 10

score 0 · Answer 1 · 2020-09-12

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 19 hours ago

United States

I advise users to not try too many choices when desiring a certain outcome. See: “garden of forking paths” and issues with replication.

ADD COMMENT • link 3.6 years ago Michael Love 41k

1

Entering edit mode

Thank you for mentioning the "garden of forking paths" here regarding this issue.

However I have troubles on translating this into bioinformatics. Does the article and your comment mean that a researcher should not perform two individual DGE analysis with parametric and non-parametric tool such as with DESeq2 and NOISeq because there is the risk of unintentional p-hacking or is it considered unintentional p-hacking to perform both analyses even when both results are reported? Or is it purely about how straight forward conclusions you make based on the data?

I prefer and recommend to test any bioinformatic results experimentally anyway since analysis workflow affect the results too and can lead to false positives.

ADD REPLY • link 2.3 years ago bioinformatic_trainee ▴ 10

1

Entering edit mode

"because there is the risk of unintentional p-hacking or is it considered unintentional p-hacking to perform both analyses"

Yes, exactly, it's not a good idea to try running multiple methods to see which comes closer to the expected outcome.

"I prefer and recommend to test any bioinformatic results experimentally"

I likewise prefer to test bioinformatic tools on experimental datasets where truth is known (at least partially). Benchmarking methods is one of my main research interests. But it's critical to do this type of testing on a pilot dataset or on an independent benchmarking dataset.

I don't recommend to do testing of multiple tools or options on the primary dataset and then report only the results that were in concordance with prior expectations.

ADD REPLY • link 2.3 years ago Michael Love 41k

0

Entering edit mode

Thank you for your answer. I am just wondering that parametric and non-parametric methods produce somewhat different results anyway and I've been told that non-parametric methods on DGE should not be used alone. If generally parametric DGE tools are preferred when it is ok to use non-parametric method as a single method by your opinion?

Honestly, I couldn't understand why using two different methods is bad if all results are reported and discussed on the possible publication as in Kim et al. 2019 "Transcriptome analysis after PPARγ activation in human meibomian gland epithelial cells (hMGEC)" (DOI: 10.1016/j.jtos.2019.02.003).

EDIT: I want to stress that I am talking about hypothesis of performing two independent analysis which doesn't combine filtering method from the other etc.

ADD REPLY • link 2.3 years ago bioinformatic_trainee ▴ 10

0

Entering edit mode

Nothing especially concerning about nonparametric methods in my opinion, except that with very low sample size they lose sensitivity. E.g. you won't be able to detect differences in 2 vs 2, where the parametric models can eke out an answer by borrowing information about parameters across features.

We have been using a method built upon SAMseq recently, and it performs well and competitively with parametric models in 5 vs 5 and so on. When the distribution of data is not a good match to a parametric distribution, sometimes the nonparametric methods can outperform in terms of sensitivity at a fixed error rate.

I don't have much more to say than what I've already said about trying out multiple methods on the same dataset. I think again that this is useful for benchmarking on pilot experiments, or independent datasets, but I don't recommend to do this forking paths approach on the primary dataset. Running methods A, B, C, etc. and then seeing which set of results conforms most to prior expectations on the primary dataset is problematic for me.