Question

DEseq2 with limited gene set

0

Entering edit mode

dmr210 • 0

@dmr210-11293

Last seen 7.7 years ago

I am using DEseq2 to quantify the gene expression of a very limited set of non coding genes.

It might be obvious, but can I still trust the significance of the differential expression that I obtain, or would having a limited set of genes cause some genes that are not strongly differentially expressed to appear as significantly differentially expressed?

How does selecting a part of the genes only impact the final assessment of the differential expression - so if I was running the same differential expression with these genes along with a full annotation set such as all the ensembl genes, would I still find the same genes as significantly differentially expressed? (without thinking of the obvious change due to the quantification of the raw reads itself which would undoubtedly be slightly different due to gene overlap)

Many thanks, Delphine

DEseq2 differential gene expression • 3.0k views

ADD COMMENT • link 7.7 years ago dmr210 • 0

1

Entering edit mode

How many genes are in the set you are interested in? Did you do a full RNA-seq experiment, or did the assay only target the limited set? If you only targeted the limited set with the assay, do you have some control genes or spike ins for normalization?

ADD REPLY • link 7.7 years ago Michael Love 41k

0

Entering edit mode

Thanks for your answer. There are about 3600 genes in that set. We did do a full RNA-seq experiment.

My data is not stranded, which means that if I use a full annotation such as ensembl along with these 3600 genes of interest and quantify using Htseq-count a lot of my genes of interest are considered as ambiguous, hence separating them from the full data. (I wanted to avoid using a quantification method that would count the read twice when ambiguous)

ADD REPLY • link 7.7 years ago dmr210 • 0

score 4 · Accepted Answer · 2016-08-19

4

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 14 months ago

United States

In this case I'd quantify using a tool like salmon (or kallisto) using the full annotation set still. It sounds like you are ignoring "normal" genes that overlap with the genes you're interested and just taking their quantitation at face value, but that's not correct either.

salmon or kallisto will do their best to quantify these regions using the unique parts of a gene to inform where ambiguous reads align to -- this is likely the best you can do for now.

Then use the tximport package to load your data back in at the gene level, and continue with differential expression. If you only care about a subset of genes, I'd still run the analysis using all of your data, then subset down to the genes of interest after you've done the test. You can readjust your adjust pvalues using the raw pvalues of the subset of genes you're analyzing downstream.

ADD COMMENT • link 7.7 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

I agree with Steve's answer :)

If you just want to test a subset you can do:

res <- results(dds)
res.sub <- res[idx,]  # index the genes you are interested in
res.sub$padj <- p.adjust(res.sub$pvalue, method="BH")

As you are not using the normal independent filtering routine to adjust p-values, you may also want to pre-filter low counts genes, e.g. require that the row sum of count is > 5, before running DESeq(). Either that or you can use the IHW package for hypothesis weighting afterwards.

ADD REPLY • link 7.7 years ago Michael Love 41k

0

Entering edit mode

Thank you both for your quick answers.

I will look into salmon or kallisto for quantification, thanks for the suggestion.

In case that quantification still leads to issues due to overlap, could you help me understand to what extent using DEseq2 on the limited set of genes would not be a good idea?

So if we forget about the whole quantification bias (which I agree is a big one, and you are probably right, using something that can distinguish between overlapping transcripts like salmon would be much better than my current approach), how does using only a part of the genes affect the results of the differential expression?

I can intuitively understand that less genes means less overall information for DEseq, but I am not sure how that affect it exactly.

Thanks very much again.

ADD REPLY • link 7.7 years ago dmr210 • 0

1

Entering edit mode

For the estimation of the dispersion prior, 100s of genes is sufficient.

However, you should use all the genes to estimate size factors, unless you know that the genes in your set are mostly not changed by treatment.

ADD REPLY • link 7.7 years ago Michael Love 41k

0

Entering edit mode

Perfect, thanks very much for your help.

ADD REPLY • link 7.7 years ago dmr210 • 0

0

Entering edit mode

Hi Michael,

I have a similar question only in my case I need to better understand why in current pipelines one must pay multiple testing corrections in two different steps 1.During DGE 2. GSE analysis.

Therefore, I need to understand if subsetting genes by the biological processes that I am interested in studying ( assuming these are >100 genes) would violate DESeq2 statistical assumptions.

ADD REPLY • link 3.9 years ago DataFanatic ▴ 10

0

Entering edit mode

Sorry I don't see the concrete question. For gene set enrichment, I'd recommend making a new post about a specific gene set testing method.

Re: DESeq2, what is your question here? I said above that you need to estimate size factors. How will you do this with your subset?

ADD REPLY • link 3.9 years ago Michael Love 41k

0

Entering edit mode

Does subsetting genes by the biological processes that I am interested in studying ( assuming these are >100 genes) violate DESeq2 statistical assumptions.

ADD REPLY • link 3.9 years ago DataFanatic ▴ 10

0

Entering edit mode

Yes it does. You cannot estimate sequencing depth appropriately.

ADD REPLY • link 3.9 years ago Michael Love 41k

0

Entering edit mode

I see. But I still don't fully understand why your answer here was:

For the estimation of the dispersion prior, 100s of genes is sufficient.

However, you should use all the genes to estimate size factors, unless you know that the genes in your set are mostly not changed by treatment.

I apologize if this is a basic question and would truly appreciate your elaborate answer.

ADD REPLY • link 3.9 years ago DataFanatic ▴ 10

0

Entering edit mode

Sorry, unfortunately, I don't have time to go into this further here. I would recommend to collaborate with a bioinformatician familiar with RNA-seq normalization and analysis.

The issue of estimation of sequencing depth, and why you can't only look at the set of genes that (may be) affected by condition, is also likely covered in the RNA-seq normalization literature.

ADD REPLY • link 3.9 years ago Michael Love 41k