Question: DEseq2 with limited gene set
gravatar for dmr210
18 months ago by
dmr2100 wrote:

I am using DEseq2 to quantify the gene expression of a very limited set of non coding genes.

It might be obvious, but can I still trust the significance of the differential expression that I obtain, or would having a limited set of genes cause some genes that are not strongly differentially expressed to appear as significantly differentially expressed?

How does selecting a part of the genes only impact the final assessment of the differential expression - so if I was running the same differential expression with these genes along with a full annotation set such as all the ensembl genes, would I still find the same genes as significantly differentially expressed? (without thinking of the obvious change due to the quantification of the raw reads itself which would undoubtedly be slightly different due to gene overlap)

Many thanks, Delphine

ADD COMMENTlink written 18 months ago by dmr2100

How many genes are in the set you are interested in? Did you do a full RNA-seq experiment, or did the assay only target the limited set? If you only targeted the limited set with the assay, do you have some control genes or spike ins for normalization?

ADD REPLYlink written 18 months ago by Michael Love16k

Thanks for your answer. There are about 3600 genes in that set. We did do a full RNA-seq experiment.

My data is not stranded, which means that if I use a full annotation such as ensembl along with these 3600 genes of interest and quantify using Htseq-count a lot of my genes of interest are considered as ambiguous, hence separating them from the full data. (I wanted to avoid using a quantification method that would count the read twice when ambiguous)

ADD REPLYlink written 18 months ago by dmr2100
gravatar for Steve Lianoglou
18 months ago by
Steve Lianoglou12k wrote:

In this case I'd quantify using a tool like salmon (or kallisto) using the full annotation set still. It sounds like you are ignoring "normal" genes that overlap with the genes you're interested and just taking their quantitation at face value, but that's not correct either.

salmon or kallisto will do their best to quantify these regions using the unique parts of a gene to inform where ambiguous reads align to -- this is likely the best you can do for now.

Then use the tximport package to load your data back in at the gene level, and continue with differential expression. If you only care about a subset of genes, I'd still run the analysis using all of your data, then subset down to the genes of interest after you've done the test. You can readjust your adjust pvalues using the raw pvalues of the subset of genes you're analyzing downstream.

ADD COMMENTlink written 18 months ago by Steve Lianoglou12k

I agree with Steve's answer :)

If you just want to test a subset you can do:

res <- results(dds)
res.sub <- res[idx,]  # index the genes you are interested in
res.sub$padj <- p.adjust(res.sub$pvalue, method="BH")

As you are not using the normal independent filtering routine to adjust p-values, you may also want to pre-filter low counts genes, e.g. require that the row sum of count is > 5, before running DESeq(). Either that or you can use the IHW package for hypothesis weighting afterwards.

ADD REPLYlink written 18 months ago by Michael Love16k

Thank you both for your quick answers.

I will look into salmon or kallisto for quantification, thanks for the suggestion.

In case that quantification still leads to issues due to overlap, could you help me understand to what extent using DEseq2 on the limited set of genes would not be a good idea?

So if we forget about the whole quantification bias (which I agree is a big one, and you are probably right, using something that can distinguish between overlapping transcripts like salmon would be much better than my current approach), how does using only a part of the genes affect the results of the differential expression?

I can intuitively understand that less genes means less overall information for DEseq, but I am not sure how that affect it exactly.

Thanks very much again.

ADD REPLYlink written 18 months ago by dmr2100

For the estimation of the dispersion prior, 100s of genes is sufficient.

However, you should use all the genes to estimate size factors, unless you know that the genes in your set are mostly not changed by treatment.

ADD REPLYlink written 18 months ago by Michael Love16k

Perfect, thanks very much for your help.

ADD REPLYlink modified 17 months ago • written 17 months ago by dmr2100
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 156 users visited in the last hour