Question

Per gene Covariates

0

Entering edit mode

kodream ▴ 20

@kodream-6952

Last seen 7.1 years ago

United States

I would like to do a per gene ancestry covariate. I believe this could be important because ancestry for portions of chromosome can be different in the same individual so there is no way to correct across all genes with a simple covariate.

Is there any way to do this with DESeq2? I am already using CQN to do gene length and GC content normalizations.

deseq2 ancestry • 1.6k views

ADD COMMENT • link updated 2.6 years ago by Paola • 0 • written 7.3 years ago by kodream ▴ 20

0

Entering edit mode

Is there a way to merge these back together at some point? Or would each separate group have its own FDR calculation?

ADD REPLY • link 7.3 years ago kodream ▴ 20

0

Entering edit mode

(I moved this from an Answer to a Comment)

How many different groups would you have with this approach? How many genes per group?

I'd recommend combining after running results(), and then recalculating the adjusted p-values by running p.adjust() on the pvalue column. If you calculate p-values using p.adjust(), then you should perform your own filtering of low counts genes. You can do this by filtering on the baseMean column (e.g keep genes with mean normalized count larger than 5 or so). The important thing is to filter before running p.adjust(), or else you incur a penalty from multiple testing.

ADD REPLY • link 7.3 years ago Michael Love 41k

0

Entering edit mode

I am thinking through the ramifications of this approach at the moment, so I haven't committed to the grouped gene approach yet, as I was thinking it would be possible to do on a gene by gene basis.

ADD REPLY • link 7.3 years ago kodream ▴ 20

0

Entering edit mode

You need groups of genes, for the estimation of the dispersion prior. It's not really feasible to run per gene, or at least it would require making some assumptions about the dispersion prior which would be complicated to implement.

ADD REPLY • link 7.3 years ago Michael Love 41k

0

Entering edit mode

Now that I think about it you would need to model the priors which would invalidate the basic Gaussian assumption about the dispersion parameters. If you modeled the dispersions with a dirichlet and the sampling with a multinomial, you could use a dirichlet compound multinomial, oh well. Maybe 26 or 27 groups at least one per chromosome.

ADD REPLY • link 7.3 years ago kodream ▴ 20

0

Entering edit mode

I wasn't proposing doing anything multidimensional, just allowing each set to have it's own dispersion prior. But this sound quite involved, I don't know you might be better off finding a solution that doesn't involve so much re-purposing.

ADD REPLY • link 7.3 years ago Michael Love 41k

score 0 · Answer 1 · 2017-01-05

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 11 hours ago

United States

hi,

There is not a way to provide different covariates for certain genes in DESeq2. The only way I can think to hack it would be to break the analysis into groups of genes with the same covariate. You could use the same size factors (or normalization factors) for all genes, but just estimate dispersions and coefficients separately for all groups of genes that share the same covariate.

ADD COMMENT • link 7.3 years ago Michael Love 41k

0

Entering edit mode

Hi Michael, I have a similar problem where we want to run a LRT analysis on RNA-Seq to find differences across 4 groups and correcting for both per sample-covariates and per-gene covariates. The per gene-covariates are necessary as they are estimates of "contaminants" of each genes coming from other sources. The per-genes covariants are also different in each sample (another matrix basically) . Is there a way to use the DESEq2 setup or shall we implement our own anova-like analysis? In that case, how would we approximately reproduce the corrections that are within DESEq2? Thank you for your help!

Paola

ADD REPLY • link 2.6 years ago Paola • 0

0

Entering edit mode

If you have per gene and per sample corrections, you can use normalization factors (see DESeq2 paper and vignette first on this topic).

For a matrix S with elements s(i,j) how do the values relate to the counts K(i,j)?

See 2nd paragraph of Results of DESeq2 paper for details on the equation.

ADD REPLY • link 2.6 years ago Michael Love 41k

0

Entering edit mode

Thank you Michael, yes the matrix would be the same gene x sample structure of the counts K(i,j), where s(i,j) represent a fraction of ambient contamination per sample derived from that gene. I understand that I can include them as normalization factor as described in the vignette, but that would remove the normalization derived from the size factors, which I need? How can include both normalization factor and library size normalization ?

Thank you for your help! Paola

ADD REPLY • link 2.6 years ago Paola • 0

0

Entering edit mode

We have a function we use for that, e.g. when we have gene lengths (which change across samples). If we know that the counts will scale with the lengths, and we additionally want to correct for sequencing depth per sample, we do:

dds <- estimateSizeFactors(dds, normMatrix=normMatrix)

This actually adds normalizationFactors to the dds object, which take into account both normMatrix and sequencing depth. See the note in ?estimateSzieFactors:

It is recommended to divide out the row-wise geometric mean of normMatrix so the rows roughly are centered on 1.

This can be done with

nm <- nm / exp(rowMeans(log(nm))) # divide out the geometric mean

ADD REPLY • link 2.6 years ago Michael Love 41k

0

Entering edit mode

Okay thank you! I tried that workflow, but my normalization matrix has 0s so when I use the suggested formula to center on 1 , it returend a lot of NaN and Inf so I added 1 to solve that and then it worked:

normMatrix <- contam +1 / exp(rowMeans(log(contam+1)))

Is it a correct solution theoretically ?

ADD REPLY • link 2.6 years ago Paola • 0