I would like to do a per gene ancestry covariate. I believe this could be important because ancestry for portions of chromosome can be different in the same individual so there is no way to correct across all genes with a simple covariate.
Is there any way to do this with DESeq2? I am already using CQN to do gene length and GC content normalizations.
How many different groups would you have with this approach? How many genes per group?
I'd recommend combining after running results(), and then recalculating the adjusted p-values by running p.adjust() on the pvalue column. If you calculate p-values using p.adjust(), then you should perform your own filtering of low counts genes. You can do this by filtering on the baseMean column (e.g keep genes with mean normalized count larger than 5 or so). The important thing is to filter before running p.adjust(), or else you incur a penalty from multiple testing.
I am thinking through the ramifications of this approach at the moment, so I haven't committed to the grouped gene approach yet, as I was thinking it would be possible to do on a gene by gene basis.
You need groups of genes, for the estimation of the dispersion prior. It's not really feasible to run per gene, or at least it would require making some assumptions about the dispersion prior which would be complicated to implement.
Now that I think about it you would need to model the priors which would invalidate the basic Gaussian assumption about the dispersion parameters. If you modeled the dispersions with a dirichlet and the sampling with a multinomial, you could use a dirichlet compound multinomial, oh well. Maybe 26 or 27 groups at least one per chromosome.
I wasn't proposing doing anything multidimensional, just allowing each set to have it's own dispersion prior. But this sound quite involved, I don't know you might be better off finding a solution that doesn't involve so much re-purposing.
There is not a way to provide different covariates for certain genes in DESeq2. The only way I can think to hack it would be to break the analysis into groups of genes with the same covariate. You could use the same size factors (or normalization factors) for all genes, but just estimate dispersions and coefficients separately for all groups of genes that share the same covariate.
Hi Michael,
I have a similar problem where we want to run a LRT analysis on RNA-Seq to find differences across 4 groups and correcting for both per sample-covariates and per-gene covariates. The per gene-covariates are necessary as they are estimates of "contaminants" of each genes coming from other sources. The per-genes covariants are also different in each sample (another matrix basically) . Is there a way to use the DESEq2 setup or shall we implement our own anova-like analysis?
In that case, how would we approximately reproduce the corrections that are within DESEq2?
Thank you for your help!
Thank you Michael,
yes the matrix would be the same gene x sample structure of the counts K(i,j), where s(i,j) represent a fraction of ambient contamination per sample derived from that gene. I understand that I can include them as normalization factor as described in the vignette, but that would remove the normalization derived from the size factors, which I need? How can include both normalization factor and library size normalization ?
We have a function we use for that, e.g. when we have gene lengths (which change across samples). If we know that the counts will scale with the lengths, and we additionally want to correct for sequencing depth per sample, we do:
This actually adds normalizationFactors to the dds object, which take into account both normMatrix and sequencing depth. See the note in ?estimateSzieFactors:
It is recommended to divide out the row-wise geometric mean of normMatrix so the rows roughly are centered on 1.
This can be done with
nm <- nm / exp(rowMeans(log(nm))) # divide out the geometric mean
Okay thank you!
I tried that workflow, but my normalization matrix has 0s so when I use the suggested formula to center on 1 , it returend a lot of NaN and Inf so I added 1 to solve that and then it worked:
Is there a way to merge these back together at some point? Or would each separate group have its own FDR calculation?
(I moved this from an Answer to a Comment)
How many different groups would you have with this approach? How many genes per group?
I'd recommend combining after running results(), and then recalculating the adjusted p-values by running p.adjust() on the pvalue column. If you calculate p-values using p.adjust(), then you should perform your own filtering of low counts genes. You can do this by filtering on the baseMean column (e.g keep genes with mean normalized count larger than 5 or so). The important thing is to filter before running p.adjust(), or else you incur a penalty from multiple testing.
I am thinking through the ramifications of this approach at the moment, so I haven't committed to the grouped gene approach yet, as I was thinking it would be possible to do on a gene by gene basis.
You need groups of genes, for the estimation of the dispersion prior. It's not really feasible to run per gene, or at least it would require making some assumptions about the dispersion prior which would be complicated to implement.
Now that I think about it you would need to model the priors which would invalidate the basic Gaussian assumption about the dispersion parameters. If you modeled the dispersions with a dirichlet and the sampling with a multinomial, you could use a dirichlet compound multinomial, oh well. Maybe 26 or 27 groups at least one per chromosome.
I wasn't proposing doing anything multidimensional, just allowing each set to have it's own dispersion prior. But this sound quite involved, I don't know you might be better off finding a solution that doesn't involve so much re-purposing.