I have mice with two possible genotypes: homozygous wildtype for a gene of interest, or homozygous edited. We’ve collected cDNA of single cells from a few tissue types, although here I mostly ignore cell and tissue type. The primary issue is that I clearly have mouse-specific effects, but I don't know how to separate those from genotype. This is analogous to the batch example in section 3.12.1 of the DESeq2 vignette, but because the mouse itself is the batch, it's not like I can change the study design. I'm doing an ad-hoc correction where I discard genes where the test ~mouse vs ~genotype is highly significant (described below). Does this seem reasonable? Is there a better approach?
Thanks!
I simulated some data as an example: 50 cells from 4 mice (two wt, two hum). In reality, I have sibling pairs, so that does help to control for some of the batch effect (testing ~genotype vs ~genotype+sibling).
Here's the simulated expression for each gene. Gene 1 has no genotype effect, but a clear issue in mouse 1. Genes 2 and 3 have a genotype effect, and the rest do not.
Do a wald test for ~genotype:
deseq.data <- DESeqDataSetFromMatrix(expr_counts, cell_annotations, ~genotype) deseq.data.wald.gt = DESeq(deseq.data) deseq.data.wald.gt <- results(deseq.data.wald.gt) deseq.data.wald.gt[deseq.data.wald.gt$padj < .1,] # log2 fold change (MAP): genotype wt vs hum # Wald test p-value: genotype wt vs hum # DataFrame with 4 rows and 6 columns # baseMean log2FoldChange lfcSE stat pvalue padj # <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> # gene1 15.50258 -0.8298745 0.14021331 -5.918657 3.245816e-09 6.491631e-08 # gene2 24.30998 0.4379174 0.08033192 5.451350 4.998887e-08 4.998887e-07 # gene3 21.15279 0.2339134 0.08394988 2.786346 5.330586e-03 2.665293e-02 # gene19 18.54322 -0.2473045 0.08387366 -2.948536 3.192826e-03 2.128551e-02
Cannot do ~mouse+genotype vs ~mouse, because genotype is a subset of mouse.
Instead I test full = ~mouse
vs reduced = ~genotype
. This gives a highly significant result for gene1:
# log2 fold change (MLE): mousem4 # LRT p-value: full vs reduced # DataFrame with 1 row and 6 columns # baseMean log2FoldChange lfcSE stat pvalue padj # <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> # gene1 15.50258 3.361492 0.1976183 297.7472 2.213212e-65 4.426425e-64
This is what I'm currently doing: treat a gene as significant only if it has a significant result with ~genotype vs ~1
, and if it has an unsignificant result for ~mouse
vs ~genotype.
deseq.data.wald.gt[deseq.data.wald.gt$padj < .1 & deseq.data.lrt2$padj > .0001,] # log2 fold change (MAP): genotype wt vs hum # Wald test p-value: genotype wt vs hum # DataFrame with 3 rows and 6 columns # baseMean log2FoldChange lfcSE stat pvalue padj # <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> # gene2 24.30998 0.4379174 0.08033192 5.451350 4.998887e-08 4.998887e-07 # gene3 21.15279 0.2339134 0.08394988 2.786346 5.330586e-03 2.665293e-02 # gene19 18.54322 -0.2473045 0.08387366 -2.948536 3.192826e-03 2.128551e-02

