Question: Batch effects and model matrices
0
24 months ago by
bvernot0
bvernot0 wrote:

I have mice with two possible genotypes: homozygous wildtype for a gene of interest, or homozygous edited. We’ve collected cDNA of single cells from a few tissue types, although here I mostly ignore cell and tissue type.  The primary issue is that I clearly have mouse-specific effects, but I don't know how to separate those from genotype. This is analogous to the batch example in section 3.12.1 of the DESeq2 vignette, but because the mouse itself is the batch, it's not like I can change the study design.  I'm doing an ad-hoc correction where I discard genes where the test ~mouse vs ~genotype is highly significant (described below).  Does this seem reasonable?  Is there a better approach?

Thanks!

I simulated some data as an example: 50 cells from 4 mice (two wt, two hum). In reality, I have sibling pairs, so that does help to control for some of the batch effect (testing ~genotype vs ~genotype+sibling).

Here's the simulated expression for each gene. Gene 1 has no genotype effect, but a clear issue in mouse 1. Genes 2 and 3 have a genotype effect, and the rest do not.

Do a wald test for ~genotype:

deseq.data <- DESeqDataSetFromMatrix(expr_counts, cell_annotations, ~genotype)
deseq.data.wald.gt = DESeq(deseq.data)
deseq.data.wald.gt <- results(deseq.data.wald.gt)
deseq.data.wald.gt[deseq.data.wald.gt$padj < .1,] # log2 fold change (MAP): genotype wt vs hum # Wald test p-value: genotype wt vs hum # DataFrame with 4 rows and 6 columns # baseMean log2FoldChange lfcSE stat pvalue padj # <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> # gene1 15.50258 -0.8298745 0.14021331 -5.918657 3.245816e-09 6.491631e-08 # gene2 24.30998 0.4379174 0.08033192 5.451350 4.998887e-08 4.998887e-07 # gene3 21.15279 0.2339134 0.08394988 2.786346 5.330586e-03 2.665293e-02 # gene19 18.54322 -0.2473045 0.08387366 -2.948536 3.192826e-03 2.128551e-02  Cannot do ~mouse+genotype vs ~mouse, because genotype is a subset of mouse. Instead I test full = ~mouse vs reduced = ~genotype. This gives a highly significant result for gene1: # log2 fold change (MLE): mousem4 # LRT p-value: full vs reduced # DataFrame with 1 row and 6 columns # baseMean log2FoldChange lfcSE stat pvalue padj # <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> # gene1 15.50258 3.361492 0.1976183 297.7472 2.213212e-65 4.426425e-64 This is what I'm currently doing: treat a gene as significant only if it has a significant result with ~genotype vs ~1, and if it has an unsignificant result for ~mouse vs ~genotype. deseq.data.wald.gt[deseq.data.wald.gt$padj < .1 & deseq.data.lrt2\$padj > .0001,]

# log2 fold change (MAP): genotype wt vs hum
# Wald test p-value: genotype wt vs hum
# DataFrame with 3 rows and 6 columns
#         baseMean log2FoldChange      lfcSE      stat       pvalue         padj
#        <numeric>      <numeric>  <numeric> <numeric>    <numeric>    <numeric>
# gene2   24.30998      0.4379174 0.08033192  5.451350 4.998887e-08 4.998887e-07
# gene3   21.15279      0.2339134 0.08394988  2.786346 5.330586e-03 2.665293e-02
# gene19  18.54322     -0.2473045 0.08387366 -2.948536 3.192826e-03 2.128551e-02

deseq2 single cell monocle • 471 views
modified 24 months ago by Michael Love23k • written 24 months ago by bvernot0
Answer: Batch effects and model matrices
0
24 months ago by
Michael Love23k
United States
Michael Love23k wrote:

For single cell data, you may want to use a software that handles excess (technical) zeros, relative to a negative binomial or a log linear model.

In general, the LRT you are performing doesn't make sense. The reduced design needs to be nested within the full design (meaning that you remove some terms but you should not introduce other terms).

Here mouse is confounded with genotype (in an unavoidable way), so you can't use fixed effects to try to account for the correlation among expression measurements from the same mouse. The only approach I know of is to use the duplicateCorrelation() functionality of limma.

Thanks for your thoughts, Mike.  I was under the impression that ~genotype is a nested model of ~mouse, since it is just ~mouse constraining m1==m3 and m2==m4 (since those mice are the same genotype).  But perhaps I'm wrong, or deseq2/monocle just can't handle that type of nesting?  For what it's worth, the results from Deseq2 (and monocle) for that particular LRT do seem to make sense.

WRT the single cell aspect, I have used monocle in a similar way as I described here, and I do seem to get more stable results,