Question: Batch effects and model matrices
gravatar for bvernot
7 months ago by
bvernot0 wrote:

I have mice with two possible genotypes: homozygous wildtype for a gene of interest, or homozygous edited. We’ve collected cDNA of single cells from a few tissue types, although here I mostly ignore cell and tissue type.  The primary issue is that I clearly have mouse-specific effects, but I don't know how to separate those from genotype. This is analogous to the batch example in section 3.12.1 of the DESeq2 vignette, but because the mouse itself is the batch, it's not like I can change the study design.  I'm doing an ad-hoc correction where I discard genes where the test ~mouse vs ~genotype is highly significant (described below).  Does this seem reasonable?  Is there a better approach?


I simulated some data as an example: 50 cells from 4 mice (two wt, two hum). In reality, I have sibling pairs, so that does help to control for some of the batch effect (testing ~genotype vs ~genotype+sibling).

Here's the simulated expression for each gene. Gene 1 has no genotype effect, but a clear issue in mouse 1. Genes 2 and 3 have a genotype effect, and the rest do not.


Do a wald test for ~genotype: <- DESeqDataSetFromMatrix(expr_counts, cell_annotations, ~genotype) = DESeq( <- results([$padj < .1,]

# log2 fold change (MAP): genotype wt vs hum 
# Wald test p-value: genotype wt vs hum 
# DataFrame with 4 rows and 6 columns
#         baseMean log2FoldChange      lfcSE      stat       pvalue         padj
#        <numeric>      <numeric>  <numeric> <numeric>    <numeric>    <numeric>
# gene1   15.50258     -0.8298745 0.14021331 -5.918657 3.245816e-09 6.491631e-08
# gene2   24.30998      0.4379174 0.08033192  5.451350 4.998887e-08 4.998887e-07
# gene3   21.15279      0.2339134 0.08394988  2.786346 5.330586e-03 2.665293e-02
# gene19  18.54322     -0.2473045 0.08387366 -2.948536 3.192826e-03 2.128551e-02

Cannot do ~mouse+genotype vs ~mouse, because genotype is a subset of mouse.

Instead I test full = ~mouse vs reduced = ~genotype.  This gives a highly significant result for gene1:

# log2 fold change (MLE): mousem4 
# LRT p-value: full vs reduced 
# DataFrame with 1 row and 6 columns
#        baseMean log2FoldChange     lfcSE      stat       pvalue         padj
#       <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
# gene1  15.50258       3.361492 0.1976183  297.7472 2.213212e-65 4.426425e-64


This is what I'm currently doing: treat a gene as significant only if it has a significant result with ~genotype vs ~1, and if it has an unsignificant result for ~mouse vs ~genotype.[$padj < .1 &$padj > .0001,]

# log2 fold change (MAP): genotype wt vs hum 
# Wald test p-value: genotype wt vs hum 
# DataFrame with 3 rows and 6 columns
#         baseMean log2FoldChange      lfcSE      stat       pvalue         padj
#        <numeric>      <numeric>  <numeric> <numeric>    <numeric>    <numeric>
# gene2   24.30998      0.4379174 0.08033192  5.451350 4.998887e-08 4.998887e-07
# gene3   21.15279      0.2339134 0.08394988  2.786346 5.330586e-03 2.665293e-02
# gene19  18.54322     -0.2473045 0.08387366 -2.948536 3.192826e-03 2.128551e-02






ADD COMMENTlink modified 7 months ago by Michael Love15k • written 7 months ago by bvernot0
gravatar for Michael Love
7 months ago by
Michael Love15k
United States
Michael Love15k wrote:

For single cell data, you may want to use a software that handles excess (technical) zeros, relative to a negative binomial or a log linear model.

In general, the LRT you are performing doesn't make sense. The reduced design needs to be nested within the full design (meaning that you remove some terms but you should not introduce other terms).

Here mouse is confounded with genotype (in an unavoidable way), so you can't use fixed effects to try to account for the correlation among expression measurements from the same mouse. The only approach I know of is to use the duplicateCorrelation() functionality of limma.

ADD COMMENTlink written 7 months ago by Michael Love15k

Thanks for your thoughts, Mike.  I was under the impression that ~genotype is a nested model of ~mouse, since it is just ~mouse constraining m1==m3 and m2==m4 (since those mice are the same genotype).  But perhaps I'm wrong, or deseq2/monocle just can't handle that type of nesting?  For what it's worth, the results from Deseq2 (and monocle) for that particular LRT do seem to make sense.

WRT the single cell aspect, I have used monocle in a similar way as I described here, and I do seem to get more stable results,

ADD REPLYlink written 7 months ago by bvernot0

I see what you're doing now. You can test ~mouse vs ~genotype, but think about what you're testing: it's whether or not there are differences between the two mice within a genotype. This isn't what you want. Like I said, the only way i know of to deal with the correlations between mice within genotype is using duplicateCorrelation() in the limma pipeline.

ADD REPLYlink written 7 months ago by Michael Love15k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 103 users visited in the last hour