I have two cell types (A and B) that I purified from bulk tissue using FACS and profiled by RNA-seq, and I'd like to perform differential expression using DESeq2. I estimate that my contamination rates are generally low for my sorted cells (1-2%). Unfortunately, I believe my contamination rates are significantly different (>2-fold) for samples A and B, and the major contaminating cell type, C, has some extremely highly expressed genes that come out as significantly different when I analyze the counts using DESeq2 (i.e., genes I know are specific to C get some of the lowest p-values when comparing A and B).
To be more precise, my tissue is roughly 85% cell type C, 7% A and 5% B (and 3% other), and I think my sorted samples of A are something like 98% A and 2% C, while sorted samples of B are something like 99% B and 1% C.
I happen to also have RNA-seq data from purified cell type C.
Initially, I took the following approach to thinking about correcting for sample contamination:
For a given gene in cell type A, I was thinking I could approximate the observed counts (what I measured) as:
counts[observed] = proportion[A]*counts[A] + proportion[C]*counts[C]
I.e., a function of the counts I would observe in pure populations of A or C scaled by the proportion of those cell types in my sorted sample. I can use a handful of genes I believe are totally specific to cell type C to estimate my contamination rate (proportion[C]) as:
counts[observed] = proportion[A]*0 + proportion[C]*counts[C]
proportion[C] = counts[observed] / counts[C]
Again: I already have data from pure populations of cell type C and don't believe contamination is an issue for cell type C.
So I've done this to estimate my contamination rate (proportion of cell type C) in sorted samples A and B, and I can use that to estimate "corrected" counts by plugging into the above formula.
My questions are:
1) Is there any way to use these heavily modified counts with DESeq2? I feel like, processed in this way, the data have broken all of the assumptions that DESeq2 depends on. So I'm guessing not, but wanted to ask if there's a way to make them usable.
2) Is there an alternative way to incorporate my contamination rate estimates into the design formula in DESeq2 to correct for the different contamination rates when comparing A and B? One of the challenges (as I see it), is that the contamination rates affect the comparison in a gene-specific way, depending on how high the gene is expressed in C.