I have two cell types (A and B) that I purified from bulk tissue using FACS and profiled by RNA-seq, and I'd like to perform differential expression using DESeq2. I estimate that my contamination rates are generally low for my sorted cells (1-2%). Unfortunately, I believe my contamination rates are significantly different (>2-fold) for samples A and B, and the major contaminating cell type, C, has some extremely highly expressed genes that come out as significantly different when I analyze the counts using DESeq2 (i.e., genes I know are specific to C get some of the lowest p-values when comparing A and B).
To be more precise, my tissue is roughly 85% cell type C, 7% A and 5% B (and 3% other), and I think my sorted samples of A are something like 98% A and 2% C, while sorted samples of B are something like 99% B and 1% C.
I happen to also have RNA-seq data from purified cell type C.
Initially, I took the following approach to thinking about correcting for sample contamination:
For a given gene in cell type A, I was thinking I could approximate the observed counts (what I measured) as:
counts[observed] = proportion[A]*counts[A] + proportion[C]*counts[C]
I.e., a function of the counts I would observe in pure populations of A or C scaled by the proportion of those cell types in my sorted sample. I can use a handful of genes I believe are totally specific to cell type C to estimate my contamination rate (proportion[C]) as:
counts[observed] = proportion[A]*0 + proportion[C]*counts[C]
proportion[C] = counts[observed] / counts[C]
Again: I already have data from pure populations of cell type C and don't believe contamination is an issue for cell type C.
So I've done this to estimate my contamination rate (proportion of cell type C) in sorted samples A and B, and I can use that to estimate "corrected" counts by plugging into the above formula.
My questions are:
1) Is there any way to use these heavily modified counts with DESeq2? I feel like, processed in this way, the data have broken all of the assumptions that DESeq2 depends on. So I'm guessing not, but wanted to ask if there's a way to make them usable.
2) Is there an alternative way to incorporate my contamination rate estimates into the design formula in DESeq2 to correct for the different contamination rates when comparing A and B? One of the challenges (as I see it), is that the contamination rates affect the comparison in a gene-specific way, depending on how high the gene is expressed in C.
Thanks!
Awesome, thanks for your suggestion.
My colData matrix now looks something like this:
sample
A
B
C
A_1
0
0.999161006
0.000838994
A_2
0
0.999457391
0.000542609
A_3
0
0.999160568
0.000839432
B_1
0.992302794
0
0.007697206
B_2
0.989715739
0
0.010284261
B_3
0.993032484
0
0.006967516
C_1
0
0
1
C_2
0
0
1
C_3
0
0
1
However, due to the way I'm estimating my contamination rates/the assumption I'm making about the cell types in my sorted sample, this matrix is not full rank. I.e., once I estimate proportion[C] for each sample, proportion[A] or proportion[B] is 1 - proportion[C]. So when I try to run DESeq2, I get the error that the model matrix is not full rank.
1) Is there a way around this/am I not implementing your suggestion correctly?
If I'm still able to use this approach, I have a follow-up question about extracting the results. For a standard design (design = ~ cell_type), I'm testing whether or not my cell_type factor (A, B, or C) is a significant predictor of the level of expression of an individual gene, and I can use contrasts to directly compare A vs. B. With the new design (A, B, and C are now proportions and design = ~ A + B + C), I'm testing if the estimated proportion of A or B or C is a significant predictor of the level of expression of an individual gene. Correct? If so,
2) how can I structure my test/use contrasts to directly test for genes that are differentially expressed between pure populations of A and B?
Thanks again for your help.
Sorry, I should have said
design ~ A + B + C -1
so that we're not attempting to fit with rank-deficient matrix.I think you should then be able to estimate the A vs B identically to how you would if you'd had pure 0's and 1's in the matrix, so
results(..., contrast=list("A", "B"))
will give the estimate of fold-change. You may need to switch the betaPrior part of DESeq off, and there may be other issues around the subtleties of how DESeq2 does it's fitting, so I'd take quite a bit of time checking how sensible the results you get out are, but conceptually I think this is a correct way of preceding.Thanks, again (sorry for the slow response). I went ahead and tried this. Conceptually, I think it's exactly what I was asking for. In practice, I don't do as good of a job as I would like/need (among genes that I strongly believe are due to differential contamination, I drop some from my differentially expressed gene list, but I retain a lot). So I think, going forward, I'll probably just test for differential expression under a standard model (~cell_type) and do some post-hoc filtering to prioritize genes that I think are specific to my cell types A and B.
Regardless, thank you very much for the suggestion and explanation. A forum/posting question: should I accept your response as an answer since, conceptually, it's what I was looking for? Or should I accept Michael's answer, since it may explain why, in practice, this did not perform as I'd hoped?