DESeq2 with contaminating cell types
2
3
Entering edit mode
hughes.drew ▴ 100
@hughesdrew-6747
Last seen 5.6 years ago
United States

I have two cell types (A and B) that I purified from bulk tissue using FACS and profiled by RNA-seq, and I'd like to perform differential expression using DESeq2. I estimate that my contamination rates are generally low for my sorted cells (1-2%). Unfortunately, I believe my contamination rates are significantly different (>2-fold) for samples A and B, and the major contaminating cell type, C, has some extremely highly expressed genes that come out as significantly different when I analyze the counts using DESeq2 (i.e., genes I know are specific to C get some of the lowest p-values when comparing A and B).

To be more precise, my tissue is roughly 85% cell type C, 7% A and 5% B (and 3% other), and I think my sorted samples of A are something like 98% A and 2% C, while sorted samples of B are something like 99% B and 1% C.

I happen to also have RNA-seq data from purified cell type C.

Initially, I took the following approach to thinking about correcting for sample contamination:

For a given gene in cell type A, I was thinking I could approximate the observed counts (what I measured) as:

counts[observed] = proportion[A]*counts[A] + proportion[C]*counts[C]

I.e., a function of the counts I would observe in pure populations of A or C scaled by the proportion of those cell types in my sorted sample. I can use a handful of genes I believe are totally specific to cell type C to estimate my contamination rate (proportion[C]) as:

counts[observed] = proportion[A]*0 + proportion[C]*counts[C]

proportion[C] = counts[observed] / counts[C]

Again: I already have data from pure populations of cell type C and don't believe contamination is an issue for cell type C.

So I've done this to estimate my contamination rate (proportion of cell type C) in sorted samples A and B, and I can use that to estimate "corrected" counts by plugging into the above formula.

My questions are:

1) Is there any way to use these heavily modified counts with DESeq2? I feel like, processed in this way, the data have broken all of the assumptions that DESeq2 depends on. So I'm guessing not, but wanted to ask if there's a way to make them usable.

2) Is there an alternative way to incorporate my contamination rate estimates into the design formula in DESeq2 to correct for the different contamination rates when comparing A and B? One of the challenges (as I see it), is that the contamination rates affect the comparison in a gene-specific way, depending on how high the gene is expressed in C.

Thanks!

deseq2 rna-seq • 1.3k views
0
Entering edit mode
Gavin Kelly ▴ 660
@gavin-kelly-6944
Last seen 2.6 years ago
United Kingdom / London / Francis Crick…

I'd be tempted to attempt correcting for contamination in the design matrix itself.  So if for each sample, you know (estimate) its A:B:C composition proportions, then the overall expression would be (proportion of A)*(expression in A) + (proportion in B)*(expression in B) + (proportion in C) * (expression in C).  A colData with three columns, each giving the numeric value [0-1] of the proportion of that row's sample that belongs to that cell-type. I guess you should distribute the missing 3% between the three columns, or could have four columns, with an 'other' cell type.

If you then have design = ~ A+B+C and then look at the contrast between A and B, that might give you an estimate of the difference between pure A and pure B?  It's probably worth comparing the effect estimates you get out of this with the alternative of correcting before putting it through the DESeq2 machinery - I'd probably recommend turning betaPriorVar to false when you invoke DESeq in either instance.

Something along these lines should work, but it might depend on your replicate structure - I'm guessing you have a collection of 'primarily A' samples, and a collection of 'primarily B' samples - if possible my intuition says it would be better to have individual estimates of each samples composition, rather than having a couple of global proportions.  Any approach will be relying strongly only the accuracy of you proportion estimates, and a proper incorporation of their uncertainty into the statistical analysis is a research topic in its own right (which someone may have addressed, but I'm unaware of it) - so p-values in either your, or my suggested, approach will be approximate at best.

0
Entering edit mode

My colData matrix now looks something like this:

 sample A B C A_1 0 0.999161006 0.000838994 A_2 0 0.999457391 0.000542609 A_3 0 0.999160568 0.000839432 B_1 0.992302794 0 0.007697206 B_2 0.989715739 0 0.010284261 B_3 0.993032484 0 0.006967516 C_1 0 0 1 C_2 0 0 1 C_3 0 0 1

However, due to the way I'm estimating my contamination rates/the assumption I'm making about the cell types in my sorted sample, this matrix is not full rank. I.e., once I estimate proportion[C] for each sample, proportion[A] or proportion[B] is 1 - proportion[C]. So when I try to run DESeq2, I get the error that the model matrix is not full rank.

1) Is there a way around this/am I not implementing your suggestion correctly?

If I'm still able to use this approach, I have a follow-up question about extracting the results. For a standard design (design = ~ cell_type), I'm testing whether or not my cell_type factor (A, B, or C) is a significant predictor of the level of expression of an individual gene, and I can use contrasts to directly compare A vs. B. With the new design (A, B, and C are now proportions and design = ~ A + B + C), I'm testing if the estimated proportion of A or B or C is a significant predictor of the level of expression of an individual gene. Correct? If so,

2) how can I structure my test/use contrasts to directly test for genes that are differentially expressed between pure populations of A and B?

0
Entering edit mode

Sorry, I should have said design ~ A + B + C -1 so that we're not attempting to fit with rank-deficient matrix.

I think you should then be able to estimate the A vs B identically to how you would if you'd had pure 0's and 1's in the matrix, so results(..., contrast=list("A", "B")) will give the estimate of fold-change.  You may need to switch the betaPrior part of DESeq off, and there may be other issues around the subtleties of how DESeq2 does it's fitting, so I'd take quite a bit of time checking how sensible the results you get out are, but conceptually I think this is a correct way of preceding.

0
Entering edit mode

Thanks, again (sorry for the slow response). I went ahead and tried this. Conceptually, I think it's exactly what I was asking for. In practice, I don't do as good of a job as I would like/need (among genes that I strongly believe are due to differential contamination, I drop some from my differentially expressed gene list, but I retain a lot). So I think, going forward, I'll probably just test for differential expression under a standard model (~cell_type) and do some post-hoc filtering to prioritize genes that I think are specific to my cell types A and B.

Regardless, thank you very much for the suggestion and explanation. A forum/posting question: should I accept your response as an answer since, conceptually, it's what I was looking for? Or should I accept Michael's answer, since it may explain why, in practice, this did not perform as I'd hoped?

0
Entering edit mode
@mikelove
Last seen 3 days ago
United States

hi hughes.drew,

Just catching up on the thread. I believe there may be statistical methods for estimating expression levels of A and B if you have 'pure samples' of C. This is often called 'deconvolution' in the bioinformatics literature although it's technically just a mixing problem, having nothing to do with signal processing. I don't have any software packages that I know of off hand (the ones I can think of, require having also pure A and B). But you might start looking around for packages that can do deconvolution and then testing of the estimated (unobserved) levels within a statistical framework.

The problem with trying to do this with a DESeq2 design is that all the statistical packages for differential expression I can think of work on the log scale of expression (which makes sense because expression is positive and has a right tail, so using the log and linear model or GLM with a log link makes sense).

While you know there are true additive effects on the scale of expression (your samples are made up of proportion A * expression A + ...), you can't attempt to model this with additive effects in the log scale of expression, which is what the statistical packages like DESeq2 offer.

0
Entering edit mode

Thanks, Michael. This is very clear and I think explains why trying to implement this with DESeq2 didn't perform as well as I'd hoped in practice. I've tracked down a handful of tools that attempt to estimate cell type composition from heterogeneous RNA-seq data (mostly for tumor samples), but I'm still trying to find ones built for the additional analysis (differential expression) and ones that can handle populations without a pure reference sample. I think I have an ad hoc solution that works well enough for now, but thank you for your input and suggestions.