Question

Advice on appropriateness of an offset in glmGamPoi/edgeR

0

Entering edit mode

Alec • 0

@43b7c63e

Last seen 17 months ago

United States

Hello!

I'm working with an interesting single cell RNA-seq dataset, which because of the collection method is suffering from elevated "contamination" because of misassigned reads. Basically because of how the sample is collected, if two cells are nearby one another, reads from Cell_1 can be mis-assigned to Cell_2, and vice-versa.

This is causing quite a lot of "false positives" in our differential expression analysis. example: let's assume gene_A is highly expressed in cell-type_1, but never expressed in cell-type_2. Let's also assume that cell-type_1 varies in its abundance between samples from 2 treatment conditions. Because of contaminating reads, gene_A might "incorrectly" appear as differentially expressed in cell-type_2.

However, we have an observation specific indirect measure of contamination (per-gene estimate for each cell), and I am wondering if there's any advice on how this could be best used in an existing method to help correct for the "contamination effect" that I'm seeing, or if that's ill-advised.

Basically, as I can generate a genes x sample matrix of counts, and a genes x sample matrix of estimated contamination, I am wondering if it would be appropriate to include this contamination metric as an offset matrix in edgeR or glmGamPoi or DESeq2 (in addition to a library.size measure). Similar in concept to observation level GC corrections done in EDAseq (as I understand it). I'm trying not to reinvent the wheel for its own sake, but if I need to go a new route that's fine as well.

I've tested this a little bit by just running negative binomial regressions at the individual gene level including this indirect contamination measure as a term in the model, or including it as an offset (representative code below).

model <- glm.nb(counts_gene_A ~ group + log(gene_A_contamination) + offset(log(library.size)), data)
model_offset <- glm.nb(counts_gene_A ~ group + offset(log(gene_A_contamination)) + offset(log(library.size)), data)

Doing this I see very little change in aic values, though model usually has a slightly lower aic than model_offset. Group coefficients aren't much affected either (again, usually).

Alternately I could try to find a way to incorporate this contamination value as a modeled term, but it doesn't seem easily done in existing differential expression models. I've seen in the new edgeR release that there's specific mention of incorporating log transformed gene expression values into the design matrix to look at how gene_A might correspond to changed expression of gene_B, but here I would need to model the contamination scores for each gene separately. That said if there's an easy way to do this that I'm just overlooking that'd be amazing.

Thanks very much for taking the time to read this!

DESeq2 glmGamPoi offset edgeR scRNAseq • 2.0k views

ADD COMMENT • link updated 18 months ago by Gordon Smyth 53k • written 19 months ago by Alec • 0

score 1 · Answer 1 · 2024-06-27

Hi Alec,

The problem you describe sounds related to problems that people encounter working with spatial data, where suboptimal segmentations misassign reads to the wrong cells. Some time ago, I saw a paper by Kieran Campbell's lab, which proposed a probabilistic model to denoise the initial count matrix (Lee et al., bioRxiv 2024), which might be relevant here.

we have an observation specific indirect measure of contamination (per-gene estimate for each cell), and I am wondering if there's any advice on how this could be best used in an existing method to help correct for the "contamination effect" that I'm seeing, or if that's ill-advised.

To me your idea to use a genes x samples contamination error matrix sounds good! Speaking for glmGamPoi, you will need to include the combined library size and gene_contamination matrix as an offset parameter. As far as I know, edgeR also supports a full matrix with offset values and DESeq2 ~~does not~~ does as well through the normalizationFactor function.

I've tested this a little bit by just running negative binomial regressions at the individual gene level including this indirect contamination measure [...] Doing this I see very little change in aic values, though model usually has a slightly lower aic than model_offset. Group coefficients aren't much affected either (again, usually).

Whether you need to fit a coefficient for your contaminations or can include them as an offset (i.e., fix the coefficient to 1), depends on how accurate they are. For comparison, if a cell is twice as big and thus the size factor is twice as large, we expect there also to be twice as many counts for each gene and can thus treat the size factor as an offset.

Alternately I could try to find a way to incorporate this contamination value as a modeled term, but it doesn't seem easily done in existing differential expression models. I've seen in the new edgeR release that there's specific mention of incorporating log transformed gene expression values into the design matrix to look at how gene_A might correspond to changed expression of gene_B, but here I would need to model the contamination scores for each gene separately. That said if there's an easy way to do this that I'm just overlooking that'd be amazing.

Again speaking for glmGamPoi, you cannot include a gene-specific covariate in your design matrix. You would have to run a separate fit for each gene. On the other hand this is not actually as bad, as it may sound. Internally glmGamPoi is running a separate model fit for each gene anyways (glmGamPoi Github).

Best, Constantin

score 0 · Answer 2 · 2024-06-28

0

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 8 hours ago

WEHI, Melbourne, Australia

edgeR allows observation specific offsets (as already pointed out), although I am not confident that would be a reliable solution to your problem.

edgeR also allows observation-specific weights as part of the quasi-likelihood pipeline, which might be relevant to downweight observations with contamination.

edgeR doesn't allow gene-specific predictors in the design matrix.

A different pipeline would be to use limma::voomaLmFit where predictor is set to your indirect measure. That doesn't correct for the contamination but it does do observation-specific modelling of accuracy with downweighting of more contaminated observations.

ADD COMMENT • link 19 months ago Gordon Smyth 53k

0

Entering edit mode

Thank you for the advice! One clarification question on using predictor in limma::voomaLmFit: the documentation talks about predictor being a measure of observation precision. My indirect measure is a measure of the predicted amount of contamination, so it scales up as contamination scales up. Using the predictor option should I transform the matrix so that a high value = high precision/low contamination-estimate, or am I misunderstanding?

Thank you again

ADD REPLY • link 18 months ago Alec • 0

0

Entering edit mode

The predictor should be correlated with precision but it makes no difference whether the correlation is positive or negative. voomaLmFit will figure out the direction and strength of the correlation.

ADD REPLY • link 18 months ago Gordon Smyth 53k