Question

Incorporating Salmon offset matrix into Wilcoxon DEG analysis

0

Entering edit mode

Andre • 0

@804722cd

Last seen 8 hours ago

Portugal

I´m working with this dataset(GSE234297) that provides a salmon counts file and a salmon offsets file(GSE234297_gene_raw_counts.txt.gz GSE234297_gene_offset_matrix.txt.gz) . The paper authors use them like this to do DEG:~

 # create DGEList
y <- DGEList(counts = counts, samples = samples, genes = genes, group = samples$Disease_status)

# add salmon offsets
y <- scaleOffset(y, offset = as.matrix(salmon_offset)) 

# filter low count genes 
keep <- filterByExpr(y, group = y$samples$group)
table(keep) 
y <- y[keep, , keep.lib.sizes=FALSE] 

# set design matrix
design <- model.matrix(~ Sex + GC_content + group, data = y$samples)
design

# estimate dispersion parameters using edgeR robust method
y <- estimateGLMRobustDisp(y, design)

But my question is if I want to use the salmon offsets for Wilcoxon (which can't use offsets directly), could I simply do:

Copy <- DGEList(counts)
y$offset <- offsets  
cpms <- edgeR::cpm(y, offset = y$offset, log = FALSE)

To get properly normalized values that account for the Salmon bias corrections?

edgeR • 579 views

ADD COMMENT • link updated 42 minutes ago by Gordon Smyth 52k • written 5 days ago by Andre • 0

score 1 · Answer 1 · 2024-12-31

1

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 42 minutes ago

WEHI, Melbourne, Australia

To compute CPM values from edgeR:

y <- DGEList(counts = counts, samples = samples, genes = genes, group = samples$Disease_status)
y <- scaleOffset(y, offset = salmon_offset) 
cpms <- cpm(y)

Needless to say, using edgeR v4 quasi-likelihood would be much faster and better than Wilcoxon tests. I would be skipping the estimateGLMRobustDisp call shown in your code above and instead using:

fit <- glmQLFit(y, design, robust=TRUE)
ftest <- glmQLFTest(fit)
topTags(ftest)

BTW, please ask questions just once. I answered essentially the same question from you 3 hours ago, added as a comment to an old answer.

ADD COMMENT • link 5 days ago Gordon Smyth 52k

0

Entering edit mode

Thank you. What is the difference between setting robust=TRUE in glmQLFit and estimateGLMRobustDisp? The paper i mentioned in the post used both:

y <- estimateGLMRobustDisp(y, design) 

fit <- glmQLFit(y, design, robust=TRUE)

ADD REPLY • link 2 days ago Andre • 0

0

Entering edit mode

It is not possible to use both. estimateGLMRobustDisp() estimates negative binomial dispersions for each gene, whereas glmQLFit() estimates quasi-dispersions. estimateGLMRobustDisp() is for the older edgeR glm pipeline based on likelihood ratio tests. If you call the newer pipeline with glmQLFit(), then the gene-specific negative binomial dispersions will be discarded. See Chen et al (2024) for the differences between the pipelines.

How do you know that the paper used robust=TRUE? I see no mention of that in the published paper associated with the GSE234297 dataset.

Chen Y, Chen L, Lun ATL, Baldoni PL, Smyth GK (2024). edgeR 4.0: powerful differential analysis of sequencing data with expanded functionality and improved support for small counts and larger datasets. bioRxiv 2024.01.21.576131. doi:10.1101/2024.01.21.576131

Edit later

I have now remembered that estimateGLMRobustDisp() produces weights as well as NB dispersion, and the weights will definitely affect any quasi-likelihood analysis done downstream. I can't recommend combining estimateGLMRobustDisp with glmQLFit. The edgeR authors didn't intend that the two pipelines be combined in this way.

ADD REPLY • link 2 days ago • updated 2 hours ago Gordon Smyth 52k

0

Entering edit mode

They provide the code they used for DGE analysis in a gitlab link that can be found in the paper. In there you can find:

# estimate dispersion parameters using edgeR robust method
y <- estimateGLMRobustDisp(y, design) 

# visualise negative binomial dispersions 
plotBCV(y)

# fit a quasi-likelihood negative binomial GLM model to count data
fit <- glmQLFit(y, design, robust=TRUE)

ADD REPLY • link 1 day ago Andre • 0

0

Entering edit mode

Here, estimateGLMRobustDisp() was used just to get the diagnostic BCV plot by plotBCV(y). The robust NB dispersions were not used for the DE analysis.

In edgeR v3, used by the paper, glmQLFit() would have taken the trended NB dispersions from the estimateGLMRobustDisp() output, which would be same trended dispersions as from the standard dispersion estimate command estimateDisp(). In edgeR v4, the NB dispersions are being ignored entirely by glmQLFit() so you can simply skip estimateDisp() and friends, resulting in a considerable saving of computation time, because the NB dispersion steps is computationally expensive.

For a dataset with 100-200 samples, the whole edgeR v4 analysis should take well under a minute.

Edit later

I remember now that estimateGLMRobustDisp() produces weights, which will be used by the glmQLFit analysis. We don't recommend that sequence though.

ADD REPLY • link 1 day ago • updated 2 hours ago Gordon Smyth 52k

0

Entering edit mode

The estimateGLMRobustDisp() function seems to have an effect on glmQLFit(), even in edgeR 4 (4.4.1). If I use estimateGLMRobustDisp(), like this:

DGE_list_object <- estimateGLMRobustDisp(DGE_list_object, design) 


fit <- glmQLFit(DGE_list_object, design, robust=robust)

qlf <- glmQLFTest(fit)

print(summary(decideTests(qlf)))

I get 45 differentially expressed genes. However, if I omit estimateGLMRobustDisp():

DGE_list_object <- estimateGLMRobustDisp(DGE_list_object, design) 


fit <- glmQLFit(DGE_list_object, design, robust=robust)

qlf <- glmQLFTest(fit)

print(summary(decideTests(qlf)))

Also, there's quite a big difference between the results obtained for this dataset with edgeR 3 using the same code as in the paper (245 differentially expressed genes) and using the new QL pipeline in edgeR 4 (0 differentially expressed genes).

ADD REPLY • link 11 hours ago Andre • 0

0

Entering edit mode

Sorry, I had forgotten estimateGLMDisp() produces weights as well as NB dispersions. The weights from estimatGLMRobustDisp() were intended to be used by adjustedProfileLik() to estimate the NB dispersions. However, if you pass the DGE object to glmQLFit(), then glmQLFit() will interpret the weights a generalized linear model weights, with possibly unintended results. The edgeR authors did not intended that estimateGLMRobustDisp() be combined with glmQLFit().

BTW, I am a bit confused by the code in your comment. You have given two code sequences, but they both look to be exactly the same.

ADD REPLY • link 3 hours ago • updated 42 minutes ago Gordon Smyth 52k