Incorporating Salmon offset matrix into Wilcoxon DEG analysis
1
0
Entering edit mode
Andre • 0
@804722cd
Last seen 4 hours ago
Portugal

I´m working with this dataset(GSE234297) that provides a salmon counts file and a salmon offsets file(GSE234297_gene_raw_counts.txt.gz GSE234297_gene_offset_matrix.txt.gz) . The paper authors use them like this to do DEG:~

 # create DGEList
y <- DGEList(counts = counts, samples = samples, genes = genes, group = samples$Disease_status)

# add salmon offsets
y <- scaleOffset(y, offset = as.matrix(salmon_offset)) 

# filter low count genes 
keep <- filterByExpr(y, group = y$samples$group)
table(keep) 
y <- y[keep, , keep.lib.sizes=FALSE] 

# set design matrix
design <- model.matrix(~ Sex + GC_content + group, data = y$samples)
design

# estimate dispersion parameters using edgeR robust method
y <- estimateGLMRobustDisp(y, design)

But my question is if I want to use the salmon offsets for Wilcoxon (which can't use offsets directly), could I simply do:

Copy <- DGEList(counts)
y$offset <- offsets  
cpms <- edgeR::cpm(y, offset = y$offset, log = FALSE)

To get properly normalized values that account for the Salmon bias corrections?

edgeR • 540 views
ADD COMMENT
1
Entering edit mode
@gordon-smyth
Last seen just now
WEHI, Melbourne, Australia

To compute CPM values from edgeR:

y <- DGEList(counts = counts, samples = samples, genes = genes, group = samples$Disease_status)
y <- scaleOffset(y, offset = salmon_offset) 
cpms <- cpm(y)

Needless to say, using edgeR v4 quasi-likelihood would be much faster and better than Wilcoxon tests. I would be skipping the estimateGLMRobustDisp call shown in your code above and instead using:

fit <- glmQLFit(y, design, robust=TRUE)
ftest <- glmQLFTest(fit)
topTags(ftest)

BTW, please ask questions just once. I answered essentially the same question from you 3 hours ago, added as a comment to an old answer.

ADD COMMENT
0
Entering edit mode

Thank you. What is the difference between setting robust=TRUE in glmQLFit and estimateGLMRobustDisp? The paper i mentioned in the post used both:

y <- estimateGLMRobustDisp(y, design) 

fit <- glmQLFit(y, design, robust=TRUE)
ADD REPLY
0
Entering edit mode

It is not possible to use both. estimateGLMRobustDisp() estimates negative binomial dispersions for each gene, whereas glmQLFit() estimates quasi-dispersions. estimateGLMRobustDisp() is for the older edgeR glm pipeline based on likelihood ratio tests. If you call the newer pipeline with glmQLFit(), then the gene-specific negative binomial dispersions will be discarded. See Chen et al (2024) for the differences between the pipelines.

How do you know that the paper used robust=TRUE? I see no mention of that in the published paper associated with the GSE234297 dataset.

Chen Y, Chen L, Lun ATL, Baldoni PL, Smyth GK (2024). edgeR 4.0: powerful differential analysis of sequencing data with expanded functionality and improved support for small counts and larger datasets. bioRxiv 2024.01.21.576131. doi:10.1101/2024.01.21.576131

ADD REPLY
0
Entering edit mode

They provide the code they used for DGE analysis in a gitlab link that can be found in the paper. In there you can find:

# estimate dispersion parameters using edgeR robust method
y <- estimateGLMRobustDisp(y, design) 

# visualise negative binomial dispersions 
plotBCV(y)

# fit a quasi-likelihood negative binomial GLM model to count data
fit <- glmQLFit(y, design, robust=TRUE)
ADD REPLY
0
Entering edit mode

Here, estimateGLMRobustDisp() was used just to get the diagnostic BCV plot by plotBCV(y). The robust NB dispersions were not used for the DE analysis.

In edgeR v3, used by the paper, glmQLFit() would have taken the trended NB dispersions from the estimateGLMRobustDisp() output, which would be same trended dispersions as from the standard dispersion estimate command estimateDisp(). In edgeR v4, the NB dispersions are being ignored entirely by glmQLFit() so you can simply skip estimateDisp() and friends, resulting in a considerable saving of computation time, because the NB dispersion steps is computationally expensive.

For a dataset with 100-200 samples, the whole edgeR v4 analysis should take well under a minute.

ADD REPLY
0
Entering edit mode

The estimateGLMRobustDisp() function seems to have an effect on glmQLFit(), even in edgeR 4 (4.4.1). If I use estimateGLMRobustDisp(), like this:

DGE_list_object <- estimateGLMRobustDisp(DGE_list_object, design) 


fit <- glmQLFit(DGE_list_object, design, robust=robust)

qlf <- glmQLFTest(fit)

print(summary(decideTests(qlf)))

I get 45 differentially expressed genes. However, if I omit estimateGLMRobustDisp():

DGE_list_object <- estimateGLMRobustDisp(DGE_list_object, design) 


fit <- glmQLFit(DGE_list_object, design, robust=robust)

qlf <- glmQLFTest(fit)

print(summary(decideTests(qlf)))

Also, there's quite a big difference between the results obtained for this dataset with edgeR 3 using the same code as in the paper (245 differentially expressed genes) and using the new QL pipeline in edgeR 4 (0 differentially expressed genes).

ADD REPLY

Login before adding your answer.

Traffic: 351 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6