Question

How can I devise a method like DESeq2 to give statistics for paired multiple -omic datasets?

0

Entering edit mode

Christina • 0

@b8b06f4f

Last seen 8 months ago

United States

Hi, I am a PhD student seeking advice on how I can devise a method like DESeq2 to give statistics for paired multiple -omic datasets that are aligned with those produced on RNA-Seq count data by DESeq2 (one of the multi-omic datasets for paired samples), so that they can be more easily and justifiably compared.

I have tried to see how close a negative binomial glm run on the continuous dispersion-normalized RNA-seq data (simulating continuous LFQ intensities) gets to the DESeq2 results stats table (see below).

#DESeq2 output:
# GeneID   baseMean log2FoldChange     lfcSE     stat           pvalue            padj
# Actr5   193.914597    1.21224548      0.31072676   3.90132314 9.57E-05      0.00074757


thisGene.exp<-c(282.2568792,    278.5787192,    171.4072232, 108.2000439,   136.845039, 73.42960134)

Grouping <- c(1,1,1,2,2,2)

lmdat.l2=data.frame(exp=thisGene.exp,Grouping=Grouping)

#negative binomial generalized model
model <- tryCatch( MASS::glm.nb(exp ~ Grouping, data = lmdat.l2), error= function(e) lm(exp ~ Grouping,data = lmdat.l2) )
SE=sqrt(diag(vcov(model))[2])
LFC=mean(lmdat.l2$exp[which(Grouping==names(table(Grouping))[1])])-mean(lmdat.l2$exp[which(Grouping==names(table(Grouping))[2])])
stat=LFC/SE
lowerTail=if(LFC<0) { TRUE } else { FALSE }
pvalue=pnorm(LFC/SE,lower.tail=lowerTail)*2
if(pvalue>1) pvalue=1

c(Gene=this.gene,baseMean=mean(2^lmdat.l2$exp),LFC=LFC,lfcSE=SE,stat=stat,pvalue=pvalue)
# Gene               baseMean                    LFC         lfcSE.Grouping          stat.Grouping 
# "Actr5" "1.66841108419978e+84"     "137.922712453333"    "0.192871498086593"     "715.101577068743" 
# pvalue.Grouping 
# "0"

Additionally, would you defend an argument that different statistics for the different -omics data makes sense and that the methods for calculating the statistics do not have to be aligned in this way to compare the datasets?

Thank you for your time and consideration.

DESeq2 • 558 views

ADD COMMENT • link updated 8 months ago by edammer • 0 • written 8 months ago by Christina • 0

score 0 · Answer 1 · 2023-10-25

0

Entering edit mode

James W. MacDonald 66k

@james-w-macdonald-5106

Last seen 2 days ago

United States

This support site is meant to help people with technical questions about Bioconductor packages, not help with methods development. I am not sure I would personally ask for methods development from random strangers on the internet, but if that's what you want to do, I think biostars.org is a better place for that sort of thing.

ADD COMMENT • link 8 months ago James W. MacDonald 66k

0

Entering edit mode

I sent Christina here to post with the intent of showing for a population of transcriptomic measurements, how the negative binomial glm that underpins DESeq2 handles both the integer counts data and continuous data (normalized counts/size factors) is comparable. Then, anticipating common criticism by naive reviewers of multi-omic analyses, we are hoping to get feedback on the appropriateness of (i.e., the answer to "why not...") using statistics derived from the glm.nb or glm(..., family=negative.binomial(theta)) functions on continuous omic abundance data.

Ignoring the complexity of estimating dispersion (we use DESeq2's estimates), we find that with a few exceptions of outlier genes, the negative binomial glm works to produce highly correlated log2FC and lfcSE estimates for the Wald statistic calculation. Shown below are correlations for our actual RNA Seq count integer data, and the corresponding counts/sizeFactors continuous data used as input.

correlation of stats underlying Z stat for Wald Test

And the Z statistic calculations using integer vs. continuous quantification correlate, with a minority of transcripts fanning out from y=x: correlation of Z statistics performed using actual RNA-Seq data pushed through DESeq2 or independently using the negative binomial glm calculation

To allow readers a reproducible example, I put together this R code to generate simulated count data with a distribution and dispersions roughly similar to actual 17000 gene RNA-Seq count data that we are working with, which generates PDF graphical outputs (scatter plots) of all comparisons that can be made:

Multi-omic Harmonized Wald Test Pilot R code with Simulated Count Data

Correlations in outputs generated by the above code, in full do not look much worse than those for the real transcriptomic data comparison of continuous to integer count input. Correlations of simulated RNA-Seq data contributing calculations to Wald Z

Z stat correlations for simulated RNA-Seq data

The log2FC calculated using mean of log2(counts) difference across groups is well correlated as a population, especially given we are simulating a comparison of two groups with n=3: Log2fold-change using classic calculation vs negative binomial (DESeq2 approach)

And generally, this alternate calculation ratioed with the negative binomial-derived lfcSE (again) produces correlated Wald Z statistics: Correlation of standard and alternate Z stat calculations

In conclusion, we ask what should we be worried about in adapting and refining the above approach to generate harmonized statistics for both DESeq2 analysis of RNA-Seq counts, and paired proteomic data which as LFQ can have missing values imputed, but is continuous? In other words, why shouldn't we attempt to harmonize statistics for integer counts and continuous paired sample data for different -omics on the same samples?

ADD REPLY • link 8 months ago edammer • 0