Question: DEXSeq dispersion outlier
0
3 months ago by
fiona.dick9110
fiona.dick9110 wrote:

Hi,

I have been using DEXSeq with transcript counts from salmon (as suggested in e.g. : https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0862-3 ) to investigate DTU between conditions case vs control. I followed a workflow described in: https://f1000research.com/articles/7-952/v3 . When I plotted a volcano plot with the p-values from the DEXSeq result object and its estimated log2fold change (from DEXSeq::estimateExonFoldChange), i saw some extreme p-value outliers. When I plotted the counts (fitted and observed) I saw that there was no visible change at all between conditions. When I looked at the pvalues of DRIMSeq for those transcripts, they were far from significant (~0.5). I saw that those transcripts (among others) were labelled by DEXSeq as dispOutliers (which I assume means dispersion outliers?!). I attached the volcano plots and a rds object which holds the 6 outlier transcripts and the mcols from DEXSeq. What could be the reason for these extreme p-values when there clearly seems to be no significance. I aimed at remodelling everything by hand for those outliers to see if I get similar lrt statistics but I am not completely sure which values to use from mcols(DEXSeqobj) to retrieve correct variable coefficients for the samples.

countplot for most extreme pvalue: https://drive.google.com/file/d/1h0Ch_5v-OMLkypU9KdotPDt-Ei0NEQCl/view?usp=sharing

( I extracted the fitted counts as follows:

fitted.common.scale = as.data.frame( t(t(assays(DEXSeqObj)[["mu"]][,1:49])/DEXSeqObj$sizeFactor)) colnames(fitted.common.scale) <- sampleAnnotation(DEXSeqObj)$sample_id


Im not sure if this is the right way to do it, I referred to this post : https://support.bioconductor.org/p/60567/ (which was about DESeq2))

DEXSeq workflow:

covariates<-lapply(covariates,function(c){paste0("+ ",as.character(c),":exon")})

design<-formula(paste("~sample + exon ",stringi::stri_flatten(covariates)," + condition:exon"))
print("DESGIN:")
print(design)
reduced_design<-formula(paste("~sample + exon ",stringi::stri_flatten(covariates)))
print("DESGIN reduced:")
print(reduced_design)

sample.data <- DRIMSeq::samples(d)
count.data <- round(as.matrix(DRIMSeq::counts(d)[,-c(1:2)]))
dx <- DEXSeqDataSet(countData=count.data,
sampleData=sample.data,
design=design,
featureID=counts(d)$feature_id,groupID=counts(d)$gene_id)
#~sample + exon + rin:exon + condition:exon,

system.time({
dx <- estimateSizeFactors(dx)
dx <- estimateDispersions(dx,quiet=FALSE,maxit=500)
dx <- testForDEU(dx,reduced=reduced_design)
dx <- estimateExonFoldChanges(dx, fitExpToVar="condition")


[1] "DESGIN:" ~sample + exon + rin:exon + ageyears:exon + sex:exon + cohort:exon + oligodendrocytes:exon + condition:exon [1] "DESGIN red:" ~sample + exon + rin:exon + ageyears:exon + sex:exon + cohort:exon + oligodendrocytes:exon converting counts to integer mode the design formula contains a numeric variable with integer values,
specifying a model with increasing fold change for higher values.
did you mean for this to be a factor? if so, first convert this variable to a factor using the factor() function

dexseq pvalue dispersion • 183 views
modified 3 months ago • written 3 months ago by fiona.dick9110

Can you try without e.g. rin:exon? I don't typically put RIN in as a covariate.

yes, I did try. I also changed the volcano plot to use the condition coefficient as l2fc instead of the result of DEXSeq::estimeExonFoldChange. There still remain some outliers but its not the same transcripts. Here you can find both volcano plots. The one resulting from analysis with and one without rin as covariate. We found that our main axis of variation is significantly associated with RIN and adding it as a covariate has been suggested by other studies at least for DGE analyisis (https://www.biostars.org/p/312942/https://bmcbiol.biomedcentral.com/articles/10.1186/1741-7007-12-42 , https://www.pnas.org/content/pnas/114/27/7130.full.pdf )

0
3 months ago by
Alejandro Reyes1.7k
Dana-Farber Cancer Institute, Boston, USA
Alejandro Reyes1.7k wrote:

Hi Fiona,

Thanks for your detail report! It is interesting what you are describing. For these points that with extreme values, have you tried plotting the (log) counts per transcript for those genes in each sample?

A possible explanation for this is that of the estimation of log2fold changes from DEXSeq does not take into account all the covariates from your model. It only takes into account "condition" If the most of the variability is driven by the other covariates of your model, this might be hiding the "condition" effect.

Hi,

• I have attached the plot you have described for the six outliers that i have reported above (in the .rds file): https://drive.google.com/file/d/1VKSjamZMHjLYWsYL6YH3t_zrABNRSaQ-/view?usp=sharing ( counts from DEXSeq::featureCounts, added pseudocount and the took the log)
• Apart from the volcano plot i am not using the l2fc from DEXSeq::estimateExonFoldChange anywhere. The extreme reported p-value resulted from the model where all covariates were taken into account. When I plot the counts (or relative abundances) of the fitted values...and also the raw observed ones...there is nothing. And more importantly, DRIMSeq shows a p-value of 0.5 for this transcript (vs DEXSeq a pvalue of smaller than 1*10^-50). Furthermore, out of thousands of transcripts, there is only 6 such outliers...and they even have the similar lg2fc...all of this seems to me really suspicious. Can I trust the rest of my results and do I filter those "artifacts" or is there a general mistake somewhere?
• what is the consequence for a transcript to be labelled as a dispOutlier (=TRUE)?
• you say DEXSeq::estimateExonFoldChange only models with the variable given to the function and doesnt take my covariates into account, would you suggest using the modeled coefficient of the condition variable instead (for generating the volcano plot)? (i.e. mcols(DEXSeq_obj)\$exonthis.condition1 )

About points 1 and 2, since you have many covariates (rin, ageyears, sex, cohort, oligodendrocytes), it will be hard to distinguish how each of these contribute to the individual counts. It is also hard to say that there is nothing. What happens if you remove all those covariates and just test the condition variable? Are those points still significant? Even though, the distributions are not identical. For example, the single transcript counts that you shared for ENSG00000254681, there might be a contribution of the condition. In general, I don't think there is a mistake and taking those things that are significant both in DRIMseq and DEXSeq is a good (conservative) way to go. Generally speaking, there should be a good agreement between DRIMseq and DEXseq.

About point 3 (dispersion outlier flag), good question. DEXSeq internally uses DESeq2. I ran the DESeq2 examples with the default and I see the same behavior.

About your last point, definitively. Actually, your comment made me wonder why is the current implementation fitting another round of GLMs to get exon fold changes. I need to revisit why this is the case, but it is likely that there is no need to fit something separately to get exon fold changes!

• the counts from DEXSeq::featureCounts are the raw ones and not the fitted ones, right?
• Sorry for asking again, I dont understand your point with the covariates. We neither see sth when looking at the fitted nor the raw counts, right? And even if there d be sth, the dispersion is so big. How come we have such an EXTREME p-value..that I would interpret as a transcript I can be really confident of. Whereas I have other transcripts that are also significant with much higher p-value (see volcano plot), that have much lower dispersion and that show a clear change.
• I read on a forum somewhere (cant find the link anymore) that DESeq2 doesnt shrink transcripts that have been labelled as dispOutliers but Im not sure I remember correctly