Dear list,
From version 1.16.7 the DEXSeq function: estimateExonFoldChanges does not work properly, i.e., it breaks samples and counts relation as depicted below:
>count_matrix["ENSG00000180155:008",]
s1 s3 s4 s5 s6 s8 s9 s10
ENSG00000180155:008 102 32 131 12 161 34 14 595
>sample_data
condition
s1 Non-Metastatic
s3 Non-Metastatic
s4 Non-Metastatic
s5 Non-Metastatic
s6 Metastatic
s8 Metastatic
s9 Metastatic
s10 Metastatic
>DEXSeqObject<-DEXSeqDataSet(count_matrix, sample_data, design, featureID=aux[,"featureID"], groupID=aux[,"groupID"])
>DEXSeqObject<-estimateSizeFactors(DEXSeqObject)
>DEXSeqObject<-estimateDispersions(DEXSeqObject, maxit=500)
>fullModel<- ~ sample + exon +condition:exon
>reducedModel<- ~ sample + exon
>DEXSeqObject<-testForDEU(DEXSeqObject, fullModel=fullModel, reducedModel=reducedModel)
>DEXSeqObject<-estimateExonFoldChanges(DEXSeqObject, fitExpToVar="condition", denominator="Non-Metastatic")
>myresults<- DEXSeqResults( DEXSeqObject )
For ezample, for the "ENSG00000180155" gene, exon 008 we found:
>myresults["ENSG00000180155:008",]
LRT p-value: full vs reduced
groupID featureID exonBaseMean dispersion
<character> <character> <numeric> <numeric>
ENSG00000180155:008 ENSG00000180155 008 100.024 0.1793136
stat pvalue padj Non.Metastatic
<numeric> <numeric> <numeric> <numeric>
ENSG00000180155:008 54.08763 1.917445e-13 2.158161e-08 7.254149
Metastatic log2fold_Metastatic_Non.Metastatic genomicData
<numeric> <numeric> < GRangesList>
ENSG00000180155:008 11.02504 0.6039062 ########
countData
<matrix>
ENSG00000180155:008 102 32 131 ...
The log2fold_Metastatic_Non.Metastatic is 0.6039062. But examining into the estimateExonFoldChanges function (run step by step):
> .....
frm <- as.formula(paste("count ~", fitExpToVar, "* exon"))
>....
>mf <- object@modelFrameBM
>geteffects <- function(geneID){
rt <- groups %in% geneID & notNAs
if( sum(rt) < 2 ){ return(NULL) }
countsThis <- countsAll[rt,]
rownames(countsThis) <- gsub("\\S+:", "", rownames(countsThis))
dispsThis <- disps[rt]
names(dispsThis) <- features[rt]
numexons <- sum(rt)
newMf <- mf[as.vector( sapply( split( seq_len(nrow(mf)), mf$sample ), "[", seq_len( numexons ) ) ),]
newMf$exon <- factor( rep( features[rt], numsamples ) )
newMf$dispersion <- dispsThis[match(newMf$exon, names(dispsThis))]
newMf$count <- as.vector( countsThis )
newMf <- droplevels(newMf)
coefficients <- fitAndArrangeCoefs( frm, balanceExons = TRUE, mf=newMf, maxRowsMF=maxRowsMF)
if (is.null(coefficients)) {
return(coefficients)
}
ret <- t( getEffectsForPlotting(coefficients, averageOutExpression = TRUE,
groupingVar = fitExpToVar))
rownames(ret) <- paste(geneID, rownames(ret), sep = ":")
return(ret)
}
alleffects <- bplapply( testablegenes, geteffects, BPPARAM=BPPARAM )
If for example the geteffects function is run only using geneID="ENSG00000180155" we can found:
>newMf
sample exon condition run sizeFactor dispersion count
s1 008 Non-Metastatic 1 1.2725022 0.1793136 102
s10 008 Metastatic 0 1.7272164 0.1793136 32
s3 008 Non-Metastatic 1 0.5398603 0.1793136 131
s4 008 Non-Metastatic 1 0.9582068 0.1793136 12
s5 008 Non-Metastatic 0 0.7174887 0.1793136 161
s6 008 Metastatic 1 1.4029030 0.1793136 34
s8 008 Metastatic 0 0.8922834 0.1793136 14
s9 008 Metastatic 1 1.4042222 0.1793136 595
Note that only the sample "s1" have its count properly assigned (remember count_matrix values showed at the beginning). The others samples have their count values mixed. Thus, counts are mixed between conditions too, i.e. condition "Non-Metastatic" should have counts 102 ("s1"), 32 ("s3"), 131 ("s4") and 12 ("s5") and condition "Metastatic" should have 161 ("s6"), 34 ("s8"), 14 ("s9") and 595 ("s10"). Thus the log2FC is erroneously computed.
We found that this confusion is produced when mf is cut in order to build the newMf object because the new object have its sample column (factor) reordered in an alphabetic way follow the factor levels. Thus sample column is :
>newMf$sample
[1] s1 s10 s3 s4 s5 s6 s8 s9
Levels: s1 s10 s3 s4 s5 s6 s8 s9
instead of s1 s3 s4 s5 s6 s8 s9 s10 like sample_data row names.
I suggest add a single code line into the estimateExonFoldChanges in order to avoid this problem. In this method, after mf creation I suggest set the levels of sample columns using the row names of the sample_data design matrix as:
>mf <- object@modelFrameBM
##add this in order to conserve sample-counts relation
>mf$sample<-factor(mf$sample, levels=unique(mf$sample))
or into the geteffects function, control the counts assignation as
>newMf$counts<-as.vector(countsThis[,as.character(unique(newMf$sample))])
Now, the newMf results as:
sample exon condition run sizeFactor dispersion count
s1 008 Non-Metastatic 1 1.2725022 0.1793136 102
s3 008 Non-Metastatic 1 0.5398603 0.1793136 32
s4 008 Non-Metastatic 1 0.9582068 0.1793136 131
s5 008 Non-Metastatic 0 0.7174887 0.1793136 12
s6 008 Metastatic 1 1.4029030 0.1793136 161
s8 008 Metastatic 0 0.8922834 0.1793136 34
s9 008 Metastatic 1 1.4042222 0.1793136 14
s10 008 Metastatic 0 1.7272164 0.1793136 595
And the corresponding DEXSeq results after call the corrected extimateExonFoldChanges function for gene is:
groupID featureID exonBaseMean dispersion
<character> <character> <numeric> <numeric>
ENSG00000180155:008 ENSG00000180155 008 100.024 0.1793136
stat pvalue padj Non.Metastatic
<numeric> <numeric> <numeric> <numeric>
ENSG00000180155:008 54.08763 1.917445e-13 2.158161e-08 5.240263
Metastatic log2fold_Metastatic_Non.Metastatic genomicData
<numeric> <numeric> <GRangesList>
ENSG00000180155:008 11.63103 1.150268 ########
countData
<matrix>
ENSG00000180155:008 102 32 131 ...
As can be observed the log2FC value now is 1.150268 correctly computed between Metastatic vs Non-Metastatic samples. Obviously, the condition means were also corrected.
I hope that this code error can be shortly solved because it cannot be easily detected given that DEXSeq doesn't return an error message and a lot of people uses this software everyday.
Best,
Gabriela