Hello,
I have counts from 32 RNA-seq samples and am trying to study drug effect using DESeq2. I have some trouble of understanding the dispersion estimation after applying a set of customized normalization factors.
First I used raw counts as input and called DESeq function directly. DESeq2 estimated size factor and dispersion automatically. I found that fitted values reflected the regression of gene-wised dispersion. Although final value looked almost exact the same as gene-wised dispersion, from my understanding it's fine as I have quite a lot samples. Please correct me if I'm wrong. The figure links to here:
https://drive.google.com/file/d/0B0LC8cC9lS-ycVU3cTdGcy1RY0k/view?usp=sharing
Then I used a set of gene-wised normalization factors. In order to keep the raw counts at a similar level, those normalization factors have been divided by their geometric means. However, after dispersion estimation, I found that the fitted values seemed to be overestimated, as most of gene-wised dispersions were under the line. And final adjustment didn't lift them toward the line. The figure links to here:
https://drive.google.com/file/d/0B0LC8cC9lS-yVkRkeUM4aGRZT28/view?usp=sharing
I also tried to calculate fitted value locally, it changed almost nothing on the overall pattern.
Is this because that the normalization factors here breaks the rule of sharing information across genes? Any help is really appreciated. Codes attached below:
library(DESeq2)library(matrixStats)
gm_mean = function(x, na.rm=TRUE, zero.propagate = FALSE){ if(any(x < 0, na.rm = TRUE)){ return(NaN) } if(zero.propagate){ if(any(x == 0, na.rm = TRUE)){ return(0) } exp(mean(log(x), na.rm = na.rm)) } else { exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x)) } }
group=read.table("conditions_new.txt",header=T) group$Phenotype=relevel(group$Phenotype,"wt") group$Treatment=relevel(group$Treatment,"ctrl")
d=read.table("SEMatrix.txt",header=T,row.names=1,check.names=F) d=d[,2:dim(d)[2]] #get rid of the annotation column d=as.matrix(d)gc=d[,1:(dim(d)[2]/2)] tc=d[,(1+dim(d)[2]/2):dim(d)[2]] #Raw normalization factors
dd = DESeqDataSetFromMatrix(countData = gc, colData = group, design = ~1) gm=apply(tc,1,gm_mean) #Calculate geometric mean for normalization factors normFactors <- tc / gm #dividing geometric means normFactors[normFactors==0]=1 #Set normalization factors of 0 to 1 normalizationFactors(dd) <- normFactors #insert normalization factors back to DESeq2
design(dd) = ~Phenotype*Treatment+Gender+Batch
dd = estimateDispersions(dd)
plotDispEsts(dd)
Thanks a lot,
Dadi
Thanks Michael, I've added my code. I tried to re-start R session and re-run from the first line attached. It still gave me the same plot.
hi,
I see now what is going on (kind of) but I have more questions. So the major issue is that whatever normalization factors you are using are removing all biological variance from the data, such that the counts look like Poisson for most genes (most of the gene-wise dispersion estimates are essentially 0. This is not likely to happen with RNA-seq data, which typically has an abundance of overdispersion relative to Poisson, and I would suspect some issues with the combination of methods leading you to this point. How did you build the normalization factors?
Hi Michael. I was trying to simplify my question so I didn't make it clear. I'm doing exon inclusion rate comparison between conditions. The detail is:
1) my 'gc' matrix in the code records the number of spliced reads that connect an exon of interest to its two flanking exon neighbors.
2) then I calculate number of spliced reads that connect the two flanking neighbors together (skip the exon of interest).
3) my 'tc' matrix is the sum of 1) and 2). I used 'tc' as raw normalization factor. I was planning to use them as a offset in the GLM, in which way I think would give me the log-transform inclusion ratio of the exon of interest. And because both of my 'tc' and 'gc' matrix have the same size factor from the same RNA-seq library, I think the offset process cancels that out.
I learnt from DESeq2 manual and other discussion threads that normalization factors should have a row-wise geometric mean near 1 to keep normalized counts close to the mean of unnormalized counts. So I divided the geometric mean of each row. I guess I might break the some distribution assumption here.
I understand DEXSeq uses exon of interest as an additional variable in the GLM. I didn't use it because I have the interaction of Phenotype and Treatment as my interest in the formula. Adding an extra "exon" variable makes it complicated.
hi,
I wouldn't double dip on the counts (both going into the "counts" slot and into the "normalizationFactors" slot). This will definitely lead to the problem you have, where the data are no longer negative binomial distributed.
Instead, if you want to test for exon inclusion, I would set this up either by using DEXSeq (probably the easiest approach), or if you really need to test on PSI, you could put the exon spanning counts vs flanking exon spanning counts as consecutive columns in a count matrix, and then use a DESeq2 setup as described here:
http://rpubs.com/mikelove/ase
Crystal clear like always. Thanks heaps Micheal!