DESeq2 high fold change values for comparison of samples with zero counts

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 12 hours ago

United States

hi Noa, Let's keep âall of theâ discussion on âthe Bioconductor â list, so that other users can follow the trail if they encounter âa similar situationâ . On Wed, Jan 15, 2014 at 8:10 AM, Noa Henig <ntzunz@gmail.com> wrote: > > Hi Mike, > I was looking at the leveling as you extracted it and I might have done a mistake with the naming of the sample (I'm checking it now), I apologize. > Regarding your suggestion for running in few separate runs, it does sound that running in few different batches makes it easier to work, but I assumed that the dispersion is influenced by the number of samples that are included. âIt's true that dispersion is estimated ââ using all the samples â.â > > Let's say I don't have any prior information whether the treatment affects in a similar or opposite way in different cell types, is it still right to run DESeq2 on separate groups of cell types? âIâ t's statistically valid either to run âone big modelâ â or separately â for each cell type. It's up to you given the advantages and disadvantages I outlined in the previous emailâ . âIâ f you run them all together â and are interested in the condition effect for each cell typeâ , I would set âthe designâ up like so: design = ~ celltype + condition + time â2â + celltype:condition + celltype:time â2 âAs I understand, you don't have a separate set of â â âcontrol samples at time 1 or 2 or a set of treated samples at time 0. I would then recommend using a condition variable, and a time2 variable (FALSE if time 0 or 1 and TRUE if time 2), to estimate any changes after time1. The interactions allow you to test for changes in the condition for different cell types and for changes from time 2 over time 1 for different cell types. If the interaction terms for, say, celltype:condition are small, this tells you that one condition effect explains the differences across all cell types. Mike On Tue, Jan 14, 2014 at 6:02 PM, Michael Love <michaelisaiahlove@gmail.com>wrote: > hi Noa, > > When I look up the column data for the dataset you sent, I don't see the > samples grouped as I was expecting from your email. > > I show below how it is possible to email the column data, even if it is > not for public viewing, by "scrubbing" the sample and factor level names. > (Note though that this does not change the results names stored in the > DESeqDataSet though, so you have to use the original factor level names > used when DESeq() was run, in order to extract results). > > > library(DESeq2) > > load("~/Downloads/example.RData") > > dim(ddsSub) > [1] 1 45 > > nlevels(colData(ddsSub)$condition) > [1] 26 > > colnames(ddsSub) <- paste0("sample",seq_len(ncol(ddsSub))) > > colnames(ddsSub) > [1] "sample1" "sample2" "sample3" "sample4" "sample5" "sample6" > [7] "sample7" "sample8" "sample9" "sample10" "sample11" "sample12" > [13] "sample13" "sample14" "sample15" "sample16" "sample17" "sample18" > [19] "sample19" "sample20" "sample21" "sample22" "sample23" "sample24" > [25] "sample25" "sample26" "sample27" "sample28" "sample29" "sample30" > [31] "sample31" "sample32" "sample33" "sample34" "sample35" "sample36" > [37] "sample37" "sample38" "sample39" "sample40" "sample41" "sample42" > [43] "sample43" "sample44" "sample45" > > levels(colData(ddsSub)$condition) <- > paste0("lvl",seq_len(nlevels(ddsSub$condition))) > > levels(colData(ddsSub)$condition) > [1] "lvl1" "lvl2" "lvl3" "lvl4" "lvl5" "lvl6" "lvl7" "lvl8" > "lvl9" > [10] "lvl10" "lvl11" "lvl12" "lvl13" "lvl14" "lvl15" "lvl16" "lvl17" > "lvl18" > [19] "lvl19" "lvl20" "lvl21" "lvl22" "lvl23" "lvl24" "lvl25" "lvl26" > > The normalized counts and conditions listed in your email do not seem to > correspond in the condition column. You had: > > T0: 29852 and 39519; T1: 37352 and 44385; T3:52539 and 58,556. >> The log2 fold change of T0/T1 was 3.28 (p-adjusted was 0.04), while the >> log2 fold change of T2/T0 was 2.96 (p-adjusted was 0.105) > > > The normalized counts alongside the condition: > > > nc <- round(counts(ddsSub,normalized=TRUE)[1,]) > > data.frame(condition=colData(ddsSub)$condition, normcounts=nc) > condition normcounts > sample1 lvl1 4267 > sample2 lvl2 4285 > sample3 lvl3 5235 > sample4 lvl4 2406 > sample5 lvl5 3312 > sample6 lvl4 7117 > sample7 lvl5 3356 > sample8 lvl1 2375 > sample9 lvl2 5355 > sample10 lvl3 3746 > sample11 lvl6 284 > sample12 lvl7 8375 > sample13 lvl8 313 > sample14 lvl6 3865 > sample15 lvl7 400 > sample16 lvl9 9866 > sample17 lvl10 223 > sample18 lvl11 370 > sample19 lvl9 365 > sample20 lvl10 431 > sample21 lvl11 452 > sample22 lvl12 25262 > sample23 lvl13 18322 > sample24 lvl14 23999 > sample25 lvl12 20305 > sample26 lvl13 35235 > sample27 lvl14 31593 > sample28 lvl15 32076 > sample29 lvl16 32923 > sample30 lvl17 40614 > sample31 lvl18 42732 > sample32 lvl19 36969 > sample33 lvl20 45901 > sample34 lvl21 29852 > sample35 lvl22 39520 > sample36 lvl23 37353 > sample37 lvl24 44386 > sample38 lvl25 52540 > sample39 lvl26 58556 > sample40 lvl18 16779 > sample41 lvl19 20490 > sample42 lvl20 18864 > sample43 lvl15 17213 > sample44 lvl16 16181 > sample45 lvl17 16639 > > The normalized counts for 29852 and 39519 are in different levels: lvl21 > and lvl22 > > 37352 and 44385 are in lvl23 and lvl24 > > 52539 and 58,556 are in lvl25 and lvl26 > > If I extract the contrast for, say, lvl23 vs lvl21 (using but scrubbing > your original level names), the log fold change (having run DESeq with > betaPrior=FALSE) is equal to the log ratio of the condition means: > > > results(ddsSub, contrast=c("condition","lvl23","lvl21")) > DataFrame with 1 row and 6 columns > baseMean log2FoldChange lfcSE stat pvalue padj > <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> > #### 18237.81 0.3233761 1.492571 0.216657 0.8284756 0.8284756 > > > log2(mean(nc[colData(ddsSub)$condition == "lvl23"]) / > mean(nc[colData(ddsSub)$condition == "lvl21"])) > [1] 0.3233965 > > > It might be easier for you to subset the DESeqDataSet to smaller groups > before running DESeq(), and to examine > as.data.frame(colData(dds)$condition) before running to make sure that the > sample and conditions line up as you expect. Maybe I should address the > question: what's the advantage/disadvantage to running all the samples in > one go, vs subsetting and running separately? One advantage of 'all in one' > is convenience, in that any contrast can be extracted after the model has > run. Another advantage is that if there are replicates for other > experimental conditions, and we expect that the dispersion within these > experimental conditions is similar to the dispersion within the conditions > we are interested in, then estimating the dispersion altogether should > provide better estimates. However, this assumes that the dispersion will be > similar across these groups, and that you have replicates for these > "out-group" conditions. If that is not the case, I recommend splitting the > analysis into subgroups of interest. A disadvantage of 'all in one' is that > the running time is longer, because for each gene we have to find the > maximum of the likelihood or posterior in a high dimensional space. It's > much faster to find this maximal point in a lower dimensional space (when > there are only a few variables, and a few levels per variable). Another > disadvantage of 'all in one' is that it becomes a bit more difficult to > keep track of samples and conditions. > > Regarding the email from your coworker (sorry to have missed it, I > couldn't find this email in searching the Bioc mailing list) > > I also wanted to raise an additional question that my colleague sent few >> weeks ago to the bioC list and was skipped. It's related to version 1.2.5 >> but still we would like to better understand the following issue: >> A Dataset contains 4 different conditions with 3 replicates each. >> In one of the genes examined, the raw counts of all 3 replicates of two >> conditions (I will call them A and B) are zero (while other conditions >> received non-zero values). >> When viewing the results given by the DESeq2 package (after building the >> data set from a matrix, estimating size factors and dispersions, using the >> Wald test and the function results with contrast) I see the following: >> In the comparison : A vs. B, even though all samples received zero >> counts: log2FoldChagne = -1.08976 >> p-value = 0.3044 >> padj = 0.36323 >> The threshold set by the independent filtering for that comparison was >> around 1. If all replicates of conditions A and B received zero counts, >> shouldnât the independent filtering filter this gene from this comparison? >> (meaning the padj value should be NA) or is the condition checked by the >> independent filtering refers to all samples in the data (and not only those >> of the conditions being compared), even though the threshold is set for >> each comparison separately? > > > That the log2FoldChange is not equal to zero for this contrast, can be > answered by the thread I originally posted. ( > > http://permalink.gmane.org/gmane.science.biology.informatics.conduct or/51749 > ). In DESeq2 version 1.2.x, when a DESeqDataSet is constructed with > factors in the design with 3 or more levels, we recommend setting > betaPrior=FALSE. (this is printed as a message in the construction of the > DESeqDataSet in the updated release version since November). Then there > might be a nonzero log2FoldChange, but the standard error will be very > large, as the profile likelihood for this coefficient is sloping toward > negative infinity, so there will not be anything near significance. In > DESeq2 version 1.3.x (devel version which will be released in Spring 2014) > we have a comprehensive solution for using a beta prior when factors are > present in the design formula with 3 or more levels. > > Yes, the independent filtering operates on the mean of the normalized > counts across *all* samples in the DESeqDataSet. The particular threshold > is set by optimizing the number of rejections based on the adjusted > p-values from the test currently being generated. However, such an > operation as you describe can be easily performed by supplied any kind of > independent filtering statistic to the 'filter' argument of the results > function. For filtering based on the mean of only the two conditions > currently being investigated, this would look like: > > > dds <- makeExampleDESeqDataSet() > > colData(dds)$condition <- factor(rep(c("a","b","c"),each=4)) > > dds <- DESeq(dds) > estimating size factors > estimating dispersions > gene-wise dispersion estimates > mean-dispersion relationship > final dispersion estimates > fitting model and testing > > bcMean <- rowMeans(counts(dds,normalized=TRUE)[,colData(dds)$condition > %in% c("b","c")]) > > results(dds, contrast=c("condition","c","b"), filter=bcMean) > DataFrame with 1000 rows and 6 columns > baseMean log2FoldChange lfcSE stat pvalue > padj > <numeric> <numeric> <numeric> <numeric> <numeric> > <numeric> > feature1 64.518521 -0.28198624 0.3521963 -0.8006507 0.4233339 > NA > feature2 3.954962 0.45020401 0.6358071 0.7080827 0.4788939 > NA > feature3 50.301443 0.04948108 0.4502842 0.1098886 0.9124978 > NA > feature4 77.491191 -0.34507847 0.3806662 -0.9065119 0.3646649 > 0.917873 > feature5 76.342876 -0.24041099 0.3612686 -0.6654633 0.5057542 > 0.917873 > ... ... ... ... ... ... > ... > feature996 2.614749 -0.3276690 0.6030464 -0.5433561 0.5868846 > NA > feature997 9.856852 -0.4837603 0.5661241 -0.8545128 0.3928209 > NA > feature998 5.403219 0.4152869 0.6299118 0.6592778 0.5097174 > NA > feature999 9.738524 -0.1803672 0.5557705 -0.3245354 0.7455327 > NA > feature1000 40.187737 0.6003838 0.4249903 1.4127001 0.1577439 > NA > > Mike > > > > > On Sun, Jan 12, 2014 at 11:39 AM, Michael Love < > michaelisaiahlove@gmail.com> wrote: > >> hi Noa, >> >> Could you email me a small reproducible example? e.g.: >> >> ... >> dds <- nbinomWaldTest(dds, betaPrior = FALSE) >> idx <- 1234 # the row you are referencing >> ddsSub <- dds[idx, ] >> save(ddsSub, "small_example.RData") >> >> best, >> >> Mike >> >> >> >> On Sun, Jan 12, 2014 at 9:03 AM, Noa Henig <ntzunz@gmail.com> wrote: >> >>> Hi Simon, >>> I forgot to mention that the counts are the *normalized* counts, I guess >>> this is the reason you asked for the size factors. >>> I ran DESeq2 again using the code below (this time with beta prior = >>> FALSE >>> to handle the issue of zeros and low counts I originally had). The fold >>> change in this run was a little higher than before: >>> log2FC (T1/T0) = 3.97; log2FC (T3/T0) = 3.64 >>> How can this be explained considering the values of the normalized counts >>> for that gene? should I do some modification to adjust also for the high >>> counts? >>> Thanks ! >>> Noa >>> >>> >>> This is the code: >>> countData = read.table("merged_counts.txt", header=TRUE, row.names=1) >>> colData = data.frame(row.names=colnames(countData), condition = >>> c(param[,2])) ## it's a long list of samples names >>> colData$condition = factor( colData$condition, levels = >>> c(param[1:numOfConditions,7])) >>> >>> dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, >>> design = ~condition) >>> dds <- estimateSizeFactors(dds) >>> dds <- estimateDispersions(dds, maxit = max_iterations) ## a user- defined >>> parameter >>> dds <- nbinomWaldTest(dds, betaPrior = FALSE) >>> tmp_counts = counts(dds, normalized=TRUE) >>> normalized_counts = data.frame(tmp_counts) >>> >>> final_stats = NULL >>> filtering_thresholds = matrix(ncol = 2, nrow = as.numeric(param[1,3]), >>> byrow = TRUE) >>> for(i in 1:numOfComparisons) >>> { >>> curr_res = results(dds, alpha = 0.05, contrast = c("condition", >>> param[i,4], >>> param[i,5])) >>> ..writing the data... >>> } >>> >>> >>> >>> >>> On Wed, Jan 8, 2014 at 3:21 PM, Simon Anders <anders@embl.de> wrote: >>> >>> > Hi Noa >>> > >>> > On 08/01/14 14:11, Noa Henig wrote: >>> > > I would expect, for example: >>> > > genes having these counts T0: 29852 and 39519; T1: 37352 and 44385; >>> > > T3:52539 and 58,556. >>> > > The log2 fold change of T0/T1 was 3.28 (p-adjusted was 0.04), while >>> the >>> > > log2 fold change of T2/T0 was 2.96 (p-adjusted was 0.105) >>> > >>> > Please tell us the values of the size factors (run "sizeFactors(dds)") >>> > to see how far off this is. >>> > >>> > And please always post your code, not just the output. >>> > >>> > Simon >>> > >>> > _______________________________________________ >>> > Bioconductor mailing list >>> > Bioconductor@r-project.org >>> > https://stat.ethz.ch/mailman/listinfo/bioconductor >>> > Search the archives: >>> > http://news.gmane.org/gmane.science.biology.informatics.conductor >>> > >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > [[alternative HTML version deleted]]

GO DESeq2 GO DESeq2 • 2.4k views

ADD COMMENT • link 10.3 years ago Michael Love 41k

Login before adding your answer.