RNAseq expression analysis using DESeq: technical replicates, paired samples

0

Entering edit mode

Michael Muratet ▴ 420

@michael-muratet-3076

Last seen 11.4 years ago

On Apr 15, 2010, at 3:59 AM, Simon Anders wrote: > Dear Jinfeng > > On Wed, 14 Apr 2010 22:28:17 +0000 (UTC), Jinfeng Liu <jinfengl at="" gene.com=""> > > wrote: >> I'm trying to use DESeq for RNAseq expression analysis. I haven't >> been >> able to find information about how to deal with the following issues: >> >> 1) technical replicates >> We have two biological samples, two libraries (of different insert >> size) >> were prepared for each of them. so I have four lanes of data in total > and I >> want to do differential expression between the two samples. It >> doesn't > look quite >> right to me to set up the condition vector as >> >> conds <- c( "Sample1", "Sample1","Sample2","Sample2") since they are > only >> technical replicates, not biological. But I'm not sure what to do. > > If you set up your test this way, DESeq will assume that the variance > between the replicates is all there is. Hence, roughly speaking, it > will > call a difference significant if it is larger than the fluctuations > observed between the technical replicates. This then only tells you > that > the gene might be typically different between different samples, but > you > won't know whether the difference is really due to the difference in > treatment or whether you would have observed the same magnitude of > difference between two samples that have been treated the same way. > > Of course, without biological replicates, there is no way to settle > this > question properly. > > The best thing you can do is to add up the counts from each sample, > and > compare just one data column with summed data from Sample 1 with one > data > column for Sample 2. Call DESeq's 'estimateVarianceFunctions' function > with > the argument 'pool=TRUE', and it will ignore the sample labels and > estimate > the variance between the conditions. Hence, it will only call those > genes > differentially expressed that have a much stronger difference between > conditions than the other genes of similar expression strength. You > might > find only few differentially expressed genes, but these are the only > ones > for which you can be somewhat sure that they are proper hits. Greetings I would like to verify that 18 months later, adding counts for technical replicates is still the best approach for combining technical replicates. We are constrained to single biological replicates, but we're interested in using the technical replicates if we can. I recall that limma had ways to set up the model matrix. Thanks Mike > >> 2) Paired samples >> We have samples from three patients. For each patient, we have >> matched >> tumor and adjacent normal samples. How should we set up the >> analysis to > capture the >> pair information? > > Sorry, but DESeq does not support paired tests (yet). I have some > ideas on > how to add this but this might take a while. > > For now, your best option is to use DESeq's > 'getVarianceStabilizedData' to > transform your data to a scale on which it is approximately > homoskedastic. > Then, you can use a pair-wise t-test or a pair-wise z-test. (Don't > do this > with the raw data, use DESeq's variance-stabilizing transcformation to > make > them homoskedastic first.) > > The pairwise t-test should work out of the box with R's standard > 't.test' > function. A pair-wise z-test should have more power in this setting, > as, > after the variance-stabilizing transformation, you may assume that all > data > has the same variance. Estimate this variance from your genes in a > pooled > fashion (ask again if you don't know how to do that) and take the > median. > Divide the pair differences by the square root of this to get z > scores, > then use 'pnorm' to get a p value. In my experience, this should work > reasonably well even though it may not have as much power as a > proper NB > test would have. > > Cheers > Simon > > > +--- > | Dr. Simon Anders, Dipl.-Phys. > | European Molecular Biology Laboratory (EMBL), Heidelberg > | office phone +49-6221-387-8632 > | preferred (permanent) e-mail: sanders at fs.tum.de > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor Michael Muratet, Ph.D. Senior Scientist HudsonAlpha Institute for Biotechnology mmuratet at hudsonalpha.org (256) 327-0473 (p) (256) 327-0966 (f) Room 4005 601 Genome Way Huntsville, Alabama 35806

RNASeq DESeq RNASeq DESeq • 2.5k views

ADD COMMENT • link updated 14.2 years ago by Simon Anders ★ 3.8k • written 14.2 years ago by Michael Muratet ▴ 420

0

Entering edit mode

Simon Anders ★ 3.8k

@simon-anders-3855

Last seen 5.5 years ago

Zentrum für Molekularbiologie, Universi…

Dear Michael On 2011-11-07 22:56, Michael Muratet wrote: > I would like to verify that 18 months later, adding counts for technical > replicates is still the best approach for combining technical replicates. > > We are constrained to single biological replicates, but we're interested > in using the technical replicates if we can. I recall that limma had > ways to set up the model matrix. Your question sounds as if you consider the issue as a technical limitation of the available methods. Unfortunately, it is not, and I am afraid, 18 months are not enough time for fundamental principles of statistics to change. If you want to know whether a difference between control and treatment may be attributed to the treatment, you need to verify that this difference is large compared to what you expect from random biological variation, and without biological replicates you cannot know the extent of biological variation. Of course, you can perform the analysis on the level of technical replicates, as you can with microarrays in limma. Then, you will get lots of results, but any call of significant differences may only be interpreted to mean that your two biological samples differ in this gene. You may not infer that the difference is more likely to be the result of your treatment than just due to chance differences between the samples. Best regards Simon

ADD COMMENT • link 14.2 years ago Simon Anders ★ 3.8k

0

Entering edit mode

On Nov 7, 2011, at 4:11 PM, Simon Anders wrote: > Dear Michael > > On 2011-11-07 22:56, Michael Muratet wrote: >> I would like to verify that 18 months later, adding counts for >> technical >> replicates is still the best approach for combining technical >> replicates. >> >> We are constrained to single biological replicates, but we're >> interested >> in using the technical replicates if we can. I recall that limma had >> ways to set up the model matrix. > > Your question sounds as if you consider the issue as a technical > limitation of the available methods. Unfortunately, it is not, and I > am afraid, 18 months are not enough time for fundamental principles > of statistics to change. Hi Simon > > If you want to know whether a difference between control and > treatment may be attributed to the treatment, you need to verify > that this difference is large compared to what you expect from > random biological variation, and without biological replicates you > cannot know the extent of biological variation. No, I really was just checking for updates in how to arrange the input. I understand the limitations of no biological replicates. However, as noted in section 6 of the vignette, '...such experiments are often undertaken...' and we hope to confirm what we find by other means. I thought I had seen documentation somewhere on how to incorporate technical replicates into deseq, but I could not put my hands back on it, so I thought I'd ask. I have actually processed the original data once and got what results we could get. I don't expect the tech replicates to change the answer. Cheers Mike >> > Of course, you can perform the analysis on the level of technical > replicates, as you can with microarrays in limma. Then, you will get > lots of results, but any call of significant differences may only be > interpreted to mean that your two biological samples differ in this > gene. You may not infer that the difference is more likely to be the > result of your treatment than just due to chance differences between > the samples. > > Best regards > Simon Michael Muratet, Ph.D. Senior Scientist HudsonAlpha Institute for Biotechnology mmuratet at hudsonalpha.org (256) 327-0473 (p) (256) 327-0966 (f) Room 4005 601 Genome Way Huntsville, Alabama 35806

ADD REPLY • link 14.2 years ago Michael Muratet ▴ 420

0

Entering edit mode

Hi Michael On 2011-11-07 23:44, Michael Muratet wrote: > No, I really was just checking for updates in how to arrange the input. > I understand the limitations of no biological replicates. However, as > noted in section 6 of the vignette, '...such experiments are often > undertaken...' and we hope to confirm what we find by other means. > > I thought I had seen documentation somewhere on how to incorporate > technical replicates into deseq, but I could not put my hands back on > it, so I thought I'd ask. I have actually processed the original data > once and got what results we could get. I don't expect the tech > replicates to change the answer. Sorry if I jumped a bit at you in my answer, I just was asked this question a bit too often recently and maybe I should not write emails late in the evening. In my old post of last year, I refered to the papers of Nagalakshmi et al. and Marioni et al., who notices that the variance between technical replicates follows a Poisson distribution. Now, if you add up two Poisson distributed random variable, the sum is Poisson-distributed, too; and this is why any method looking at the technical replicates separately will not have more useful information to work with than one that is given the sum. Even if you have overdispersion (i.e., variance in excess of Poisson) between technical replicates, summing them up will smooth this out a bit, and what is left will simply become part of the variance seen between the biological replicates (i.e., the sums of the technical ones). So if you have layered experiment with both technical and biological, I do not think it is worth the effort to account for this hierarchy in the analysis. Your situation seems to be that you have no biological replicates but expect to see only so few good hits that you can follow them up with independent, replicated verification. For such cases, I recently wondered whether the following trick might help: Make your best guess for a biological coefficient of variation. Let us say, we expect, from experience with previous experiments, that the gene expression between replicates, if you had them, would typically vary by 15% (i.e., 0.15). Then you can inject this dispersion value into your DESeq calculation: # Start as usual: library( DESeq ) # An example data set with only two samples: cds <- makeExampleCountDataSet()[,2:3] # Estimate the dispersion, only to throw it away in the next step cds <- estimateDispersions( cds, method="blind", sharingMode="fit-only" ) # Overwrite the estimates with our guessed fixed value: fData(cds)$disp_blind <- .15^2 # See what you get res <- nbinomTest( cds, "A", "B" ) If you have some hunch what the dispersion might be, you can get some preliminary results that way. Do not take the p values too literally, of course, as they are based on guesswork. Observe how the ranking of the genes and the number of hits changes if you try different dispersion guesses. This help you anticipate how much success you can hope for in your subsequent effort of verification, and for how many genes you should attempt verification. Some _very_ _rough_ (and maybe completely wrong) values for coefficients of variations, as I have seen them in the past: - Comparison between isogenic liquid yeast cell cultures, highly controlled experiment: Below 10% or even 5% (dispersion .1^2 or .05^2) if you are lucky - Comparison between isogenic mammalian cell cultures: maybe around 15% if things go well - Comparison between tissues derived from isogenic litter mates: maybe 20% to 35% - Comparison between different non-related individuals: 50% or even much more Best regards Simon

ADD REPLY • link 14.2 years ago Simon Anders ★ 3.8k

Login before adding your answer.