Question

a few questions on DESeq

0

Entering edit mode

sunghee OH ▴ 50

@sunghee-oh-4019

Last seen 9.6 years ago

Hi All, In DESeq, if one data set has both technical and biological replicates, i am curious how technical and biological replicates are distinguished in that package. for instance, say, each biological replicate has two technical replicates. T_b_1_t_1(cancer biological replicate 1, technical replicate1) T_b_1_t_2, T_b_2_t_1, T_b_2_t_2 VS similarly Normal samples. So. the main null hypothesis is no difference between tumor vs normal group.(4 samples vs 4) For such a case, when analyzing 8 samples together with the above hypothesis, Q1. for "conds" variable indicating the labels of samples mentioned in DESeq manual, how do i have to define to distinguish tech vs biological replicates?(edgeR can not do for such a data, i know that it is able to treat only data with technical replicates rather than biological individuals instead, yet, it seems DESeq is able to handle data set with biological replicates without any additional variable of biological individual factor as like GLM. In the paper, one of published data sets for such kind of type data which has both tech and bio replicates was also analyzed: RNA-Seq of yeast as real data application using DESeq) Q2. If question one is doable in DESeq, it is also able to take experimental effects(nuisance factors) such as library/flow cell/time point(if it is a time series data set) as well as main hypothesis into account in DESeq? Q3. when estimating size factors, Sj, it is using geometric mean pseudo-reference as like quantile adjusted way in 2008 Robinson paper. so. here i was wondering why it is taken by especially geometric mean rather than just simple arithmetic mean(or median) or trimmed mean(or median). is there any specific reason to take geometric mean in the estimation of size factors here? Thank you all in advance. S- [[alternative HTML version deleted]]

DESeq DESeq • 1.7k views

ADD COMMENT • link updated 13.5 years ago by Simon Anders ★ 3.7k • written 13.5 years ago by sunghee OH ▴ 50

score 0 · Answer 1 · 2010-11-11

Dear Sunghee On 11/11/2010 05:44 AM, sunghee OH wrote: > In DESeq, if one data set has both technical and biological replicates, i am > curious how technical and biological replicates are distinguished in that > package. for instance, say, each biological replicate has two technical > replicates. T_b_1_t_1(cancer biological replicate 1, technical replicate1) > T_b_1_t_2, T_b_2_t_1, T_b_2_t_2 VS similarly Normal samples. > > So. the main null hypothesis is no difference between tumor vs normal > group.(4 samples vs 4) > > For such a case, when analyzing 8 samples together with the above > hypothesis, > > Q1. for "conds" variable indicating the labels of samples mentioned in DESeq > manual, how do i have to define to distinguish tech vs biological > replicates?(edgeR can not do for such a data, i know that it is able to > treat only data with technical replicates rather than biological individuals > instead, yet, it seems DESeq is able to handle data set with biological > replicates without any additional variable of biological individual factor > as like GLM. In the paper, one of published data sets for such kind of type > data which has both tech and bio replicates was also analyzed: RNA- Seq of > yeast as real data application using DESeq) The short answer is: Just add up the counts from your technical replicates, and treat them as one biological sample. If you want a justification for this, I have to elaborate quite a bit. Hence, I take the opportunity to write a little treatise on noise in RNA-Seq, as this is something I wanted to write down anyway. Let's first fix some terminology. I divide the noise (or: variance) in our sample into three parts, shot noise, technical noise and biological noise. Shot noise: Assume, you prepare a library including amplification etc and fill the prepared library in two lanes of a Solexa flowcell. Then, we may assume that the concentration of each transcript is exactly the same in the two lanes. However, we will nevertheless get different counts for the transcript from the two lanes because high-throughput sequencing is (as is all shotgun sequencing) a probabilistic process: each cDNA molecule has a certain probability to be read by the sequencer, and hence, the number of observed reads follows a Poisson distribution. The variance of a Poisson distribution is equal to its mean. From this, we can conclude that the theoretical minimum to the variance in an HTS experiment (i.e., the noise occurring even when comparing perfectly identical prepared libraries) is equal to the expected count rate. The actual variance is larger than the shot noise, as two further noise sources need to be taken into account. Technical noise: By this, I mean the amount of variance in addition to shot noise that we observe when comparing two technical replicates, i.e., two libraries prepared from the same sample. This noise captures the non-uniformity of the mRNA extraction and library preparation processes (and any noise in the sequencing process itself beyond the shot noise). Biological noise: This is variation that you observe in addition to shot noise and technical noise, when comparing two biological replicates, i.e., libraries prepared from two samples which were treated in the same way. It reflects how two different samples which the experimenter attempts to treat identical, nevertheless differ in their response to environmental circumstances that the experimenter could not control for. Hence, the total variance that one can observe from comparing two samples is: Total variance = Shot noise + technical noise + biological noise If you compare technical replicates, the biological noise is zero, of course. The first thing to note in this decomposition is that we know the amount of shot noise a priori from theoretical grounds (it is well estimated as equal to the observed count rate), while the other two have to be estimated from the data, as they reflect properties of the specific experiment. The crucial point is that technical noise is usually _much_ smaller than shot noise and biological noise, so small, in fact, that it is not worth treating it separately, rather than just lumping it together with either the shot noise or the biological noise. This fact is the core result of Marioni et al.'s 2008 Genome Res. paper and was also described by Nagalakshmi et al. in their 2008 Science paper. This holds only, of course, if everything went well in library preparation. So, if you have technical replicates, you should take the opportunity to check this. To do so with DESeq, set the 'conditions' factor of your CountDataSet such that each sample in a pair of technical replicates gets the same levels, but biological replicates get different levels, i.e., let DESeq know which samples are technical replicates but let it treat biological replicates as different conditions. Then, estimate the size factors and variance functions and call 'scvPlot'. If technical noise is really negligible compared to shot noise, the distance of the solid lines (total variance minus shot noise) to the x axis should appear very small when compared to the distance between the dashed and the solid line (which is the shot noise), at least in the count rate region where most genes are (as indicated by the black line). If it does not look like this, you may have issues with your sample prep. Once you have convinced yourself that the technical replicates agree as well as the shot noise predicts, there is no longer any point in keeping them separate. Just add up the counts from all technical replicates of a given biological replicate to form a single column in the count table and make a new CountDataSet, with the 'conditions' factor now describing the biological conditions of the samples. If you now look at the SCV plot, the distance between solid lines and x axis indicates the sum of biological and technical noise. If technical noise has before been established as negligible, you can read it as only showing the biological variation. A good way to interpret it, is, by the way, to take the square root of the value and understand it as the coefficient of variation of gene expression over biological replicates. The distance between solid and dashed line is the shot noise, as before. If your technical noise is not negligible, one might argue that you need a hierarchical noise model to get optimal power. To my knowledge, nobody has implemented something like this yet for RNA-Seq. Furthermore, I think, it would still be permissible to add up the technical replicates because the technical noise will still be apparent in the sum and hence properly accounted for once you estimate total variance between the summed-up biological replicates. Hence, my initial short answer: Just sum up the count from technical replicates. This was a long mail, maybe even too long to be read by anyone, so I stop here and reply to your other two questions later today. Cheers Simon +--- | Dr. Simon Anders, Dipl.-Phys. | European Molecular Biology Laboratory (EMBL), Heidelberg | office phone +49-6221-387-8632 | preferred (permanent) e-mail: sanders at fs.tum.de