EdgeR: replicated pools, yes or not?
5
1
Entering edit mode
Guest User ★ 13k
@guest-user-4897
Last seen 11.1 years ago
Hello, I would like to ask for your opinion on whether using replicated pools in the context of RNASeq experiments makes sense, or not. Lets say that we are interested in detecting genes that are differentially expressed in two genetic backgrounds (a certain KO mutant strain and the corresponding WT), in mouse liver. We could perform an RNASeq experiment using liver tissue from four KO and four WT with the same sex, age, and diet. We would have eight samples: four biological replicates for each of the two conditions to be compared. However, we decide to pool liver tissue from three animals, to prepare each of the eight samples (we would use, therefore 24 animals: 12 KO animals pooled to produce four KO samples, and 12 WT animals pooled to produce four WT samples). We would do it following the argument that pooling samples to build biological replicates reduces variation between replicates and increases the statistical power of the analysis, resulting in a more sensitive detection of genes that are differentially expressed between conditions. However, EdgeR relies, precisely, on measuring biological variability to establish the statistical significance of differences in gene expression across conditions. Therefore, pooling samples to buid biological replicates is not correct and we are, in fact, losing statistical power. We are unable of determining whether the observed differences in gene expression are significative or not. There are some publications dealing with this issue in the context of microarrays (for example, Kendziorski et al, 2005, "On the utility of pooling biological samples in microarray experiments", PNAS, 102:4252) but I have not found anything similar in the context of RNASeq and, more specifically, of the analysis of RNASeq data with EdgeR. Any comment will be more than welcome, as well as any relevant references. Thanks a lot in advance. -- output of sessionInfo(): NA -- Sent via the guest posting facility at bioconductor.org.
RNASeq Microarray edgeR RNASeq Microarray edgeR • 5.8k views
ADD COMMENT
0
Entering edit mode
@ryan-c-thompson-5618
Last seen 12 months ago
Icahn School of Medicine at Mount Sinai…
Don't pool. You are throwing away information. If you're going to do 24 animals, you may as well use 24 barcodes. To see that a separate barcode for each animal provides strictly more information than pooling, note that once you have used separate barcodes, you could add the counts together to do in silico pooling and get the same result as if you had done pooling in vitro. In other words, you can get from separate barcodes to pooling by throwing away information. For a literature reference, try "Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing." http://www.ncbi.nlm.nih.gov/pubmed/22985019 That publication doesn't directly address the issue of pooling multiple biological samples in the same barcode, but it does make clear that more biological replication results in a drastic improvement in results. You could simulate your described pooling scheme yourself: simply simulate 24 libraries in 2 groups with some number of true differentially expressed genes between them. Then pool them 3 at a time (by adding their counts together) to get the pooled dataset of 8 pooled libraries in 2 groups. Then perform the analysis on both datasets using your preferred tool and compute the ROC curve. I think you will find that pooling significantly diminishes your power to detect differential expression. -Ryan Thompson On Wed Apr 23 09:42:15 2014, "Manuel J G?mez [guest]" wrote: > > Hello, > > I would like to ask for your opinion on whether using replicated pools in the context of RNASeq experiments makes sense, or not. > > Lets say that we are interested in detecting genes that are differentially expressed in two genetic backgrounds (a certain KO mutant strain and the corresponding WT), in mouse liver. > > We could perform an RNASeq experiment using liver tissue from four KO and four WT with the same sex, age, and diet. > > We would have eight samples: four biological replicates for each of the two conditions to be compared. > > However, we decide to pool liver tissue from three animals, to prepare each of the eight samples (we would use, therefore 24 animals: 12 KO animals pooled to produce four KO samples, and 12 WT animals pooled to produce four WT samples). > > We would do it following the argument that pooling samples to build biological replicates reduces variation between replicates and increases the statistical power of the analysis, resulting in a more sensitive detection of genes that are differentially expressed between conditions. > > However, EdgeR relies, precisely, on measuring biological variability to establish the statistical significance of differences in gene expression across conditions. Therefore, pooling samples to buid biological replicates is not correct and we are, in fact, losing statistical power. We are unable of determining whether the observed differences in gene expression are significative or not. > > There are some publications dealing with this issue in the context of microarrays (for example, Kendziorski et al, 2005, "On the utility of pooling biological samples in microarray experiments", PNAS, 102:4252) but I have not found anything similar in the context of RNASeq and, more specifically, of the analysis of RNASeq data with EdgeR. > > Any comment will be more than welcome, as well as any relevant references. > > Thanks a lot in advance. > > -- output of sessionInfo(): > > NA > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
Dear Ryan, Thanks a lot for your answer. I perfectly understand that using 12 replicas for each condition is more informative than using 4. However, assuming that my budget allows me to sequence only a limited number of samples at a decent coverage (for example, 8 samples at 10 million reads per sample), which of the following would be the preferred solution? a) using 8 samples obtained from 8 different animals (4 KO and 4 WT); b) using 8 samples (4 KO and 4 WT) obtained by pooling tissue from "n" animals (with the same genotype, obviously). I am pretty sure that if the unique difference between the two types of animal (or condition) is a specific mutation, solution (a) would be THE correct solution because it would imply using truly biological and independent replicates. Solution (b) would be not just less correct, but blatantly incorrect, because it would eliminate biological variation between replicates (specially if "n" is high), and having an estimation of that variation is necessary to establish the significance of the differences observed between conditions. I acknowledge that I am answering myself, but I keep finding examples in which pooling (in the sense that I am describing above) is not completely discouraged. For example, Churchill (in "Fundamentals of experimental design for cDNA microarrays", 2002, Nature Genetics 32) explains that "in a two-sample comparison, we could consider making two large pools of all available units and measuring each pool multiple times. This is a poor design, as it does not allow estimation of the between-pool variance. By pooling all the available samples together we have minimized the biological variance, but we have also eliminated all independent replication. It is better to use several pools and fewer technical replicates". Why does he write that it is better to use several pools? Wouldn't it be better to use no pools at all? Similarly, a discussion in which pooling is not completely discouraged can be found in: http://seqanswers.com/forums/showthread.php?t=27905 Finally, pooling samples is often justified because of limited availability of RNA. In those cases pooling is mandatory, obviously. But if replicates have been constructed by pooling RNA from many tiny individual samples, shouldn't we have in mind that we have lost all information regarding biological variance, and that we will not be able to asses the significance of any differences observed between conditions? Manuel J G?mez ________________________________________ From: Ryan [rct@thompsonclan.org] Sent: Wednesday, April 23, 2014 7:06 PM To: "\"Manuel J G?mez [guest]\" " Cc: bioconductor at r-project.org; Manuel Jos? G?mez Rodr?guez Subject: Re: [BioC] EdgeR: replicated pools, yes or not? Don't pool. You are throwing away information. If you're going to do 24 animals, you may as well use 24 barcodes. To see that a separate barcode for each animal provides strictly more information than pooling, note that once you have used separate barcodes, you could add the counts together to do in silico pooling and get the same result as if you had done pooling in vitro. In other words, you can get from separate barcodes to pooling by throwing away information. For a literature reference, try "Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing." http://www.ncbi.nlm.nih.gov/pubmed/22985019 That publication doesn't directly address the issue of pooling multiple biological samples in the same barcode, but it does make clear that more biological replication results in a drastic improvement in results. You could simulate your described pooling scheme yourself: simply simulate 24 libraries in 2 groups with some number of true differentially expressed genes between them. Then pool them 3 at a time (by adding their counts together) to get the pooled dataset of 8 pooled libraries in 2 groups. Then perform the analysis on both datasets using your preferred tool and compute the ROC curve. I think you will find that pooling significantly diminishes your power to detect differential expression. -Ryan Thompson On Wed Apr 23 09:42:15 2014, "Manuel J G?mez [guest]" wrote: > > Hello, > > I would like to ask for your opinion on whether using replicated pools in the context of RNASeq experiments makes sense, or not. > > Lets say that we are interested in detecting genes that are differentially expressed in two genetic backgrounds (a certain KO mutant strain and the corresponding WT), in mouse liver. > > We could perform an RNASeq experiment using liver tissue from four KO and four WT with the same sex, age, and diet. > > We would have eight samples: four biological replicates for each of the two conditions to be compared. > > However, we decide to pool liver tissue from three animals, to prepare each of the eight samples (we would use, therefore 24 animals: 12 KO animals pooled to produce four KO samples, and 12 WT animals pooled to produce four WT samples). > > We would do it following the argument that pooling samples to build biological replicates reduces variation between replicates and increases the statistical power of the analysis, resulting in a more sensitive detection of genes that are differentially expressed between conditions. > > However, EdgeR relies, precisely, on measuring biological variability to establish the statistical significance of differences in gene expression across conditions. Therefore, pooling samples to buid biological replicates is not correct and we are, in fact, losing statistical power. We are unable of determining whether the observed differences in gene expression are significative or not. > > There are some publications dealing with this issue in the context of microarrays (for example, Kendziorski et al, 2005, "On the utility of pooling biological samples in microarray experiments", PNAS, 102:4252) but I have not found anything similar in the context of RNASeq and, more specifically, of the analysis of RNASeq data with EdgeR. > > Any comment will be more than welcome, as well as any relevant references. > > Thanks a lot in advance. > > -- output of sessionInfo(): > > NA > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor *************** AVISO LEGAL *************** Este mensaje va dirigido, de manera exclusiva, a su destinatario y contiene informaci?n confidencial y sujeta al secreto profesional, cuya divulgaci?n no est? permitida por la ley. En caso de haber recibido este mensaje por error, le rogamos que, de forma inmediata, nos lo comunique mediante correo electr?nico remitido a nuestra atenci?n o a trav?s del tel?fono (+34 914531200) y proceda a su eliminaci?n, as? como a la de cualquier documento adjunto al mismo. Asimismo, le comunicamos que la distribuci?n, copia o utilizaci?n de este mensaje, o de cualquier documento adjunto al mismo, cualquiera que fuera su finalidad, est?n prohibidas por la ley. Le informamos, como destinatario de este mensaje, que el correo electr?nico y las comunicaciones por medio de Internet no permiten asegurar ni garantizar la confidencialidad de los mensajes transmitidos, as? como tampoco su integridad o su correcta recepci?n, por lo que el CNIC no asume responsabilidad alguna por tales circunstancias. Si no consintiese la utilizaci?n del correo electr?nico o de las comunicaciones v?a Internet le rogamos nos lo comunique y ponga en nuestro conocimiento de manera inmediata. *************** LEGAL NOTICE ************** This message is intended exclusively for the person to whom it is addressed and contains privileged and confidential information protected from disclosure by law. If you are not the addressee indicated in this message, you should immediately delete it and any attachments and notify the sender by reply e-mail or by phone (+34 914531200). In such case, you are hereby notified that any dissemination, distribution, copying or use of this message or any attachments, for any purpose, is strictly prohibited by law. We hereby inform you, as addressee of this message, that e-mail and Internet do not guarantee the confidentiality, nor the completeness or proper reception of the messages sent and, thus, CNIC does not assume any liability for those circumstances. Should you not agree to the use of e-mail or to communications via Internet, you are kindly requested to notify us immediately.
0
Entering edit mode
> However, assuming that my budget allows me to sequence only a limited > number of samples at a decent coverage (for example, 8 samples at 10 > million reads per sample), which of the following would be the preferred > solution? > > a) using 8 samples obtained from 8 different animals (4 KO and 4 WT); > b) using 8 samples (4 KO and 4 WT) obtained by pooling tissue from "n" > animals (with the same genotype, obviously). > The preferred solution would be to take your 8 * n animals and sequence them all individually using the same total amount of sequencing as you would have used for the 8 pools. Each individual sample will have n times less coverage, but that doesn't matter because you have still done the same total amount of sequencing per condition. I read a paper showing that increasing the number of biological replicates for an RNA-seq experiment while holding constant the total amount of sequencing (and therefore reducing the sequencing per replicate) continued to give gains in statistical power up to at least 192 biological replicates (which was the largest number they tested). This was in simulations, of course. Unfortunately, I can't find the citation in my ever-growing library of articles, but maybe someone else can supply it. So, I'm not sure whether option a or b is better, but if you have the capability to to b, then you also probably have the capability to do 8 * n unpooled samples, which is unquestionably better than either a or b. > I am pretty sure that if the unique difference between the two types of > animal (or condition) is a specific mutation, solution (a) would be THE > correct solution because it would imply using truly biological and > independent replicates. Solution (b) would be not just less correct, but > blatantly incorrect, because it would eliminate biological variation > between replicates (specially if "n" is high), and having an estimation of > that variation is necessary to establish the significance of the > differences observed between conditions. > This is not necessarily a problem, although it might be. With the pooled samples, your estimate of biological variability will be smaller, but you also fewer degrees of freedom than you would if you did all the samples separately instead of pooling. I don't know which of these effects would dominate. So your significance estimates may not be any less accurate or unbiased, but they will probably be less precise since you are working with fewer observations. > > I acknowledge that I am answering myself, but I keep finding examples in > which pooling (in the sense that I am describing above) is not completely > discouraged. For example, Churchill (in "Fundamentals of experimental > design for cDNA microarrays", 2002, Nature Genetics 32) explains that "in a > two-sample comparison, we could consider making two large pools of all > available units and measuring each pool multiple times. This is a poor > design, as it does not allow estimation of the between-pool variance. By > pooling all the available samples together we have minimized the biological > variance, but we have also eliminated all independent replication. It is > better to use several pools and fewer technical replicates". Why does he > write that it is better to use several pools? Wouldn't it be better to use > no pools at all? > The considerations are different for microarrays. In sequencing, you can divide up your available sequencing space into as many individual replicates as you like. In microarrays, if you only have money to do 10 arrays, then you can only do 10 samples, so are forced to choose between 10 individuals or 10 pools. > Similarly, a discussion in which pooling is not completely discouraged can > be found in: > > http://seqanswers.com/forums/showthread.php?t=27905 > The only place I see pooling not discouraged in that thread is the part talking about 5 pools of 10 individuals each for 3 conditions vs 5 individuals each for 3 conditions. In that case Simon says that pooling is acceptable because the money or labor costs of individually prepping 150 samples may be prohibitive. He still notes that this is the preferred solution if possible, and he notes that there is a trade-off that must be considered for the few samples vs few pools question. This echoes my answer above in this reply. Finally, pooling samples is often justified because of limited availability > of RNA. In those cases pooling is mandatory, obviously. But if replicates > have been constructed by pooling RNA from many tiny individual samples, > shouldn't we have in mind that we have lost all information regarding > biological variance, and that we will not be able to asses the significance > of any differences observed between conditions? > You haven't lost *all* information about biological variance. There are still different individuals going into each pool. For a concrete example, when doing RNA-seq on C. elegans, a single worm doesn't provide sufficient RNA, so each "sample" is actually a whole tank of worms all receiving the same treatment, i.e. litterally a pool of individuals. I have analyzed such an experiment, and the dispersions as estimated by edgeR were on par with the general guide values one would expect for genetically identical individuals. As I said above, there are the balancing factors of reducing variability and reducing degrees of freedom, and I'm not exactly sure how they balance out. -Ryan [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi Ryan, I would like to pop in just to emphasize something about the current economics of sequencing that clearly depends on the lab or sequencing facility you're using. In our institute, and it sounds to me like Manuel is in a similar situation, the most expensive part of doing a proper RNA-seq experiment is the cost of each (barcoded) library. When you reply "if you have the capability to do b [8 pools], then you also probably have the capability to do 8 * n unpooled samples" you are clearly considering that the "per lane" cost of sequencing will be the same, but are missing the reality that many labs pay quite heavily for each library prep. For me, and surely for others, it is quite realistic to only have enough money for a limited number of library preps (say 8 or 12), even though we might have many more individuals (animals, plants, cell cultures, what-not) at almost no extra cost. In these cases, Manuel's question becomes quite relevant: should we pool many individuals into the fixed number of samples to be made into libraries, or should we try to make the libraries reflect as best as possible unique "individuals"? Of course when the individual provides too little RNA the question is moot, but what about cases like Manuel's where a single animal or tissue is enough for a library? Best, Cei On 4/24/14 3:24 PM, Ryan Thompson wrote: >> However, assuming that my budget allows me to sequence only a limited >> number of samples at a decent coverage (for example, 8 samples at 10 >> million reads per sample), which of the following would be the preferred >> solution? >> >> a) using 8 samples obtained from 8 different animals (4 KO and 4 WT); >> b) using 8 samples (4 KO and 4 WT) obtained by pooling tissue from "n" >> animals (with the same genotype, obviously). >> > > The preferred solution would be to take your 8 * n animals and sequence > them all individually using the same total amount of sequencing as you > would have used for the 8 pools. Each individual sample will have n times > less coverage, but that doesn't matter because you have still done the same > total amount of sequencing per condition. I read a paper showing that > increasing the number of biological replicates for an RNA-seq experiment > while holding constant the total amount of sequencing (and therefore > reducing the sequencing per replicate) continued to give gains in > statistical power up to at least 192 biological replicates (which was the > largest number they tested). This was in simulations, of course. > Unfortunately, I can't find the citation in my ever-growing library of > articles, but maybe someone else can supply it. > > So, I'm not sure whether option a or b is better, but if you have the > capability to to b, then you also probably have the capability to do 8 * n > unpooled samples, which is unquestionably better than either a or b. > > >> I am pretty sure that if the unique difference between the two types of >> animal (or condition) is a specific mutation, solution (a) would be THE >> correct solution because it would imply using truly biological and >> independent replicates. Solution (b) would be not just less correct, but >> blatantly incorrect, because it would eliminate biological variation >> between replicates (specially if "n" is high), and having an estimation of >> that variation is necessary to establish the significance of the >> differences observed between conditions. >> > > This is not necessarily a problem, although it might be. With the pooled > samples, your estimate of biological variability will be smaller, but you > also fewer degrees of freedom than you would if you did all the samples > separately instead of pooling. I don't know which of these effects would > dominate. So your significance estimates may not be any less accurate or > unbiased, but they will probably be less precise since you are working with > fewer observations. > >> >> I acknowledge that I am answering myself, but I keep finding examples in >> which pooling (in the sense that I am describing above) is not completely >> discouraged. For example, Churchill (in "Fundamentals of experimental >> design for cDNA microarrays", 2002, Nature Genetics 32) explains that "in a >> two-sample comparison, we could consider making two large pools of all >> available units and measuring each pool multiple times. This is a poor >> design, as it does not allow estimation of the between-pool variance. By >> pooling all the available samples together we have minimized the biological >> variance, but we have also eliminated all independent replication. It is >> better to use several pools and fewer technical replicates". Why does he >> write that it is better to use several pools? Wouldn't it be better to use >> no pools at all? >> > > The considerations are different for microarrays. In sequencing, you can > divide up your available sequencing space into as many individual > replicates as you like. In microarrays, if you only have money to do 10 > arrays, then you can only do 10 samples, so are forced to choose between 10 > individuals or 10 pools. > > >> Similarly, a discussion in which pooling is not completely discouraged can >> be found in: >> >> http://seqanswers.com/forums/showthread.php?t=27905 >> > > The only place I see pooling not discouraged in that thread is the part > talking about 5 pools of 10 individuals each for 3 conditions vs 5 > individuals each for 3 conditions. In that case Simon says that pooling is > acceptable because the money or labor costs of individually prepping 150 > samples may be prohibitive. He still notes that this is the preferred > solution if possible, and he notes that there is a trade-off that must be > considered for the few samples vs few pools question. This echoes my answer > above in this reply. > > Finally, pooling samples is often justified because of limited availability >> of RNA. In those cases pooling is mandatory, obviously. But if replicates >> have been constructed by pooling RNA from many tiny individual samples, >> shouldn't we have in mind that we have lost all information regarding >> biological variance, and that we will not be able to asses the significance >> of any differences observed between conditions? >> > > You haven't lost *all* information about biological variance. There are > still different individuals going into each pool. For a concrete example, > when doing RNA-seq on C. elegans, a single worm doesn't provide sufficient > RNA, so each "sample" is actually a whole tank of worms all receiving the > same treatment, i.e. litterally a pool of individuals. I have analyzed such > an experiment, and the dispersions as estimated by edgeR were on par with > the general guide values one would expect for genetically identical > individuals. As I said above, there are the balancing factors of reducing > variability and reducing degrees of freedom, and I'm not exactly sure how > they balance out. > > -Ryan > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Dr. Cei Abreu-Goodger Profesor Investigador Langebio CINVESTAV Tel: (52) 462 166 3006 cei at langebio.cinvestav.mx -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
ADD REPLY
0
Entering edit mode
Hi Cei, Yes, that is a good point. If your dominant cost is per sample and not per lane of sequencing, then you are back to the same situation as in microarrays, where you want to minimize the number of samples required to achieve a given level of significance. My other email gives my best attempt to address the question of pools vs individuals with the same number of samples in each case. -Ryan On 04/25/2014 08:18 AM, Cei Abreu-Goodger wrote: > Hi Ryan, > > I would like to pop in just to emphasize something about the current > economics of sequencing that clearly depends on the lab or sequencing > facility you're using. > > In our institute, and it sounds to me like Manuel is in a similar > situation, the most expensive part of doing a proper RNA-seq > experiment is the cost of each (barcoded) library. When you reply "if > you have the capability to do b [8 pools], then you also probably have > the capability to do 8 * n unpooled samples" you are clearly > considering that the "per lane" cost of sequencing will be the same, > but are missing the reality that many labs pay quite heavily for each > library prep. For me, and surely for others, it is quite realistic to > only have enough money for a limited number of library preps (say 8 or > 12), even though we might have many more individuals (animals, plants, > cell cultures, what-not) at almost no extra cost. In these cases, > Manuel's question becomes quite relevant: should we pool many > individuals into the fixed number of samples to be made into > libraries, or should we try to make the libraries reflect as best as > possible unique "individuals"? Of course when the individual provides > too little RNA the question is moot, but what about cases like > Manuel's where a single animal or tissue is enough for a library? > > Best, > > Cei > > > On 4/24/14 3:24 PM, Ryan Thompson wrote: >>> However, assuming that my budget allows me to sequence only a limited >>> number of samples at a decent coverage (for example, 8 samples at 10 >>> million reads per sample), which of the following would be the >>> preferred >>> solution? >>> >>> a) using 8 samples obtained from 8 different animals (4 KO and 4 WT); >>> b) using 8 samples (4 KO and 4 WT) obtained by pooling tissue from "n" >>> animals (with the same genotype, obviously). >>> >> >> The preferred solution would be to take your 8 * n animals and sequence >> them all individually using the same total amount of sequencing as you >> would have used for the 8 pools. Each individual sample will have n >> times >> less coverage, but that doesn't matter because you have still done >> the same >> total amount of sequencing per condition. I read a paper showing that >> increasing the number of biological replicates for an RNA-seq experiment >> while holding constant the total amount of sequencing (and therefore >> reducing the sequencing per replicate) continued to give gains in >> statistical power up to at least 192 biological replicates (which was >> the >> largest number they tested). This was in simulations, of course. >> Unfortunately, I can't find the citation in my ever-growing library of >> articles, but maybe someone else can supply it. >> >> So, I'm not sure whether option a or b is better, but if you have the >> capability to to b, then you also probably have the capability to do >> 8 * n >> unpooled samples, which is unquestionably better than either a or b. >> >> >>> I am pretty sure that if the unique difference between the two types of >>> animal (or condition) is a specific mutation, solution (a) would be THE >>> correct solution because it would imply using truly biological and >>> independent replicates. Solution (b) would be not just less correct, >>> but >>> blatantly incorrect, because it would eliminate biological variation >>> between replicates (specially if "n" is high), and having an >>> estimation of >>> that variation is necessary to establish the significance of the >>> differences observed between conditions. >>> >> >> This is not necessarily a problem, although it might be. With the pooled >> samples, your estimate of biological variability will be smaller, but >> you >> also fewer degrees of freedom than you would if you did all the samples >> separately instead of pooling. I don't know which of these effects would >> dominate. So your significance estimates may not be any less accurate or >> unbiased, but they will probably be less precise since you are >> working with >> fewer observations. >> >>> >>> I acknowledge that I am answering myself, but I keep finding >>> examples in >>> which pooling (in the sense that I am describing above) is not >>> completely >>> discouraged. For example, Churchill (in "Fundamentals of experimental >>> design for cDNA microarrays", 2002, Nature Genetics 32) explains >>> that "in a >>> two-sample comparison, we could consider making two large pools of all >>> available units and measuring each pool multiple times. This is a poor >>> design, as it does not allow estimation of the between-pool >>> variance. By >>> pooling all the available samples together we have minimized the >>> biological >>> variance, but we have also eliminated all independent replication. >>> It is >>> better to use several pools and fewer technical replicates". Why >>> does he >>> write that it is better to use several pools? Wouldn't it be better >>> to use >>> no pools at all? >>> >> >> The considerations are different for microarrays. In sequencing, you can >> divide up your available sequencing space into as many individual >> replicates as you like. In microarrays, if you only have money to do 10 >> arrays, then you can only do 10 samples, so are forced to choose >> between 10 >> individuals or 10 pools. >> >> >>> Similarly, a discussion in which pooling is not completely >>> discouraged can >>> be found in: >>> >>> http://seqanswers.com/forums/showthread.php?t=27905 >>> >> >> The only place I see pooling not discouraged in that thread is the part >> talking about 5 pools of 10 individuals each for 3 conditions vs 5 >> individuals each for 3 conditions. In that case Simon says that >> pooling is >> acceptable because the money or labor costs of individually prepping 150 >> samples may be prohibitive. He still notes that this is the preferred >> solution if possible, and he notes that there is a trade-off that >> must be >> considered for the few samples vs few pools question. This echoes my >> answer >> above in this reply. >> >> Finally, pooling samples is often justified because of limited >> availability >>> of RNA. In those cases pooling is mandatory, obviously. But if >>> replicates >>> have been constructed by pooling RNA from many tiny individual samples, >>> shouldn't we have in mind that we have lost all information regarding >>> biological variance, and that we will not be able to asses the >>> significance >>> of any differences observed between conditions? >>> >> >> You haven't lost *all* information about biological variance. There are >> still different individuals going into each pool. For a concrete >> example, >> when doing RNA-seq on C. elegans, a single worm doesn't provide >> sufficient >> RNA, so each "sample" is actually a whole tank of worms all receiving >> the >> same treatment, i.e. litterally a pool of individuals. I have >> analyzed such >> an experiment, and the dispersions as estimated by edgeR were on par >> with >> the general guide values one would expect for genetically identical >> individuals. As I said above, there are the balancing factors of >> reducing >> variability and reducing degrees of freedom, and I'm not exactly sure >> how >> they balance out. >> >> -Ryan >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >
ADD REPLY
0
Entering edit mode
Dear Ryan, Thank you very much for your detailed answers. >From your comments it seems that a key point in terms of pondering the advantages of pooling is, as you say, what is the relative contribution of reducing biological variability and reducing degrees of freedom. I guess that it may depend both on the number of pooled individuals per sample and the level of variability expressed between individuals. Since it can be expected that the level of variability will be different depending on the species, the tissue (if applies) and the conditions, it may not be possible to get some general rule. Best regards, Manuel J G?mez *************** AVISO LEGAL *************** Este mensaje va dirigido, de manera exclusiva, a su destinatario y contiene informaci?n confidencial y sujeta al secreto profesional, cuya divulgaci?n no est? permitida por la ley. En caso de haber recibido este mensaje por error, le rogamos que, de forma inmediata, nos lo comunique mediante correo electr?nico remitido a nuestra atenci?n o a trav?s del tel?fono (+34 914531200) y proceda a su eliminaci?n, as? como a la de cualquier documento adjunto al mismo. Asimismo, le comunicamos que la distribuci?n, copia o utilizaci?n de este mensaje, o de cualquier documento adjunto al mismo, cualquiera que fuera su finalidad, est?n prohibidas por la ley. Le informamos, como destinatario de este mensaje, que el correo electr?nico y las comunicaciones por medio de Internet no permiten asegurar ni garantizar la confidencialidad de los mensajes transmitidos, as? como tampoco su integridad o su correcta recepci?n, por lo que el CNIC no asume responsabilidad alguna por tales circunstancias. Si no consintiese la utilizaci?n del correo electr?nico o de las comunicaciones v?a Internet le rogamos nos lo comunique y ponga en nuestro conocimiento de manera inmediata. *************** LEGAL NOTICE ************** This message is intended exclusively for the person to whom it is addressed and contains privileged and confidential information protected from disclosure by law. If you are not the addressee indicated in this message, you should immediately delete it and any attachments and notify the sender by reply e-mail or by phone (+34 914531200). In such case, you are hereby notified that any dissemination, distribution, copying or use of this message or any attachments, for any purpose, is strictly prohibited by law. We hereby inform you, as addressee of this message, that e-mail and Internet do not guarantee the confidentiality, nor the completeness or proper reception of the messages sent and, thus, CNIC does not assume any liability for those circumstances. Should you not agree to the use of e-mail or to communications via Internet, you are kindly requested to notify us immediately.
0
Entering edit mode
I'm glad you are aware of the Kendziorski et al. paper, because it is the most applicable to the concept of biological vs. mathematical averaging. Also, I agree with Ryan. Several years ago I did exactly what he mentioned, looking in silico pooling vs. actual pooling, along with extensive simulations. The results were in agreement with Kendziorski using RNA-Seq, with some slight difference due to the dynamic range of RNA-Seq vs. microarrays. An additional benefit of the in silo pooling / repeated technical measurements is the design is more robust to technical problems (e.g., one bad library prep and all you've lost are a fraction of your reads for that animal rather than all the reads from that animal.) Also, unless one does a carefully designed (and complex) experiment like Kendziorski, then the apparent gain in power via biological pooling is a complete mirage because the within-group variance being measured is technical, and not biological. Therefore, significance tests from such experiments do not reflect what one is really after. Philosophically, one has to ask how meaningful such experiments are, especially if the ultimate goal is prediction at the individual level. Wade -----Original Message----- From: Ryan [mailto:rct@thompsonclan.org] Sent: Wednesday, April 23, 2014 12:06 PM To: "\"Manuel J G?mez [guest]\" " Cc: mjgomezr at cnic.es; bioconductor at r-project.org Subject: Re: [BioC] EdgeR: replicated pools, yes or not? Don't pool. You are throwing away information. If you're going to do 24 animals, you may as well use 24 barcodes. To see that a separate barcode for each animal provides strictly more information than pooling, note that once you have used separate barcodes, you could add the counts together to do in silico pooling and get the same result as if you had done pooling in vitro. In other words, you can get from separate barcodes to pooling by throwing away information. For a literature reference, try "Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing." http://www.ncbi.nlm.nih.gov/pubmed/22985019 That publication doesn't directly address the issue of pooling multiple biological samples in the same barcode, but it does make clear that more biological replication results in a drastic improvement in results. You could simulate your described pooling scheme yourself: simply simulate 24 libraries in 2 groups with some number of true differentially expressed genes between them. Then pool them 3 at a time (by adding their counts together) to get the pooled dataset of 8 pooled libraries in 2 groups. Then perform the analysis on both datasets using your preferred tool and compute the ROC curve. I think you will find that pooling significantly diminishes your power to detect differential expression. -Ryan Thompson On Wed Apr 23 09:42:15 2014, "Manuel J G?mez [guest]" wrote: > > Hello, > > I would like to ask for your opinion on whether using replicated pools in the context of RNASeq experiments makes sense, or not. > > Lets say that we are interested in detecting genes that are differentially expressed in two genetic backgrounds (a certain KO mutant strain and the corresponding WT), in mouse liver. > > We could perform an RNASeq experiment using liver tissue from four KO and four WT with the same sex, age, and diet. > > We would have eight samples: four biological replicates for each of the two conditions to be compared. > > However, we decide to pool liver tissue from three animals, to prepare each of the eight samples (we would use, therefore 24 animals: 12 KO animals pooled to produce four KO samples, and 12 WT animals pooled to produce four WT samples). > > We would do it following the argument that pooling samples to build biological replicates reduces variation between replicates and increases the statistical power of the analysis, resulting in a more sensitive detection of genes that are differentially expressed between conditions. > > However, EdgeR relies, precisely, on measuring biological variability to establish the statistical significance of differences in gene expression across conditions. Therefore, pooling samples to buid biological replicates is not correct and we are, in fact, losing statistical power. We are unable of determining whether the observed differences in gene expression are significative or not. > > There are some publications dealing with this issue in the context of microarrays (for example, Kendziorski et al, 2005, "On the utility of pooling biological samples in microarray experiments", PNAS, 102:4252) but I have not found anything similar in the context of RNASeq and, more specifically, of the analysis of RNASeq data with EdgeR. > > Any comment will be more than welcome, as well as any relevant references. > > Thanks a lot in advance. > > -- output of sessionInfo(): > > NA > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
@manuel-jgomez-6512
Last seen 11.1 years ago
Ryan <rct at="" ...=""> writes: > > Don't pool. You are throwing away information. If you're going to do 24 > animals, you may as well use 24 barcodes. To see that a separate > barcode for each animal provides strictly more information than > pooling, note that once you have used separate barcodes, you could add > the counts together to do in silico pooling and get the same result as > if you had done pooling in vitro. In other words, you can get from > separate barcodes to pooling by throwing away information. > > For a literature reference, try "Efficient experimental design and > analysis strategies for the detection of differential expression using > RNA-Sequencing." http://www.ncbi.nlm.nih.gov/pubmed/22985019 > > That publication doesn't directly address the issue of pooling multiple > biological samples in the same barcode, but it does make clear that > more biological replication results in a drastic improvement in > results. You could simulate your described pooling scheme yourself: > simply simulate 24 libraries in 2 groups with some number of true > differentially expressed genes between them. Then pool them 3 at a time > (by adding their counts together) to get the pooled dataset of 8 pooled > libraries in 2 groups. Then perform the analysis on both datasets using > your preferred tool and compute the ROC curve. I think you will find > that pooling significantly diminishes your power to detect differential > expression. > > -Ryan Thompson > > On Wed Apr 23 09:42:15 2014, "Manuel J G?mez [guest]" wrote: > > > > Hello, > > > > I would like to ask for your opinion on whether using replicated pools in the context of RNASeq experiments > makes sense, or not. > > > > Lets say that we are interested in detecting genes that are differentially expressed in two genetic > backgrounds (a certain KO mutant strain and the corresponding WT), in mouse liver. > > > > We could perform an RNASeq experiment using liver tissue from four KO and four WT with the same sex, age, and diet. > > > > We would have eight samples: four biological replicates for each of the two conditions to be compared. > > > > However, we decide to pool liver tissue from three animals, to prepare each of the eight samples (we would > use, therefore 24 animals: 12 KO animals pooled to produce four KO samples, and 12 WT animals pooled to > produce four WT samples). > > > > We would do it following the argument that pooling samples to build biological replicates reduces > variation between replicates and increases the statistical power of the analysis, resulting in a more > sensitive detection of genes that are differentially expressed between conditions. > > > > However, EdgeR relies, precisely, on measuring biological variability to establish the statistical > significance of differences in gene expression across conditions. Therefore, pooling samples to buid > biological replicates is not correct and we are, in fact, losing statistical power. We are unable of > determining whether the observed differences in gene expression are significative or not. > > > > There are some publications dealing with this issue in the context of microarrays (for example, > Kendziorski et al, 2005, "On the utility of pooling biological samples in microarray experiments", > PNAS, 102:4252) but I have not found anything similar in the context of RNASeq and, more specifically, of > the analysis of RNASeq data with EdgeR. > > > > Any comment will be more than welcome, as well as any relevant references. > > > > Thanks a lot in advance. > > > > -- output of sessionInfo(): > > > > NA > > > > -- > > Sent via the guest posting facility at bioconductor.org. > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor <at> r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor <at> r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor Dear Ryan, Thanks a lot for your answer. I perfectly understand that using 12 replicas for each condition is more informative than using 4. However, assuming that my budget allows me to sequence only a limited number of samples at a decent coverage (for example, 8 samples at 10 million reads per sample), which of the following would be the preferred solution? a) using 8 samples obtained from 8 different animals (4 KO and 4 WT); b) using 8 samples (4 KO and 4 WT) obtained by pooling tissue from "n" animals (with the same genotype, obviously). I am pretty sure that if the unique difference between the two types of animal (or condition) is a specific mutation, solution (a) would be THE correct solution because it would imply using truly biological and independent replicates. Solution (b) would be not just less correct, but blatantly incorrect, because it would eliminate biological variation between replicates (specially if "n" is high), and having an estimation of that variation is necessary to establish the significance of the differences observed between conditions. I acknowledge that I am answering myself, but I keep finding examples in which pooling (in the sense that I am describing above) is not completely discouraged. For example, Churchill (in "Fundamentals of experimental design for cDNA microarrays", 2002, Nature Genetics 32) explains that "in a two-sample comparison, we could consider making two large pools of all available units and measuring each pool multiple times. This is a poor design, as it does not allow estimation of the between-pool variance. By pooling all the available samples together we have minimized the biological variance, but we have also eliminated all independent replication. It is better to use several pools and fewer technical replicates". Why does he write that it is better to use several pools? Wouldn't it be better to use no pools at all? Similarly, a discussion in which pooling is not completely discouraged can be found in: http://seqanswers.com/forums/showthread.php?t=27905 Finally, pooling samples is often justified because of limited availability of RNA. In those cases pooling is mandatory, obviously. But if replicates have been constructed by pooling RNA from many tiny individual samples, shouldn't we have in mind that we have lost all information regarding biological variance, and that we will not be able to asses the significance of any differences observed between conditions? - Manuel J G?mez
ADD COMMENT
0
Entering edit mode
Robert Castelo ★ 3.4k
@rcastelo
Last seen 7 weeks ago
Barcelona/Universitat Pompeu Fabra
hi Manuel, On 04/23/2014 06:42 PM, Manuel J G?mez [guest] wrote: > > Hello, [..] > There are some publications dealing with this issue in the context of > microarrays (for example, Kendziorski et al, 2005, "On the utility of > pooling biological samples in microarray experiments", PNAS, > 102:4252) but I have not found anything similar in the context of > RNASeq and, more specifically, of the analysis of RNASeq data with > EdgeR. i believe the RNA-seq counterpart of that article is this one: Kasper D Hansen, Zhijin Wu, Rafael A Irizarry & Jeffrey T Leek. Sequencing technology does not eliminate biological variability. Nature Biotechnology, 29:572-573, 2011. http://www.nature.com/nbt/journal/v29/n7/abs/nbt.1910.html cheers, robert.
ADD COMMENT
0
Entering edit mode
@ryan-c-thompson-5618
Last seen 12 months ago
Icahn School of Medicine at Mount Sinai…
Thinking about it, it should theoretically be possible to model the dispersion term of a pool as being derived from a mixture of N individuals. For example, taking the model used by edgeR and DESeq, the biological variation is Gamma distributed and the technical variation is Poisson distributed (yielding the NB distribution for the counts). So, instead of modelling the biological variation as a single gamma distribution, we could model it as the mean of n independent and identically distributed Gamma variables. However, the mean of N iid Gamma(k,theta) random variables is (I think) a Gamma(k * N, theta / N) random variable (using the shape-scale parametrization from Wikipedia). So the NB distribution is equally valid (or equally invalid) for both individuals and pools. Based on this, I would think that if you have pools, it is perfectly reasonable to use edgeR or DESeq or any other NB method on the pools. You will have fewer degrees of freedom than if you did all the samples without pooling, but your BCV will also be smaller (since gamma variance is k * theta^2). So, if you have already sequenced pools, I think NB-based methods will give you a valid answer (in terms of significance levels) based on your data, without you having to do anything special to account for the pooling. If you have pooled data and you want to estimate what the dispersions would be if you had individual samples, you could back-calculate the parameters by reversing the above (I forget exactly how the gamma distribution parameters relate to the mean and dispersion of the NB, but there is a formula for that). However, calculating this would only be for curiosity's sake, since this would be the distribution for observations that you don't have (i.e. counts for individual samples), so you can't do any statistics with it. As to whether pools are preferable, I still think the best way to figure this out would be to simulate an experiment with few samples vs few pools vs many samples and see what happens. My intuition based on the above is that analysis based on M pools would be more powerful than analysis based on M individuals, but of course would be less powerful than analysis based on all the M * N individuals. But I wouldn't trust my intuition, and even if I did, my intuition is based on the assumption of a gamma distribution for the biological variability, which is not necessarily a valid assumption in the first place, so again I stress the need for a simulation test to see which is better. -Ryan On 04/25/2014 06:26 AM, Manuel Jos? G?mez Rodr?guez wrote: > Dear Ryan, > > Thank you very much for your detailed answers. > > From your comments it seems that a key point in terms of pondering the advantages of pooling is, as you say, what is the relative contribution of reducing biological variability and reducing degrees of freedom. > > I guess that it may depend both on the number of pooled individuals per sample and the level of variability expressed between individuals. > > Since it can be expected that the level of variability will be different depending on the species, the tissue (if applies) and the conditions, it may not be possible to get some general rule. > > Best regards, > > Manuel J G?mez
ADD COMMENT
0
Entering edit mode
@ryan-c-thompson-5618
Last seen 12 months ago
Icahn School of Medicine at Mount Sinai…
Thinking about it, it should theoretically be possible to model the dispersion term of a pool as being derived from a mixture of N individuals. For example, taking the model used by edgeR and DESeq, the biological variation is Gamma distributed and the technical variation is Poisson distributed (yielding the NB distribution for the counts). So, instead of modelling the biological variation as a single gamma distribution, we could model it as the mean of n independent and identically distributed Gamma variables. However, the mean of N iid Gamma(k,theta) random variables is (I think) a Gamma(k * N, theta / N) random variable (using the shape-scale parametrization from Wikipedia). So the NB distribution is equally valid (or equally invalid) for both individuals and pools. Based on this, I would think that if you have pools, it is perfectly reasonable to use edgeR or DESeq or any other NB method on the pools. You will have fewer degrees of freedom than if you did all the samples without pooling, but your BCV will also be smaller (since gamma variance is k * theta^2). So, if you have already sequenced pools, I think NB-based methods will give you a valid answer (in terms of significance levels) based on your data, without you having to do anything special to account for the pooling. If you have pooled data and you want to estimate what the dispersions would be if you had individual samples, you could back-calculate the parameters by reversing the above (I forget exactly how the gamma distribution parameters relate to the mean and dispersion of the NB, but there is a formula for that). However, calculating this would only be for curiosity's sake, since this would be the distribution for observations that you don't have (i.e. counts for individual samples), so you can't do any statistics with it. As to whether pools are preferable, I still think the best way to figure this out would be to simulate an experiment with few samples vs few pools vs many samples and see what happens. My intuition based on the above is that analysis based on M pools would be more powerful than analysis based on M individuals, but of course would be less powerful than analysis based on all the M * N individuals. But I wouldn't trust my intuition, and even if I did, my intuition is based on the assumption of a gamma distribution for the biological variability, which is not necessarily a valid assumption in the first place, so again I stress the need for a simulation test to see which is better. -Ryan On 04/25/2014 06:26 AM, Manuel Jos? G?mez Rodr?guez wrote: > Dear Ryan, > > Thank you very much for your detailed answers. > > From your comments it seems that a key point in terms of pondering the advantages of pooling is, as you say, what is the relative contribution of reducing biological variability and reducing degrees of freedom. > > I guess that it may depend both on the number of pooled individuals per sample and the level of variability expressed between individuals. > > Since it can be expected that the level of variability will be different depending on the species, the tissue (if applies) and the conditions, it may not be possible to get some general rule. > > Best regards, > > Manuel J G?mez
ADD COMMENT
0
Entering edit mode
Apologies for the multiple copies sent of this email. My mailer was having issues. On Fri 25 Apr 2014 02:16:03 PM PDT, Ryan C. Thompson wrote: > Thinking about it, it should theoretically be possible to model the > dispersion term of a pool as being derived from a mixture of N > individuals. For example, taking the model used by edgeR and DESeq, > the biological variation is Gamma distributed and the technical > variation is Poisson distributed (yielding the NB distribution for the > counts). So, instead of modelling the biological variation as a single > gamma distribution, we could model it as the mean of n independent and > identically distributed Gamma variables. However, the mean of N iid > Gamma(k,theta) random variables is (I think) a Gamma(k * N, theta / N) > random variable (using the shape-scale parametrization from > Wikipedia). So the NB distribution is equally valid (or equally > invalid) for both individuals and pools. Based on this, I would think > that if you have pools, it is perfectly reasonable to use edgeR or > DESeq or any other NB method on the pools. You will have fewer degrees > of freedom than if you did all the samples without pooling, but your > BCV will also be smaller (since gamma variance is k * theta^2). So, if > you have already sequenced pools, I think NB-based methods will give > you a valid answer (in terms of significance levels) based on your > data, without you having to do anything special to account for the > pooling. If you have pooled data and you want to estimate what the > dispersions would be if you had individual samples, you could > back-calculate the parameters by reversing the above (I forget exactly > how the gamma distribution parameters relate to the mean and > dispersion of the NB, but there is a formula for that). However, > calculating this would only be for curiosity's sake, since this would > be the distribution for observations that you don't have (i.e. counts > for individual samples), so you can't do any statistics with it. > > As to whether pools are preferable, I still think the best way to > figure this out would be to simulate an experiment with few samples vs > few pools vs many samples and see what happens. My intuition based on > the above is that analysis based on M pools would be more powerful > than analysis based on M individuals, but of course would be less > powerful than analysis based on all the M * N individuals. But I > wouldn't trust my intuition, and even if I did, my intuition is based > on the assumption of a gamma distribution for the biological > variability, which is not necessarily a valid assumption in the first > place, so again I stress the need for a simulation test to see which > is better. > > -Ryan > > On 04/25/2014 06:26 AM, Manuel Jos? G?mez Rodr?guez wrote: >> Dear Ryan, >> >> Thank you very much for your detailed answers. >> >> From your comments it seems that a key point in terms of pondering >> the advantages of pooling is, as you say, what is the relative >> contribution of reducing biological variability and reducing degrees >> of freedom. >> >> I guess that it may depend both on the number of pooled individuals >> per sample and the level of variability expressed between individuals. >> >> Since it can be expected that the level of variability will be >> different depending on the species, the tissue (if applies) and the >> conditions, it may not be possible to get some general rule. >> >> Best regards, >> >> Manuel J G?mez
ADD REPLY

Login before adding your answer.

Traffic: 997 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6