edgeR: mixing technical replicates from Illumina HiSeq and MiSeq
2
0
Entering edit mode
Nick N ▴ 60
@nick-n-6370
Last seen 8.5 years ago
United Kingdom
Hi, I have a study where a fraction of the samples have been replicated on 2 Illumina platforms (HiSeq and Miseq). These are technical replicates - the library preparation is the same using the same biological replicates - it's only the sequencing which is different. My hunch was that I shall introduce the platform as as an additional (blocking) factor in the analysis. Than I stumbled upon this post: https://stat.ethz.ch/pipermail/bioconductor/2010-April/033099.html It recommends pooling the replicates. The post seems to apply to a different case ("pure" technical replicates, i.e. no differences in the sequencing platform used) so I probably shall ignore it. But I still feel a bit uncertain of the best way to treat the technical replicates. Can you, please, advise me on this? many thanks! Nick [[alternative HTML version deleted]]
Sequencing Sequencing • 1.7k views
ADD COMMENT
0
Entering edit mode
@ryan-c-thompson-5618
Last seen 7 months ago
Scripps Research, La Jolla, CA
Hi Nick, Thanks to the underlying theory behind dispersion estimation, you can easily test whether your "technical replicates" really do represent technical replicates. Specifically, read counts in technical replicates should follow a Poisson distribution, which is a special case of the negative binomial with zero dispersion. So, simply fit a model using edgeR or DESeq2 with a separate coefficient for each group of technical replicates. Thus all the experimental variation will be absorbed into the model coefficients and the only thing left will be the technical variability of of the replicates. For true technical replicates, the dispersion should be zero for all genes. So if you estimate dispersions using this model, and plotBCV/plotDispEsts shows the dispersion very near to zero, then you can be confident that you really have technical replicates. If the dispersion is nonzero, then there is some additional source of unaccounted-for variation. I have used this method on a pilot dataset with several technical replicates for each condition. edgeR said the dispersion was something like 10^-3 or less for all genes except for the very low-expressed genes. -Ryan On 8/28/14, 9:23 AM, Nick N wrote: > Hi, > > I have a study where a fraction of the samples have been replicated on 2 > Illumina platforms (HiSeq and Miseq). These are technical replicates - the > library preparation is the same using the same biological replicates - it's > only the sequencing which is different. > > My hunch was that I shall introduce the platform as as an additional > (blocking) factor in the analysis. Than I stumbled upon this post: > > https://stat.ethz.ch/pipermail/bioconductor/2010-April/033099.html > > It recommends pooling the replicates. The post seems to apply to a > different case ("pure" technical replicates, i.e. no differences in the > sequencing platform used) so I probably shall ignore it. But I still feel a > bit uncertain of the best way to treat the technical replicates. Can you, > please, advise me on this? > > many thanks! > Nick > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
Thanks Ryan and Nicolas! I was wondering whether there is some sort of decision tree that can be formalised. Nicolas, you would consider 3 options - merging, ignoring or adding a factor. Could you recommend some sort of cut-offs for each choice or is it more of a qualitative decision by looking at plots and PCA? By the way, my data is RNA-Seq - I forgot to mention it. Ryan, I would basically ask you the same question. On Fri, Aug 29, 2014 at 9:42 AM, Ryan <rct at="" thompsonclan.org=""> wrote: > Hi Nick, > > Thanks to the underlying theory behind dispersion estimation, you can > easily test whether your "technical replicates" really do represent > technical replicates. Specifically, read counts in technical replicates > should follow a Poisson distribution, which is a special case of the > negative binomial with zero dispersion. So, simply fit a model using edgeR > or DESeq2 with a separate coefficient for each group of technical > replicates. Thus all the experimental variation will be absorbed into the > model coefficients and the only thing left will be the technical > variability of of the replicates. For true technical replicates, the > dispersion should be zero for all genes. So if you estimate dispersions > using this model, and plotBCV/plotDispEsts shows the dispersion very near > to zero, then you can be confident that you really have technical > replicates. If the dispersion is nonzero, then there is some additional > source of unaccounted-for variation. > > I have used this method on a pilot dataset with several technical > replicates for each condition. edgeR said the dispersion was something like > 10^-3 or less for all genes except for the very low-expressed genes. > > -Ryan > > > On 8/28/14, 9:23 AM, Nick N wrote: > >> Hi, >> >> I have a study where a fraction of the samples have been replicated on 2 >> Illumina platforms (HiSeq and Miseq). These are technical replicates - the >> library preparation is the same using the same biological replicates - >> it's >> only the sequencing which is different. >> >> My hunch was that I shall introduce the platform as as an additional >> (blocking) factor in the analysis. Than I stumbled upon this post: >> >> https://stat.ethz.ch/pipermail/bioconductor/2010-April/033099.html >> >> It recommends pooling the replicates. The post seems to apply to a >> different case ("pure" technical replicates, i.e. no differences in the >> sequencing platform used) so I probably shall ignore it. But I still feel >> a >> bit uncertain of the best way to treat the technical replicates. Can you, >> please, advise me on this? >> >> many thanks! >> Nick >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane. >> science.biology.informatics.conductor >> > > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
@nicolas-delhomme-6252
Last seen 5.4 years ago
Sweden
Hej Nick! Even if technical replicates on Illumina sequencers tend to be very similar, I would always do a number of checks before actually merging them. I usually do as follows to learn how similar/different are my technical replicates (honestly I haven?t had many in the recent past, most have been biological reps, but the same apply): 1) do scatterplots of the raw data replicates (pair-wise) to see how similar are the replicates (e.g. by binning reads into 1-10 kb windows) 2) do a PCA (sample based, so on a transpose of the above matrix; i.e. prcomp(t(binnedDataMatrix))). In that PCA, I?ll check whether the replicates cluster together and whether there?s is any dimension separating the tech replicates, and what?s is the contribution of the corresponding component. 3) plot the density distribution of all the samples and boxplots of the same to see how similar they are between replicates. 4) Normalise the data using a vst approach (to normalise for lib size and to correct for the var~mean relationship). I?m using a vst approach here no matter the sample size of the experiment because it gives in my opinion ?clearer? plots; or plots I can better interpret, but that is independent of how I would conduct the analysis (i.e. I would only use a vst approach if I have enough replication per condition, see Soneson and Delorenzi, 2013, BMC Bioinformatics for more). 5) redo all the plots above to complement the analyses After that, I usually get a good idea of what the properties of my tech. rep.s are and if I should consider 1) ignoring them, 2) merging them or 3) adding an additional factor in the analyses. I also recall that edgeR, DESeq, DESeq2 expect replicates to be biological replicates and not technical replicates since technical replicates on illumina usually show very little variation - hence the suggestion to merge them - and this could possibly bias the dispersion estimation. You did not precise what data you have at hand (DNA, RNA-Seq?) so I described a more global approach (binning) but for my RNA-Seq study, I actually do the comparison also after I?ve generated my count- table(s). HTH, Nico --------------------------------------------------------------- Nicolas Delhomme The Street Lab Department of Plant Physiology Ume? Plant Science Center Tel: +46 90 786 5478 Email: nicolas.delhomme at umu.se SLU - Ume? universitet Ume? S-901 87 Sweden --------------------------------------------------------------- On 28 Aug 2014, at 18:23, Nick N <feralmedic at="" gmail.com=""> wrote: > Hi, > > I have a study where a fraction of the samples have been replicated on 2 > Illumina platforms (HiSeq and Miseq). These are technical replicates - the > library preparation is the same using the same biological replicates - it's > only the sequencing which is different. > > My hunch was that I shall introduce the platform as as an additional > (blocking) factor in the analysis. Than I stumbled upon this post: > > https://stat.ethz.ch/pipermail/bioconductor/2010-April/033099.html > > It recommends pooling the replicates. The post seems to apply to a > different case ("pure" technical replicates, i.e. no differences in the > sequencing platform used) so I probably shall ignore it. But I still feel a > bit uncertain of the best way to treat the technical replicates. Can you, > please, advise me on this? > > many thanks! > Nick > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT

Login before adding your answer.

Traffic: 801 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6