Question

edgeR: mixing technical replicates from Illumina HiSeq and MiSeq

0

Entering edit mode

Nick N ▴ 60

@nick-n-6370

Last seen 8.6 years ago

United Kingdom

Dea Gordon, Ryan and Nicolas, Than you all for the detailed advice. I have one more question regarding the blocking factor model. In my case I have, actually, 2 external factors to consider - one is the platform, the other one are the subjects. My sample matrix is the following (I've attached the CSV in case you can't view the image): I am only interested in comparing treatments B:D to A (the latter are controls). So far I've never had a model with more than one external factor. I imagine it should be OK to have more - is this correct? If yes - can you, perhaps, check whether I am setting the model matrix correctly? (Apologies if this sounds too trivial) I imagine it shall be defined as: Platform <- factor(targets$Platform) > Subject <- factor(targets$Subject) > Treatment <- factor(targets$Treatment) > design <- model.matrix(~Platform+Subject+Treatment) .. > fit <- glmFit(y, design) > lrt <- glmLRT(fit, coef=24) # for comparing Treatment B to Treatment A Is this correct? On Sun, Aug 31, 2014 at 12:44 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > Dear Nick, > > If you go back to the post from 2010 that you give the URL for, you will > see that I was giving very briefly the same advice about checking Poisson > variability that Ryan has explained at greater detail. > > You don't give any information about read lengths, sequence depths or > alignment methods. I would be surprised if MiSeq and HiSeq would generate > perfect Poisson replicates of one another, especially if the read lengths > from the two platform are different or the alignment and counting software > has been varied. So you may well end up back at the blocking idea. > > > Best wishes > Gordon > > --------------------------------------------- > Professor Gordon K Smyth, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > http://www.statsci.org/smyth > > On Sun, 31 Aug 2014, Ryan wrote: > > Thanks to the underlying theory behind dispersion estimation, you can >> easily test whether your "technical replicates" really do represent >> technical replicates. Specifically, read counts in technical replicates >> should follow a Poisson distribution, which is a special case of the >> negative binomial with zero dispersion. So, simply fit a model using edgeR >> or DESeq2 with a separate coefficient for each group of technical >> replicates. Thus all the experimental variation will be absorbed into the >> model coefficients and the only thing left will be the technical >> variability of of the replicates. For true technical replicates, the >> dispersion should be zero for all genes. So if you estimate dispersions >> using this model, and plotBCV/plotDispEsts shows the dispersion very near >> to zero, then you can be confident that you really have technical >> replicates. If the dispersion is nonzero, then there is some additional >> source of unaccounted-for variation. >> >> I have used this method on a pilot dataset with several technical >> replicates for each condition. edgeR said the dispersion was something like >> 10^-3 or less for all genes except for the very low-expressed genes. >> >> -Ryan >> >> On 8/28/14, 9:23 AM, Nick N wrote: >> >>> Hi, >>> >>> I have a study where a fraction of the samples have been replicated on 2 >>> Illumina platforms (HiSeq and Miseq). These are technical replicates - the >>> library preparation is the same using the same biological replicates - it's >>> only the sequencing which is different. >>> >>> My hunch was that I shall introduce the platform as as an additional >>> (blocking) factor in the analysis. Than I stumbled upon this post: >>> >>> https://stat.ethz.ch/pipermail/bioconductor/2010-April/033099.html >>> >>> It recommends pooling the replicates. The post seems to apply to a >>> different case ("pure" technical replicates, i.e. no differences in the >>> sequencing platform used) so I probably shall ignore it. But I still feel a >>> bit uncertain of the best way to treat the technical replicates. Can you, >>> please, advise me on this? >>> >>> many thanks! >>> Nick >>> >> > ______________________________________________________________________ > The information in this email is confidential and intended solely for the > addressee. > You must not disclose, forward, print or use it without the permission of > the sender. > ______________________________________________________________________ >

Sequencing Alignment GO edgeR DESeq2 • 1.4k views

ADD COMMENT • link updated 9.6 years ago by Ryan C. Thompson ★ 7.9k • written 9.6 years ago by Nick N ▴ 60

score 0 · Answer 1 · 2014-09-03

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 58 minutes ago

WEHI, Melbourne, Australia

Yes, that's all that is needed. In general, one corrects for a batch effect by fitting ~ batch + otherterms where "otherterms" is what the model would have been without the batch effect. Best wishes Gordon On Mon, 1 Sep 2014, Nick N wrote: > Dea Gordon, Ryan and Nicolas, > > Than you all for the detailed advice. > > I have one more question regarding the blocking factor model. In my case I > have, actually, 2 external factors to consider - one is the platform, the > other one are the subjects. > > My sample matrix is the following (I've attached the CSV in case you can't > view the image): > > > > > > I am only interested in comparing treatments B:D to A (the latter are > controls). So far I've never had a model with more than one external > factor. I imagine it should be OK to have more - is this correct? If yes - > can you, perhaps, check whether I am setting the model matrix correctly? > (Apologies if this sounds too trivial) I imagine it shall be defined as: > > Platform <- factor(targets$Platform) >> Subject <- factor(targets$Subject) >> Treatment <- factor(targets$Treatment) >> design <- model.matrix(~Platform+Subject+Treatment) > > .. >> fit <- glmFit(y, design) >> lrt <- glmLRT(fit, coef=24) # for comparing Treatment B to Treatment A > > > Is this correct? > > > > On Sun, Aug 31, 2014 at 12:44 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > >> Dear Nick, >> >> If you go back to the post from 2010 that you give the URL for, you will >> see that I was giving very briefly the same advice about checking Poisson >> variability that Ryan has explained at greater detail. >> >> You don't give any information about read lengths, sequence depths or >> alignment methods. I would be surprised if MiSeq and HiSeq would generate >> perfect Poisson replicates of one another, especially if the read lengths >> from the two platform are different or the alignment and counting software >> has been varied. So you may well end up back at the blocking idea. >> >> >> Best wishes >> Gordon >> >> --------------------------------------------- >> Professor Gordon K Smyth, >> Bioinformatics Division, >> Walter and Eliza Hall Institute of Medical Research, >> 1G Royal Parade, Parkville, Vic 3052, Australia. >> http://www.statsci.org/smyth >> >> On Sun, 31 Aug 2014, Ryan wrote: >> >> Thanks to the underlying theory behind dispersion estimation, you can >>> easily test whether your "technical replicates" really do represent >>> technical replicates. Specifically, read counts in technical replicates >>> should follow a Poisson distribution, which is a special case of the >>> negative binomial with zero dispersion. So, simply fit a model using edgeR >>> or DESeq2 with a separate coefficient for each group of technical >>> replicates. Thus all the experimental variation will be absorbed into the >>> model coefficients and the only thing left will be the technical >>> variability of of the replicates. For true technical replicates, the >>> dispersion should be zero for all genes. So if you estimate dispersions >>> using this model, and plotBCV/plotDispEsts shows the dispersion very near >>> to zero, then you can be confident that you really have technical >>> replicates. If the dispersion is nonzero, then there is some additional >>> source of unaccounted-for variation. >>> >>> I have used this method on a pilot dataset with several technical >>> replicates for each condition. edgeR said the dispersion was something like >>> 10^-3 or less for all genes except for the very low-expressed genes. >>> >>> -Ryan >>> >>> On 8/28/14, 9:23 AM, Nick N wrote: >>> >>>> Hi, >>>> >>>> I have a study where a fraction of the samples have been replicated on 2 >>>> Illumina platforms (HiSeq and Miseq). These are technical replicates - the >>>> library preparation is the same using the same biological replicates - it's >>>> only the sequencing which is different. >>>> >>>> My hunch was that I shall introduce the platform as as an additional >>>> (blocking) factor in the analysis. Than I stumbled upon this post: >>>> >>>> https://stat.ethz.ch/pipermail/bioconductor/2010-April/033099.html >>>> >>>> It recommends pooling the replicates. The post seems to apply to a >>>> different case ("pure" technical replicates, i.e. no differences in the >>>> sequencing platform used) so I probably shall ignore it. But I still feel a >>>> bit uncertain of the best way to treat the technical replicates. Can you, >>>> please, advise me on this? >>>> >>>> many thanks! >>>> Nick ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 9.6 years ago Gordon Smyth 50k

0

Entering edit mode

Thanks for all the advice so far. I tried that route but I am getting > y <- estimateGLMCommonDisp(d, design) > > Error in glmFit.default(y, design = design, dispersion = dispersion, >> offset = offset, : > > Design matrix not of full rank. The following coefficients not >> estimable: > > TreatmentB TreatmentC TreatmentD TreatmentE TreatmentF > > For the sake of completeness here is my code: targets <- readTargets() > > d <- readDGE(targets$FileName) > > keep <- rowSums(cpm(d)>1) >=3 > > d <- d[keep,] > > d$samples$lib.size <- colSums(d$counts) > > d <- calcNormFactors(d) > > all_cpm=cpm(d, normalized.lib.size=TRUE) > > all_counts <- cbind(rownames(all_cpm), all_cpm) > > colnames(all_counts)[1] <- "Ensembl.Gene.ID" > > rownames(all_counts) <- NULL > > Subject <- factor(targets$Subject) > > Platform <- factor(targets$Platform) > > Treatment <- factor(targets$Treatment) > > design <- model.matrix(~Platform+Subject+Treatment) > > y <- estimateGLMCommonDisp(d, design) > > The design matrix looks OK to me: > design > > (Intercept) PlatformMiSeq Subjecta8 Subjectb6 Subjectb7 Subjectb8 >> Subjectc5 Subjectc6 Subjectc7 Subjectc8 Subjectd5 Subjectd6 Subjectd7 >> Subjectd8 Subjecte5 Subjecte6 > > 1 1 1 0 0 0 0 >> 0 0 0 0 0 0 0 0 >> 0 0 > > 2 1 1 1 0 0 0 >> 0 0 0 0 0 0 0 0 >> 0 0 > > 3 1 1 0 1 0 0 >> 0 0 0 0 0 0 0 0 >> 0 0 > > 4 1 1 0 0 1 0 >> 0 0 0 0 0 0 0 0 >> 0 0 > > 5 1 1 0 0 0 1 >> 0 0 0 0 0 0 0 0 >> 0 0 > > 6 1 1 0 0 0 0 >> 1 0 0 0 0 0 0 0 >> 0 0 > > 7 1 1 0 0 0 0 >> 0 1 0 0 0 0 0 0 >> 0 0 > > 8 1 1 0 0 0 0 >> 0 0 1 0 0 0 0 0 >> 0 0 > > 9 1 1 0 0 0 0 >> 0 0 0 1 0 0 0 0 >> 0 0 > > 10 1 1 0 0 0 0 >> 0 0 0 0 1 0 0 0 >> 0 0 > > 11 1 1 0 0 0 0 >> 0 0 0 0 0 1 0 0 >> 0 0 > > 12 1 1 0 0 0 0 >> 0 0 0 0 0 0 1 0 >> 0 0 > > 13 1 1 0 0 0 0 >> 0 0 0 0 0 0 0 1 >> 0 0 > > 14 1 1 0 0 0 0 >> 0 0 0 0 0 0 0 0 >> 1 0 > > 15 1 1 0 0 0 0 >> 0 0 0 0 0 0 0 0 >> 0 1 > > 16 1 1 0 0 0 0 >> 0 0 0 0 0 0 0 0 >> 0 0 > > 17 1 1 0 0 0 0 >> 0 0 0 0 0 0 0 0 >> 0 0 > > 18 1 1 0 0 0 0 >> 0 0 0 0 0 0 0 0 >> 0 0 > > 19 1 1 0 0 0 0 >> 0 0 0 0 0 0 0 0 >> 0 0 > > 20 1 1 0 0 0 0 >> 0 0 0 0 0 0 0 0 >> 0 0 > > 21 1 1 0 0 0 0 >> 0 0 0 0 0 0 0 0 >> 0 0 > > 22 1 0 0 0 0 0 >> 1 0 0 0 0 0 0 0 >> 0 0 > > 23 1 0 0 0 0 0 >> 0 1 0 0 0 0 0 0 >> 0 0 > > 24 1 0 0 0 0 0 >> 0 0 1 0 0 0 0 0 >> 0 0 > > 25 1 0 0 0 0 0 >> 0 0 0 1 0 0 0 0 >> 0 0 > > 26 1 0 0 0 0 0 >> 0 0 0 0 1 0 0 0 >> 0 0 > > 27 1 0 0 0 0 0 >> 0 0 0 0 0 1 0 0 >> 0 0 > > 28 1 0 0 0 0 0 >> 0 0 0 0 0 0 1 0 >> 0 0 > > 29 1 0 0 0 0 0 >> 0 0 0 0 0 0 0 1 >> 0 0 > > Subjecte7 Subjecte8 Subjectf5 Subjectf6 Subjectf7 Subjectf8 TreatmentB >> TreatmentC TreatmentD TreatmentE TreatmentF > > 1 0 0 0 0 0 0 0 >> 0 0 0 0 > > 2 0 0 0 0 0 0 0 >> 0 0 0 0 > > 3 0 0 0 0 0 0 1 >> 0 0 0 0 > > 4 0 0 0 0 0 0 1 >> 0 0 0 0 > > 5 0 0 0 0 0 0 1 >> 0 0 0 0 > > 6 0 0 0 0 0 0 0 >> 1 0 0 0 > > 7 0 0 0 0 0 0 0 >> 1 0 0 0 > > 8 0 0 0 0 0 0 0 >> 1 0 0 0 > > 9 0 0 0 0 0 0 0 >> 1 0 0 0 > > 10 0 0 0 0 0 0 0 >> 0 1 0 0 > > 11 0 0 0 0 0 0 0 >> 0 1 0 0 > > 12 0 0 0 0 0 0 0 >> 0 1 0 0 > > 13 0 0 0 0 0 0 0 >> 0 1 0 0 > > 14 0 0 0 0 0 0 0 >> 0 0 1 0 > > 15 0 0 0 0 0 0 0 >> 0 0 1 0 > > 16 1 0 0 0 0 0 0 >> 0 0 1 0 > > 17 0 1 0 0 0 0 0 >> 0 0 1 0 > > 18 0 0 1 0 0 0 0 >> 0 0 0 1 > > 19 0 0 0 1 0 0 0 >> 0 0 0 1 > > 20 0 0 0 0 1 0 0 >> 0 0 0 1 > > 21 0 0 0 0 0 1 0 >> 0 0 0 1 > > 22 0 0 0 0 0 0 0 >> 1 0 0 0 > > 23 0 0 0 0 0 0 0 >> 1 0 0 0 > > 24 0 0 0 0 0 0 0 >> 1 0 0 0 > > 25 0 0 0 0 0 0 0 >> 1 0 0 0 > > 26 0 0 0 0 0 0 0 >> 0 1 0 0 > > 27 0 0 0 0 0 0 0 >> 0 1 0 0 > > 28 0 0 0 0 0 0 0 >> 0 1 0 0 > > 29 0 0 0 0 0 0 0 >> 0 1 0 0 > > I'm also attaching the target file in case it adds any info. Can you, perhaps, give any advice on this? ps. I am not sure whether I should continue submitting to this thread or open up a new one as this problem may not be related to the original question I asked. On Wed, Sep 3, 2014 at 12:12 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > Yes, that's all that is needed. > > In general, one corrects for a batch effect by fitting > > ~ batch + otherterms > > where "otherterms" is what the model would have been without the batch > effect. > > Best wishes > Gordon > > > On Mon, 1 Sep 2014, Nick N wrote: > > Dea Gordon, Ryan and Nicolas, >> >> Than you all for the detailed advice. >> >> I have one more question regarding the blocking factor model. In my case I >> have, actually, 2 external factors to consider - one is the platform, the >> other one are the subjects. >> >> My sample matrix is the following (I've attached the CSV in case you can't >> view the image): >> >> >> >> >> >> I am only interested in comparing treatments B:D to A (the latter are >> controls). So far I've never had a model with more than one external >> factor. I imagine it should be OK to have more - is this correct? If yes - >> can you, perhaps, check whether I am setting the model matrix correctly? >> (Apologies if this sounds too trivial) I imagine it shall be defined as: >> >> Platform <- factor(targets$Platform) >> >>> Subject <- factor(targets$Subject) >>> Treatment <- factor(targets$Treatment) >>> design <- model.matrix(~Platform+Subject+Treatment) >>> >> >> .. >> >>> fit <- glmFit(y, design) >>> lrt <- glmLRT(fit, coef=24) # for comparing Treatment B to Treatment A >>> >> >> >> Is this correct? >> >> >> >> >> On Sun, Aug 31, 2014 at 12:44 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> >> wrote: >> >> Dear Nick, >>> >>> If you go back to the post from 2010 that you give the URL for, you will >>> see that I was giving very briefly the same advice about checking Poisson >>> variability that Ryan has explained at greater detail. >>> >>> You don't give any information about read lengths, sequence depths or >>> alignment methods. I would be surprised if MiSeq and HiSeq would >>> generate >>> perfect Poisson replicates of one another, especially if the read lengths >>> from the two platform are different or the alignment and counting >>> software >>> has been varied. So you may well end up back at the blocking idea. >>> >>> >>> Best wishes >>> Gordon >>> >>> --------------------------------------------- >>> Professor Gordon K Smyth, >>> Bioinformatics Division, >>> Walter and Eliza Hall Institute of Medical Research, >>> 1G Royal Parade, Parkville, Vic 3052, Australia. >>> http://www.statsci.org/smyth >>> >>> On Sun, 31 Aug 2014, Ryan wrote: >>> >>> Thanks to the underlying theory behind dispersion estimation, you can >>> >>>> easily test whether your "technical replicates" really do represent >>>> technical replicates. Specifically, read counts in technical replicates >>>> should follow a Poisson distribution, which is a special case of the >>>> negative binomial with zero dispersion. So, simply fit a model using >>>> edgeR >>>> or DESeq2 with a separate coefficient for each group of technical >>>> replicates. Thus all the experimental variation will be absorbed into >>>> the >>>> model coefficients and the only thing left will be the technical >>>> variability of of the replicates. For true technical replicates, the >>>> dispersion should be zero for all genes. So if you estimate dispersions >>>> using this model, and plotBCV/plotDispEsts shows the dispersion very >>>> near >>>> to zero, then you can be confident that you really have technical >>>> replicates. If the dispersion is nonzero, then there is some additional >>>> source of unaccounted-for variation. >>>> >>>> I have used this method on a pilot dataset with several technical >>>> replicates for each condition. edgeR said the dispersion was something >>>> like >>>> 10^-3 or less for all genes except for the very low-expressed genes. >>>> >>>> -Ryan >>>> >>>> On 8/28/14, 9:23 AM, Nick N wrote: >>>> >>>> Hi, >>>>> >>>>> I have a study where a fraction of the samples have been replicated on >>>>> 2 >>>>> Illumina platforms (HiSeq and Miseq). These are technical replicates - >>>>> the >>>>> library preparation is the same using the same biological replicates - >>>>> it's >>>>> only the sequencing which is different. >>>>> >>>>> My hunch was that I shall introduce the platform as as an additional >>>>> (blocking) factor in the analysis. Than I stumbled upon this post: >>>>> >>>>> https://stat.ethz.ch/pipermail/bioconductor/2010-April/033099.html >>>>> >>>>> It recommends pooling the replicates. The post seems to apply to a >>>>> different case ("pure" technical replicates, i.e. no differences in the >>>>> sequencing platform used) so I probably shall ignore it. But I still >>>>> feel a >>>>> bit uncertain of the best way to treat the technical replicates. Can >>>>> you, >>>>> please, advise me on this? >>>>> >>>>> many thanks! >>>>> Nick >>>>> >>>> > ______________________________________________________________________ > The information in this email is confidential and intended solely for the > addressee. > You must not disclose, forward, print or use it without the permission of > the sender. > ______________________________________________________________________ > -------------- next part -------------- FileName Platform Subject Treatment ../A7/accepted_hits_clean.count MiSeq a7 A ../A8/accepted_hits_clean.count MiSeq a8 A ../B6/accepted_hits_clean.count MiSeq b6 B ../B7/accepted_hits_clean.count MiSeq b7 B ../B8/accepted_hits_clean.count MiSeq b8 B ../C5/accepted_hits_clean.count MiSeq c5 C ../C6/accepted_hits_clean.count MiSeq c6 C ../C7/accepted_hits_clean.count MiSeq c7 C ../C8/accepted_hits_clean.count MiSeq c8 C ../D5/accepted_hits_clean.count MiSeq d5 D ../D6/accepted_hits_clean.count MiSeq d6 D ../D7/accepted_hits_clean.count MiSeq d7 D ../D8/accepted_hits_clean.count MiSeq d8 D ../E5/accepted_hits_clean.count MiSeq e5 E ../E6/accepted_hits_clean.count MiSeq e6 E ../E7/accepted_hits_clean.count MiSeq e7 E ../E8/accepted_hits_clean.count MiSeq e8 E ../F5/accepted_hits_clean.count MiSeq f5 F ../F6/accepted_hits_clean.count MiSeq f6 F ../F7/accepted_hits_clean.count MiSeq f7 F ../F8/accepted_hits_clean.count MiSeq f8 F ../SC5/accepted_hits_clean.count HiSeq c5 C ../SC6/accepted_hits_clean.count HiSeq c6 C ../SC7/accepted_hits_clean.count HiSeq c7 C ../SC8/accepted_hits_clean.count HiSeq c8 C ../SD5/accepted_hits_clean.count HiSeq d5 D ../SD6/accepted_hits_clean.count HiSeq d6 D ../SD7/accepted_hits_clean.count HiSeq d7 D ../SD8/accepted_hits_clean.count HiSeq d8 D

ADD REPLY • link 9.6 years ago Nikolay N. ▴ 30

score 0 · Answer 2 · 2014-09-05

Hi, This is a simple problem in building your design matrix. If you are trying our recommendation of fitting dispersions for technical replicates to verify the Poisson variance assumption, then your model should be just "~Subject". If you want to fit model for analyzing the biological aspects, then you either want to pool the different platforms and use "~Treatment" as your design, or you want to keep them separate and use "~Treatment + Platform" as the design, (or you can fit Platform as a mixed effect voom and duplicateCorrelation, but that's more complicated to explain). However, using all three of these factors in a single design is both confusing the two analyses (technical vs biological) and putting too many degrees of freedom into the model. -Ryan On 09/05/2014 02:31 PM, Nikolay N. wrote: > Thanks for all the advice so far. I tried that route but I am getting > > > y <- estimateGLMCommonDisp(d, design) > > Error in glmFit.default(y, design = design, dispersion = > dispersion, offset = offset, : > > Design matrix not of full rank. The following coefficients > not estimable: > > TreatmentB TreatmentC TreatmentD TreatmentE TreatmentF > > > For the sake of completeness here is my code: > > targets <- readTargets() > > d <- readDGE(targets$FileName) > > keep <- rowSums(cpm(d)>1) >=3 > > d <- d[keep,] > > d$samples$lib.size <- colSums(d$counts) > > d <- calcNormFactors(d) > > all_cpm=cpm(d, normalized.lib.size=TRUE) > > all_counts <- cbind(rownames(all_cpm), all_cpm) > > colnames(all_counts)[1] <- "Ensembl.Gene.ID > <http: ensembl.gene.id="">" > > rownames(all_counts) <- NULL > > Subject <- factor(targets$Subject) > > Platform <- factor(targets$Platform) > > Treatment <- factor(targets$Treatment) > > design <- model.matrix(~Platform+Subject+Treatment) > > y <- estimateGLMCommonDisp(d, design) > > > The design matrix looks OK to me: > > > design > > (Intercept) PlatformMiSeq Subjecta8 Subjectb6 Subjectb7 > Subjectb8 Subjectc5 Subjectc6 Subjectc7 Subjectc8 Subjectd5 > Subjectd6 Subjectd7 Subjectd8 Subjecte5 Subjecte6 > > 1 1 1 0 0 0 > 0 0 0 0 0 0 > 0 0 0 0 0 > > 2 1 1 1 0 0 > 0 0 0 0 0 0 > 0 0 0 0 0 > > 3 1 1 0 1 0 > 0 0 0 0 0 0 > 0 0 0 0 0 > > 4 1 1 0 0 1 > 0 0 0 0 0 0 > 0 0 0 0 0 > > 5 1 1 0 0 0 > 1 0 0 0 0 0 > 0 0 0 0 0 > > 6 1 1 0 0 0 > 0 1 0 0 0 0 > 0 0 0 0 0 > > 7 1 1 0 0 0 > 0 0 1 0 0 0 > 0 0 0 0 0 > > 8 1 1 0 0 0 > 0 0 0 1 0 0 > 0 0 0 0 0 > > 9 1 1 0 0 0 > 0 0 0 0 1 0 > 0 0 0 0 0 > > 10 1 1 0 0 0 > 0 0 0 0 0 1 > 0 0 0 0 0 > > 11 1 1 0 0 0 > 0 0 0 0 0 0 > 1 0 0 0 0 > > 12 1 1 0 0 0 > 0 0 0 0 0 0 > 0 1 0 0 0 > > 13 1 1 0 0 0 > 0 0 0 0 0 0 > 0 0 1 0 0 > > 14 1 1 0 0 0 > 0 0 0 0 0 0 > 0 0 0 1 0 > > 15 1 1 0 0 0 > 0 0 0 0 0 0 > 0 0 0 0 1 > > 16 1 1 0 0 0 > 0 0 0 0 0 0 > 0 0 0 0 0 > > 17 1 1 0 0 0 > 0 0 0 0 0 0 > 0 0 0 0 0 > > 18 1 1 0 0 0 > 0 0 0 0 0 0 > 0 0 0 0 0 > > 19 1 1 0 0 0 > 0 0 0 0 0 0 > 0 0 0 0 0 > > 20 1 1 0 0 0 > 0 0 0 0 0 0 > 0 0 0 0 0 > > 21 1 1 0 0 0 > 0 0 0 0 0 0 > 0 0 0 0 0 > > 22 1 0 0 0 0 > 0 1 0 0 0 0 > 0 0 0 0 0 > > 23 1 0 0 0 0 > 0 0 1 0 0 0 > 0 0 0 0 0 > > 24 1 0 0 0 0 > 0 0 0 1 0 0 > 0 0 0 0 0 > > 25 1 0 0 0 0 > 0 0 0 0 1 0 > 0 0 0 0 0 > > 26 1 0 0 0 0 > 0 0 0 0 0 1 > 0 0 0 0 0 > > 27 1 0 0 0 0 > 0 0 0 0 0 0 > 1 0 0 0 0 > > 28 1 0 0 0 0 > 0 0 0 0 0 0 > 0 1 0 0 0 > > 29 1 0 0 0 0 > 0 0 0 0 0 0 > 0 0 1 0 0 > > Subjecte7 Subjecte8 Subjectf5 Subjectf6 Subjectf7 Subjectf8 > TreatmentB TreatmentC TreatmentD TreatmentE TreatmentF > > 1 0 0 0 0 0 0 > 0 0 0 0 0 > > 2 0 0 0 0 0 0 > 0 0 0 0 0 > > 3 0 0 0 0 0 0 > 1 0 0 0 0 > > 4 0 0 0 0 0 0 > 1 0 0 0 0 > > 5 0 0 0 0 0 0 > 1 0 0 0 0 > > 6 0 0 0 0 0 0 > 0 1 0 0 0 > > 7 0 0 0 0 0 0 > 0 1 0 0 0 > > 8 0 0 0 0 0 0 > 0 1 0 0 0 > > 9 0 0 0 0 0 0 > 0 1 0 0 0 > > 10 0 0 0 0 0 0 > 0 0 1 0 0 > > 11 0 0 0 0 0 0 > 0 0 1 0 0 > > 12 0 0 0 0 0 0 > 0 0 1 0 0 > > 13 0 0 0 0 0 0 > 0 0 1 0 0 > > 14 0 0 0 0 0 0 > 0 0 0 1 0 > > 15 0 0 0 0 0 0 > 0 0 0 1 0 > > 16 1 0 0 0 0 0 > 0 0 0 1 0 > > 17 0 1 0 0 0 0 > 0 0 0 1 0 > > 18 0 0 1 0 0 0 > 0 0 0 0 1 > > 19 0 0 0 1 0 0 > 0 0 0 0 1 > > 20 0 0 0 0 1 0 > 0 0 0 0 1 > > 21 0 0 0 0 0 1 > 0 0 0 0 1 > > 22 0 0 0 0 0 0 > 0 1 0 0 0 > > 23 0 0 0 0 0 0 > 0 1 0 0 0 > > 24 0 0 0 0 0 0 > 0 1 0 0 0 > > 25 0 0 0 0 0 0 > 0 1 0 0 0 > > 26 0 0 0 0 0 0 > 0 0 1 0 0 > > 27 0 0 0 0 0 0 > 0 0 1 0 0 > > 28 0 0 0 0 0 0 > 0 0 1 0 0 > > 29 0 0 0 0 0 0 > 0 0 1 0 0 > > > > I'm also attaching the target file in case it adds any info. > > Can you, perhaps, give any advice on this? > > ps. I am not sure whether I should continue submitting to this thread > or open up a new one as this problem may not be related to the > original question I asked. > > On Wed, Sep 3, 2014 at 12:12 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> <mailto:smyth at="" wehi.edu.au="">> wrote: > > Yes, that's all that is needed. > > In general, one corrects for a batch effect by fitting > > ~ batch + otherterms > > where "otherterms" is what the model would have been without the > batch effect. > > Best wishes > Gordon > > > On Mon, 1 Sep 2014, Nick N wrote: > > Dea Gordon, Ryan and Nicolas, > > Than you all for the detailed advice. > > I have one more question regarding the blocking factor model. > In my case I > have, actually, 2 external factors to consider - one is the > platform, the > other one are the subjects. > > My sample matrix is the following (I've attached the CSV in > case you can't > view the image): > > > > > > I am only interested in comparing treatments B:D to A (the > latter are > controls). So far I've never had a model with more than one > external > factor. I imagine it should be OK to have more - is this > correct? If yes - > can you, perhaps, check whether I am setting the model matrix > correctly? > (Apologies if this sounds too trivial) I imagine it shall be > defined as: > > Platform <- factor(targets$Platform) > > Subject <- factor(targets$Subject) > Treatment <- factor(targets$Treatment) > design <- model.matrix(~Platform+Subject+Treatment) > > > .. > > fit <- glmFit(y, design) > lrt <- glmLRT(fit, coef=24) # for comparing Treatment B to > Treatment A > > > > Is this correct? > > > > > On Sun, Aug 31, 2014 at 12:44 AM, Gordon K Smyth > <smyth at="" wehi.edu.au="" <mailto:smyth="" at="" wehi.edu.au="">> wrote: > > Dear Nick, > > If you go back to the post from 2010 that you give the URL > for, you will > see that I was giving very briefly the same advice about > checking Poisson > variability that Ryan has explained at greater detail. > > You don't give any information about read lengths, > sequence depths or > alignment methods. I would be surprised if MiSeq and > HiSeq would generate > perfect Poisson replicates of one another, especially if > the read lengths > from the two platform are different or the alignment and > counting software > has been varied. So you may well end up back at the > blocking idea. > > > Best wishes > Gordon > > --------------------------------------------- > Professor Gordon K Smyth, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > http://www.statsci.org/smyth > > On Sun, 31 Aug 2014, Ryan wrote: > > Thanks to the underlying theory behind dispersion > estimation, you can > > easily test whether your "technical replicates" really > do represent > technical replicates. Specifically, read counts in > technical replicates > should follow a Poisson distribution, which is a > special case of the > negative binomial with zero dispersion. So, simply fit > a model using edgeR > or DESeq2 with a separate coefficient for each group > of technical > replicates. Thus all the experimental variation will > be absorbed into the > model coefficients and the only thing left will be the > technical > variability of of the replicates. For true technical > replicates, the > dispersion should be zero for all genes. So if you > estimate dispersions > using this model, and plotBCV/plotDispEsts shows the > dispersion very near > to zero, then you can be confident that you really > have technical > replicates. If the dispersion is nonzero, then there > is some additional > source of unaccounted-for variation. > > I have used this method on a pilot dataset with > several technical > replicates for each condition. edgeR said the > dispersion was something like > 10^-3 or less for all genes except for the very > low-expressed genes. > > -Ryan > > On 8/28/14, 9:23 AM, Nick N wrote: > > Hi, > > I have a study where a fraction of the samples > have been replicated on 2 > Illumina platforms (HiSeq and Miseq). These are > technical replicates - the > library preparation is the same using the same > biological replicates - it's > only the sequencing which is different. > > My hunch was that I shall introduce the platform > as as an additional > (blocking) factor in the analysis. Than I stumbled > upon this post: > > https://stat.ethz.ch/pipermail/bioconductor/2010-April/033099.html > > It recommends pooling the replicates. The post > seems to apply to a > different case ("pure" technical replicates, i.e. > no differences in the > sequencing platform used) so I probably shall > ignore it. But I still feel a > bit uncertain of the best way to treat the > technical replicates. Can you, > please, advise me on this? > > many thanks! > Nick > > > ______________________________________________________________________ > The information in this email is confidential and intended solely > for the addressee. > You must not disclose, forward, print or use it without the > permission of the sender. > ______________________________________________________________________ > > [[alternative HTML version deleted]]