Hi,
This is a simple problem in building your design matrix. If you are
trying our recommendation of fitting dispersions for technical
replicates to verify the Poisson variance assumption, then your model
should be just "~Subject". If you want to fit model for analyzing the
biological aspects, then you either want to pool the different
platforms
and use "~Treatment" as your design, or you want to keep them separate
and use "~Treatment + Platform" as the design, (or you can fit
Platform
as a mixed effect voom and duplicateCorrelation, but that's more
complicated to explain). However, using all three of these factors in
a
single design is both confusing the two analyses (technical vs
biological) and putting too many degrees of freedom into the model.
-Ryan
On 09/05/2014 02:31 PM, Nikolay N. wrote:
> Thanks for all the advice so far. I tried that route but I am
getting
>
> > y <- estimateGLMCommonDisp(d, design)
>
> Error in glmFit.default(y, design = design, dispersion =
> dispersion, offset = offset, :
>
> Design matrix not of full rank. The following coefficients
> not estimable:
>
> TreatmentB TreatmentC TreatmentD TreatmentE TreatmentF
>
>
> For the sake of completeness here is my code:
>
> targets <- readTargets()
>
> d <- readDGE(targets$FileName)
>
> keep <- rowSums(cpm(d)>1) >=3
>
> d <- d[keep,]
>
> d$samples$lib.size <- colSums(d$counts)
>
> d <- calcNormFactors(d)
>
> all_cpm=cpm(d, normalized.lib.size=TRUE)
>
> all_counts <- cbind(rownames(all_cpm), all_cpm)
>
> colnames(all_counts)[1] <- "Ensembl.Gene.ID
> <http: ensembl.gene.id="">"
>
> rownames(all_counts) <- NULL
>
> Subject <- factor(targets$Subject)
>
> Platform <- factor(targets$Platform)
>
> Treatment <- factor(targets$Treatment)
>
> design <- model.matrix(~Platform+Subject+Treatment)
>
> y <- estimateGLMCommonDisp(d, design)
>
>
> The design matrix looks OK to me:
>
> > design
>
> (Intercept) PlatformMiSeq Subjecta8 Subjectb6 Subjectb7
> Subjectb8 Subjectc5 Subjectc6 Subjectc7 Subjectc8 Subjectd5
> Subjectd6 Subjectd7 Subjectd8 Subjecte5 Subjecte6
>
> 1 1 1 0 0 0
> 0 0 0 0 0 0
> 0 0 0 0 0
>
> 2 1 1 1 0 0
> 0 0 0 0 0 0
> 0 0 0 0 0
>
> 3 1 1 0 1 0
> 0 0 0 0 0 0
> 0 0 0 0 0
>
> 4 1 1 0 0 1
> 0 0 0 0 0 0
> 0 0 0 0 0
>
> 5 1 1 0 0 0
> 1 0 0 0 0 0
> 0 0 0 0 0
>
> 6 1 1 0 0 0
> 0 1 0 0 0 0
> 0 0 0 0 0
>
> 7 1 1 0 0 0
> 0 0 1 0 0 0
> 0 0 0 0 0
>
> 8 1 1 0 0 0
> 0 0 0 1 0 0
> 0 0 0 0 0
>
> 9 1 1 0 0 0
> 0 0 0 0 1 0
> 0 0 0 0 0
>
> 10 1 1 0 0 0
> 0 0 0 0 0 1
> 0 0 0 0 0
>
> 11 1 1 0 0 0
> 0 0 0 0 0 0
> 1 0 0 0 0
>
> 12 1 1 0 0 0
> 0 0 0 0 0 0
> 0 1 0 0 0
>
> 13 1 1 0 0 0
> 0 0 0 0 0 0
> 0 0 1 0 0
>
> 14 1 1 0 0 0
> 0 0 0 0 0 0
> 0 0 0 1 0
>
> 15 1 1 0 0 0
> 0 0 0 0 0 0
> 0 0 0 0 1
>
> 16 1 1 0 0 0
> 0 0 0 0 0 0
> 0 0 0 0 0
>
> 17 1 1 0 0 0
> 0 0 0 0 0 0
> 0 0 0 0 0
>
> 18 1 1 0 0 0
> 0 0 0 0 0 0
> 0 0 0 0 0
>
> 19 1 1 0 0 0
> 0 0 0 0 0 0
> 0 0 0 0 0
>
> 20 1 1 0 0 0
> 0 0 0 0 0 0
> 0 0 0 0 0
>
> 21 1 1 0 0 0
> 0 0 0 0 0 0
> 0 0 0 0 0
>
> 22 1 0 0 0 0
> 0 1 0 0 0 0
> 0 0 0 0 0
>
> 23 1 0 0 0 0
> 0 0 1 0 0 0
> 0 0 0 0 0
>
> 24 1 0 0 0 0
> 0 0 0 1 0 0
> 0 0 0 0 0
>
> 25 1 0 0 0 0
> 0 0 0 0 1 0
> 0 0 0 0 0
>
> 26 1 0 0 0 0
> 0 0 0 0 0 1
> 0 0 0 0 0
>
> 27 1 0 0 0 0
> 0 0 0 0 0 0
> 1 0 0 0 0
>
> 28 1 0 0 0 0
> 0 0 0 0 0 0
> 0 1 0 0 0
>
> 29 1 0 0 0 0
> 0 0 0 0 0 0
> 0 0 1 0 0
>
> Subjecte7 Subjecte8 Subjectf5 Subjectf6 Subjectf7 Subjectf8
> TreatmentB TreatmentC TreatmentD TreatmentE TreatmentF
>
> 1 0 0 0 0 0 0
> 0 0 0 0 0
>
> 2 0 0 0 0 0 0
> 0 0 0 0 0
>
> 3 0 0 0 0 0 0
> 1 0 0 0 0
>
> 4 0 0 0 0 0 0
> 1 0 0 0 0
>
> 5 0 0 0 0 0 0
> 1 0 0 0 0
>
> 6 0 0 0 0 0 0
> 0 1 0 0 0
>
> 7 0 0 0 0 0 0
> 0 1 0 0 0
>
> 8 0 0 0 0 0 0
> 0 1 0 0 0
>
> 9 0 0 0 0 0 0
> 0 1 0 0 0
>
> 10 0 0 0 0 0 0
> 0 0 1 0 0
>
> 11 0 0 0 0 0 0
> 0 0 1 0 0
>
> 12 0 0 0 0 0 0
> 0 0 1 0 0
>
> 13 0 0 0 0 0 0
> 0 0 1 0 0
>
> 14 0 0 0 0 0 0
> 0 0 0 1 0
>
> 15 0 0 0 0 0 0
> 0 0 0 1 0
>
> 16 1 0 0 0 0 0
> 0 0 0 1 0
>
> 17 0 1 0 0 0 0
> 0 0 0 1 0
>
> 18 0 0 1 0 0 0
> 0 0 0 0 1
>
> 19 0 0 0 1 0 0
> 0 0 0 0 1
>
> 20 0 0 0 0 1 0
> 0 0 0 0 1
>
> 21 0 0 0 0 0 1
> 0 0 0 0 1
>
> 22 0 0 0 0 0 0
> 0 1 0 0 0
>
> 23 0 0 0 0 0 0
> 0 1 0 0 0
>
> 24 0 0 0 0 0 0
> 0 1 0 0 0
>
> 25 0 0 0 0 0 0
> 0 1 0 0 0
>
> 26 0 0 0 0 0 0
> 0 0 1 0 0
>
> 27 0 0 0 0 0 0
> 0 0 1 0 0
>
> 28 0 0 0 0 0 0
> 0 0 1 0 0
>
> 29 0 0 0 0 0 0
> 0 0 1 0 0
>
>
>
> I'm also attaching the target file in case it adds any info.
>
> Can you, perhaps, give any advice on this?
>
> ps. I am not sure whether I should continue submitting to this
thread
> or open up a new one as this problem may not be related to the
> original question I asked.
>
> On Wed, Sep 3, 2014 at 12:12 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> <mailto:smyth at="" wehi.edu.au="">> wrote:
>
> Yes, that's all that is needed.
>
> In general, one corrects for a batch effect by fitting
>
> ~ batch + otherterms
>
> where "otherterms" is what the model would have been without the
> batch effect.
>
> Best wishes
> Gordon
>
>
> On Mon, 1 Sep 2014, Nick N wrote:
>
> Dea Gordon, Ryan and Nicolas,
>
> Than you all for the detailed advice.
>
> I have one more question regarding the blocking factor
model.
> In my case I
> have, actually, 2 external factors to consider - one is the
> platform, the
> other one are the subjects.
>
> My sample matrix is the following (I've attached the CSV in
> case you can't
> view the image):
>
>
>
>
>
> I am only interested in comparing treatments B:D to A (the
> latter are
> controls). So far I've never had a model with more than one
> external
> factor. I imagine it should be OK to have more - is this
> correct? If yes -
> can you, perhaps, check whether I am setting the model
matrix
> correctly?
> (Apologies if this sounds too trivial) I imagine it shall be
> defined as:
>
> Platform <- factor(targets$Platform)
>
> Subject <- factor(targets$Subject)
> Treatment <- factor(targets$Treatment)
> design <- model.matrix(~Platform+Subject+Treatment)
>
>
> ..
>
> fit <- glmFit(y, design)
> lrt <- glmLRT(fit, coef=24) # for comparing Treatment B
to
> Treatment A
>
>
>
> Is this correct?
>
>
>
>
> On Sun, Aug 31, 2014 at 12:44 AM, Gordon K Smyth
> <smyth at="" wehi.edu.au="" <mailto:smyth="" at="" wehi.edu.au="">> wrote:
>
> Dear Nick,
>
> If you go back to the post from 2010 that you give the
URL
> for, you will
> see that I was giving very briefly the same advice about
> checking Poisson
> variability that Ryan has explained at greater detail.
>
> You don't give any information about read lengths,
> sequence depths or
> alignment methods. I would be surprised if MiSeq and
> HiSeq would generate
> perfect Poisson replicates of one another, especially if
> the read lengths
> from the two platform are different or the alignment and
> counting software
> has been varied. So you may well end up back at the
> blocking idea.
>
>
> Best wishes
> Gordon
>
> ---------------------------------------------
> Professor Gordon K Smyth,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
>
http://www.statsci.org/smyth
>
> On Sun, 31 Aug 2014, Ryan wrote:
>
> Thanks to the underlying theory behind dispersion
> estimation, you can
>
> easily test whether your "technical replicates"
really
> do represent
> technical replicates. Specifically, read counts in
> technical replicates
> should follow a Poisson distribution, which is a
> special case of the
> negative binomial with zero dispersion. So, simply
fit
> a model using edgeR
> or DESeq2 with a separate coefficient for each group
> of technical
> replicates. Thus all the experimental variation will
> be absorbed into the
> model coefficients and the only thing left will be
the
> technical
> variability of of the replicates. For true technical
> replicates, the
> dispersion should be zero for all genes. So if you
> estimate dispersions
> using this model, and plotBCV/plotDispEsts shows the
> dispersion very near
> to zero, then you can be confident that you really
> have technical
> replicates. If the dispersion is nonzero, then there
> is some additional
> source of unaccounted-for variation.
>
> I have used this method on a pilot dataset with
> several technical
> replicates for each condition. edgeR said the
> dispersion was something like
> 10^-3 or less for all genes except for the very
> low-expressed genes.
>
> -Ryan
>
> On 8/28/14, 9:23 AM, Nick N wrote:
>
> Hi,
>
> I have a study where a fraction of the samples
> have been replicated on 2
> Illumina platforms (HiSeq and Miseq). These are
> technical replicates - the
> library preparation is the same using the same
> biological replicates - it's
> only the sequencing which is different.
>
> My hunch was that I shall introduce the platform
> as as an additional
> (blocking) factor in the analysis. Than I
stumbled
> upon this post:
>
>
https://stat.ethz.ch/pipermail/bioconductor/2010-April/033099.html
>
> It recommends pooling the replicates. The post
> seems to apply to a
> different case ("pure" technical replicates,
i.e.
> no differences in the
> sequencing platform used) so I probably shall
> ignore it. But I still feel a
> bit uncertain of the best way to treat the
> technical replicates. Can you,
> please, advise me on this?
>
> many thanks!
> Nick
>
>
>
______________________________________________________________________
> The information in this email is confidential and intended
solely
> for the addressee.
> You must not disclose, forward, print or use it without the
> permission of the sender.
>
______________________________________________________________________
>
>
[[alternative HTML version deleted]]