It is possible to manipulate edgeR into ignoring the samples for which there are no distinguishing SNPs. This strategy uses a paired-sample design whereby the maternal and paternal counts for each individual are treated as a pair. A paired design should be more appropriate than your one-way layout, as the former accounts for any individual-specific effects on gene expression. Assuming that counts from the same individual are in adjacent columns of the count matrix, we can do:
individual <- rep(1:10, rep(2, 10))
design <- model.matrix(~factor(parent) + factor(individual))
And then, as Steve suggested:
counts[is.na(counts)] <- 0
The analysis can proceed with the GLM-based methods in edgeR. Specification of the group
is not necessary as the experimental design is described in design
.
d <- DGEList(counts=counts, genes=row.names(counts), remove.zeros=TRUE)
d <- estimateGLMCommonDisp(d, design)
The idea here is that both of the counts for an individual will be NA
if no SNPs are available to distinguish the maternal and paternal alleles. When these missing values are converted into zeros, the coefficient corresponding to that individual will approach negative infinity in the fitted GLM. This will not affect nor be affected by the estimated (finite) value of the coefficient corresponding to the paternal/maternal log-fold change for all other individuals. As such, the missing individual is effectively taken out of consideration during inference of allele-specific DE.
In practice, dispersion estimation is slightly biased by this strategy, due to inaccuracies with calculation of the Cox-Reid adjusted profile likelihood when a fitted value is arbitrarily close to zero. Nonetheless, it's a good place to start.
Thanks for your help,
We are looking at allele-specific expression by RNASeq in an outbred population. In order to distinguish maternally/paternally transcribed RNASeq reads in the offspring we rely on heterozygous SNPs (we've sequenced and genotyped the parents). For a given gene we may or may not have distinguishing SNPs in all samples but would still like to be able to look for parent-of-origin-specific expression.
Does this mean you have two counts (one from
"mother"
, one from"father"
) from each individual? If so, does this mean that both counts are defined asNA
if there are no distinguishing SNPs for that individual?That's right!