EdgeR Differential Expression analysis with NAs
4
0
Entering edit mode
elliott77 • 0
@elliott77-6778
Last seen 15 months ago
United States

Hi,

I'd like to be able to use EdgeR to look for allele-specific differential expression in data sets containing NAs for some samples.  The analysis runs well on my datasets lacking NAs but when I include the NAs. I get and error when I run 'estimateGLMCommonDisp':

parent <-rep(c("mother", "father"), 10)

d <- DGEList(counts = counts, group=parent, genes = row.names(counts), remove.zeros=T)

model.matrix(~parent) -> design
d <- estimateGLMCommonDisp(d, design)

Error in if (any(y < 0)) stop("y must be non-negative") :
missing value where TRUE/FALSE needed

where 'counts' is a data frame of allele-specific digital expression data containing NAs for some samples.

Thanks!

Elliott

EdgeR rnaseq NA • 6.2k views
0
Entering edit mode

We are looking at allele-specific expression by RNASeq in an outbred population.  In order  to distinguish maternally/paternally transcribed RNASeq reads in the offspring we rely on heterozygous SNPs (we've sequenced and genotyped the parents).  For a given gene we may or may not have distinguishing SNPs in all samples but would still like to be able to look for parent-of-origin-specific expression.

1
Entering edit mode

Does this mean you have two counts (one from "mother", one from "father") from each individual? If so, does this mean that both counts are defined as NA if there are no distinguishing SNPs for that individual?

0
Entering edit mode

That's right!

5
Entering edit mode
@gordon-smyth
Last seen 20 minutes ago
WEHI, Melbourne, Australia

Short answer is that we don't allow NAs in edgeR analyses.

In fact, I have yet to see a experiment for which it would be meaningful to have NA counts. So, as Steve says, the real question here is why you have NA counts in the first place.

0
Entering edit mode

Does it makes sense to get 'NA' s in the result of differential expression analysis using edgeR? Under the 'Gene symbol' and 'Gene name' I am getting NAs but all the EntrezID information is filled;however cannot match the EntrezID to a relevant gene name. Could anything be wrong? I was performing a differential gene expression of TCGA data sets. Great thanks in advance!

1
Entering edit mode

edgeR does not generate NAs and associating gene symbols or names with EntrezIDs is not the responsibility of edgeR. Annotation like that is assembled outside of edgeR although edgeR will store it.

You comment here is not actually related to the original question above. If you want help on TCGA annotation it would be better to post it as a separate question.

0
Entering edit mode

OK, Thank you for the guidance.

1
Entering edit mode
@steve-lianoglou-2771
Last seen 26 days ago
United States

An easy answer would be to replace the NA's with 0.

If for some reason that approach is not reasonable to you, perhaps we can get into a philosophical discussion of what an appropriate thing to do for data with missing values is in context of a differential expression analysis ;-)

1
Entering edit mode

Replacement of NA's with zeros will probably be misleading. For example, consider a one-way layout with several groups, like that described by the OP. For a given gene, assume that one group contains several large counts, as well as one or more missing values. Replacing these missing values with zero will inflate the apparent variability in the counts for this group. Similarly, if the number of missing values is different between groups (and all other counts are reasonably large), replacement by zero can result in spurious differences in the group averages.

1
Entering edit mode

I totally agree that replacing NA with 0 will produce different results than if one could deal with missing data "appropriately" (whatever that means in this context), but the question is where are the NA's coming from?

1
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 19 hours ago
The city by the bay

It is possible to manipulate edgeR into ignoring the samples for which there are no distinguishing SNPs. This strategy uses a paired-sample design whereby the maternal and paternal counts for each individual are treated as a pair. A paired design should be more appropriate than your one-way layout, as the former accounts for any individual-specific effects on gene expression. Assuming that counts from the same individual are in adjacent columns of the count matrix, we can do:

individual <- rep(1:10, rep(2, 10))
design <- model.matrix(~factor(parent) + factor(individual))

And then, as Steve suggested:

counts[is.na(counts)] <- 0

The analysis can proceed with the GLM-based methods in edgeR. Specification of the group is not necessary as the experimental design is described in design.

d <- DGEList(counts=counts, genes=row.names(counts), remove.zeros=TRUE)
d <- estimateGLMCommonDisp(d, design)

The idea here is that both of the counts for an individual will be NA if no SNPs are available to distinguish the maternal and paternal alleles. When these missing values are converted into zeros, the coefficient corresponding to that individual will approach negative infinity in the fitted GLM. This will not affect nor be affected by the estimated (finite) value of the coefficient corresponding to the paternal/maternal log-fold change for all other individuals. As such, the missing individual is effectively taken out of consideration during inference of allele-specific DE.

In practice, dispersion estimation is slightly biased by this strategy, due to inaccuracies with calculation of the Cox-Reid adjusted profile likelihood when a fitted value is arbitrarily close to zero. Nonetheless, it's a good place to start.

0
Entering edit mode
elliott77 • 0
@elliott77-6778
Last seen 15 months ago
United States

We are looking at allele-specific expression by RNASeq in an outbred population.  In order  to distinguish maternally/paternally transcribed RNASeq reads in the offspring we rely on heterozygous SNPs (we've sequenced and genotyped the parents).  For a given gene we may or may not have distinguishing SNPs in all samples but would still like to be able to look for parent-of-origin-specific expression.

1
Entering edit mode

Rather than adding another answer, consider adding clarifying information like this as a comment to a previous answer (or to the original question).  Answers the support forum may not remain in the same order and are not threaded (don't follow one another logically).