Question

EdgeR Differential Expression analysis with NAs

0

Entering edit mode

elliott77 • 0

@elliott77-6778

Last seen 3.7 years ago

United States

Hi,

I'd like to be able to use EdgeR to look for allele-specific differential expression in data sets containing NAs for some samples. The analysis runs well on my datasets lacking NAs but when I include the NAs. I get and error when I run 'estimateGLMCommonDisp':

parent <-rep(c("mother", "father"), 10)

d <- DGEList(counts = counts, group=parent, genes = row.names(counts), remove.zeros=T)

model.matrix(~parent) -> design
d <- estimateGLMCommonDisp(d, design)

Error in if (any(y < 0)) stop("y must be non-negative") : 
  missing value where TRUE/FALSE needed

where 'counts' is a data frame of allele-specific digital expression data containing NAs for some samples.

Thanks!

Elliott

EdgeR rnaseq NA • 8.7k views

ADD COMMENT • link updated 10.4 years ago by Aaron Lun ★ 28k • written 10.5 years ago by elliott77 • 0

0

Entering edit mode

Thanks for your help,

We are looking at allele-specific expression by RNASeq in an outbred population. In order to distinguish maternally/paternally transcribed RNASeq reads in the offspring we rely on heterozygous SNPs (we've sequenced and genotyped the parents). For a given gene we may or may not have distinguishing SNPs in all samples but would still like to be able to look for parent-of-origin-specific expression.

ADD REPLY • link 10.4 years ago elliott77 • 0

1

Entering edit mode

Does this mean you have two counts (one from "mother", one from "father") from each individual? If so, does this mean that both counts are defined as NA if there are no distinguishing SNPs for that individual?

ADD REPLY • link 10.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

That's right!

ADD REPLY • link 10.4 years ago elliott77 • 0

score 5 · Answer 1 · 2014-09-25

5

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 1 hour ago

WEHI, Melbourne, Australia

Short answer is that we don't allow NAs in edgeR analyses.

In fact, I have yet to see a experiment for which it would be meaningful to have NA counts. So, as Steve says, the real question here is why you have NA counts in the first place.

ADD COMMENT • link 10.4 years ago Gordon Smyth 52k

0

Entering edit mode

Does it makes sense to get 'NA' s in the result of differential expression analysis using edgeR? Under the 'Gene symbol' and 'Gene name' I am getting NAs but all the EntrezID information is filled;however cannot match the EntrezID to a relevant gene name. Could anything be wrong? I was performing a differential gene expression of TCGA data sets. Great thanks in advance!

ADD REPLY • link 5.1 years ago fawazfebin ▴ 60

1

Entering edit mode

edgeR does not generate NAs and associating gene symbols or names with EntrezIDs is not the responsibility of edgeR. Annotation like that is assembled outside of edgeR although edgeR will store it.

You comment here is not actually related to the original question above. If you want help on TCGA annotation it would be better to post it as a separate question.

ADD REPLY • link 5.1 years ago Gordon Smyth 52k

0

Entering edit mode

OK, Thank you for the guidance.

ADD REPLY • link 5.1 years ago fawazfebin ▴ 60

Gordon Smyth · Answer 2 · 2014-09-24

1

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 2.0 years ago

United States

An easy answer would be to replace the NA's with 0.

If for some reason that approach is not reasonable to you, perhaps we can get into a philosophical discussion of what an appropriate thing to do for data with missing values is in context of a differential expression analysis ;-)

ADD COMMENT • link 10.5 years ago Steve Lianoglou ★ 13k

1

Entering edit mode

Replacement of NA's with zeros will probably be misleading. For example, consider a one-way layout with several groups, like that described by the OP. For a given gene, assume that one group contains several large counts, as well as one or more missing values. Replacing these missing values with zero will inflate the apparent variability in the counts for this group. Similarly, if the number of missing values is different between groups (and all other counts are reasonably large), replacement by zero can result in spurious differences in the group averages.

ADD REPLY • link 10.5 years ago • updated 10.4 years ago Aaron Lun ★ 28k

1

Entering edit mode

I totally agree that replacing NA with 0 will produce different results than if one could deal with missing data "appropriately" (whatever that means in this context), but the question is where are the NA's coming from?

ADD REPLY • link updated 10.4 years ago by Gordon Smyth 52k • written 10.4 years ago by Steve Lianoglou ★ 13k

score 1 · Answer 3 · 2014-09-25

It is possible to manipulate edgeR into ignoring the samples for which there are no distinguishing SNPs. This strategy uses a paired-sample design whereby the maternal and paternal counts for each individual are treated as a pair. A paired design should be more appropriate than your one-way layout, as the former accounts for any individual-specific effects on gene expression. Assuming that counts from the same individual are in adjacent columns of the count matrix, we can do:

individual <- rep(1:10, rep(2, 10))
design <- model.matrix(~factor(parent) + factor(individual))

And then, as Steve suggested:

counts[is.na(counts)] <- 0

The analysis can proceed with the GLM-based methods in edgeR. Specification of the group is not necessary as the experimental design is described in design.

d <- DGEList(counts=counts, genes=row.names(counts), remove.zeros=TRUE)
d <- estimateGLMCommonDisp(d, design)

The idea here is that both of the counts for an individual will be NA if no SNPs are available to distinguish the maternal and paternal alleles. When these missing values are converted into zeros, the coefficient corresponding to that individual will approach negative infinity in the fitted GLM. This will not affect nor be affected by the estimated (finite) value of the coefficient corresponding to the paternal/maternal log-fold change for all other individuals. As such, the missing individual is effectively taken out of consideration during inference of allele-specific DE.

In practice, dispersion estimation is slightly biased by this strategy, due to inaccuracies with calculation of the Cox-Reid adjusted profile likelihood when a fitted value is arbitrarily close to zero. Nonetheless, it's a good place to start.

score 0 · Answer 4 · 2014-09-25

0

Entering edit mode

elliott77 • 0

@elliott77-6778

Last seen 3.7 years ago

United States

Thanks for your help,

We are looking at allele-specific expression by RNASeq in an outbred population. In order to distinguish maternally/paternally transcribed RNASeq reads in the offspring we rely on heterozygous SNPs (we've sequenced and genotyped the parents). For a given gene we may or may not have distinguishing SNPs in all samples but would still like to be able to look for parent-of-origin-specific expression.

ADD COMMENT • link 10.4 years ago elliott77 • 0

1

Entering edit mode

Rather than adding another answer, consider adding clarifying information like this as a comment to a previous answer (or to the original question). Answers the support forum may not remain in the same order and are not threaded (don't follow one another logically).

ADD REPLY • link 10.4 years ago Sean Davis 21k