Question: Using edgeR or DESeq2 to analyze allele-specific expression?
4
3.9 years ago by
United States
trianglescout40 wrote:

Hi,

Does anyone have experience with standard rna-seq analysis tools (edgeR, DESeq2, etc) to detect allele-specific expression? I have a count table in which each F1 hybrid (3 biological replicates) has separate counts for the paternal and maternal allele. I am considering implementing a paired design in edgeR to test for differential allele expression between the parental alleles. However, I have not seen any publications using edgeR or DESeq2 to analyze ASE, so I would like to know if there is an underlying reason why these tools would be inappropriate for this type of analysis.

Though there are some rna-seq analysis tools specifically designed for ASE, most assume a binomial distribution, which is not appropriate for count data.

Thanks!

Jennifer

edger deseq2 ase allele • 2.4k views
modified 3.9 years ago • written 3.9 years ago by trianglescout40
1

I haven't analyzed any ASE data, but would assume the appropriate model is closer to something like bisulfite sequencing for DNA methylation, since you're evaluating the relative levels of two signals within a sample at each locus, not the absolute level of one signal in each sample.

Answer: Using edgeR or DESeq2 to analyze allele-specific expression?
5
3.9 years ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

I don't see any obvious problems with what you're proposing. The real difficulty is in distinguishing the paternal and maternal read counts in the first place - if you've already done that, then it should be relatively easy to do the rest. To clarify the DE analysis, so that we're all talking about the same thing:

individual <- factor(c(1,1,2,2,3,3))
allele <- factor(rep(c("M", "P"), 3))
design <- model.matrix(~0+individual+allele)

The last coefficient represents the log-fold change in expression of the paternal allele over the maternal allele. Dropping it will test for allele-specific expression, averaged across all individuals. The first three coefficients will absorb any differences in overall expression between individuals, so that you're only testing for relative changes within individuals.

1

interesting! I hadn't thought of this before, doing a direct comparison of the counts, rather than the binomial approach.

I'm wondering about library size correction. For this design, I think you would want to make sure that there are not differences in normalization factors between M and P within an individual, or these would absorb the ASE. For the interaction design below, again you wouldn't want library size to affect the ratio of ratios. I'm thinking in both cases you could just set normalization factors to 1. What do you think?

I agree with Mike on this; I would set the normalization factors to unity to start with. Obviously, we're comparing within individuals, so we don't need to normalize between individuals (as any differences between individuals gets absorbed by the blocking factors anyway). There may be a case for normalizing within individuals, if there's a systematic yet uninteresting difference between the paternal and maternal counts, e.g., due to differing mapping success rates (if one parent has the reference genome, and the other has a bunch of variants that reduce the mapping rate).

Running calcNormFactors before analysis would get rid of this systematic difference. I guess it would also remove some ASE, but that would only be a problem if you expect a majority of genes to exhibit ASE (which may or may not be true). In general, you'll have to see if such differences exist on a case-by-case basis; and, if they do, decide whether or not they're due to genuine ASE in a majority of genes, or some sort of technical artefact like the mapping thing I described above.

Yes, the main question we are asking is: “for what genes is the paternal allele differentially expressed from the maternal allele”, and I think design <- model.matrix(~0 + individual + allele) will address this question. thanks!

To complicate things further, we may also look at how ASE changes by time-point. The setup here would be:

individual = factor(c(1,1,1,1,2,2,2,2,3,3,3,3)

allele = factor(rep(c("M", "P"), 6)

time = factor(rep(c(1,1,2,2), 3))

design = model.matrix(~0+individual+allele+time)

I think you'd be better off doing something like:

design = model.matrix(~0+individual:time+allele:time)

If you run colnames(design), you get:

[1] "individual1:time1" "individual2:time1" "individual3:time1"
[4] "individual1:time2" "individual2:time2" "individual3:time2"
[7] "time1:alleleP"     "time2:alleleP"    

The first 6 terms represent blocking factors for each individual/time combination. This is better than having only one term for each individual, as it absorbs any (uninteresting) changes in the baseline expression between times within an individual. The last two terms represent the paternal/maternal log-fold change at each respective time; you can drop them to test for ASE at each time, or compare them to each other to determine if there's a time/ASE interaction.

This thread is likely dead by now, by would be curious to hear more about "compare them to each other to determine if there's a time/ASE interaction". How would you propose that would work? Why not something like (if I get the syntax correct):

design = model.matrix(~0+individual:time+allele + allele:time)?

That formulation is not of full column rank. You could drop alleleM to get to full column rank, but the coefficients require some more work to interpret:

• individualX:timeY represent the blocking factors as before.
• alleleP represents the paternal/maternal log-fold change at time 1. This is confusing because if you just looked at the column name, you wouldn't think time was involved.
• alleleP:time2 is the difference in the P/M log-fold change in time 2 against 1.

By comparison, my formulation above is (i) of full rank, (ii) has coefficients where the default names match up nicely to what they mean mathematically, and (iii) is easy enough to work with if you use makeContrasts() to do the comparison between coefficients.

That makes perfect sense. Thanks! Do you have a suggestion for what it might look like to layer yet another level of data on top of it? [e.g. add temperature variation].

If you can assume that the effect of temperature variation is additive with respect to the other terms, then you can just add it to the design matrix.

If you want to identify allele-specific effects of temperature variation, then you would probably have to set up a more complex model with a allele:time:temp term. This would not be pleasant to try to interpret, but the goal would be to obtain a design (extending from my previous post) that contains time1:alleleP and time1:alleleP:temp terms. These represent the "intercept" and "slope" with respect to temperature of the allele-specific log-fold changes at time 1.

HI @Aaron:

What do you mean by, "Dropping it will test for allele-specific expression, averaged across all individuals"? Are we not supposed to keep the design as it is as

~0+individual+allele

Dropping the relevant coefficient forms the null design matrix, to test whether the coefficient has a non-zero value.

Answer: Using edgeR or DESeq2 to analyze allele-specific expression?
0
3.9 years ago by
United States
trianglescout40 wrote:

For each individual, I have “total expression” (sum of M + P + counts of unknown origin)**. Originally, I planned to normalize across individuals, using this total expression (as is done in standard workflow) and then pass the correction on to the ASE counts.

As an example, prior to normalization for a given gene I could have:

Individual_A : 150

Individual_B : 100

After normalization (total expression):

Individual_A : 100

Individual_B : 150

I know that in individual A, prior to normalization, there are 25 counts of M, 100 counts of P, and 25 counts of unknown origin.

So if use the normalized total expression, these “normalized” allele-specific counts in individual A would become: 16.67 counts of M and 66.67 counts of P in my new matrix. Counts of unknown origin would not be considered this design

**Now, I have also considered having total expression = M + P and discarding counts of unknown origin prior to normalization. I am not sure which approach is best.

1

In terms of the GLM fitting and testing with edgeR, any per-individual normalization (i.e., where you scale both maternal and paternal counts for an individual by the same factor) will be irrelevant. Any changes will be absorbed into the blocking factors of the various design matrices that I've described above. In any case, you should not be feeding normalized counts into edgeR. The analysis requires raw counts, and any normalization should be done through the normalization factors.

Ok, yes I see what you mean. I will start with setting the normalization factors == 1. Thanks for your input!

1

Incidentally, if you have the log-fold change between paternal and maternal counts for a given gene, you can recover the proportion of total expression for that gene from each parent with:

log.fc <- 2 # example for a given gene; assuming paternal/maternal log-FC
prop.mat <- 1/(1+2^log.fc) # proportion of expression from maternal allele

This may be more intuitive for whoever's reading the final DE list. Of course, it doesn't matter to edgeR.