Allele Specific Expression in DESeq2 or EdgeR
3
2
Entering edit mode
martiningi ▴ 40
@martiningi-9274
Last seen 6.0 years ago
United States

Dear all. I am working on allele-specific expression analysis on 60+ human samples (unrelated, do not have parental genotyping).

One of my applications involves comparing expression of reference and alternative allele of around 50k heterozygous SNPs within these individuals between patients who develop cancer and those who do not. I have created a count matrix that containsFor each individual heterozygous for each SNP I have RNAcounts with ref and alt allele, or have listed the counts as NA if the individual is not heterozygous for the SNP:

 SNP sample1 sample2 kgp1876518_REF_G 0 NA kgp1876518_ALT_A 0 NA kgp1876747_REF_A 0 77 kgp1876747_ALT_G 0 35 kgp1877878_REF_C 77 NA kgp1877878_ALT_T 34 NA

I would prefer to use DESeq2 (or EdgeR) for this, based on good prior experience with this software

Of course for each SNP, multiple individuals are homozygous and therefore do not contain information on ref/alt counts. Simply replacing the NA with 0 can bias the count variability and dispersion estimate. Neither package allows NA counts.

Any thoughts on how to do this or alternative suggestions of solution?

rnaseq deseq2 edger • 2.2k views
7
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 20 hours ago
The city by the bay

Actually, if you block on the patient, you'll be fine with replacing the missing values with zeros (which might be surprising, but read on). Assume that we have a scenario with three patients:

patient <- factor(rep(1:3, each=2))
allele <- factor(rep(c("alt", "ref"), 3))
design <- model.matrix(~0 + patient + allele)

If we look at colnames(design), we get:

[1] "patient1"  "patient2"  "patient3"  "alleleref"

The first three coefficients represent blocking factors for each patient, while the last represents the log-fold change of the reference allele over the alternative. The idea is to fit this to the alternative/reference counts for each SNP across all patients, and then test for allele specific expression (ASE) by dropping the last coefficient. For the count matrix, you'll need to arrange it so that you have one row per SNP, and two columns per patient alternating between the alternative and reference alleles.

Now, if the counts for both alleles of any patient are zero, the estimate of the blocking factor for that patient will approach negative infinity. This means that the fitted value for those counts will be arbitrarily close to zero, i.e., the fit is always near-perfect for this patient, regardless of whatever else is happening in the other patients. This patient will not affect the GLM fit for this SNP; it will not affect the NB dispersion estimate; and it will not affect the computed log-FC and p-value for ASE. If you were to use QL edgeR, some adjustment is required to account for the loss of residual d.f., but this is done automatically so don't worry about it. In short, replacement with zeros has the same effect as if you took out that patient before fitting the GLM, which is what you would normally do for NA values.

2
Entering edit mode
martiningi ▴ 40
@martiningi-9274
Last seen 6.0 years ago
United States

This is a fantastic answer and very helpful to use DESeq/edgeR to pick up ASE.

What if you want to test if ASE is associated with another condition, say that the expression of alternative allele of SNP1 is higher in patients with cancer vs those who do not?

1
Entering edit mode

If you have something like:

disease <- factor(rep(c("cancer", "cancer", "normal"), each=2))

... you could do:

design <- model.matrix(~0 + patient + disease:allele)
design <- design[,!grepl("allelealt", colnames(design))] # get to full rank

If we now look at colnames(design), we get:

[1] "patient1"                "patient2"
[3] "patient3"                "diseasecancer:alleleref"
[5] "diseasenormal:alleleref"

So, the fourth and fifth coefficients represent the ASE log-fold change of the reference over the alternative in cancer and normal patients, respectively. Drop either to test for ASE in each disease type, or compare them to each other with contrast=c(0,0,0,1,-1) in glmLRT to test for disease-specific ASE.

0
Entering edit mode

Is there anyway if the treated/control is also paired ? Something like 3 samples, paired, treated vs untreated. So 3 treated and 3 untreated. Is it possible to implement this design in allele specific manner ?

0
Entering edit mode

0
Entering edit mode
martiningi ▴ 40
@martiningi-9274
Last seen 6.0 years ago
United States

This is a fantastic answer and very helpful to use DESeq/edgeR to pick up ASE.

What if you want to test if ASE is associated with another condition, say that the expression of alternative allele of SNP1 is higher in patients with cancer vs those who do not?