Question

allele specific expression in edgeR/DESeq

1

Entering edit mode

gthm ▴ 30

@gthm-8377

Last seen 5.1 years ago

spain

This question is an extension to the previously posted question. Allele Specific Expression in DESeq2 or EdgeR

I have a paired data set, treated vs untreated samples from each individual.

ind1_T,ind1_U, ind2_T,ind2_U, ind3_T,ind3_U

I would like to perform differential allele specific analysis of SNPs i.e SNPs that show a significant allelic imbalance between Treated vs Untreated. I would like to know how to model this in DESeq/edgeR ?

SNP	ind1_T	ind1_U	ind2_T	ind2_U	ind3_T	ind3_U
kgp1876518_REF_G	0	0	0	0	0	0
kgp1876518_ALT_A	0	0	0	0	0	0
kgp1876747_REF_A	0	77	0	55	0	55
kgp1876747_ALT_G	0	35	0	30	0	30
kgp1877878_REF_C	77	0	55	10	60	13
kgp1877878_ALT_T	34	0	25	5	28	8

edger deseq2 deseq • 1.4k views

ADD COMMENT • link updated 8.4 years ago by Aaron Lun ★ 28k • written 8.4 years ago by gthm ▴ 30

score 1 · Answer 1 · 2015-12-07

I would suggest that you arrange your data so that each row contains all the counts for both alleles of each SNP, for all treated/untreated samples from all individuals. This means that you would get 12 counts per gene in total (4 counts per individual). You could then set up your design like:

allele <- factor(rep(c("alt", "ref"), 6), level=c("ref", "alt"))
individual <- factor(rep(1:3, each=4))
treatment <- factor(rep(c("T", "T", "U", "U"), 3))
design <- model.matrix(~0 + individual:treatment + allele:treatment)

This design is a bit complicated, so bear with me. If we look at colnames(design), we get:

[1] "individual1:treatmentT" "individual2:treatmentT" "individual3:treatmentT"
[4] "individual1:treatmentU" "individual2:treatmentU" "individual3:treatmentU"
[7] "treatmentT:allelealt"   "treatmentU:allelealt"

The first six coefficients represent the blocking factors for each individual/treatment combination, i.e., the baseline log-expression of the reference allele in each individual/treatment combination. The last two coefficients represent the log-fold change of the alternative allele over the reference (i.e., allele-specific expression) in the treated or untreated samples. You can drop either of these two coefficients individually to test for ASE in each treatment type, or you can compare them to each other (e.g., by setting the contrast argument to c(0,0,0,0,0,0,-1,1) in glmLRT) to test for differential ASE between treatment types.

For the sake of completeness, I will also describe a more sophisticated model:

design <- model.matrix(~0 + individual:treatment + individual:allele + allele:treatment)

If we look at the colnames(design), we get:

 [1] "individual1:treatmentT" "individual2:treatmentT" "individual3:treatmentT"
 [4] "individual1:treatmentU" "individual2:treatmentU" "individual3:treatmentU"
 [7] "individual1:allelealt"  "individual2:allelealt"  "individual3:allelealt"
[10] "treatmentU:allelealt"

The first six coefficients are as previously described; the next three coefficients represent the alternative/reference ASE in the treated sample of each individual; and the last coefficient represents the log-fold change in the ASE in the untreated over treated samples across individuals. In other words, the last coefficient can be dropped to test for differential ASE upon treatment. This model is more sophisticated as it allows for individual-specific ASE that might have inflated the dispersions in the previous model (thus leading to a decrease in power). However, the cost is that this model provides fewer residual d.f., which will also reduce power.