Question: SNP-level Allele-specific expression detection using DESeq2 in an unpaired experimental design
gravatar for engelbrecht.stephan
8 days ago by
engelbrecht.stephan10 wrote:

Hi, I'm interested in ASE in plant hybrids, of which I have an F2 backcross. F2 hybrid progeny can have both homospecific (or intra) or heterospecific (inter) haplotype combinations. At the same SNP, I will have some individuals that are both heterospecific, and homospecific, most often in unequal proportions. For every SNP, I want to statistically compare changes in allelic ratios (ref/alt+ref) between these haplotype combos while accounting for variation within individuals.

Note that this design is fundamentally different from traditional unpaired condition-specific DE ASE analysis (control vs treatment), since the 'condition' (homo or hetero haplotype combo) is not the same for all SNPs in an individual (due to recombination, some haplotype combinations may be heterospecific, while others may be homospecific). I create a toy example of data from a single SNP so that you guys can determine if I'm on the right track.

Here's the data with counts for each indivuals's reference and alternate allele counts in successive order

hap <- c(rep("het",6),rep("hom",4))
ind <- factor(rep(1:5,each = 2))
allele <- factor(rep(c("ref","alt"),5))
counts <- c(9,19,2,4,10,11,13,16,13,20)

Create counts matrix and sample data

counts_matrix <- t(matrix(counts))
samps <-,hap,allele)

Create DESeq data structure with "dummy" design

dds <- DESeqDataSetFromMatrix(counts_matrix,samps,design = ~ind + allele)

Add a within-group identifier as suggested to prevent confounding

dds$ind.n <- factor(c(rep(1:3,each = 2),rep(1:2,each = 2)))

So now our coldata looks like this:


DataFrame with 10 rows and 4 columns
        ind      hap   allele     ind.n
   <factor> <factor> <factor> <integer>
1         1      het      ref         1
2         1      het      alt         1
3         2      het      ref         2
4         2      het      alt         2
5         3      het      ref         3
6         3      het      alt         3
7         4      hom      ref         1
8         4      hom      alt         1
9         5      hom      ref         2
10        5      hom      alt         2

Create custom design matrix and remove all-zero columns to get full rank. I'm not sure if I should add or remove the intercept. Would you guys say this design is correct to determine differences in allelic ratios between homo/het groups while controlling for uninteresting within individual variation?

mm <- model.matrix(~ 0 + hap + hap:ind.n + hap:allele, colData(dds)) <- apply(mm, MARGIN = 2, function(x) all(x==0)) #detect nonzero cols
idx <- which( #index nonzero cols
mm <- mm[,-idx] #remove all-zero cols

Replace dds design with correct formula used to create custom mm

design(dds) <- ~0 + hap+ hap:ind.n + hap:allele

Set size factors to 1 (specific to ASE analysis)

sizeFactors(dds) <- rep(1,ncol(counts_matrix))

Manually run all steps of DESeq. Dispersion estimation is not possible, since I can't process multiple SNPs, genes at once. Is this a shortfoll that can be fixed?

dds <- estimateDispersionsGeneEst(dds,modelMatrix = mm) 
dispersions(dds) <- mcols(dds)$dispGeneEst   
dds <- nbinomWaldTest(dds, betaPrior = FALSE, modelMatrix = mm) #I've seen people use LRT, but I think this appropriate

These are what the results look like:


log2 fold change (MLE): haphom.alleleref 
Wald test p-value: haphom.alleleref 
DataFrame with 1 row and 6 columns
   baseMean log2FoldChange     lfcSE      stat    pvalue      padj
  <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
1      11.7      -0.469485 0.3713064 -1.264414 0.2060815 0.2060815

As described by Michael Love, I convert LFC to allelic ratios (ref/alt+ref) in the two groups, as well as the contrast between groups as follows (described previously by Michael Love in the link below):

hom_results <- results(dds, name = "haphet.alleleref")
hom_ref.vs.alt <- 2^hom_results$log2FoldChange
hom_ref.vs.tot <- hom_ref.vs.alt/(1+hom_ref.vs.alt)
hom_p <- hom_results[,"pvalue"]

het_results <- results(dds, name = "haphom.alleleref")
het_ref.vs.alt <- 2^het_results$log2FoldChange
het_ref.vs.tot <- het_ref.vs.alt/(1+het_ref.vs.alt)
het_p <- het_results[,"pvalue"]

comp_results <- results(dds,contrast = list("haphom.alleleref","haphet.alleleref"))
diff <- 2^comp_results$log2FoldChange * (1/(1 + hom_ref.vs.alt)) / (1/(1 + het_ref.vs.alt)) #can you check that this is correct?
diff_p <- comp_results[,"pvalue"]


Then I create a vector with calculated values: ASE in the respective groups and difference in ASE between groups. The vector will be appended to a larger data structure using an lapply over all SNPs from which I can p.adjust("fdr") for multiple testing. What do you guys think?

out <- c(homAR = hom_ref.vs.tot, hom_p = hom_p, hetAR = het_ref.vs.tot,het_p =  het_p,diff = diff, diff_p = diff_p)

     homAR      hom_p      hetAR      het_p       diff     diff_p 
0.38181824 0.08255042 0.41935489 0.20608150 1.24490400 0.67943076 

Thanks a lot in advance. And thanks for the really clear manuals and answers to questions. I really has helped me a long way to using DE methods to detect ASE.

Any help, comments, suggestions, flaws, and criticism is highly appreciated!

Stephan Engelbrecht


R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X Yosemite 10.10.3

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] limma_3.28.21              BiocInstaller_1.22.3       stringr_1.2.0              gridExtra_2.3              reshape2_1.4.2         
[6] ggplot2_2.2.1              DESeq2_1.12.4              SummarizedExperiment_1.2.3 Biobase_2.32.0             GenomicRanges_1.24.3   
[11] GenomeInfoDb_1.8.7         IRanges_2.6.1              S4Vectors_0.10.3           BiocGenerics_0.18.0        bindrcpp_0.2           
[16] tidyr_0.7.2                dplyr_0.7.4               

loaded via a namespace (and not attached):
[1] bit64_0.9-7          splines_3.3.3        Formula_1.2-2        assertthat_0.2.0     latticeExtra_0.6-28  blob_1.1.0        
[7] RSQLite_2.0          backports_1.1.1      lattice_0.20-35      glue_1.2.0           digest_0.6.12        RColorBrewer_1.1-2
[13] XVector_0.12.1       checkmate_1.8.5      colorspace_1.3-2     htmltools_0.3.6      Matrix_1.2-11        plyr_1.8.4        
[19] XML_3.98-1.9         pkgconfig_2.0.1      genefilter_1.54.2    zlibbioc_1.18.0      purrr_0.2.4          xtable_1.8-2      
[25] scales_0.5.0         BiocParallel_1.6.6   htmlTable_1.9        tibble_1.3.4         annotate_1.50.1      nnet_7.3-12       
[31] lazyeval_0.2.1       survival_2.41-3      magrittr_1.5         memoise_1.1.0        foreign_0.8-69       tools_3.3.3       
[37] data.table_1.10.4-3  munsell_0.4.3        locfit_1.5-9.1       cluster_2.0.6        AnnotationDbi_1.34.4 rlang_0.1.4       
[43] grid_3.3.3           RCurl_1.95-4.8       htmlwidgets_0.9      bitops_1.0-6         base64enc_0.1-3      labeling_0.3      
[49] gtable_0.2.0         DBI_0.7              R6_2.2.2             knitr_1.17           bit_1.1-12           bindr_0.1         
[55] Hmisc_4.0-3          stringi_1.1.5        Rcpp_0.12.13         geneplotter_1.50.0   rpart_4.1-11         acepack_1.4.1     
[61] tidyselect_0.2.3    
ADD COMMENTlink modified 7 days ago by Michael Love14k • written 8 days ago by engelbrecht.stephan10
gravatar for Michael Love
7 days ago by
Michael Love14k
United States
Michael Love14k wrote:

hi Stephan,

"Is this a shortfall that can be fixed?" This is a good question. The step that estimates the parameters for the Bayesian prior on dispersions doesn't need a model matrix. So I think you could rbind() the dds together, then run estimateDispersionsFit(), then split them apart again for estimateDispersionMAP() and nbinomWaldTest(), both with custom model matrices. This way you would avoid having to use MLE estimates of dispersion here:

dispersions(dds) <- mcols(dds)$dispGeneEst   

Does that make sense? Let me know if you need more code, or if there are any hitches. But I think this approach should work. You can use Wald or LRT, depending on if you want to test individual contrasts or test the null hypothesis for multiple coefficients at once. This is how we make use of Wald and LRT in DESeq2 at least.


ADD COMMENTlink written 7 days ago by Michael Love14k

Hi Michael

Thank you for your rapid response! Unfortunately I'm unable to rbind 'dds' experiments together, since the number of samples assayed for each haplotype 'condition' varies across genes/SNPs. Rbinding SummarizedExperiments requires that objects must have the same number of samples. 

I don't think that there is any workaround for this, unless you know of anything, in which case some code would be extremely helpful.

Again, thanks for going over this!

ADD REPLYlink modified 3 days ago • written 3 days ago by engelbrecht.stephan10

Ah, then it would take more hacking. You could really run estimateDispersionsFit() on a dummy dataset, it just looks at mcols(dds)... then distribute the dispersionFunction() to the split apart dataset. But this seems like a lot of hacking, and there may be specialized software that handles this for you.

ADD REPLYlink written 3 days ago by Michael Love14k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 113 users visited in the last hour