Hi, I'm interested in ASE in plant hybrids, of which I have an F2 backcross. F2 hybrid progeny can have both homospecific (or intra) or heterospecific (inter) haplotype combinations. At the same SNP, I will have some individuals that are both heterospecific, and homospecific, most often in unequal proportions. For every SNP, I want to statistically compare changes in allelic ratios (ref/alt+ref) between these haplotype combos while accounting for variation within individuals.
Note that this design is fundamentally different from traditional unpaired condition-specific DE ASE analysis (control vs treatment), since the 'condition' (homo or hetero haplotype combo) is not the same for all SNPs in an individual (due to recombination, some haplotype combinations may be heterospecific, while others may be homospecific). I create a toy example of data from a single SNP so that you guys can determine if I'm on the right track.
Here's the data with counts for each indivuals's reference and alternate allele counts in successive order
hap <- c(rep("het",6),rep("hom",4)) ind <- factor(rep(1:5,each = 2)) allele <- factor(rep(c("ref","alt"),5)) counts <- c(9,19,2,4,10,11,13,16,13,20)
Create counts matrix and sample data
counts_matrix <- t(matrix(counts)) samps <- cbind.data.frame(ind,hap,allele)
Create DESeq data structure with "dummy" design
dds <- DESeqDataSetFromMatrix(counts_matrix,samps,design = ~ind + allele)
Add a within-group identifier as suggested to prevent confounding
dds$ind.n <- factor(c(rep(1:3,each = 2),rep(1:2,each = 2)))
So now our coldata looks like this:
colData(dds) DataFrame with 10 rows and 4 columns ind hap allele ind.n <factor> <factor> <factor> <integer> 1 1 het ref 1 2 1 het alt 1 3 2 het ref 2 4 2 het alt 2 5 3 het ref 3 6 3 het alt 3 7 4 hom ref 1 8 4 hom alt 1 9 5 hom ref 2 10 5 hom alt 2
Create custom design matrix and remove all-zero columns to get full rank. I'm not sure if I should add or remove the intercept. Would you guys say this design is correct to determine differences in allelic ratios between homo/het groups while controlling for uninteresting within individual variation?
mm <- model.matrix(~ 0 + hap + hap:ind.n + hap:allele, colData(dds))
all.zero <- apply(mm, MARGIN = 2, function(x) all(x==0)) #detect nonzero cols idx <- which(all.zero) #index nonzero cols mm <- mm[,-idx] #remove all-zero cols
Replace dds design with correct formula used to create custom mm
design(dds) <- ~0 + hap+ hap:ind.n + hap:allele
Set size factors to 1 (specific to ASE analysis)
sizeFactors(dds) <- rep(1,ncol(counts_matrix))
Manually run all steps of DESeq. Dispersion estimation is not possible, since I can't process multiple SNPs, genes at once. Is this a shortfoll that can be fixed?
dds <- estimateDispersionsGeneEst(dds,modelMatrix = mm) dispersions(dds) <- mcols(dds)$dispGeneEst dds <- nbinomWaldTest(dds, betaPrior = FALSE, modelMatrix = mm) #I've seen people use LRT, but I think this appropriate
These are what the results look like:
results(dds) log2 fold change (MLE): haphom.alleleref Wald test p-value: haphom.alleleref DataFrame with 1 row and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> 1 11.7 -0.469485 0.3713064 -1.264414 0.2060815 0.2060815
As described by Michael Love, I convert LFC to allelic ratios (ref/alt+ref) in the two groups, as well as the contrast between groups as follows (described previously by Michael Love in the link below):
hom_results <- results(dds, name = "haphet.alleleref") hom_ref.vs.alt <- 2^hom_results$log2FoldChange hom_ref.vs.tot <- hom_ref.vs.alt/(1+hom_ref.vs.alt) hom_p <- hom_results[,"pvalue"] het_results <- results(dds, name = "haphom.alleleref") het_ref.vs.alt <- 2^het_results$log2FoldChange het_ref.vs.tot <- het_ref.vs.alt/(1+het_ref.vs.alt) het_p <- het_results[,"pvalue"] comp_results <- results(dds,contrast = list("haphom.alleleref","haphet.alleleref")) diff <- 2^comp_results$log2FoldChange * (1/(1 + hom_ref.vs.alt)) / (1/(1 + het_ref.vs.alt)) #can you check that this is correct? diff_p <- comp_results[,"pvalue"]
Then I create a vector with calculated values: ASE in the respective groups and difference in ASE between groups. The vector will be appended to a larger data structure using an lapply over all SNPs from which I can p.adjust("fdr") for multiple testing. What do you guys think?
out <- c(homAR = hom_ref.vs.tot, hom_p = hom_p, hetAR = het_ref.vs.tot,het_p = het_p,diff = diff, diff_p = diff_p) out homAR hom_p hetAR het_p diff diff_p 0.38181824 0.08255042 0.41935489 0.20608150 1.24490400 0.67943076
Thanks a lot in advance. And thanks for the really clear manuals and answers to questions. I really has helped me a long way to using DE methods to detect ASE.
Any help, comments, suggestions, flaws, and criticism is highly appreciated!
sessionInfo() R version 3.3.3 (2017-03-06) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X Yosemite 10.10.3 locale:  en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages:  parallel stats4 stats graphics grDevices utils datasets methods base other attached packages:  limma_3.28.21 BiocInstaller_1.22.3 stringr_1.2.0 gridExtra_2.3 reshape2_1.4.2  ggplot2_2.2.1 DESeq2_1.12.4 SummarizedExperiment_1.2.3 Biobase_2.32.0 GenomicRanges_1.24.3  GenomeInfoDb_1.8.7 IRanges_2.6.1 S4Vectors_0.10.3 BiocGenerics_0.18.0 bindrcpp_0.2  tidyr_0.7.2 dplyr_0.7.4 loaded via a namespace (and not attached):  bit64_0.9-7 splines_3.3.3 Formula_1.2-2 assertthat_0.2.0 latticeExtra_0.6-28 blob_1.1.0  RSQLite_2.0 backports_1.1.1 lattice_0.20-35 glue_1.2.0 digest_0.6.12 RColorBrewer_1.1-2  XVector_0.12.1 checkmate_1.8.5 colorspace_1.3-2 htmltools_0.3.6 Matrix_1.2-11 plyr_1.8.4  XML_3.98-1.9 pkgconfig_2.0.1 genefilter_1.54.2 zlibbioc_1.18.0 purrr_0.2.4 xtable_1.8-2  scales_0.5.0 BiocParallel_1.6.6 htmlTable_1.9 tibble_1.3.4 annotate_1.50.1 nnet_7.3-12  lazyeval_0.2.1 survival_2.41-3 magrittr_1.5 memoise_1.1.0 foreign_0.8-69 tools_3.3.3  data.table_1.10.4-3 munsell_0.4.3 locfit_1.5-9.1 cluster_2.0.6 AnnotationDbi_1.34.4 rlang_0.1.4  grid_3.3.3 RCurl_1.95-4.8 htmlwidgets_0.9 bitops_1.0-6 base64enc_0.1-3 labeling_0.3  gtable_0.2.0 DBI_0.7 R6_2.2.2 knitr_1.17 bit_1.1-12 bindr_0.1  Hmisc_4.0-3 stringi_1.1.5 Rcpp_0.12.13 geneplotter_1.50.0 rpart_4.1-11 acepack_1.4.1  tidyselect_0.2.3