Question

DEXSeq: controlling for cell composition

0

Entering edit mode

skinnider • 0

@skinnider-18958

Last seen 5.4 years ago

I am using DEXSeq to perform an analysis of alternative splicing in the spinal cord parenchyma 1 week after spinal cord injury. Following the “standard analysis workflow” given in the DEXSeq vignette returns about 1,000 alternatively spliced exons at 10% FDR. However, at 1 week after SCI it is possible that the cellular composition of the spinal cord has already undergone some changes: for example, death of neurons. I would like to identify which alternative splicing events are independent of changes in cell type composition, both because these are more likely to be of biological interest, and because section 4 of the vignette (“Additional technical or experimental variables”) suggests this may actually increase statistical power.

To estimate cell type composition, I followed the approach of Swarup et al., 2018, who took the top 100 cell type-specific marker genes for each of five CNS cell types based on RNA-seq data from purified cells described by Zhang et al., 2014, then included the first principal component of each set of marker genes in each sample as a covariate in their regression model. This gives me a data frame of sample annotations in the following format, where columns 3-7 represent the ‘eigengene’ of cell type-specific marker gene expression:

              sample injury astrocytes endothelial_cells microglia oligodendrocytes   neurons
1   1-200-SpinalCord    200  0.2559934         0.2626620 0.2624809        0.2608435 0.2522872
2   2-200-SpinalCord    200  0.2588842         0.2658519 0.2643757        0.2731271 0.2536084
3   3-200-SpinalCord    200  0.2580737         0.2655819 0.2648090        0.2642733 0.2523708
4   4-200-SpinalCord    200  0.2517741         0.2627561 0.2550055        0.2559144 0.2504905
5   5-200-SpinalCord    200  0.2515403         0.2540218 0.2520716        0.2497159 0.2553896
6   6-100-SpinalCord    100  0.2547717         0.2560722 0.2536112        0.2509286 0.2557450
7   7-100-SpinalCord    100  0.2641999         0.2636526 0.2692555        0.2708577 0.2628503
8   8-100-SpinalCord    100  0.2633527         0.2648022 0.2626432        0.2663232 0.2608956
9   9-100-SpinalCord    100  0.2460898         0.2424264 0.2417473        0.2336518 0.2558509
10 10-100-SpinalCord    100  0.2617045         0.2628493 0.2665769        0.2633762 0.2603489
11   11-0-SpinalCord      0  0.2664507         0.2605212 0.2614453        0.2675547 0.2635772
12   12-0-SpinalCord      0  0.2597377         0.2514553 0.2519198        0.2500942 0.2625129
13   13-0-SpinalCord      0  0.2633110         0.2545084 0.2588167        0.2586880 0.2645590
14   14-0-SpinalCord      0  0.2543777         0.2492514 0.2517410        0.2471603 0.2583947
15   15-0-SpinalCord      0  0.2618528         0.2552682 0.2550236        0.2575152 0.2634858

Here injury reflects the force delivered in kDyn and is modeled as a continuous covariate.

In order to identify AS events that are independent of cell type composition, I am trying to follow the code in section 4 of the vignette by including the ‘eigengenes’ of each cell type as a blocking factor, as follows:

formula_reduced = ~ sample + exon + injury:exon
formula_full = ~ sample + exon + injury:exon + 
  astrocytes:exon + endothelial_cells:exon + microglia:exon + 
  oligodendrocytes:exon + neurons:exon

# create dataset
dxd = DEXSeqDataSet(input, sampleData = targets, 
                    design = formula_full,
                    featureID = exon_ids, groupID = gene_ids)

# estimate size factors
dxd = estimateSizeFactors(dxd)
# estimate dispersion
dxd = estimateDispersions(dxd)

# test for AS
dxd = testForDEU(dxd,
                 reducedModel = formula_reduced, 
                 fullModel = formula_full)

However, this code is giving me the following error:

Error in solve.default(xtwx + ridge) : 
  system is computationally singular: reciprocal condition number = 3.9289e-19
Calls: estimateDispersions ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>

I tried removing a single cell type (endothelial cells) from the full formula, but still got the same error, and I am a bit reluctant to drop any more cell types from the analysis.

Could this be solved by using a lower tolerance? And, if so, is there a way to set this from estimateDispersions?

Thanks for your help.

sessionInfo():

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] aggregation_1.0.1           magrittr_1.5                forcats_0.3.0               stringr_1.3.1              
 [5] dplyr_0.7.8                 purrr_0.2.5                 readr_1.1.1                 tidyr_0.8.2                
 [9] tibble_1.4.2                ggplot2_3.1.0               tidyverse_1.2.1             DEXSeq_1.26.0              
[13] RColorBrewer_1.1-2          AnnotationDbi_1.42.1        DESeq2_1.20.0               SummarizedExperiment_1.10.1
[17] DelayedArray_0.8.0          BiocParallel_1.14.2         matrixStats_0.54.0          Biobase_2.40.0             
[21] GenomicRanges_1.34.0        GenomeInfoDb_1.16.0         IRanges_2.16.0              S4Vectors_0.20.0           
[25] BiocGenerics_0.28.0        

loaded via a namespace (and not attached):
 [1] nlme_3.1-137           bitops_1.0-6           lubridate_1.7.4        bit64_0.9-7            progress_1.2.0        
 [6] httr_1.3.1             tools_3.5.1            backports_1.1.3        R6_2.3.0               rpart_4.1-13          
[11] Hmisc_4.1-1            DBI_1.0.0              lazyeval_0.2.1         colorspace_1.3-2       nnet_7.3-12           
[16] withr_2.1.2            tidyselect_0.2.5       gridExtra_2.3          prettyunits_1.0.2      bit_1.1-14            
[21] compiler_3.5.1         cli_1.0.1              rvest_0.3.2            htmlTable_1.13         xml2_1.2.0            
[26] scales_1.0.0           checkmate_1.8.5        genefilter_1.62.0      digest_0.6.18          Rsamtools_1.32.3      
[31] foreign_0.8-71         XVector_0.20.0         base64enc_0.1-3        pkgconfig_2.0.2        htmltools_0.3.6       
[36] readxl_1.1.0           htmlwidgets_1.3        rlang_0.3.0.1          rstudioapi_0.8         RSQLite_2.1.1         
[41] bindr_0.1.1            jsonlite_1.6           hwriter_1.3.2          acepack_1.4.1          RCurl_1.95-4.11       
[46] GenomeInfoDbData_1.1.0 Formula_1.2-3          Matrix_1.2-14          Rcpp_1.0.0             munsell_0.5.0         
[51] stringi_1.2.4          yaml_2.2.0             zlibbioc_1.26.0        plyr_1.8.4             grid_3.5.1            
[56] blob_1.1.1             crayon_1.3.4           lattice_0.20-35        Biostrings_2.48.0      haven_1.1.2           
[61] splines_3.5.1          annotate_1.58.0        hms_0.4.2              locfit_1.5-9.1         knitr_1.21            
[66] pillar_1.3.1           geneplotter_1.58.0     biomaRt_2.36.1         XML_3.98-1.16          glue_1.3.0            
[71] latticeExtra_0.6-28    modelr_0.1.2           data.table_1.11.8      cellranger_1.1.0       gtable_0.2.0          
[76] assertthat_0.2.0       xfun_0.4               xtable_1.8-3           broom_0.5.0            survival_2.42-6       
[81] memoise_1.1.0          bindrcpp_0.2.2         cluster_2.0.7-1        statmod_1.4.30

dexseq • 744 views

ADD COMMENT • link updated 5.2 years ago by Alejandro Reyes ★ 1.9k • written 5.3 years ago by skinnider • 0

score 0 · Answer 1 · 2019-01-27

This error results when the predictors of a regression are linearly dependent (or in your case, highly correlated). This will be a problem for any regression model. However, the cell-type compositions based on your estimates are almost identical across all samples. My suggestion would be to run the analysis using, individually, each of the cell-type composition as covariates and check whether the results change substantially.