Question

DESeq2 changing Rownames (Gene accession numbers)

0

Entering edit mode

JGibbons1 • 0

@jgibbons1-22546

Last seen 4.4 years ago

I am running DESeq2 to find associations between microbiome abundance and gene expression. The results include accession numbers that look valid but do not appear in the reference that was used when calculating the gene counts and according to the ensembl ID converter were never valid ensembl IDs implying the problem is not that I used an old reference.

This is the code that is running DEseq2 and consolidating the results

 ##Loop through the microbiome abundance data 
##to modify create colData for each microbiome group
  for(i in seq(1,nrow(m_microbiome))){
    microbiome_group<-m_microbiome[i,]
    df_meta$MicrobiomeGroup<-as.vector(t(microbiome_group))

    dds<-DESeqDataSetFromMatrix(countData = m_expression,
                                colData = df_meta,
                                design= ~Timepoint+MicrobiomeGroup)
    dds<-DESeq(dds,
               parallel=TRUE,BPPARAM=MulticoreParam(4))
    df_results<-results(dds)
    df_results$MicrobiomeGroup<-rownames(m_microbiome)[[i]]
    l_results[[i]]<-as.data.frame(df_results)

  }

df_combined<-do.call("rbind",l_results)
df_combined$FDR<-p.adjust(df_combined$pvalue)
 write.csv(df_combined,outfile,row.names = T)

Here is an example of a gene that is not in the input but is in the output:

> print(m_expression[which(rownames(m_expression)=="gene:ENSG000001628921"),])
     Early1 Early2 Early4 Late1 Late2 Late3 Late4
> print(df_combined[which(rownames(df_combined)=="gene:ENSG000001628921"),])
                      baseMean log2FoldChange     lfcSE      stat       pvalue         padj MicrobiomeGroup          FDR
gene:ENSG000001628921 436.3101      -2.112225 0.3147359 -6.711102 1.931604e-11 2.348638e-07      Firmicutes 9.773919e-07

Here is the session info:

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] knitr_1.25                   microbiome_1.8.0             ggplot2_3.2.1                phyloseq_1.29.0             
 [5] DESeq2_1.26.0                SummarizedExperiment_1.15.10 DelayedArray_0.11.8          BiocParallel_1.19.5         
 [9] matrixStats_0.55.0           Biobase_2.45.1               GenomicRanges_1.37.17        GenomeInfoDb_1.21.2         
[13] IRanges_2.19.18              S4Vectors_0.23.25            BiocGenerics_0.31.6         

loaded via a namespace (and not attached):
 [1] nlme_3.1-141           bitops_1.0-6           bit64_0.9-7            RColorBrewer_1.1-2     tools_3.6.1           
 [6] backports_1.1.5        R6_2.4.0               vegan_2.5-6            rpart_4.1-15           mgcv_1.8-29           
[11] Hmisc_4.2-0            DBI_1.0.0              lazyeval_0.2.2         colorspace_1.4-1       permute_0.9-5         
[16] ade4_1.7-13            nnet_7.3-12            withr_2.1.2            tidyselect_0.2.5       gridExtra_2.3         
[21] bit_1.1-14             compiler_3.6.1         htmlTable_1.13.2       scales_1.0.0           checkmate_1.9.4       
[26] genefilter_1.68.0      stringr_1.4.0          digest_0.6.21          foreign_0.8-72         XVector_0.25.0        
[31] base64enc_0.1-3        pkgconfig_2.0.3        htmltools_0.4.0        htmlwidgets_1.5.1      rlang_0.4.1           
[36] rstudioapi_0.10        RSQLite_2.1.2          jsonlite_1.6           acepack_1.4.1          dplyr_0.8.3           
[41] RCurl_1.95-4.12        magrittr_1.5           GenomeInfoDbData_1.2.2 Formula_1.2-3          biomformat_1.13.0     
[46] Matrix_1.2-17          Rcpp_1.0.2             munsell_0.5.0          Rhdf5lib_1.7.6         ape_5.3               
[51] lifecycle_0.1.0        stringi_1.4.3          MASS_7.3-51.4          zlibbioc_1.31.0        Rtsne_0.15            
[56] rhdf5_2.29.6           plyr_1.8.4             grid_3.6.1             blob_1.2.0             crayon_1.3.4          
[61] lattice_0.20-38        Biostrings_2.53.2      splines_3.6.1          multtest_2.41.0        annotate_1.64.0       
[66] locfit_1.5-9.1         zeallot_0.1.0          pillar_1.4.2           igraph_1.2.4.1         geneplotter_1.64.0    
[71] reshape2_1.4.3         codetools_0.2-16       XML_3.98-1.20          glue_1.3.1             latticeExtra_0.6-28   
[76] data.table_1.12.6      vctrs_0.2.0            foreach_1.4.7          tidyr_1.0.0            gtable_0.3.0          
[81] purrr_0.3.3            assertthat_0.2.1       xfun_0.10              xtable_1.8-4           survival_2.44-1.1     
[86] tibble_2.1.3           iterators_1.0.12       AnnotationDbi_1.47.1   memoise_1.1.0          cluster_2.1.0

Any help explaining and fixing this behavior will be much appreciated.

Thank you

deseq2 • 1.1k views

ADD COMMENT • link 4.4 years ago JGibbons1 • 0

score 0 · Answer 1 · 2019-12-11

You've got a lot of complicated code here, and I'll bet there's an error somewhere in making the list of results tables and rbind-ing.

I promise that DESeq2 outputs the exact same rows as are provided. No rows are dropped from the dataset or results table. This code that undergirds the object input and output hasn't changed in the past 7 years, so it's pretty well tested.

Why don't you step through the objects inside the loop and see where things are missing?

score 0 · Answer 2 · 2019-12-11

The problem was not with any of the DESeq2 functions. The incorrect accession numbers were introduced during the rbind step.

rbind requires all of the rownames to be unique and if they are not it just automatically changes them (see here) .

To solve the problem I created a gene id column for each of the dataframes that was created:

for(i in seq(1,nrow(m_microbiome))){
    microbiome_group<-m_microbiome[i,]
    df_meta$MicrobiomeGroup<-as.vector(t(microbiome_group))
    #According to DESeq2 documentation: Input formula should end with the variable of interest
    #p-value is for the last variable of the design formula (MicrobiomeGroup)
    #print(df_meta)
    dds<-DESeqDataSetFromMatrix(countData = m_expression,
                                colData = df_meta,
                                design= ~Timepoint+MicrobiomeGroup)
    dds<-DESeq(dds,
               parallel=TRUE,BPPARAM=MulticoreParam(4))
    df_results<-results(dds)
    df_results<-as.data.frame(df_results)
    df_results$GeneID<-rownames(df_results) ##This is the new code
    df_results$MicrobiomeGroup<-rownames(m_microbiome)[[i]]
    l_results[[i]]<-df_results

  }
 df_combined<-do.call("rbind",l_results)
 df_combined$FDR<-p.adjust(df_combined$pvalue)
 column_order<-c("GeneID","MicrobiomeGroup","baseMean","log2FoldChange",    "lfcSE",
                 "stat",    "pvalue",   "padj","FDR")
 df_combined<-df_combined[,column_order]
 write.csv(df_combined,outfile,row.names = F) #No longer need to write out column names