Question

Getting pheno tables from recount datasets without any NA values in characteristics and geo_accession fields and also without duplicated row.names

0

Entering edit mode

Mustafa ABUELQUMSAN • 0

@mustafa-abuelqumsan-12460

Last seen 4.3 years ago

France/Marseille

Dear sir’s Bioconductor developers,

I have intrinsic question about the recount repository “datasets” I work to make advanced statistics analysis for the most of the recount dataset,

we noticed that the most of pheno tables in recount have the NA values for the characteristics and geo_accession fields!!!

Could you please anyone help me how I could getting to the pheno tables for all the projects in the recount without any NA values in a characteristics and geo_accession fields moreover that I faced also critical obstacle with duplicated “row.names” , could any one directive me how I can overcome to that essentially dogma, please.

Thank so much for any one will suggest or give me any practical guide .

Mustafa.

recount summarizedexperiment • 1.2k views

ADD COMMENT • link updated 6.4 years ago by Leonardo Collado Torres ★ 1.0k • written 6.4 years ago by Mustafa ABUELQUMSAN • 0

score 0 · Answer 1 · 2017-11-13

Hi Mustafa,

Nearly 13k samples from the SRA ones don't have any characteristics or GEO accession numbers as shown with the code below. There's nothing we can really do about it. Sometimes updates in SRAdb include new GEO accession numbers. The issue with sample metadata being incomplete is a problem that Shannon Ellis and others have tried to address in different ways. Check http://biorxiv.org/content/early/2017/06/03/145656, http://metasra.biostat.wisc.edu/publication.html, SHARQ beta http://www.cs.cmu.edu/~ckingsf/sharq/about.html and elsewhere.

Regarding the row.names issue, if you have some reproducible code then I bet other people could help you out. And if you could highlight what step is actually failing that'd be great too. In any case, if you are combining rows, you could set the row names to be unique before combining them.

Best,

Leonardo

> library(recount)
> m <- all_metadata()
> table(sum(is.na(m$characteristics)) == 1)
FALSE  TRUE 
37278 12821 
> table(is.na(m$geo_accession))
FALSE  TRUE 
37395 12704 

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] recount_1.4.0              SummarizedExperiment_1.8.0 DelayedArray_0.4.0         matrixStats_0.52.2         Biobase_2.38.0            
 [6] GenomicRanges_1.30.0       GenomeInfoDb_1.14.0        IRanges_2.12.0             S4Vectors_0.16.0           BiocGenerics_0.24.0       

loaded via a namespace (and not attached):
 [1] bitops_1.0-6             bit64_0.9-7              RColorBrewer_1.1-2       progress_1.1.2           httr_1.3.1               GenomicFiles_1.14.0     
 [7] tools_3.4.2              backports_1.1.1          doRNG_1.6.6              R6_2.2.2                 rpart_4.1-11             Hmisc_4.0-3             
[13] DBI_0.7                  lazyeval_0.2.1           colorspace_1.3-2         nnet_7.3-12              gridExtra_2.3            prettyunits_1.0.2       
[19] RMySQL_0.10.13           bit_1.1-12               compiler_3.4.2           htmlTable_1.9            derfinder_1.12.0         xml2_1.1.1              
[25] pkgmaker_0.22            rtracklayer_1.38.0       scales_0.5.0             checkmate_1.8.5          readr_1.1.1              stringr_1.2.0           
[31] digest_0.6.12            Rsamtools_1.30.0         foreign_0.8-69           rentrez_1.1.0            GEOquery_2.46.1          XVector_0.18.0          
[37] base64enc_0.1-3          pkgconfig_2.0.1          htmltools_0.3.6          BSgenome_1.46.0          htmlwidgets_0.9          rlang_0.1.2             
[43] RSQLite_2.0              bindr_0.1                jsonlite_1.5             BiocParallel_1.12.0      acepack_1.4.1            dplyr_0.7.4             
[49] VariantAnnotation_1.24.0 RCurl_1.95-4.8           magrittr_1.5             GenomeInfoDbData_0.99.1  Formula_1.2-2            Matrix_1.2-11           
[55] Rcpp_0.12.13             munsell_0.4.3            stringi_1.1.5            zlibbioc_1.24.0          qvalue_2.10.0            plyr_1.8.4              
[61] bumphunter_1.20.0        grid_3.4.2               blob_1.1.0               lattice_0.20-35          Biostrings_2.46.0        splines_3.4.2           
[67] GenomicFeatures_1.30.0   hms_0.3                  derfinderHelper_1.12.0   locfit_1.5-9.1           knitr_1.17               rngtools_1.2.4          
[73] reshape2_1.4.2           codetools_0.2-15         biomaRt_2.34.0           XML_3.98-1.9             glue_1.2.0               downloader_0.4          
[79] latticeExtra_0.6-28      data.table_1.10.4-3      foreach_1.4.3            gtable_0.2.0             purrr_0.2.4              tidyr_0.7.2             
[85] assertthat_0.2.0         ggplot2_2.2.1            xtable_1.8-2             survival_2.41-3          tibble_1.3.4             iterators_1.0.8         
[91] GenomicAlignments_1.14.0 AnnotationDbi_1.40.0     registry_0.3             memoise_1.1.0            bindrcpp_0.2             cluster_2.0.6