Search
Question: Getting pheno tables from recount datasets without any NA values in characteristics and geo_accession fields and also without duplicated row.names
0
gravatar for Mustafa ABUELQUMSAN
14 days ago by
France/Marseille
Mustafa ABUELQUMSAN0 wrote:

Dear sir’s Bioconductor developers,

I have intrinsic question about the recount repository “datasets” I work to make advanced statistics analysis for the most of the recount dataset,

we noticed that the most of pheno tables in recount have the NA values for the characteristics and geo_accession fields!!!

Could you please anyone help me how I could getting to the pheno tables for all the projects in the recount without any NA values in a characteristics and geo_accession fields moreover that I faced also critical obstacle with duplicated “row.names” , could any one directive me how I can overcome to that essentially dogma, please.

Thank so much for any one will suggest or give me any practical guide  .

Mustafa.

ADD COMMENTlink modified 11 days ago by Leonardo Collado Torres560 • written 14 days ago by Mustafa ABUELQUMSAN0
0
gravatar for Leonardo Collado Torres
11 days ago by
United States
Leonardo Collado Torres560 wrote:

Hi Mustafa,

Nearly 13k samples from the SRA ones don't have any characteristics or GEO accession numbers as shown with the code below. There's nothing we can really do about it. Sometimes updates in SRAdb include new GEO accession numbers. The issue with sample metadata being incomplete is a problem that Shannon Ellis and others have tried to address in different ways. Check http://biorxiv.org/content/early/2017/06/03/145656http://metasra.biostat.wisc.edu/publication.html, SHARQ beta http://www.cs.cmu.edu/~ckingsf/sharq/about.html and elsewhere.

 

Regarding the row.names issue, if you have some reproducible code then I bet other people could help you out. And if you could highlight what step is actually failing that'd be great too. In any case, if you are combining rows, you could set the row names to be unique before combining them. 

 

Best,

Leonardo

 

> library(recount)
> m <- all_metadata()
> table(sum(is.na(m$characteristics)) == 1)
FALSE  TRUE 
37278 12821 
> table(is.na(m$geo_accession))
FALSE  TRUE 
37395 12704 

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] recount_1.4.0              SummarizedExperiment_1.8.0 DelayedArray_0.4.0         matrixStats_0.52.2         Biobase_2.38.0            
 [6] GenomicRanges_1.30.0       GenomeInfoDb_1.14.0        IRanges_2.12.0             S4Vectors_0.16.0           BiocGenerics_0.24.0       

loaded via a namespace (and not attached):
 [1] bitops_1.0-6             bit64_0.9-7              RColorBrewer_1.1-2       progress_1.1.2           httr_1.3.1               GenomicFiles_1.14.0     
 [7] tools_3.4.2              backports_1.1.1          doRNG_1.6.6              R6_2.2.2                 rpart_4.1-11             Hmisc_4.0-3             
[13] DBI_0.7                  lazyeval_0.2.1           colorspace_1.3-2         nnet_7.3-12              gridExtra_2.3            prettyunits_1.0.2       
[19] RMySQL_0.10.13           bit_1.1-12               compiler_3.4.2           htmlTable_1.9            derfinder_1.12.0         xml2_1.1.1              
[25] pkgmaker_0.22            rtracklayer_1.38.0       scales_0.5.0             checkmate_1.8.5          readr_1.1.1              stringr_1.2.0           
[31] digest_0.6.12            Rsamtools_1.30.0         foreign_0.8-69           rentrez_1.1.0            GEOquery_2.46.1          XVector_0.18.0          
[37] base64enc_0.1-3          pkgconfig_2.0.1          htmltools_0.3.6          BSgenome_1.46.0          htmlwidgets_0.9          rlang_0.1.2             
[43] RSQLite_2.0              bindr_0.1                jsonlite_1.5             BiocParallel_1.12.0      acepack_1.4.1            dplyr_0.7.4             
[49] VariantAnnotation_1.24.0 RCurl_1.95-4.8           magrittr_1.5             GenomeInfoDbData_0.99.1  Formula_1.2-2            Matrix_1.2-11           
[55] Rcpp_0.12.13             munsell_0.4.3            stringi_1.1.5            zlibbioc_1.24.0          qvalue_2.10.0            plyr_1.8.4              
[61] bumphunter_1.20.0        grid_3.4.2               blob_1.1.0               lattice_0.20-35          Biostrings_2.46.0        splines_3.4.2           
[67] GenomicFeatures_1.30.0   hms_0.3                  derfinderHelper_1.12.0   locfit_1.5-9.1           knitr_1.17               rngtools_1.2.4          
[73] reshape2_1.4.2           codetools_0.2-15         biomaRt_2.34.0           XML_3.98-1.9             glue_1.2.0               downloader_0.4          
[79] latticeExtra_0.6-28      data.table_1.10.4-3      foreach_1.4.3            gtable_0.2.0             purrr_0.2.4              tidyr_0.7.2             
[85] assertthat_0.2.0         ggplot2_2.2.1            xtable_1.8-2             survival_2.41-3          tibble_1.3.4             iterators_1.0.8         
[91] GenomicAlignments_1.14.0 AnnotationDbi_1.40.0     registry_0.3             memoise_1.1.0            bindrcpp_0.2             cluster_2.0.6  
ADD COMMENTlink written 11 days ago by Leonardo Collado Torres560
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 127 users visited in the last hour