AnnotationHub: A RPKM data.frame of Epigenomics RoadMap Project seems strange
1
1
Entering edit mode
wcstcyx ▴ 30
@wcstcyx-11636
Last seen 6.2 years ago
China/Beijing/AMSS,CAS

Dear All,

I found that a RPKM data.frame seems strange. This data.frame is obtained from AnnotationHub and the source is from Epigenomics RoadMap Project. The below is codes help you see the problem.

library("AnnotationHub")
ah <- AnnotationHub()
epiFiles <- query(ah, "EpigenomeRoadMap")
dfs <- subset(epiFiles, rdataclass == "data.frame")
# View(data.frame(dfs$title, dfs$description, dfs$sourceurl))
rpkm <- dfs[[8]]
# View(rpkm) # the title seems not right, and the last column are all NAs
# download it by myself
url <- dfs$sourceurl[8]
filename <-  basename(url)
download.file(url, destfile=filename)
if (file.exists(filename))
  myrpkm <- read.table(filename, header = TRUE, row.names = 1)
# View(myrpkm) # it seems right
# See
# =========================
# EXPRESSION QUANTIFICATION
# =========================
# in http://egg2.wustl.edu/roadmap/data/byDataType/rna/README

My sessionInfo is

R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C                                                   
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    

attached base packages:
[1] parallel  stats    
[3] graphics  grDevices
[5] utils     datasets 
[7] methods   base     

other attached packages:
[1] AnnotationHub_2.5.12
[2] BiocGenerics_0.19.2 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.7                  
 [2] IRanges_2.7.17               
 [3] digest_0.6.10                
 [4] mime_0.5                     
 [5] R6_2.2.0                     
 [6] xtable_1.8-2                 
 [7] DBI_0.5-1                    
 [8] stats4_3.3.1                 
 [9] RSQLite_1.0.0                
[10] BiocInstaller_1.23.9         
[11] httr_1.2.1                   
[12] curl_2.1                     
[13] S4Vectors_0.11.18            
[14] tools_3.3.1                  
[15] Biobase_2.33.4               
[16] shiny_0.14.1                 
[17] httpuv_1.3.3                 
[18] AnnotationDbi_1.35.4         
[19] htmltools_0.3.5              
[20] interactiveDisplayBase_1.11.3

After I read 

EXPRESSION QUANTIFICATION

from http://egg2.wustl.edu/roadmap/data/byDataType/rna/README

I think the first column should be gene id and the first numeric column should be expression index of sample E000. So I load it by read.table(filename, header = TRUE, row.names = 1).

I found that more than one data.frame with this problem. Hope this kind of data could be reloaded appropriately by AnnotationHub.

Thanks in advance,
Can Wang

annotationhub roadmap project rpkm data.frame • 1.6k views
ADD COMMENT
2
Entering edit mode
@valerie-obenchain-4275
Last seen 2.3 years ago
United States

Hi Can,

Thanks for reporting this bug. As you described, the problem was how the data were read in, the gene_id column was not being used as the row names. This has been fixed in AnnotationHub 2.5.13 (devel) and 2.4.3 (release). Both should be available via biocLite() Thursday Oct. 13 after noon PST or from svn immediately.

Valerie

ADD COMMENT
0
Entering edit mode

Thank you! I have checked that. It's OK now.

Can

ADD REPLY

Login before adding your answer.

Traffic: 746 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6