Question

Trouble annotating HTA microarray

0

Entering edit mode

giroudpaul ▴ 40

@giroudpaul-10031

Last seen 4.4 years ago

France

Dear bioconductor members,

I finally got my hands on results of HTA 2.0 microarray experiments, and I started processing them using the standard methods.

Reading CEL files, and performing RMA doesn't pose problems, for RMA I used oligo this way :

> data.rma = oligo::rma(data, background=TRUE, normalize=TRUE, subset=NULL, target="core")

But then, I tried to annotate my dataset, with two different methods, but neither works.

First using the PDInfo :

> data.ann <- annotateEset(data.rma, pd.hta.2.0, type = "core")
Error: There appears to be a mismatch between the ExpressionSet and the annotation data.
Please ensure that the summarization level for the ExpressionSet and the 'type' argument are the same.
 See ?annotateEset for more information on the type argument.

Then, using the ChipDb :

> data.ann <- annotateEset(data.rma, hta20sttranscriptcluster.db, columns = c("PROBEID", "ENTREZID", "SYMBOL", "ENSEMBL", "GENENAME"))
Error: cannot allocate vector of size 37.1 Gb
In addition: Warning messages:
1: In unique(.Internal(unlist(lapply(x, levels), recursive, FALSE))) :
  Reached total allocation of 8089Mb: see help(memory.size)
2: In unique(.Internal(unlist(lapply(x, levels), recursive, FALSE))) :
  Reached total allocation of 8089Mb: see help(memory.size)
3: In unique(.Internal(unlist(lapply(x, levels), recursive, FALSE))) :
  Reached total allocation of 8089Mb: see help(memory.size)
4: In unique(.Internal(unlist(lapply(x, levels), recursive, FALSE))) :
  Reached total allocation of 8089Mb: see help(memory.size)

So, I don't understand the trouble with the PDInfo, since I used the same level of summarization (ie "core") in both commands. The second one is simply my computer not being able to process so much data. For the moment, I don't have access to a bioinformatic server, I will see if that's possible, but is there no way to annotate HTA arrays with 8Go of RAM.

For the details :

Computer : W10 64 bits, i5-2410M CPU (dual core, 2.3 Ghz), 8Go RAM, using R with Rstudio

Session Info :

R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252    LC_MONETARY=French_France.1252
[4] LC_NUMERIC=C                   LC_TIME=French_France.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] hta20sttranscriptcluster.db_8.3.1 org.Hs.eg.db_3.3.0               
 [3] AnnotationDbi_1.34.4              NMF_0.23.3                       
 [5] cluster_2.0.4                     rngtools_1.2.4                   
 [7] pkgmaker_0.25.10                  registry_0.3                     
 [9] limma_3.28.14                     pd.hta.2.0_3.12.1                
[11] RSQLite_1.0.0                     DBI_0.4-1                        
[13] oligo_1.36.1                      Biostrings_2.40.2                
[15] XVector_0.12.0                    IRanges_2.6.1                    
[17] S4Vectors_0.10.2                  genefilter_1.54.2                
[19] affycoretools_1.44.2              BiocInstaller_1.22.3             
[21] Biobase_2.32.0                    BiocGenerics_0.18.0              
[23] ggplot2_2.1.0                     rpart_4.1-10                     
[25] Matrix_1.2-6                      lattice_0.20-33                  
[27] oligoClasses_1.34.0              

loaded via a namespace (and not attached):
  [1] colorspace_1.2-6              hwriter_1.3.2                 class_7.3-14                 
  [4] modeltools_0.2-21             mclust_5.2                    biovizBase_1.20.0            
  [7] GenomicRanges_1.24.2          dichromat_2.0-0               affyio_1.42.0                
 [10] flexmix_2.3-13                mvtnorm_1.0-5                 interactiveDisplayBase_1.10.3
 [13] codetools_0.2-14              splines_3.3.0                 R.methodsS3_1.7.1            
 [16] ggbio_1.20.1                  doParallel_1.0.10             robustbase_0.92-6            
 [19] geneplotter_1.50.0            knitr_1.13                    Formula_1.2-1                
 [22] Rsamtools_1.24.0              gridBase_0.4-7                annotate_1.50.0              
 [25] kernlab_0.9-24                GO.db_3.3.0                   R.oo_1.20.0                  
 [28] graph_1.50.0                  shiny_0.13.2                  httr_1.2.1                   
 [31] GOstats_2.38.1                acepack_1.3-3.3               htmltools_0.3.5              
 [34] tools_3.3.0                   gtable_0.2.0                  affy_1.50.0                  
 [37] Category_2.38.0               reshape2_1.4.1                affxparser_1.44.0            
 [40] Rcpp_0.12.5                   trimcluster_0.1-2             gdata_2.17.0                 
 [43] preprocessCore_1.34.0         rtracklayer_1.32.1            fpc_2.1-10                   
 [46] iterators_1.0.8               stringr_1.0.0                 mime_0.5                     
 [49] ensembldb_1.4.7               gtools_3.5.0                  XML_3.98-1.4                 
 [52] dendextend_1.2.0              DEoptimR_1.0-6                AnnotationHub_2.4.2          
 [55] edgeR_3.14.0                  MASS_7.3-45                   zlibbioc_1.18.0              
 [58] scales_0.4.0                  BSgenome_1.40.1               VariantAnnotation_1.18.5     
 [61] SummarizedExperiment_1.2.3    RBGL_1.48.1                   RColorBrewer_1.1-2           
 [64] gridExtra_2.2.1               biomaRt_2.28.0                reshape_0.8.5                
 [67] latticeExtra_0.6-28           stringi_1.1.1                 gcrma_2.44.0                 
 [70] foreach_1.4.3                 GenomicFeatures_1.24.4        caTools_1.17.1               
 [73] BiocParallel_1.6.2            chron_2.3-47                  GenomeInfoDb_1.8.3           
 [76] prabclus_2.2-6                ReportingTools_2.12.2         bitops_1.0-6                 
 [79] GenomicAlignments_1.8.4       bit_1.1-12                    GSEABase_1.34.0              
 [82] AnnotationForge_1.14.2        GGally_1.2.0                  plyr_1.8.4                   
 [85] magrittr_1.5                  DESeq2_1.12.3                 R6_2.1.2                     
 [88] gplots_3.0.1                  Hmisc_3.17-4                  whisker_0.3-2                
 [91] foreign_0.8-66                survival_2.39-5               RCurl_1.95-4.8               
 [94] nnet_7.3-12                   KernSmooth_2.23-15            OrganismDbi_1.14.1           
 [97] PFAM.db_3.3.0                 locfit_1.5-9.1                grid_3.3.0                   
[100] data.table_1.9.6              diptest_0.75-7                digest_0.6.9                 
[103] xtable_1.8-2                  ff_2.2-13                     httpuv_1.3.3                 
[106] R.utils_2.3.0                 munsell_0.4.3

hta pd.hta.2.0 hta20sttranscriptcluster.db • 1.3k views

ADD COMMENT • link updated 7.8 years ago by James W. MacDonald 65k • written 7.8 years ago by giroudpaul ▴ 40

score 2 · Accepted Answer · 2016-07-15

2

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 4 hours ago

United States

There is a bug in the code for annotateEset when using the pdInfo package that I will have to fix. And I will also have to put some error checking in for annotateEset when using the chipDb packages. It turns out that you can't do something like

> z <- mapIds(hta20sttranscriptcluster.db, featureNames(eset), "PROBEID","PROBEID")
Error in FUN(X[[i]], ...) : long vectors not supported yet: memory.c:1652

Which is what you are asking for when you do

data.ann <- annotateEset(data.rma, hta20sttranscriptcluster.db, columns = c("PROBEID", "ENTREZID", "SYMBOL", "ENSEMBL", "GENENAME"))

Because it's recursively calling mapIds on all the columns you listed there. Since you ALREADY get the PROBEID back by default, asking for it again is both not going to work, and is redundant.

For now, if you just do

data.ann <- annotateEset(data.rma, hta20sttranscriptcluster.db, columns = c("ENTREZID", "SYMBOL", "ENSEMBL", "GENENAME"))

it will work correctly, and downstream packages like limma will still show the probeset ID in the topTable results.

ADD COMMENT • link 7.8 years ago James W. MacDonald 65k

1

Entering edit mode

OK, I have fixed the bugs:

> eset <- rma(read.celfiles(list.celfiles()))
Loading required package: pd.hta.2.0
Loading required package: RSQLite
Loading required package: DBI
Platform design info loaded.
Background correcting
Normalizing
Calculating Expression
> library(affycoretools)

> eset2 <- annotateEset(eset, pd.hta.2.0)

> eset3 <- annotateEset(eset, hta20sttranscriptcluster.db)
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns

> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] hta20sttranscriptcluster.db_8.3.1 org.Hs.eg.db_3.3.0               
 [3] AnnotationDbi_1.34.4              affycoretools_1.44.3             
 [5] pd.hta.2.0_3.12.1                 RSQLite_1.0.0                    
 [7] DBI_0.4-1                         oligo_1.36.1                     
 [9] Biostrings_2.40.2                 XVector_0.12.0                   
[11] IRanges_2.6.1                     S4Vectors_0.10.2                 
[13] Biobase_2.32.0                    oligoClasses_1.34.0              
[15] BiocGenerics_0.18.0

It should progress through the build machines within a day or two - you are looking for version 1.44.3.