Question

BrainArray Custom CDFs and oligo

0

Entering edit mode

lcordeiro • 0

@lcordeiro-12338

Last seen 16 months ago

United States

I've started using BrainArray's custom CDFs with oligo and noticed something I cannot explain. I looked at some Q&As -- such as this one: How to use brainarray custom cdf with oligo package? -- but couldn't figure out an answer.

When I use CDFs with ENTREZ or ENSEMBL annotation I noticed read.celfiles collapses probesets/transcripts/exon rows into each unique ENTREZ or ENSEMBL code. Using a generic GEO dataseries, for example, the following code (assuming all necessary libraries are loaded; see sessioInfo() below)

# Change dir
base.dir = "~/Documents/IntegrationTest/"
setwd(base.dir)

# Reading data
data.dir = paste(base.dir,"GSE474",sep="")
list1 = list.celfiles(data.dir, full.name=TRUE)
data96a = read.celfiles(list1)
data96b = read.celfiles(list1, pkgname = "pd.hgu133a.hs.ensg")

produces

> data96a  # ExpressionFeatureSet
ExpressionFeatureSet (storageMode: lockedEnvironment)
assayData: 506944 features, 24 samples 
  element names: exprs 
protocolData
  rowNames: GSM3979.CEL GSM3980.CEL ... GSM4002.CEL (24 total)
  varLabels: exprs dates
  varMetadata: labelDescription channel
phenoData
  rowNames: GSM3979.CEL GSM3980.CEL ... GSM4002.CEL (24 total)
  varLabels: index
  varMetadata: labelDescription channel
featureData: none
experimentData: use 'experimentData(object)'
Annotation: pd.hg.u133a 
> data96b  # GenericFeatureSet
GenericFeatureSet (storageMode: lockedEnvironment)
assayData: 506944 features, 24 samples 
  element names: exprs 
protocolData
  rowNames: GSM3979.CEL GSM3980.CEL ... GSM4002.CEL (24 total)
  varLabels: exprs dates
  varMetadata: labelDescription channel
phenoData
  rowNames: GSM3979.CEL GSM3980.CEL ... GSM4002.CEL (24 total)
  varLabels: index
  varMetadata: labelDescription channel
featureData: none
experimentData: use 'experimentData(object)'
Annotation: pd.hgu133a.hs.ensg

The first difference can be seen right here: without any custom CDF, read.celfiles generates an ExpressionFeatureSet, while with the BrainArray CDF it gives a GenericFeatureSet. How could I get an ExpressionFeatureSet with a custom CDF?

Moving on with

eset96a = oligo::rma(data96a)
eset96b = oligo::rma(data96b)

one gets

> eset96a
ExpressionSet (storageMode: lockedEnvironment)
assayData: 22283 features, 24 samples 
  element names: exprs 
protocolData
  rowNames: GSM3979.CEL GSM3980.CEL ... GSM4002.CEL (24 total)
  varLabels: exprs dates
  varMetadata: labelDescription channel
phenoData
  rowNames: GSM3979.CEL GSM3980.CEL ... GSM4002.CEL (24 total)
  varLabels: index
  varMetadata: labelDescription channel
featureData: none
experimentData: use 'experimentData(object)'
Annotation: pd.hg.u133a 
> eset96b
ExpressionSet (storageMode: lockedEnvironment)
assayData: 11959 features, 24 samples 
  element names: exprs 
protocolData
  rowNames: GSM3979.CEL GSM3980.CEL ... GSM4002.CEL (24 total)
  varLabels: exprs dates
  varMetadata: labelDescription channel
phenoData
  rowNames: GSM3979.CEL GSM3980.CEL ... GSM4002.CEL (24 total)
  varLabels: index
  varMetadata: labelDescription channel
featureData: none
experimentData: use 'experimentData(object)'
Annotation: pd.hgu133a.hs.ensg

It is clear that oligo is collapsing rows into ENSEMBL gene codes and I don't really mind that it is doing it. I just need to understand HOW it is doing that. Is it taking the average, median, max of all rows/transcripts/exons related to any single gene code?

Any help will be greatly appreciated.

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pd.hgu133a.hs.ensg_0.0.1             hugene11sthsentrezg.db_21.0.0        hugene10sthsentrezg.db_21.0.0       
 [4] huex10sthsentrezg.db_21.0.0          hgu133a2hsentrezg.db_21.0.0          hgu133plus2hsentrezg.db_21.0.0      
 [7] hgu133ahsentrezg.db_21.0.0           pd.hugene11st.hs.entrezg_0.0.1       pd.hugene10st.hs.entrezg_0.0.1      
[10] pd.huex10st.hs.entrezg_0.0.1         pd.hgu133a2.hs.entrezg_0.0.1         pd.hgu133plus2.hs.entrezg_0.0.1     
[13] pd.hgu133a.hs.entrezg_0.0.1          hugene11sttranscriptcluster.db_8.5.0 hugene11stprobeset.db_8.5.0         
[16] hugene10sttranscriptcluster.db_8.5.0 hugene10stprobeset.db_8.5.0          huex10sttranscriptcluster.db_8.5.0  
[19] huex10stprobeset.db_8.5.0            hgu133a2.db_3.2.3                    hgu133plus2.db_3.2.3                
[22] hgu133a.db_3.2.3                     org.Hs.eg.db_3.4.0                   AnnotationDbi_1.36.2                
[25] pd.hugene.1.1.st.v1_3.14.1           pd.hugene.1.0.st.v1_3.14.1           pd.huex.1.0.st.v2_3.14.1            
[28] pd.hg.u133a.2_3.12.0                 pd.hg.u133.plus.2_3.12.0             pd.hg.u133a_3.12.0                  
[31] affycoretools_1.46.4                 CustomCDF_1.0.3                      DBI_0.5-1                           
[34] RSQLite_1.1-2                        frma_1.26.0                          oligo_1.38.0                        
[37] Biostrings_2.42.1                    XVector_0.14.0                       IRanges_2.8.1                       
[40] S4Vectors_0.12.1                     oligoClasses_1.36.0                  affy_1.52.0                         
[43] Biobase_2.34.0                       BiocGenerics_0.20.0                 

loaded via a namespace (and not attached):
  [1] backports_1.0.5               GOstats_2.40.0                Hmisc_4.0-2                   AnnotationHub_2.6.4          
  [5] plyr_1.8.4                    lazyeval_0.2.0                GSEABase_1.36.0               splines_3.3.2                
  [9] BiocParallel_1.8.1            GenomeInfoDb_1.10.3           ggplot2_2.2.1                 digest_0.6.12                
 [13] foreach_1.4.3                 BiocInstaller_1.24.0          ensembldb_1.6.2               htmltools_0.3.5              
 [17] GO.db_3.4.0                   gdata_2.17.0                  magrittr_1.5                  checkmate_1.8.2              
 [21] memoise_1.0.0                 BSgenome_1.42.0               cluster_2.0.5                 gcrma_2.46.0                 
 [25] limma_3.30.11                 annotate_1.52.1               R.utils_2.5.0                 ggbio_1.22.3                 
 [29] colorspace_1.3-2              RCurl_1.95-4.8                graph_1.52.0                  genefilter_1.56.0            
 [33] survival_2.40-1               VariantAnnotation_1.20.2      iterators_1.0.8               gtable_0.2.0                 
 [37] zlibbioc_1.20.0               scales_0.4.1                  GGally_1.3.0                  edgeR_3.16.5                 
 [41] Rcpp_0.12.9                   xtable_1.8-2                  htmlTable_1.9                 foreign_0.8-67               
 [45] bit_1.1-12                    OrganismDbi_1.16.0            preprocessCore_1.36.0         Formula_1.2-1                
 [49] AnnotationForge_1.16.0        htmlwidgets_0.8               httr_1.2.1                    gplots_3.0.1                 
 [53] RColorBrewer_1.1-2            acepack_1.4.1                 ff_2.2-13                     reshape_0.8.6                
 [57] XML_3.98-1.5                  R.methodsS3_1.7.1             nnet_7.3-12                   locfit_1.5-9.1               
 [61] reshape2_1.4.2                munsell_0.4.3                 tools_3.3.2                   stringr_1.1.0                
 [65] yaml_2.1.14                   knitr_1.15.1                  caTools_1.17.1                RBGL_1.50.0                  
 [69] mime_0.5                      R.oo_1.21.0                   biomaRt_2.30.0                interactiveDisplayBase_1.12.0
 [73] affyio_1.44.0                 PFAM.db_3.4.0                 tibble_1.2                    geneplotter_1.52.0           
 [77] stringi_1.1.2                 GenomicFeatures_1.26.2        lattice_0.20-34               Matrix_1.2-8                 
 [81] data.table_1.10.4             bitops_1.0-6                  httpuv_1.3.3                  rtracklayer_1.34.1           
 [85] GenomicRanges_1.26.2          R6_2.2.0                      latticeExtra_0.6-28           hwriter_1.3.2                
 [89] KernSmooth_2.23-15            gridExtra_2.2.1               affxparser_1.46.0             codetools_0.2-15             
 [93] dichromat_2.0-0               MASS_7.3-45                   gtools_3.5.0                  assertthat_0.1               
 [97] SummarizedExperiment_1.4.0    DESeq2_1.14.1                 Category_2.40.0               ReportingTools_2.14.0        
[101] GenomicAlignments_1.10.0      Rsamtools_1.26.1              grid_3.3.2                    rpart_4.1-10                 
[105] biovizBase_1.22.0             shiny_1.0.0                   base64enc_0.1-3

oligo brainarray rma customcdf • 1.6k views

ADD COMMENT • link updated 7.2 years ago by James W. MacDonald 65k • written 7.2 years ago by lcordeiro • 0

score 2 · Accepted Answer · 2017-02-12

You cannot get an ExpressionFeatureSet from an MBNI remapped pdInfoPackage. Those were built using new functionality that Benilton introduced in oligo before the last release, to allow people to use arbitrary (or Generic, if you will) array definitions. Since the MBNI group uses the Generic array definition pipeline to generate their pdInfoPackages, you end up with a GenericFeatureSet. This is as it is intended.

HOW it is summarizing at the 'collapsed' level is pretty prosaic. For an Affy array, there are any number of 25-mers that are intended, in aggregate, to measure the expression level of a transcript (or in the case of the Gene ST arrays, you can also summarize at roughly the exon level as well). For a given gene with say three exons, there might be 12 x 25-mer probes that are intended to measure that gene's transcript. You could summarize as three individual probesets, that measure the expression level of each exon, or you could summarize all 12 together to get an overall measure of the gene's expression.

In addition, there might be one or more other probesets that are intended to measure that same gene, in which case you end up with two gene expression values, and which one is the 'right' one?

What the folks at MBNI do is to take all the 25-mers on the array and blast them against the current genome to see where they align. If they align uniquely to a single position, they keep them, otherwise they throw them out. They then look at the genomic regions for each gene (based on Entrez Gene or Ensembl mappings) and then aggregate all the probes that overlap a given gene into a new probeset, based on their mapping. So instead of having two different probesets for a given gene, you only have one, with however many probes mapped to that gene. Then when you summarize, you just use all the probes that mapped to a given gene.

If you use rma, then the summarization step is basically medianpolish, and if you care to know how that works, then google is your friend.