When we had to analyse Affy exon arrays a decade ago, we found that the affy package was not able to handle them. We instead used to RMA normalize using the aroma.affymetrix package from CRAN, restricting to core probe-sets. We used to convert from exon to gene level expression summaries using the aroma.affymetrix function:
ExonRmaPlm(csBN, mergeGroups=TRUE)
Ten years later, are there any recommended Bioconductor pipelines for normalizing and summarizing Affy exon arrays at the gene level? (Apart from the obvious advice to please do RNA-seq instead.)
I am aware of the xps and oligo packages but have not used either of them yet.
As Christian noted, the oligo package is the successor to affy, and intended to be used to analyze the newer random-primer arrays. The default for rma in oligo for these arrays is to summarize the core probesets at the transcript level, so
dat <- read.celfiles(list.celfiles())
eset <- rma(dat)
Thanks, that is simple. Once the required array definition package had been installed, these commands ran in a few seconds on my computer.
I assume that moex10sttranscriptcluster.db is the appropriate transcript-cluster annotation package. That works fine for me, although the Bioconductor functions issue a lot of complaints about deprecation when I use select() to obtain CHR information from this package (or indeed from any package).
For the reference of other readers, I give code below:
library(oligo)
install.packages("pd.moex.1.0.st.v1",
repos="https://bioconductor.org/packages/3.5/data/annotation")
x <- read.celfiles(files)
y <- rma(x)
library(moex10sttranscriptcluster.db)
ann <- select(moex10sttranscriptcluster.db,keys=featureNames(y),
columns=c("ENTREZID","SYMBOL","GENENAME","CHR"),keytype="PROBEID")
Keep first Entrez ID for any probe ID:
d <- duplicated(ann$PROBEID)
ann <- ann[!d,]
m <- match(featureNames(y), ann$PROBEID)
fData(y) <- ann[m,-1]
The complaints arise because the positional information has been moved from the orgDb packages (which are intended to contain mappings of probesets to gene names, symbols, various IDs, etc) to the TxDb packages, which are intended to contain positional information. The ChipDb packages are really just thin wrappers that contain only the probesetid -> Entrez Gene ID mappings, and all the heavy lifting is done by calls to the representative OrgDb package.
Ideally AnnotationDbi would use e.g., the Mus.musculus package instead of the org.Mm.eg.db package, but any change in that direction will require there to be OrganismDb packages for all of the arrays out there, and I don't see that happening any time soon.
You might want to reconsider your selection procedure for the Entrez Gene ID. I assume your rationale is that the smaller number implies that it has been around longer, and is more likely to be 'real' for some definition of?
When I generate the annotation packages for these arrays, I parse the annotation csv files from Affy, and collect all the IDs in order. This dictates the order in the annotation packages, so keeping with the example above, I am parsing this:
NM_001195732 // Fam150a // family with sequence similarity 150, member A // 1 A1|1 // 620393 /// NM_001290514 // Mospd1 // motile sperm domain containing 1 // X|X A4 // 70380 ///
Where the number preceding the triple forward slash is the Entrez Gene ID (so, 620393, then 70380). And when you get these data out of the ChipDb, the order follows:
> select(moex10sttranscriptcluster.db, "6747349", "ENTREZID", "PROBEID")
'select()' returned 1:many mapping between keys and columns
PROBEID ENTREZID
1 6747349 620393
2 6747349 70380
I believe that Affy orders these based on how well the underlying probes align to each of the transcripts. If you align the probes for this probeset or the transcript that Affy used as the basis for designing the probes, there is 100% identity with Fam150a, then 98.3% identity with Mospd1. So all things equal, the better choice in this regard is likely to be the larger Entrez Gene ID.
I chose to keep the smallest Entrez Gene ID for each gene because IDs for Gene Models are larger than IDs for more established genes.
From what you say, I should instead keep the mappings in the original order, so as to get the Entrez ID with best alignment for each transcript-cluster ID. I have edited the code above to do that instead.
you might find the following paper interesting (https://f1000research.com/articles/5-1384/v1) -it mentions in the abstract that "This workflow is directly applicable to current “Gene” type arrays, e.g. the HuGene or MoGene arrays but can easily adapted to similar platforms..", so it might be useful for also Exon ST Arrays.
As you might know, from the packages mentioned package 'affy' cannot be used
for exon arrays; for this purpose package 'oligo' was created, which is
probably used by most people.
As author of 'xps' I will only mention that 'xps' is able to do RMA normalization
of exon arrays, also restricting to core probe-sets only.
Besides the usual vignettes, directory 'xps/examples' contains scripts for
working with Affymetrix arrays; for exon arrays you may look at script
'script4exon.R'.
Thanks. It turned out that I didn't need any special memory management as the oligo functions ran in a few seconds on my laptop computer (which has 32Gb RAM).
Thanks, that is simple. Once the required array definition package had been installed, these commands ran in a few seconds on my computer.
I assume that moex10sttranscriptcluster.db is the appropriate transcript-cluster annotation package. That works fine for me, although the Bioconductor functions issue a lot of complaints about deprecation when I use select() to obtain CHR information from this package (or indeed from any package).
For the reference of other readers, I give code below:
Keep first Entrez ID for any probe ID:
The complaints arise because the positional information has been moved from the orgDb packages (which are intended to contain mappings of probesets to gene names, symbols, various IDs, etc) to the TxDb packages, which are intended to contain positional information. The ChipDb packages are really just thin wrappers that contain only the probesetid -> Entrez Gene ID mappings, and all the heavy lifting is done by calls to the representative OrgDb package.
Ideally AnnotationDbi would use e.g., the Mus.musculus package instead of the org.Mm.eg.db package, but any change in that direction will require there to be OrganismDb packages for all of the arrays out there, and I don't see that happening any time soon.
You might want to reconsider your selection procedure for the Entrez Gene ID. I assume your rationale is that the smaller number implies that it has been around longer, and is more likely to be 'real' for some definition of?
When I generate the annotation packages for these arrays, I parse the annotation csv files from Affy, and collect all the IDs in order. This dictates the order in the annotation packages, so keeping with the example above, I am parsing this:
Where the number preceding the triple forward slash is the Entrez Gene ID (so, 620393, then 70380). And when you get these data out of the ChipDb, the order follows:
I believe that Affy orders these based on how well the underlying probes align to each of the transcripts. If you align the probes for this probeset or the transcript that Affy used as the basis for designing the probes, there is 100% identity with Fam150a, then 98.3% identity with Mospd1. So all things equal, the better choice in this regard is likely to be the larger Entrez Gene ID.
I chose to keep the smallest Entrez Gene ID for each gene because IDs for Gene Models are larger than IDs for more established genes.
From what you say, I should instead keep the mappings in the original order, so as to get the Entrez ID with best alignment for each transcript-cluster ID. I have edited the code above to do that instead.