Question: Gene level analysis of Affymetrix GeneChip Mouse Exon Arrays
gravatar for Gordon Smyth
9 weeks ago by
Gordon Smyth31k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth31k wrote:

I have been asked to do a gene level expression analysis of an experiment that used Affymetrix GeneChip Mouse Exon 1.0 ST Arrays:

When we had to analyse Affy exon arrays a decade ago, we found that the affy package was not able to handle them. We instead used to RMA normalize using the aroma.affymetrix package from CRAN, restricting to core probe-sets. We used to convert from exon to gene level expression summaries using the aroma.affymetrix function:

ExonRmaPlm(csBN, mergeGroups=TRUE)

Ten years later, are there any recommended Bioconductor pipelines for normalizing and summarizing Affy exon arrays at the gene level? (Apart from the obvious advice to please do RNA-seq instead.)

I am aware of the xps and oligo packages but have not used either of them yet.

ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by Gordon Smyth31k
gravatar for James W. MacDonald
9 weeks ago by
United States
James W. MacDonald44k wrote:

As Christian noted, the oligo package is the successor to affy, and intended to be used to analyze the newer random-primer arrays. The default for rma in oligo for these arrays is to summarize the core probesets at the transcript level, so

dat <- read.celfiles(list.celfiles())
eset <- rma(dat)

will do what you ask.

ADD COMMENTlink written 9 weeks ago by James W. MacDonald44k

Thanks, that is simple. Once the required array definition package had been installed, these commands ran in a few seconds on my computer.

I assume that moex10sttranscriptcluster.db is the appropriate transcript-cluster annotation package. That works fine for me, although the Bioconductor functions issue a lot of complaints about deprecation when I use select() to obtain CHR information from this package (or indeed from any package).

For the reference of other readers, I give code below:

x <- read.celfiles(files)
y <- rma(x)
ann <- select(moex10sttranscriptcluster.db,keys=featureNames(y),

Keep first Entrez ID for any probe ID:

d <- duplicated(ann$PROBEID)
ann <- ann[!d,]
m <- match(featureNames(y), ann$PROBEID)
fData(y) <- ann[m,-1]
ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by Gordon Smyth31k

The complaints arise because the positional information has been moved from the orgDb packages (which are intended to contain mappings of probesets to gene names, symbols, various IDs, etc) to the TxDb packages, which are intended to contain positional information. The ChipDb packages are really just thin wrappers that contain only the probesetid -> Entrez Gene ID mappings, and all the heavy lifting is done by calls to the representative OrgDb package.

Ideally AnnotationDbi would use e.g., the Mus.musculus package instead of the package, but any change in that direction will require there to be OrganismDb packages for all of the arrays out there, and I don't see that happening any time soon.

You might want to reconsider your selection procedure for the Entrez Gene ID. I assume your rationale is that the smaller number implies that it has been around longer, and is more likely to be 'real' for some definition of?

When I generate the annotation packages for these arrays, I parse the annotation csv files from Affy, and collect all the IDs in order. This dictates the order in the annotation packages, so keeping with the example above, I am parsing this:

NM_001195732 // Fam150a // family with sequence similarity 150, member A // 1 A1|1 // 620393 /// NM_001290514 // Mospd1 // motile sperm domain containing 1 // X|X A4 // 70380 ///

Where the number preceding the triple forward slash is the Entrez Gene ID (so, 620393, then 70380). And when you get these data out of the ChipDb, the order follows:

> select(moex10sttranscriptcluster.db, "6747349", "ENTREZID", "PROBEID")
'select()' returned 1:many mapping between keys and columns
1 6747349   620393
2 6747349    70380

I believe that Affy orders these based on how well the underlying probes align to each of the transcripts. If you align the probes for this probeset or the transcript that Affy used as the basis for designing the probes, there is 100% identity with Fam150a, then 98.3% identity with Mospd1. So all things equal, the better choice in this regard is likely to be the larger Entrez Gene ID.

ADD REPLYlink written 9 weeks ago by James W. MacDonald44k

I chose to keep the smallest Entrez Gene ID for each gene because IDs for Gene Models are larger than IDs for more established genes.

From what you say, I should instead keep the mappings in the original order, so as to get the Entrez ID with best alignment for each transcript-cluster ID. I have edited the code above to do that instead.

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by Gordon Smyth31k
gravatar for svlachavas
9 weeks ago by
Greece/Athens/National Hellenic Research Foundation
svlachavas560 wrote:

Dear Gordon,

you might find the following paper interesting ( -it mentions in the abstract that "This workflow is directly applicable to current “Gene” type arrays, e.g. the HuGene or MoGene arrays but can easily adapted to similar platforms..", so it might be useful for also Exon ST Arrays.



ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by svlachavas560

Thanks, the F1000R article does have some helpful explanations.

ADD REPLYlink written 9 weeks ago by Gordon Smyth31k
gravatar for cstrato
9 weeks ago by
cstrato3.8k wrote:

Dear Gordon,

As you might know, from the packages mentioned package 'affy' cannot be used 
for exon arrays; for this purpose package 'oligo' was created, which is 
probably used by most people.

As author of 'xps' I will only mention that 'xps' is able to do RMA normalization
of exon arrays, also restricting to core probe-sets only.

Besides the usual vignettes, directory 'xps/examples' contains scripts for 
working with Affymetrix arrays; for exon arrays you may look at script

Best regards,

ADD COMMENTlink written 9 weeks ago by cstrato3.8k

Thanks. It turned out that I didn't need any special memory management as the oligo functions ran in a few seconds on my laptop computer (which has 32Gb RAM).

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by Gordon Smyth31k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 342 users visited in the last hour