Question

Gene level analysis of Affymetrix GeneChip Mouse Exon Arrays

2

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

I have been asked to do a gene level expression analysis of an experiment that used Affymetrix GeneChip Mouse Exon 1.0 ST Arrays:

http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-1487/?query=mutant+TDP-43

When we had to analyse Affy exon arrays a decade ago, we found that the affy package was not able to handle them. We instead used to RMA normalize using the aroma.affymetrix package from CRAN, restricting to core probe-sets. We used to convert from exon to gene level expression summaries using the aroma.affymetrix function:

ExonRmaPlm(csBN, mergeGroups=TRUE)

Ten years later, are there any recommended Bioconductor pipelines for normalizing and summarizing Affy exon arrays at the gene level? (Apart from the obvious advice to please do RNA-seq instead.)

I am aware of the xps and oligo packages but have not used either of them yet.

affy aroma.affymetrix exon array xps oligo • 3.0k views

ADD COMMENT • link 8.4 years ago Gordon Smyth 53k

1

Entering edit mode

svlachavas ▴ 840

@svlachavas-7225

Last seen 6 months ago

Germany/Heidelberg/German Cancer Resear…

Dear Gordon,

you might find the following paper interesting (https://f1000research.com/articles/5-1384/v1) -it mentions in the abstract that "This workflow is directly applicable to current “Gene” type arrays, e.g. the HuGene or MoGene arrays but can easily adapted to similar platforms..", so it might be useful for also Exon ST Arrays.

Best,

Efstathios-Iason

ADD COMMENT • link 8.4 years ago svlachavas ▴ 840

0

Entering edit mode

Thanks, the F1000R article does have some helpful explanations.

ADD REPLY • link 8.4 years ago Gordon Smyth 53k

1

Entering edit mode

cstrato ★ 3.9k

@cstrato-908

Last seen 7.2 years ago

Austria

Dear Gordon,

As you might know, from the packages mentioned package 'affy' cannot be used
for exon arrays; for this purpose package 'oligo' was created, which is
probably used by most people.

As author of 'xps' I will only mention that 'xps' is able to do RMA normalization
of exon arrays, also restricting to core probe-sets only.

Besides the usual vignettes, directory 'xps/examples' contains scripts for
working with Affymetrix arrays; for exon arrays you may look at script
'script4exon.R'.

Best regards,
Christian

ADD COMMENT • link 8.4 years ago cstrato ★ 3.9k

0

Entering edit mode

Thanks. It turned out that I didn't need any special memory management as the oligo functions ran in a few seconds on my laptop computer (which has 32Gb RAM).

ADD REPLY • link 8.4 years ago Gordon Smyth 53k

score 2 · Accepted Answer · 2017-07-14

2

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 4 days ago

United States

As Christian noted, the oligo package is the successor to affy, and intended to be used to analyze the newer random-primer arrays. The default for rma in oligo for these arrays is to summarize the core probesets at the transcript level, so

dat <- read.celfiles(list.celfiles())
eset <- rma(dat)

will do what you ask.

ADD COMMENT • link 8.4 years ago James W. MacDonald 68k

0

Entering edit mode

Thanks, that is simple. Once the required array definition package had been installed, these commands ran in a few seconds on my computer.

I assume that moex10sttranscriptcluster.db is the appropriate transcript-cluster annotation package. That works fine for me, although the Bioconductor functions issue a lot of complaints about deprecation when I use select() to obtain CHR information from this package (or indeed from any package).

For the reference of other readers, I give code below:

library(oligo)
install.packages("pd.moex.1.0.st.v1",
                 repos="https://bioconductor.org/packages/3.5/data/annotation")
x <- read.celfiles(files)
y <- rma(x)

library(moex10sttranscriptcluster.db)
ann <- select(moex10sttranscriptcluster.db,keys=featureNames(y),
       columns=c("ENTREZID","SYMBOL","GENENAME","CHR"),keytype="PROBEID")

Keep first Entrez ID for any probe ID:

d <- duplicated(ann$PROBEID)
ann <- ann[!d,]
m <- match(featureNames(y), ann$PROBEID)
fData(y) <- ann[m,-1]

ADD REPLY • link 8.4 years ago Gordon Smyth 53k

0

Entering edit mode

The complaints arise because the positional information has been moved from the orgDb packages (which are intended to contain mappings of probesets to gene names, symbols, various IDs, etc) to the TxDb packages, which are intended to contain positional information. The ChipDb packages are really just thin wrappers that contain only the probesetid -> Entrez Gene ID mappings, and all the heavy lifting is done by calls to the representative OrgDb package.

Ideally AnnotationDbi would use e.g., the Mus.musculus package instead of the org.Mm.eg.db package, but any change in that direction will require there to be OrganismDb packages for all of the arrays out there, and I don't see that happening any time soon.

You might want to reconsider your selection procedure for the Entrez Gene ID. I assume your rationale is that the smaller number implies that it has been around longer, and is more likely to be 'real' for some definition of?

When I generate the annotation packages for these arrays, I parse the annotation csv files from Affy, and collect all the IDs in order. This dictates the order in the annotation packages, so keeping with the example above, I am parsing this:

NM_001195732 // Fam150a // family with sequence similarity 150, member A // 1 A1|1 // 620393 /// NM_001290514 // Mospd1 // motile sperm domain containing 1 // X|X A4 // 70380 ///

Where the number preceding the triple forward slash is the Entrez Gene ID (so, 620393, then 70380). And when you get these data out of the ChipDb, the order follows:

> select(moex10sttranscriptcluster.db, "6747349", "ENTREZID", "PROBEID")
'select()' returned 1:many mapping between keys and columns
  PROBEID ENTREZID
1 6747349   620393
2 6747349    70380

I believe that Affy orders these based on how well the underlying probes align to each of the transcripts. If you align the probes for this probeset or the transcript that Affy used as the basis for designing the probes, there is 100% identity with Fam150a, then 98.3% identity with Mospd1. So all things equal, the better choice in this regard is likely to be the larger Entrez Gene ID.

ADD REPLY • link 8.4 years ago James W. MacDonald 68k

0

Entering edit mode

I chose to keep the smallest Entrez Gene ID for each gene because IDs for Gene Models are larger than IDs for more established genes.

From what you say, I should instead keep the mappings in the original order, so as to get the Entrez ID with best alignment for each transcript-cluster ID. I have edited the code above to do that instead.

ADD REPLY • link 8.4 years ago Gordon Smyth 53k