Search
Question: Gene level analysis of Affymetrix GeneChip Mouse Exon Arrays
2
gravatar for Gordon Smyth
4 months ago by
Gordon Smyth32k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth32k wrote:

I have been asked to do a gene level expression analysis of an experiment that used Affymetrix GeneChip Mouse Exon 1.0 ST Arrays:

http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-1487/?query=mutant+TDP-43

When we had to analyse Affy exon arrays a decade ago, we found that the affy package was not able to handle them. We instead used to RMA normalize using the aroma.affymetrix package from CRAN, restricting to core probe-sets. We used to convert from exon to gene level expression summaries using the aroma.affymetrix function:

ExonRmaPlm(csBN, mergeGroups=TRUE)

Ten years later, are there any recommended Bioconductor pipelines for normalizing and summarizing Affy exon arrays at the gene level? (Apart from the obvious advice to please do RNA-seq instead.)

I am aware of the xps and oligo packages but have not used either of them yet.

ADD COMMENTlink modified 4 months ago • written 4 months ago by Gordon Smyth32k
2
gravatar for James W. MacDonald
4 months ago by
United States
James W. MacDonald45k wrote:

As Christian noted, the oligo package is the successor to affy, and intended to be used to analyze the newer random-primer arrays. The default for rma in oligo for these arrays is to summarize the core probesets at the transcript level, so

dat <- read.celfiles(list.celfiles())
eset <- rma(dat)

will do what you ask.

ADD COMMENTlink written 4 months ago by James W. MacDonald45k

Thanks, that is simple. Once the required array definition package had been installed, these commands ran in a few seconds on my computer.

I assume that moex10sttranscriptcluster.db is the appropriate transcript-cluster annotation package. That works fine for me, although the Bioconductor functions issue a lot of complaints about deprecation when I use select() to obtain CHR information from this package (or indeed from any package).

For the reference of other readers, I give code below:

library(oligo)
install.packages("pd.moex.1.0.st.v1",
                 repos="https://bioconductor.org/packages/3.5/data/annotation")
x <- read.celfiles(files)
y <- rma(x)
library(moex10sttranscriptcluster.db)
ann <- select(moex10sttranscriptcluster.db,keys=featureNames(y),
       columns=c("ENTREZID","SYMBOL","GENENAME","CHR"),keytype="PROBEID")

Keep first Entrez ID for any probe ID:

d <- duplicated(ann$PROBEID)
ann <- ann[!d,]
m <- match(featureNames(y), ann$PROBEID)
fData(y) <- ann[m,-1]
ADD REPLYlink modified 4 months ago • written 4 months ago by Gordon Smyth32k

The complaints arise because the positional information has been moved from the orgDb packages (which are intended to contain mappings of probesets to gene names, symbols, various IDs, etc) to the TxDb packages, which are intended to contain positional information. The ChipDb packages are really just thin wrappers that contain only the probesetid -> Entrez Gene ID mappings, and all the heavy lifting is done by calls to the representative OrgDb package.

Ideally AnnotationDbi would use e.g., the Mus.musculus package instead of the org.Mm.eg.db package, but any change in that direction will require there to be OrganismDb packages for all of the arrays out there, and I don't see that happening any time soon.

You might want to reconsider your selection procedure for the Entrez Gene ID. I assume your rationale is that the smaller number implies that it has been around longer, and is more likely to be 'real' for some definition of?

When I generate the annotation packages for these arrays, I parse the annotation csv files from Affy, and collect all the IDs in order. This dictates the order in the annotation packages, so keeping with the example above, I am parsing this:

NM_001195732 // Fam150a // family with sequence similarity 150, member A // 1 A1|1 // 620393 /// NM_001290514 // Mospd1 // motile sperm domain containing 1 // X|X A4 // 70380 ///

Where the number preceding the triple forward slash is the Entrez Gene ID (so, 620393, then 70380). And when you get these data out of the ChipDb, the order follows:

> select(moex10sttranscriptcluster.db, "6747349", "ENTREZID", "PROBEID")
'select()' returned 1:many mapping between keys and columns
  PROBEID ENTREZID
1 6747349   620393
2 6747349    70380

I believe that Affy orders these based on how well the underlying probes align to each of the transcripts. If you align the probes for this probeset or the transcript that Affy used as the basis for designing the probes, there is 100% identity with Fam150a, then 98.3% identity with Mospd1. So all things equal, the better choice in this regard is likely to be the larger Entrez Gene ID.

ADD REPLYlink written 4 months ago by James W. MacDonald45k

I chose to keep the smallest Entrez Gene ID for each gene because IDs for Gene Models are larger than IDs for more established genes.

From what you say, I should instead keep the mappings in the original order, so as to get the Entrez ID with best alignment for each transcript-cluster ID. I have edited the code above to do that instead.

ADD REPLYlink modified 4 months ago • written 4 months ago by Gordon Smyth32k
1
gravatar for svlachavas
4 months ago by
svlachavas570
Greece/Athens/National Hellenic Research Foundation
svlachavas570 wrote:

Dear Gordon,

you might find the following paper interesting (https://f1000research.com/articles/5-1384/v1) -it mentions in the abstract that "This workflow is directly applicable to current “Gene” type arrays, e.g. the HuGene or MoGene arrays but can easily adapted to similar platforms..", so it might be useful for also Exon ST Arrays.

Best,

Efstathios-Iason

ADD COMMENTlink modified 4 months ago • written 4 months ago by svlachavas570

Thanks, the F1000R article does have some helpful explanations.

ADD REPLYlink written 4 months ago by Gordon Smyth32k
1
gravatar for cstrato
4 months ago by
cstrato3.8k
Austria
cstrato3.8k wrote:

Dear Gordon,

As you might know, from the packages mentioned package 'affy' cannot be used 
for exon arrays; for this purpose package 'oligo' was created, which is 
probably used by most people.

As author of 'xps' I will only mention that 'xps' is able to do RMA normalization
of exon arrays, also restricting to core probe-sets only.

Besides the usual vignettes, directory 'xps/examples' contains scripts for 
working with Affymetrix arrays; for exon arrays you may look at script
'script4exon.R'.

Best regards,
Christian

ADD COMMENTlink written 4 months ago by cstrato3.8k

Thanks. It turned out that I didn't need any special memory management as the oligo functions ran in a few seconds on my laptop computer (which has 32Gb RAM).

ADD REPLYlink modified 4 months ago • written 4 months ago by Gordon Smyth32k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 122 users visited in the last hour