Question

what is the difference between "pd.mogene.2.0.st" and "mogene20sttranscriptcluster.db"?

1

Entering edit mode

Nathaniel ▴ 20

@nathaniel-9283

Last seen 8.4 years ago

Denmark

I am trying to annotate transcripts from an Affymetrix Mouse Gene ST 2.0 microarray using 'oligo', but I have found so many resources and annotation approaches that I cannot figure out the relationships and differences between them.

So, to start with, what is the difference between the packages pd.mogene.2.0.st and mogene20sttranscriptcluster.db? How are they used?

Also, one what do they differ from the annotation .csv file provided by Affymetrix that you can obtain using getNetAffx()?

Finally, annotating with BioMart would one get the same results as with any of the previous approaches?

Thanks.

affymetrix mouse gene arrays • 2.7k views

ADD COMMENT • link 8.4 years ago Nathaniel ▴ 20

score 5 · Answer 1 · 2015-11-28

5

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 12 hours ago

United States

The pd.mogene.2.0.st package is used by oligo when you process your arrays. It basically tells oligo where all the probes are on the array, as well as which probes to combine into a probeset when summarizing.

The mogene20sttranscriptcluster.db package maps the 'core' probesets (the default summarization level for oligo) to the genes that are interrogated by each probeset, as well as other information about each gene. Note that you can also summarize this array at the 'probeset' level, which corresponds to the 'PSR' or probe set region, which roughly corresponds to exons. If you do that, then you want to annotate using the mogene20stprobeset.db package.

The annotation packages differ in a couple ways from the csv files you can get from Affy, but do note that they are based on those files. They differ in ease of use (parsing the Affy csv files is a non-trivial exercise), as well as the mappings. To generate the annotation files we get the RefSeq and GenBank IDs for each Affy probeset ID and then map to Entrez Gene, and then map to all the other annotation databases. So if NCBI has different information for a given Entrez Gene ID, then the annotation data package may differ from the Affy csv.

You could also annotate using biomaRt, and you will get reasonably similar results. The differences would be between the data housed at NCBI versus EBI.

ADD COMMENT • link 8.4 years ago James W. MacDonald 65k

0

Entering edit mode

Excellent answer James, thanks a lot!

just a final question: when mapping the 'core' probesets to genes using mogene20sttranscriptcluster.db package, should I expect to have duplicated genes in the collapsed matrix, or ALL probesets mapping to the same gene will have been collapsed?

I expected to be the latter, but to make a sanity check, after collapsing using affyNorm <- rma(affyRaw, target="core"), I annotated all "collapsed" probesets using BioMart, and I obtain many duplicated genes like the following:

probeset.id gene symbol gene entrez

17427309 Jun 16476

17427312 Jun 16476

I checked the corresponding probeset sequence with BLAT, and they both actually map at different regions of the Jun gene.

Why is this happening? What should we do with that?

ADD REPLY • link 8.4 years ago Nathaniel ▴ 20

0

Entering edit mode

No, you shouldn't expect that. Affymetrix arrays have historically had more than one probeset for some genes, and this pattern continues with the Gene ST arrays. Why this is so, and what you should do with it are good questions, but I have no answers for you. Given the number of duplicated probesets I would be surprised if Affymetrix had a single rationale for the duplication, and would instead assume that it is a gene-dependent thing.

As an example, they could have just piled all the probes that interrogate Jun into one probeset and called it good. But maybe they think there are two predominant transcripts, and you can use the two probesets to infer differences between those two transcripts by looking at the expression values for the two probesets. Or maybe they think something else about Jun. I really don't know. This does make interpretation of the results more difficult, and using an MBNI cdf where they have all been collapsed at the gene level would make interpretation easier.

On the other hand, there is a valid argument that simplifying data to the gene level is ignoring the complexity of the transcript, and you could for instance think you have differential expression when in fact you have identical expression levels, but very different transcripts. In other words, consider the hypothetical of a gene where there are four exons, and two main transcripts where transcript A has all four exons, and transcript B only has two. If two sample types are expressing exactly equivalent numbers of transcript, but one sample type is only expressing transcript A, and the other sample type is expressing only transcript B, then you may get very different signal, and interpret it as differential gene expression when in fact it is due entirely to differences in the form of the transcript.

As to what you should do with that, I have no idea. The answer depends on too many variables. There is always the tension between the 'bulk analysis' that we do with microarray data and the particular questions you may have. With 30,000 or more different probesets, you have to do some things in a pretty naive way. For example, you fit the same linear model on each probeset, which you would never do if you were doing conventional statistical analysis. But you cannot go through and decide what the best model is for each gene because nobody has that kind of time. Plus you usually don't have the replication to decide what the best model is anyway. But at some point you have a set of interesting genes, at which time you might want to look more closely at the probesets, what they are measuring, etc, in order to decide what the data mean.

ADD REPLY • link 8.4 years ago James W. MacDonald 65k

0

Entering edit mode

That perfectly answers my question, thanks James!

ADD REPLY • link 8.4 years ago Nathaniel ▴ 20

score 0 · Answer 2 · 2015-11-29

0

Entering edit mode

Nathaniel ▴ 20

@nathaniel-9283

Last seen 8.4 years ago

Denmark

""

ADD COMMENT • link 8.4 years ago Nathaniel ▴ 20