Pre-processing of HTA 2.0 array
1
1
Entering edit mode
pbachali ▴ 20
@pbachali-9651
Last seen 2.3 years ago

Hi,

I am Prathyusha, working as a Bioinformatician for a biomedical research company. I came across HTA-2.0 array for the first time. I am quite confused about this array and CDF file. Has anybody done the differential gene expression on this chip? If so could you please provide some background on how to proceed with this? Regarding annotations I have used  getNetAffx function from oligo package and it is working. But there are transcript level and probeset level. Which one should I use? How can  I get probe ids from transcript cluster annotation file to match with the expression set I create? How can I delete low-intensity probes here? Can I use gene filter? One last question is can I use limma to find differential expressed genes?

Any help is much appreciated.

pd.hta.2.0 • 4.1k views
2
Entering edit mode
@james-w-macdonald-5106
Last seen 34 minutes ago
United States

The HTA-2.0 arrays are intended to be used for one of three things:

1. Transcript level comparisons
2. Probe-set region level comparisons
3. Comparisons that include estimates of differential expression of probe-set regions

Number 3 is supposed to allow you to compare samples and detect either differential exon usage, or possibly differential expression of splice variants.

The first two types of comparisons are really all you can do with Bioconductor packages. You need Affy's TAC software to attempt to do the third.

If you want to do #1, you can use the oligo package and just do 'the usual'.

## start R in the dir that contains your celfiles

library(oligo)

eset <- rma(dat)

## you can then get rid of background probes and annotate using functions in my affycoretools package

library(affycoretools)

library(hta20transcriptcluster.db)

eset.main <- getMainProbes(eset, pd.hta.2.0)

eset.main <- annotateEset(eset.main, hta20stranscriptcluster.db)

-OR-

eset.main <- annotateEset(eset.main, pd.hta.2.0)

But the former will probably be faster than the latter

You can then move on to limma, and your topTable results will contain the available annotations.

An alternative is to do things at the probeset region (PSR in Affy speak) level, which gives you expression values for exons (or parts of exons), and for probes that span exon-exon junctions. This is necessarily a more complex analysis, because the expression values are not as easy to interpret, and modeling them is probably not as straightforward.

eset <- rma(dat, target = "probeset")

eset.main <- getMainProbes(eset, pd.hta.2.0)

eset.main <- annotateEset(eset.main, hta20probeset.db)

-OR-

eset.main <- annotateEset(eset.main, pd.hta.2.0, level = "probeset")

Note that you need to specify the 'level' if you use the pdInfoPackage for annotating the ExpressionSet. These data are now OK for use with the limma package, although as noted above, you are now making comparisons between smaller transcript regions rather than comparing total transcript abundances.

0
Entering edit mode

To add to the answer of James: analyses of differential exon usage (number 3) using R/BioC libraries (iGEMS approach) has been reported here:

"iGEMS: an integrated model for identification of alternative exon usage events"

http://nar.oxfordjournals.org/content/44/11/e109

0
Entering edit mode

Hi Guido,

Thanks for the links and the script on github. I am quite confused to select the correct approach among the three mentioned above by James. I tried performing transcript level and probeset level. I have done limma. I got some differentially expressed genes but I am not able to interpret it. I could see transcript cluster id, probeset level id, number of probes but coming to genesymbol I could see the only one gene symbol and one entrez id. When we have set of probes in each transcript cluster, is it correct having only one gene symbol.

I am not sure whether or not I am understanding it correctly. Any help is much appreciated. If you could explain  me the main differences between transcript level, probeset level that would be very helpful.

1
Entering edit mode

Well, James basically explained it all in his answer above... Please note that by summarizing at the transcript or probeset level your are NOT analyzing on the level of genes! Hence a gene is thus expected to be represented by multiple transcripts (or probesets), exactly as you found. The discussion and comments of James in this thread also provide a good read... what is the difference between "pd.mogene.2.0.st" and "mogene20sttranscriptcluster.db"?.

Please check here and here for 2 pictures that clarify the differences between probeset (= gene exon)- and transcript-level a bit.

The first one nicely illustrates the differences between exon and transcript levels, and the second one also indicates the exon-exon junctions. (Plz note that the latter site deals with the so-called "Glue Grant Human Transcriptome Array (GG-H) ", which basically was the predecessor to the HTA2.0 array, and that detection of SNPs is not possible on the HTA2.0 array).

If you only would like to summarize the expression data on the level of genes, you may also want to consider the use of the so-called remapped (custom) CDFs Manhong Dai and Fan Meng of the BrainArray group provide. The latest versions (v21) can already be found here (Entrez-gene based) and here (ENSEMBL-based). If you go this way, you unfortunately cannot make use of the oligo framework (because no corresponding PdInfo packages are available for this array), but you rather will have to use the old affy-based framework... for which you have to use the modified affy package, also available at the MBNI site.

0
Entering edit mode

Hi James,

Thank you for your reply. That was very helpful. I tried transcript level and probeset level. I am unable to understand clearly what is the main difference between transcript level and probeset level. When I have done limma and when I see my annotations for the differentially expressed genes, I could see transcript cluster id, a number of probes in the transcript cluster id is mentioned but only one gene symbol is mentioned. This is quite confusing to me. Is this the spliced variant analysis? I am getting some result but I am unable to interpret my result.

0
Entering edit mode

Guido has pretty much explained things in his response to your question to him. But I will add some description of what the probeset and transcript level really means.

These arrays are based on the old Exon ST arrays, which tried to allow people to assess differential exon usage. Those arrays had (in general) four-probe probesets (e.g., four 25-mers that were summarized to estimate the expression of a 'probe set region', or PSR). A PSR was some or all of an exon, so it wasn't even that clear what you were measuring. If the exon was long, there might have been multiple PSRs for the exon, or if it was short maybe only one.

So when you summarize at the probeset level on the HTA arrays, you are summarizing all the probes in a probeset, which may measure a PSR, or on this array, may also summarize a set of probes that are supposed to span an exon-exon junction (e.g., the probe is complementary to the sequence that arises when two exons are joined, so if they bind and have signal, it implies that the underlying transcript had both exons in it). Analyzing the data at this level is very complex, particularly in the bulk way that most microarrays are analyzed. Any significantly differentially expressed PSR or JUC (junction probe) just says something about a little chunk of the gene, and what that then means in the larger context of the gene is something that you have to explore further.

When you summarize at the transcript level, all the probes for all the PSRs for a given transcript are collected into a much larger probeset and summarized, giving some measure of the expression of that transcript. As you have noted there are often multiple such transcript probesets for a given gene. And given the propensity for Affy to re-use probes in the later versions of arrays, the multiple probesets for a given gene may well include some of the same probes! At a first approximation, the transcript level probesets provide some relative measure of the underlying transcription level of a gene, and different probesets for the same gene may measure different splice variants. Again, as with the PSRs, there are thousands of these things, and what may be true for the set of probesets that measure the transcripts for one gene may not be true of another.

So there is the sort of bulk processing of the probesets where you use oligo to summarize, and then you use limma to make comparisons. Then you get this list of things that appear to be differentially expressed. What you then do is dependent on what level you summarized at, and how much time and effort you want to put into figuring out what exactly you have measured, and what it means in the context of your experiment.

0
Entering edit mode

Hi James,

I now got pretty good knowledge on ST arrays. Thank you for all your help. I have 1820 CEL files of this Exon ST arrays. My laptop has 16GB RAM. I am not able to run on my laptop.  I am not able to normalize all the files in Rstudio with the file size. Could you please suggest some idea? Can I use any other program to analyze these files? I assume even R cannot handle more than 64GB files. I also think that when I normalize the matrix file size would be more than 64GB size. I am wondering if we can do using any BioPython packages.

1
Entering edit mode

You could probably use the xps package, although I have no experience with it, so cannot say for sure. Another alternative would be to use the Bioconductor AMI with a big EC2 instance that can handle a large number of files. Another alternative is aroma.affymetrix, which IIRC works with Exon ST arrays, and can process an unlimited number of files, so long as you have the disk space for all the cruft it writes to disk.

Or you could see if your institution has some big iron you can use. If you have that many celfiles, I would imagine that you have access to more compute power than a single laptop.

0
Entering edit mode

Hi James, I did try all your above solution and my problem in any platform (I work on linux cluster with 250GB RAM, used a large ec2 instance with 1TB RAM) is that Oligo::rma gives me NaN's while processing all the files together. To eliminate the question of corrupt file, I divided the files into multiple buckets of lesser numbers and saw that the normalization took place alright. Any reason you can think of that is happening to the normalization ? I really appreciate any inputs.

Thanks!