Question: HTA 2.0 and Transcript level analysis
0
2.2 years ago by
giroudpaul40
France
giroudpaul40 wrote:

Dear Bioconductor members

I am asked to analyze my lab's HTA2.0 data once again, but this time to find information on specific transcript variants.

Do you know if it is possible to annotate an Eset obtained via : data.rma = oligo::rma(data, target="probeset")? For what I understand, this apply RMA to the probeset, without summarizing in transcriptclusters. And can we perform DE analysis on transcripts ?

For instance, how do I know which probes will be of interest for this kind of transcript : https://www.ncbi.nlm.nih.gov/nuccore/XM_011529084.2

I read in an another thread (Use junction probesets (JUC) to detect alternative splicing on Affymetrix human HTA 2.0 arrays? that practically no one was interested in transcript analysis from microarray data, explaining the lack of tools to perform this task. Is it really so ? I find this quite surprising.

Paul

transcripts hta2.0 • 1.1k views
modified 2.2 years ago • written 2.2 years ago by giroudpaul40
Answer: HTA 2.0 and Transcript level analysis
2
2.2 years ago by
United States
James W. MacDonald51k wrote:

You can annotate at the probeset level. Just use the hta20probeset.db package, which contains whatever annotation Affy makes available.

As for performing DE analysis on transcripts, there are, so far as I know, no Bioconductor packages for this. There is Affy's TAC, which does something that they claim is useful, so you might take a look at that.

Hi again,

So I annotated my dataset just to have a look, and it seems that the hta20probeset.db does not really differentiate between transcripts as it maps all refseq transcripts for one gene to each probes corresponding to this gene.

For example:
hta.dat = rma(hta.dat, target="probeset")
ProbesTr <- select(hta20probeset.db, keys=keys(hta20probeset.db), columns = c("SYMBOL", "REFSEQ"))
ProbesTr[ProbesTr$PROBEID %in% "JUC17000798.hg.1",] Returns :  PROBEID SYMBOL REFSEQ 4984845 JUC17000798.hg.1 CD68 NM_001040059 4984846 JUC17000798.hg.1 CD68 NM_001251 4984847 JUC17000798.hg.1 CD68 NP_001035148 4984848 JUC17000798.hg.1 CD68 NP_001242 Corresponding to two transcript variants and two precursors I suppose this is what makes hta.dat <- annotateEset(hta.dat, hta20probeset.db) fails (R crashes). Is this normal ? Is there a trick that I missed ? Or is it simply because affymetrix did not give the correct annotations for its probes. ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by giroudpaul40 Ah, you know what? The way these annotation packages work is completely anathema to how this array is supposed to work. The ChipDb package only has one real table, which maps the probeset ID to the Entrez Gene ID, and any other lookups will be farmed out to the org.Hs.eg.db package. Since all previous Affy arrays (with the notable exception of the Exon ST arrays) were based on gene level expression, this was fine. But for these arrays it doesn't really work. For example the JUC probe you mention is only intended to measure two of the four transcripts for CD68: grep JUC17000798.hg.1 HTA-2_0.na35.hg19.probeset.csv | cut -d, -f 12 | sed 's|///|\n|g' uc002ghu.3 // --- // 100 // 0 // --- // 1 uc002ghv.3 // --- // 100 // 0 // --- // 1 NM_001040059 // --- // 100 // 0 // --- // 1 NM_001251 // --- // 100 // 0 // --- // 1 BC015557 // --- // 100 // 0 // --- // 1 ENST00000250092 // --- // 100 // 0 // --- // 1 ENST00000380498 // --- // 100 // 0 // --- // 1 OTTHUMT00000226949 // --- // 100 // 0 // --- // 1 But when we make the annotation package we map the RefSeq ID to its Entrez Gene ID > dbGetQuery(hta20probeset_dbconn(), "select * from probes where probe_id='JUC17000798.hg.1';") probe_id gene_id is_multiple 1 JUC17000798.hg.1 968 0 And any mapping from that probeset ID to anything else is based on a join to that table, so is by definition gene based, not transcript based. So as they stand, the probeset.db packages are probably not that useful. OTOH, it's a fairly rare thing for people to want to do what you want to do, so I don't know if it's worth putting limited resources towards making a special ChipDb type that is transcript rather than gene centric. ADD REPLYlink written 2.2 years ago by James W. MacDonald51k I should also mention that annotateEset works for me, so I cannot reproduce any crash. You also have access to all the annotation that Affy provides, which you can get using the getNetAffx function in oligo. > z <- getNetAffx(dat) > pData(z)["JUC17000798.hg.1","mrnaassignment"] [1] "uc002ghu.3 // --- // 100 // 0 // --- // 1 /// uc002ghv.3 // --- // 100 // 0 // --- // 1 /// NM_001040059 // --- // 100 // 0 // --- // 1 /// NM_001251 // --- // 100 // 0 // --- // 1 /// BC015557 // --- // 100 // 0 // --- // 1 /// ENST00000250092 // --- // 100 // 0 // --- // 1 /// ENST00000380498 // --- // 100 // 0 // --- // 1 /// OTTHUMT00000226949 // --- // 100 // 0 // --- // 1" Which you can parse to get all the transcripts that each probeset is supposed to interrogate. But this turns out, for this gene, to be exactly what you already would get from the hta20probeset.db package: > toget <- select(hta20probeset.db, "CD68", "PROBEID", "SYMBOL")$PROBEID
'select()' returned 1:many mapping between keys and columns
> anno <- pData(z)[toget,"mrnaassignment"]
> lapply(strsplit(anno, " /// "), function(x) grep("^NM_", sapply(strsplit(x, " // "), "[", 1), value = TRUE))
[[1]]
[1] "NM_001040059" "NM_001251"

[[2]]
[1] "NM_001040059" "NM_001251"

[[3]]
[1] "NM_001040059" "NM_001251"

[[4]]
[1] "NM_001040059" "NM_001251"

[[5]]
[1] "NM_001040059" "NM_001251"

[[6]]
[1] "NM_001040059" "NM_001251"

[[7]]
[1] "NM_001040059" "NM_001251"

[[8]]
[1] "NM_001040059" "NM_001251"

[[9]]
[1] "NM_001040059" "NM_001251"

[[10]]
[1] "NM_001040059" "NM_001251"

[[11]]
[1] "NM_001040059" "NM_001251"

[[12]]
[1] "NM_001040059" "NM_001251"

[[13]]
[1] "NM_001040059" "NM_001251"

[[14]]
[1] "NM_001040059" "NM_001251"

[[15]]
[1] "NM_001040059" "NM_001251"

[[16]]
[1] "NM_001040059" "NM_001251"

[[17]]
[1] "NM_001040059" "NM_001251"

[[18]]
[1] "NM_001040059" "NM_001251"

[[19]]
[1] "NM_001040059" "NM_001251"  

So apparently all of the PSR and JUC probesets for this gene measure just two of the known transcripts? I don't have time to explore further, but this should give you some hints as to how you can proceed on your own.

Thank you very much for these explanations. As I suspected, it had something to do with how the ChipDB package is built.

Paul

I think you missed my point. At least for this particular probeset, we are supplying EXACTLY what Affy supplies, which is the same exact data for all of the probesets. So unless this is an outlier, it's not clear that Affy supply the required data to do any transcript work anyway.

"The ChipDb package only has one real table, which maps the probeset ID to the Entrez Gene ID, and any other lookups will be farmed out to the org.Hs.eg.db package."

This is what I refered to when saying this. As you said, Affymetrix is not providing Transcript detailed information for the ChipDB package of it's HTA chips (but they sell you them telling that its great to look at transcripts). Well, anyway, I cannot do that (which I will explain my colleague). Only solution left is to look at Affymetrix TAC software.