Summarize Microarray data on a transcript level
1
0
Entering edit mode
@a3a9257a
Last seen 16 months ago

Hello Everyone,

I'm analyzing Affymetrix data from Clariom D human Gene chip assays. I understand that there are two ways of look at the data: from a gene level and from a probe set level.

Using the RMA algorithm, I can summarize the exon data on a gene level:

probeset.eset=rma(dat, target="core")


or on an probe set level

probeset.eset=rma(dat, target="probeset")


But how do I get the expression data on a transcript (mRNA Isoforms) level?

I understand the probe sets map to multiple exon ranges, and so they can target exons from multiple transcripts (but not all) of the same gene.

Is there a way to summarize Microarray data on a transcript level?

1
Entering edit mode
@james-w-macdonald-5106
Last seen 3 hours ago
United States

The 'core' level is +/- the transcript level, or as close as you can easily get to that. That said, it's not really that close. For example,

> select(clariomdhumantranscriptcluster.db, "TC0100009406.hg.1", "REFSEQ")
'select()' returned 1:many mapping between keys and columns
PROBEID       REFSEQ
1  TC0100009406.hg.1 NM_001040623
2  TC0100009406.hg.1 NM_001258001
3  TC0100009406.hg.1 NM_001258002
4  TC0100009406.hg.1 NM_001258003
5  TC0100009406.hg.1 NM_001258004
6  TC0100009406.hg.1 NM_001258005
7  TC0100009406.hg.1    NM_021797
8  TC0100009406.hg.1    NM_201653
9  TC0100009406.hg.1 NP_001035713
10 TC0100009406.hg.1 NP_001244930
11 TC0100009406.hg.1 NP_001244931
12 TC0100009406.hg.1 NP_001244932
13 TC0100009406.hg.1 NP_001244933
14 TC0100009406.hg.1 NP_001244934
15 TC0100009406.hg.1    NP_068569
16 TC0100009406.hg.1    NP_970615
17 TC0100009406.hg.1 XM_006710577
18 TC0100009406.hg.1 XM_017001047
19 TC0100009406.hg.1 XM_017001048
20 TC0100009406.hg.1 XP_006710640
21 TC0100009406.hg.1 XP_016856536
22 TC0100009406.hg.1 XP_016856537
> select(clariomdhumantranscriptcluster.db, "TC0100009406.hg.1", "ENTREZID")
'select()' returned 1:1 mapping between keys and columns
PROBEID ENTREZID
1 TC0100009406.hg.1    27159


So the claim from Affy is that this probeset interrogates 22 different transcripts (which it may!), from one gene.

You could use the MBNI re-mapped pdInfoPackage for this array that is based on say Ensembl Transcript IDs, but probably there is still lots of overlap. For example,

> install.packages("http://mbni.org/customcdf/24.0.0/enst.download/pd.clariomdhuman.hs.enst_24.0.0.tar.gz", repos = NULL, type = "source")
## Need type = "source" as I am on Windows here
> library(pd.clariomdhuman.hs.enst)
Warning message:
package 'RSQLite' was built under R version 4.0.3
## we'll do some direct SQL queries
> con <- db(pd.clariomdhuman.hs.enst)
> library(DBI)
> dbListTables(con)
[1] "featureSet1" "mps1pm"      "pmfeature"   "table_info"
## You can check for what is in these tables thusly
> dbGetQuery(con, "select * from featureset1 limit 3;")
fsetid            man_fsetid       type direction
1      1 ENST00000000233.10_at expression antisense
2      2  ENST00000000412.8_at expression antisense
3      3 ENST00000000442.11_at expression antisense
## get an example set of transcripts, using the gene from above.
## First we need the corresponding Ensembl Gene ID to query the Biomart
> select(org.Hs.eg.db, "27159", "ENSEMBL")
'select()' returned 1:1 mapping between keys and columns
ENTREZID         ENSEMBL
1    27159 ENSG00000134216
## so that's the Ensembl Gene ID we want, now get Transcript IDs
> library(biomaRt)
> mart <- useEnsembl("ensembl","hsapiens_gene_ensembl")
> z <- getBM("ensembl_transcript_id_version", "ensembl_gene_id", "ENSG00000134216", mart)
## staple an _at on the end to match the MBNI IDs
> z <- paste0(z[,1], "_at")
> z
[1] "ENST00000422815.5_at"  "ENST00000483391.5_at"  "ENST00000369740.6_at"
[4] "ENST00000352594.10_at" "ENST00000477918.6_at"  "ENST00000353665.10_at"
[7] "ENST00000489524.5_at"  "ENST00000430615.1_at"  "ENST00000451398.6_at"
[10] "ENST00000343320.10_at"
## So those are our transcript IDs; let's do a query
> zz <- dbGetQuery(con, paste0("select fsetid, man_fsetid, fid from featureset1 inner join mps1pm using(fsetid) where man_fsetid in ('", paste(z, collapse = "','"), "');"))
fsetid            man_fsetid    fid
1  14198 ENST00000343320.10_at  28005
2  14198 ENST00000343320.10_at  68436
3  14198 ENST00000343320.10_at  80011
4  14198 ENST00000343320.10_at  88045
5  14198 ENST00000343320.10_at 130690
6  14198 ENST00000343320.10_at 139347
> dim(zz)
[1] 941   3
> zzz <- split(zz[,-1], zz[,1])
> sapply(zzz, nrow)
14198 15333 15436 20293 41209 45482 56686 75939 80547 85688
109   101    88   116    86    91    92   107    79    72
## so about 70-100 probes per probeset. What's the overlap between them?
> sapply(2:10, function(x) sum(zzz[[1]][,2] %in% zzz[[x]][,2])/nrow(zzz[[x]]))
[1] 0.8217822 0.7954545 0.9224138 0.8953488 0.9560440 0.9130435 0.7383178
[8] 0.8354430 0.8888889
## so about 80-90% overlap in the probe usage for each probeset.


In this one example, the MBNI re-mapped probeset has 9 transcripts for a single gene, but re-uses about 90% of the probes for each transcript. It's probably pretty similar for most probesets, so it's not clear if you will gain anything by using that pdInfoPackage instead of the regular one.

0
Entering edit mode

The one-to-many mapping issue from Probeset IDs to Transcripts was exactly what I had ran into before asking this question. What if I go in another direction?

If I have a differentially expressed Exon/probeset (after running RMA on probeset level), is that substantial enough to claim that the transcript(s) this probeset maps to are also differentially expressed?

Or

If all the probesets, that map to a particular transcript, are differentially expressed, then the transcript is differentially expressed as well?

Thank you for your valuable support and time.

0
Entering edit mode

I don't see how either of your proposals are different from just doing rma at the transcript level. If you were to do that, then a significantly differentially expressed gene would also mean that the underlying transcripts were differentially expressed as well.

0
Entering edit mode

My use case if focused on following:

Let's say Gene-A has 4 Transcripts: Transcript 1, Transcript 2, Transcript 3, Transcript 4. I'm interested in Genes that display the following behavior:

Gene Name--------------------Differentially Expressed?

Gene A (Transcript 1)----------------Yes

Gene A (Transcript 2)----------------Yes

Gene A (Transcript 3)-----------------No

Gene A (Transcript 4)-----------------No

If I find a probe set that maps to Transcript 1 & Transcript 2 and doesn't map to Transcript 3 & Transcript 4, will I be able to infer the above-mentioned situation?

I really appreciate your time. Thanks a lot.

1
Entering edit mode

That's the whole idea behind the Clariom D (and HTA) arrays, that you can do differential transcript abundance (DTA). The arrays are actually pretty complex, having probes that span exon-exon junctions to infer different transcripts and stuff.

These arrays were supposed to be Affy's answer to RNA-Seq, but so far as I know, the only people who ever developed software to do DTA was Affy, with their Transcript Analysis Console. Which you could play around with.

A hypothetical alternative would be to assume that the 'core' summarization is at the exon level (it's not - it's at what Affy calls the probe set region (PSR), which may include an exon, or part of an exon, or IIRC more than one exon). But anyway. The limma package has a function called diffSplice that is actually intended for RNA-Seq analyses, but if you summarize at the core level and pretend it's exons, and if you supply those data to diffSplice in the correct format, then maybe it's OK? I've never used that function, and like I said it's actually for RNA-Seq data, so caveat emptor and all that. But you might give it a look.