The 'core' level is +/- the transcript level, or as close as you can easily get to that. That said, it's not really that close. For example,
> select(clariomdhumantranscriptcluster.db, "TC0100009406.hg.1", "REFSEQ")
'select()' returned 1:many mapping between keys and columns
PROBEID REFSEQ
1 TC0100009406.hg.1 NM_001040623
2 TC0100009406.hg.1 NM_001258001
3 TC0100009406.hg.1 NM_001258002
4 TC0100009406.hg.1 NM_001258003
5 TC0100009406.hg.1 NM_001258004
6 TC0100009406.hg.1 NM_001258005
7 TC0100009406.hg.1 NM_021797
8 TC0100009406.hg.1 NM_201653
9 TC0100009406.hg.1 NP_001035713
10 TC0100009406.hg.1 NP_001244930
11 TC0100009406.hg.1 NP_001244931
12 TC0100009406.hg.1 NP_001244932
13 TC0100009406.hg.1 NP_001244933
14 TC0100009406.hg.1 NP_001244934
15 TC0100009406.hg.1 NP_068569
16 TC0100009406.hg.1 NP_970615
17 TC0100009406.hg.1 XM_006710577
18 TC0100009406.hg.1 XM_017001047
19 TC0100009406.hg.1 XM_017001048
20 TC0100009406.hg.1 XP_006710640
21 TC0100009406.hg.1 XP_016856536
22 TC0100009406.hg.1 XP_016856537
> select(clariomdhumantranscriptcluster.db, "TC0100009406.hg.1", "ENTREZID")
'select()' returned 1:1 mapping between keys and columns
PROBEID ENTREZID
1 TC0100009406.hg.1 27159
So the claim from Affy is that this probeset interrogates 22 different transcripts (which it may!), from one gene.
You could use the MBNI re-mapped pdInfoPackage for this array that is based on say Ensembl Transcript IDs, but probably there is still lots of overlap. For example,
> install.packages("http://mbni.org/customcdf/24.0.0/enst.download/pd.clariomdhuman.hs.enst_24.0.0.tar.gz", repos = NULL, type = "source")
## Need type = "source" as I am on Windows here
> library(pd.clariomdhuman.hs.enst)
Loading required package: RSQLite
Loading required package: DBI
Warning message:
package 'RSQLite' was built under R version 4.0.3
## we'll do some direct SQL queries
> con <- db(pd.clariomdhuman.hs.enst)
> library(DBI)
> dbListTables(con)
[1] "featureSet1" "mps1pm" "pmfeature" "table_info"
## You can check for what is in these tables thusly
> dbGetQuery(con, "select * from featureset1 limit 3;")
fsetid man_fsetid type direction
1 1 ENST00000000233.10_at expression antisense
2 2 ENST00000000412.8_at expression antisense
3 3 ENST00000000442.11_at expression antisense
## get an example set of transcripts, using the gene from above.
## First we need the corresponding Ensembl Gene ID to query the Biomart
> select(org.Hs.eg.db, "27159", "ENSEMBL")
'select()' returned 1:1 mapping between keys and columns
ENTREZID ENSEMBL
1 27159 ENSG00000134216
## so that's the Ensembl Gene ID we want, now get Transcript IDs
> library(biomaRt)
> mart <- useEnsembl("ensembl","hsapiens_gene_ensembl")
> z <- getBM("ensembl_transcript_id_version", "ensembl_gene_id", "ENSG00000134216", mart)
## staple an _at on the end to match the MBNI IDs
> z <- paste0(z[,1], "_at")
> z
[1] "ENST00000422815.5_at" "ENST00000483391.5_at" "ENST00000369740.6_at"
[4] "ENST00000352594.10_at" "ENST00000477918.6_at" "ENST00000353665.10_at"
[7] "ENST00000489524.5_at" "ENST00000430615.1_at" "ENST00000451398.6_at"
[10] "ENST00000343320.10_at"
## So those are our transcript IDs; let's do a query
> zz <- dbGetQuery(con, paste0("select fsetid, man_fsetid, fid from featureset1 inner join mps1pm using(fsetid) where man_fsetid in ('", paste(z, collapse = "','"), "');"))
> head(zz)
fsetid man_fsetid fid
1 14198 ENST00000343320.10_at 28005
2 14198 ENST00000343320.10_at 68436
3 14198 ENST00000343320.10_at 80011
4 14198 ENST00000343320.10_at 88045
5 14198 ENST00000343320.10_at 130690
6 14198 ENST00000343320.10_at 139347
> dim(zz)
[1] 941 3
> zzz <- split(zz[,-1], zz[,1])
> sapply(zzz, nrow)
14198 15333 15436 20293 41209 45482 56686 75939 80547 85688
109 101 88 116 86 91 92 107 79 72
## so about 70-100 probes per probeset. What's the overlap between them?
> sapply(2:10, function(x) sum(zzz[[1]][,2] %in% zzz[[x]][,2])/nrow(zzz[[x]]))
[1] 0.8217822 0.7954545 0.9224138 0.8953488 0.9560440 0.9130435 0.7383178
[8] 0.8354430 0.8888889
## so about 80-90% overlap in the probe usage for each probeset.
In this one example, the MBNI re-mapped probeset has 9 transcripts for a single gene, but re-uses about 90% of the probes for each transcript. It's probably pretty similar for most probesets, so it's not clear if you will gain anything by using that pdInfoPackage
instead of the regular one.
Hi James, Thanks a lot for your reply.
The one-to-many mapping issue from Probeset IDs to Transcripts was exactly what I had ran into before asking this question. What if I go in another direction?
If I have a differentially expressed Exon/probeset (after running RMA on probeset level), is that substantial enough to claim that the transcript(s) this probeset maps to are also differentially expressed?
Or
If all the probesets, that map to a particular transcript, are differentially expressed, then the transcript is differentially expressed as well?
Thank you for your valuable support and time.
I don't see how either of your proposals are different from just doing
rma
at the transcript level. If you were to do that, then a significantly differentially expressed gene would also mean that the underlying transcripts were differentially expressed as well.My use case if focused on following:
Let's say Gene-A has 4 Transcripts: Transcript 1, Transcript 2, Transcript 3, Transcript 4. I'm interested in Genes that display the following behavior:
If I find a probe set that maps to Transcript 1 & Transcript 2 and doesn't map to Transcript 3 & Transcript 4, will I be able to infer the above-mentioned situation?
I really appreciate your time. Thanks a lot.
That's the whole idea behind the Clariom D (and HTA) arrays, that you can do differential transcript abundance (DTA). The arrays are actually pretty complex, having probes that span exon-exon junctions to infer different transcripts and stuff.
These arrays were supposed to be Affy's answer to RNA-Seq, but so far as I know, the only people who ever developed software to do DTA was Affy, with their Transcript Analysis Console. Which you could play around with.
A hypothetical alternative would be to assume that the 'core' summarization is at the exon level (it's not - it's at what Affy calls the probe set region (PSR), which may include an exon, or part of an exon, or IIRC more than one exon). But anyway. The limma package has a function called
diffSplice
that is actually intended for RNA-Seq analyses, but if you summarize at the core level and pretend it's exons, and if you supply those data todiffSplice
in the correct format, then maybe it's OK? I've never used that function, and like I said it's actually for RNA-Seq data, so caveat emptor and all that. But you might give it a look.