Question

Summarize Microarray data on a transcript level

0

Entering edit mode

mailtoshivam13 • 0

@a3a9257a

Last seen 5 months ago

United States

Hello Everyone,

I'm analyzing Affymetrix data from Clariom D human Gene chip assays. I understand that there are two ways of look at the data: from a gene level and from a probe set level.

Using the RMA algorithm, I can summarize the exon data on a gene level:

probeset.eset=rma(dat, target="core")

or on an probe set level

probeset.eset=rma(dat, target="probeset")

But how do I get the expression data on a transcript (mRNA Isoforms) level?

I understand the probe sets map to multiple exon ranges, and so they can target exons from multiple transcripts (but not all) of the same gene.

Is there a way to summarize Microarray data on a transcript level?

Thank you for your time!

pd.clariom.d.human oligo clariomdhumanprobeset.db • 1.7k views

ADD COMMENT • link updated 3.3 years ago by James W. MacDonald 65k • written 3.3 years ago by mailtoshivam13 • 0

score 1 · Answer 1 · 2020-12-21

The 'core' level is +/- the transcript level, or as close as you can easily get to that. That said, it's not really that close. For example,

> select(clariomdhumantranscriptcluster.db, "TC0100009406.hg.1", "REFSEQ")
'select()' returned 1:many mapping between keys and columns
             PROBEID       REFSEQ
1  TC0100009406.hg.1 NM_001040623
2  TC0100009406.hg.1 NM_001258001
3  TC0100009406.hg.1 NM_001258002
4  TC0100009406.hg.1 NM_001258003
5  TC0100009406.hg.1 NM_001258004
6  TC0100009406.hg.1 NM_001258005
7  TC0100009406.hg.1    NM_021797
8  TC0100009406.hg.1    NM_201653
9  TC0100009406.hg.1 NP_001035713
10 TC0100009406.hg.1 NP_001244930
11 TC0100009406.hg.1 NP_001244931
12 TC0100009406.hg.1 NP_001244932
13 TC0100009406.hg.1 NP_001244933
14 TC0100009406.hg.1 NP_001244934
15 TC0100009406.hg.1    NP_068569
16 TC0100009406.hg.1    NP_970615
17 TC0100009406.hg.1 XM_006710577
18 TC0100009406.hg.1 XM_017001047
19 TC0100009406.hg.1 XM_017001048
20 TC0100009406.hg.1 XP_006710640
21 TC0100009406.hg.1 XP_016856536
22 TC0100009406.hg.1 XP_016856537
> select(clariomdhumantranscriptcluster.db, "TC0100009406.hg.1", "ENTREZID")
'select()' returned 1:1 mapping between keys and columns
            PROBEID ENTREZID
1 TC0100009406.hg.1    27159

So the claim from Affy is that this probeset interrogates 22 different transcripts (which it may!), from one gene.

You could use the MBNI re-mapped pdInfoPackage for this array that is based on say Ensembl Transcript IDs, but probably there is still lots of overlap. For example,

> install.packages("http://mbni.org/customcdf/24.0.0/enst.download/pd.clariomdhuman.hs.enst_24.0.0.tar.gz", repos = NULL, type = "source")
## Need type = "source" as I am on Windows here
> library(pd.clariomdhuman.hs.enst)
Loading required package: RSQLite
Loading required package: DBI
Warning message:
package 'RSQLite' was built under R version 4.0.3 
## we'll do some direct SQL queries
> con <- db(pd.clariomdhuman.hs.enst)
> library(DBI)
> dbListTables(con)
[1] "featureSet1" "mps1pm"      "pmfeature"   "table_info" 
## You can check for what is in these tables thusly
> dbGetQuery(con, "select * from featureset1 limit 3;")
  fsetid            man_fsetid       type direction
1      1 ENST00000000233.10_at expression antisense
2      2  ENST00000000412.8_at expression antisense
3      3 ENST00000000442.11_at expression antisense
## get an example set of transcripts, using the gene from above. 
## First we need the corresponding Ensembl Gene ID to query the Biomart
> select(org.Hs.eg.db, "27159", "ENSEMBL")
'select()' returned 1:1 mapping between keys and columns
  ENTREZID         ENSEMBL
1    27159 ENSG00000134216
## so that's the Ensembl Gene ID we want, now get Transcript IDs
> library(biomaRt)
> mart <- useEnsembl("ensembl","hsapiens_gene_ensembl")
> z <- getBM("ensembl_transcript_id_version", "ensembl_gene_id", "ENSG00000134216", mart)
## staple an _at on the end to match the MBNI IDs
> z <- paste0(z[,1], "_at")
> z
 [1] "ENST00000422815.5_at"  "ENST00000483391.5_at"  "ENST00000369740.6_at" 
 [4] "ENST00000352594.10_at" "ENST00000477918.6_at"  "ENST00000353665.10_at"
 [7] "ENST00000489524.5_at"  "ENST00000430615.1_at"  "ENST00000451398.6_at" 
[10] "ENST00000343320.10_at"
## So those are our transcript IDs; let's do a query
> zz <- dbGetQuery(con, paste0("select fsetid, man_fsetid, fid from featureset1 inner join mps1pm using(fsetid) where man_fsetid in ('", paste(z, collapse = "','"), "');"))
> head(zz)
  fsetid            man_fsetid    fid
1  14198 ENST00000343320.10_at  28005
2  14198 ENST00000343320.10_at  68436
3  14198 ENST00000343320.10_at  80011
4  14198 ENST00000343320.10_at  88045
5  14198 ENST00000343320.10_at 130690
6  14198 ENST00000343320.10_at 139347
> dim(zz)
[1] 941   3
> zzz <- split(zz[,-1], zz[,1])
> sapply(zzz, nrow)
14198 15333 15436 20293 41209 45482 56686 75939 80547 85688 
  109   101    88   116    86    91    92   107    79    72 
## so about 70-100 probes per probeset. What's the overlap between them?
> sapply(2:10, function(x) sum(zzz[[1]][,2] %in% zzz[[x]][,2])/nrow(zzz[[x]]))
[1] 0.8217822 0.7954545 0.9224138 0.8953488 0.9560440 0.9130435 0.7383178
[8] 0.8354430 0.8888889
## so about 80-90% overlap in the probe usage for each probeset.

In this one example, the MBNI re-mapped probeset has 9 transcripts for a single gene, but re-uses about 90% of the probes for each transcript. It's probably pretty similar for most probesets, so it's not clear if you will gain anything by using that pdInfoPackage instead of the regular one.