Question

How to Annotate my Clariom D Human Array?

0

Entering edit mode

jiamacro • 0

@jiamacro-21790

Last seen 4.5 years ago

I am analyzing my Clariom D Human WT Assay data, which tests expression level of mRNA, lncRNA, miRNA and circRNA at the same time. While I find that the official .csv document contains annotation information from many different sources like RefSeq, Ensembl, AceView and lncRNAwiki at the same time, which is too complex to extract the gene id. And I did use clariomdhumantranscriptcluster.db package from James W. MacDonload before, however, I found there are still many NA values even after I deleted them and the matched probesets are only about 25000. Besides, I did DEG analysis of the whole expression matrix and only get about 1000 DEGs and found almost nothing enriched in the consequent enrichment analysis. Emily from Ensembl Team told me there is no annotation package for my assay.

So I have two questions now. 1) Does the clariomdhumantranscriptcluster.db package still work? If not, how can I extract the gene id from the official annotation document? 2) Do you think I should annotate first and do DEG analysis by different types of RNA? I used to do DEG analysis for the whole expression matrix before annotation and found almost nothing significant. Besides, I read some paper and the authors separated lncRNA and mRNA first, did DEG respectively, and got a pretty result. Since I am a freshman, I have no idea about my DEG analysis. Should I analyze the different RNAs respectively?

> ann.df <- read.csv ("Clariom_D_Human.r1.na36.hg38.a1.transcript.csv",
+                     header = T,
+                     sep = ",",
+                     dec = ".",
+                     fill = T,
+                     comment.char = "#"
+ )
> head (ann.df,1)
  transcript_cluster_id       probeset_id seqname strand start  stop total_probes
1     TC0100006432.hg.1 TC0100006432.hg.1    chr1      + 11869 14412           10
                                                                                                                                                                                                                                                                                                                        gene_assignment
1 NR_046018 // DDX11L1 // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 // 1p36.33 // 100287102 /// OTTHUMT00000002844 // DDX11L1 // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 // 1p36.33 // 100287102 /// OTTHUMT00000362751 // DDX11L1 // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 // 1p36.33 // 100287102
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        mrna_assignment
1 NR_046018 // RefSeq // Homo sapiens DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 (DDX11L1), non-coding RNA. // chr1 // 100 // 100 // 0 // --- // 0 /// OTTHUMT00000002844 // Havana transcript // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1[gene_biotype:transcribed_unprocessed_pseudogene transcript_biotype:transcribed_unprocessed_pseudogene] // chr1 // 100 // 100 // 0 // --- // 0 /// OTTHUMT00000362751 // Havana transcript // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1[gene_biotype:transcribed_unprocessed_pseudogene transcript_biotype:processed_transcript] // chr1 // 100 // 100 // 0 // --- // 0 /// ENST00000450305 // ENSEMBL // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 [gene_biotype:transcribed_unprocessed_pseudogene transcript_biotype:transcribed_unprocessed_pseudogene] // chr1 // 100 // 100 // 0 // --- // 0 /// ENST00000456328 // ENSEMBL // DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 [gene_biotype:transcribed_unprocessed_pseudogene transcript_biotype:processed_transcript] // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000001 // lncRNAWiki // Non-coding transcript identified by NONCODE // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000001 // NONCODE // Non-coding transcript identified by NONCODE: Linc // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000002 // lncRNAWiki // Non-coding transcript identified by NONCODE // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000002 // NONCODE // Non-coding transcript identified by NONCODE: Linc // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000003 // lncRNAWiki // Non-coding transcript identified by NONCODE // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000003 // NONCODE // Non-coding transcript identified by NONCODE: Linc // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000004 // lncRNAWiki // Non-coding transcript identified by NONCODE // chr1 // 100 // 100 // 0 // --- // 0 /// NONHSAT000004 // NONCODE // Non-coding transcript identified by NONCODE: Linc // chr1 // 100 // 100 // 0 // --- // 0
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        swissprot
1 NR_046018 // B7ZGX0 /// NR_046018 // B7ZGX2 /// NR_046018 // B7ZGX7 /// NR_046018 // B7ZGX8 /// OTTHUMT00000002844 // B7ZGX0 /// OTTHUMT00000002844 // B7ZGX2 /// OTTHUMT00000002844 // B7ZGX7 /// OTTHUMT00000002844 // B7ZGX8 /// OTTHUMT00000362751 // B7ZGX0 /// OTTHUMT00000362751 // B7ZGX2 /// OTTHUMT00000362751 // B7ZGX7 /// OTTHUMT00000362751 // B7ZGX8 /// ENST00000450305 // B7ZGX0 /// ENST00000450305 // B7ZGX2 /// ENST00000450305 // B7ZGX7 /// ENST00000450305 // B7ZGX8 /// ENST00000450305 // B4E2Z4 /// ENST00000450305 // B7ZGW9 /// ENST00000450305 // Q6ZU42 /// ENST00000450305 // B7ZGX3 /// ENST00000450305 // B5WYT6 /// ENST00000456328 // B7ZGX0 /// ENST00000456328 // B7ZGX2 /// ENST00000456328 // B7ZGX7 /// ENST00000456328 // B7ZGX8 /// ENST00000456328 // B4E2Z4 /// ENST00000456328 // B7ZGW9 /// ENST00000456328 // Q6ZU42 /// ENST00000456328 // B7ZGX3 /// ENST00000456328 // B5WYT6
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            unigene
1 NR_046018 // Hs.714157 // testis| normal| adult /// OTTHUMT00000002844 // Hs.714157 // testis| normal| adult /// OTTHUMT00000362751 // Hs.714157 // testis| normal| adult /// ENST00000450305 // Hs.719844 // brain| testis| normal /// ENST00000450305 // Hs.714157 // testis| normal| adult /// ENST00000450305 // Hs.740212 // --- /// ENST00000450305 // Hs.712940 // bladder| bone marrow| brain| embryonic tissue| intestine| mammary gland| muscle| pharynx| placenta| prostate| skin| spleen| stomach| testis| thymus| breast (mammary gland) tumor| gastrointestinal tumor| glioma| non-neoplasia| normal| prostate cancer| skin tumor| soft tissue/muscle tissue tumor|embryoid body| adult /// ENST00000456328 // Hs.719844 // brain| testis| normal /// ENST00000456328 // Hs.714157 // testis| normal| adult /// ENST00000456328 // Hs.740212 // --- /// ENST00000456328 // Hs.712940 // bladder| bone marrow| brain| embryonic tissue| intestine| mammary gland| muscle| pharynx| placenta| prostate| skin| spleen| stomach| testis| thymus| breast (mammary gland) tumor| gastrointestinal tumor| glioma| non-neoplasia| normal| prostate cancer| skin tumor| soft tissue/muscle tissue tumor|embryoid body| adult
                                                                                                                                                                                                                                    GO_biological_process
1 ENST00000450305 // GO:0006139 // nucleobase-containing compound metabolic process // inferred from electronic annotation  /// ENST00000456328 // GO:0006139 // nucleobase-containing compound metabolic process // inferred from electronic annotation 
  GO_cellular_component
1                   ---
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        GO_molecular_function
1 ENST00000450305 // GO:0003676 // nucleic acid binding // inferred from electronic annotation  /// ENST00000450305 // GO:0005524 // ATP binding // inferred from electronic annotation  /// ENST00000450305 // GO:0008026 // ATP-dependent helicase activity // inferred from electronic annotation  /// ENST00000450305 // GO:0016818 // hydrolase activity, acting on acid anhydrides, in phosphorus-containing anhydrides // inferred from electronic annotation  /// ENST00000456328 // GO:0003676 // nucleic acid binding // inferred from electronic annotation  /// ENST00000456328 // GO:0005524 // ATP binding // inferred from electronic annotation  /// ENST00000456328 // GO:0008026 // ATP-dependent helicase activity // inferred from electronic annotation  /// ENST00000456328 // GO:0016818 // hydrolase activity, acting on acid anhydrides, in phosphorus-containing anhydrides // inferred from electronic annotation 
  pathway protein_domains category       locus.type notes Best_Coverage_TaqMan_Assay
1     ---             ---     main Multiple_Complex   ---   TaqMan Probe Unavailable
  Best_Coverage_TaqMan_Assay_HTML
1        TaqMan Probe Unavailable

ClariomDHuman Annotation 3 groups • 1.7k views

ADD COMMENT • link updated 4.5 years ago by James W. MacDonald 65k • written 4.5 years ago by jiamacro • 0

score 0 · Answer 1 · 2019-11-04

The clariomdhumantranscriptcluster.db package still 'works' where by 'works' I mean 'contains the data that were supplied by Affymetrix for that array'. I use the csv file you mention to generate that package, and used the na36 version (which has been static for many years now), so unless Affy is updating things without incrementing the file number, it's still reflective of what they supply.

I am not sure what Emily from Ensembl meant by that, unless she was saying that Ensembl hasn't added that array as a filter for Biomart (and consequently the biomaRt package). It is technically annotated, by which I mean Affymetrix has aligned the probes against the genome and told us what they found.

The Clariom D array is intended to have what Affy calls 'deep content', unfortunately much of which isn't annotated, which is why of the ~139K probesets on that array, only ~26.5K have a Gene ID. There are probably many reasons for that, but the primary reason is that Affy put a bunch of speculative content on the array which remains speculative to this day.

The MBNI group at Michigan do a re-mapping of the probesets where they just pretend Affy never did any annotating, and they take all the probes on the array, align to the genome and then throw out all the probes that don't align to a unique genomic position. They then take the remaining probes and count up all those that align to a known gene location, and put them in new probesets that are just one probeset per gene. They take the 8.1M probes on this array and throw out 5.7M of them, and only use the remaining 29% of the probes to do that.

So an alternative would be to Google 'MBNI custom cdf', download the current pdInfo and transcript packages (clariomdhumanhsentrezg.db and pd.clariomdhuman.hs.entrezg, respectively) and use them instead of the Affy versions. With the MBNI re-mapped data you get more genes:

> length(keys(clariomdhumanhsentrezg.db))
[1] 40899
> length(keys(clariomdhumantranscriptcluster.db, "ENTREZID"))
[1] 26469
>

Which may be better?