How to do annotation with Limma
1
0
Entering edit mode
1gagoogag1 • 0
@4fd86e1d
Last seen 3.4 years ago
Turkey

Hey everyone, I'm trying to make an annotation for a specific dataset I have. I'm not sure if this helps but this is the platform of the dataset according to NCBI:

GPL17586    [HTA-2_0] Affymetrix Human Transcriptome Array 2.0 [transcript (gene) version]

I've tried the following example:

BiocManager::install("hgu95av2.db")
library(hgu95av2.db)
select(hgu95av2.db, c("1007_s_at","1053_at"), c("SYMBOL","ENTREZID","GENENAME"))

And this is the output:

> select(hgu95av2.db, c("1007_s_at","1053_at"), c("SYMBOL","ENTREZID","GENENAME"))
'select()' returned 1:many mapping between keys and columns
    PROBEID  SYMBOL  ENTREZID                                    GENENAME
1 1007_s_at    DDR1       780 discoidin domain receptor tyrosine kinase 1
2 1007_s_at MIR4640 100616237                               microRNA 4640
3   1053_at    RFC2      5982              replication factor C subunit 2

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

The output looks good, but it's not very clear to me how I can implement this for my own dataset. Am I supposed to make my own select function that works with my dataset? Or is there a ready function? Any help is appreciated!

annotation limma • 2.5k views
ADD COMMENT
1
Entering edit mode
Guido Hooiveld ★ 4.1k
@guido-hooiveld-2020
Last seen 18 hours ago
Wageningen University, Wageningen, the …

For the HTA2.0 arrays should not use hgu95av2.db, which is the annotation library for the, ehh, human U95 set of Affymetrix arrays! You should rather use the hta20transcriptcluster.db. Link here. After modifying this, and provided you use the correct (thus HTA2.0) probeset ids, your code above should work.

Also check this thread that links so several very relevant posts on the support site dealing with the difference between the analyses of the HTA2.0 array on the level of probesets or transcripts! How to annotate the Affymetrix data HTA-2.0 with oligo

BTW: the function annotateEset from the package affycoretools (link) allows to easy annotate your (normalized) data object; if you do annotate the data set before 'transferring' it to limma, then these annotation will be automagically transferred to the limma object as well.

Some example code to get you started (I downloaded 6 CEL files (= raw data) from a random GEO submission (GSE54937). After extracting the files in the GEO archive GSE54937_RAW.tar into the folder GSE54937_RAW:

library(oligo)
library(affycoretools)
library(hta20transcriptcluster.db)

#set path
path <- './GSE54937_RAW'

#load and normalize (RMA) the data
raw.affy.data <- read.celfiles(filenames = list.celfiles(path,  full.names=TRUE, listGzipped=TRUE) )
normalized.data <- rma(raw.affy.data, target = "core")

# add annotation data
library(hta20transcriptcluster.db)
normalized.data <- annotateEset(normalized.data,  hta20transcriptcluster.db)

# check
fData(normalized.data)[5000:5010,]
                        PROBEID ENTREZID    SYMBOL
TC01002046.hg.1 TC01002046.hg.1     <NA>      <NA>
TC01002047.hg.1 TC01002047.hg.1     <NA>      <NA>
TC01002048.hg.1 TC01002048.hg.1     <NA>      <NA>
TC01002049.hg.1 TC01002049.hg.1   267002     PGBD2
TC01002050.hg.1 TC01002050.hg.1   653635    WASH7P
TC01002051.hg.1 TC01002051.hg.1   645520   FAM138A
TC01002052.hg.1 TC01002052.hg.1     <NA>      <NA>
TC01002053.hg.1 TC01002053.hg.1   729737 LOC729737
TC01002054.hg.1 TC01002054.hg.1     <NA>      <NA>
TC01002055.hg.1 TC01002055.hg.1     <NA>      <NA>
TC01002056.hg.1 TC01002056.hg.1     <NA>      <NA>
                                                    GENENAME
TC01002046.hg.1                                         <NA>
TC01002047.hg.1                                         <NA>
TC01002048.hg.1                                         <NA>
TC01002049.hg.1      piggyBac transposable element derived 2
TC01002050.hg.1            WASP family homolog 7, pseudogene
TC01002051.hg.1 family with sequence similarity 138 member A
TC01002052.hg.1                                         <NA>
TC01002053.hg.1                    uncharacterized LOC729737
TC01002054.hg.1                                         <NA>
TC01002055.hg.1                                         <NA>
TC01002056.hg.1                                         <NA>

Be sure the check the help page for the function annotateEset (by typing ?annotateEset) for more info which annotation info can be retrieved.

ADD COMMENT
0
Entering edit mode

Thank you so much for your explanation! I have a question about how can I get the probeIDs and are they different from the probe set ID?

ADD REPLY
1
Entering edit mode

The probe IDs are different from the probeset IDs, and are usually not of interest. Each probeset is made up of 4 or more probes (depending on the level of summarization you are using), and they are summarized to estimate the overall PSR or transcript expression level.

The IDs can be extracted from the pmfeature table of the pd.hta.2.0 package:

> library(pd.hta.2.0)
> con <- db(pd.hta.2.0)
> dbGetQuery(con, "select fid, fsetid from pmfeature limit 20;")
       fid   fsetid
1     9555 18670000
2  6121713 18670001
3   725124 18670001
4  5156822 18670002
5  1997636 18670002
6  2794445 18670002
7  4143624 18670003
8   933907 18670003
9  4032789 18670003
10 6177676 18670004
11 4447448 18670004
12 5967856 18670004
13 4242702 18670005
14 5894560 18670005
15 5868378 18670005
16  777354 18670006
17  858824 18670006
18 2893709 18670006
19  106094 18670007
20 3599493 18670008

## that gives the mapping between PSR probeset (roughly exon) and probe ID (fid) if you want transcript, it's 

> dbGetQuery(con, "select fid, meta_fsetid as fsetid from pmfeature INNER JOIN core_mps USING(fsetid) limit 20;")
       fid   fsetid
1  4242702 18670005
2  5894560 18670005
3  5868378 18670005
4   106094 18670007
5  3358611 18670009
6  3813084 18670009
7  1225389 18670011
8  1631221 18670011
9  2586803 18670020
10 6674782 18670020
11 4597575 18670022
12  589772 18670023
13 5685963 18670023
14 2916466 18670027
15 2976905 18670027
16 4521670 18670028
17 1818803 18670028
18  699743 18670032
19  558838 18670032
20  257320 18670033

## or if you want a particular probeset

> dbGetQuery(con, "select fid, meta_fsetid as fsetid from pmfeature INNER JOIN core_mps USING(fsetid) where meta_fsetid='TC01000001.hg.1';")
       fid          fsetid
1  5481205 TC01000001.hg.1
2   118563 TC01000001.hg.1
3  2718198 TC01000001.hg.1
4  5650209 TC01000001.hg.1
5  5007080 TC01000001.hg.1
6  2871812 TC01000001.hg.1
7  1971373 TC01000001.hg.1
8  2614617 TC01000001.hg.1
9  5734781 TC01000001.hg.1
10 1503144 TC01000001.hg.1
11 5007561 TC01000001.hg.1
12 3706525 TC01000001.hg.1
13 1613613 TC01000001.hg.1
14 5066374 TC01000001.hg.1
15 6548458 TC01000001.hg.1
16 6730778 TC01000001.hg.1
17 6682814 TC01000001.hg.1
18 6179002 TC01000001.hg.1
19 5737004 TC01000001.hg.1
20  415419 TC01000001.hg.1
21 3832073 TC01000001.hg.1
22  973055 TC01000001.hg.1
23 3650905 TC01000001.hg.1
24 3846991 TC01000001.hg.1
25 1234142 TC01000001.hg.1
26 1652877 TC01000001.hg.1
27  594304 TC01000001.hg.1
28 5005929 TC01000001.hg.1
29 3218452 TC01000001.hg.1
30  337901 TC01000001.hg.1
31 1233321 TC01000001.hg.1
32 6764571 TC01000001.hg.1
33 1032165 TC01000001.hg.1
34 1804287 TC01000001.hg.1
35 2325148 TC01000001.hg.1
36 2266560 TC01000001.hg.1
37 4174932 TC01000001.hg.1
38 5311659 TC01000001.hg.1
39 6415840 TC01000001.hg.1
40 3162554 TC01000001.hg.1
41 2700895 TC01000001.hg.1
42  224878 TC01000001.hg.1
43 4727501 TC01000001.hg.1
44 2123889 TC01000001.hg.1
45 6618713 TC01000001.hg.1
46  784417 TC01000001.hg.1
47 5458432 TC01000001.hg.1
48 6107053 TC01000001.hg.1
49 1221952 TC01000001.hg.1
ADD REPLY
1
Entering edit mode

I should also point out that the fid isn't really the probe ID. For this array (and IIRC all of the random-primer arrays), there really isn't a probe ID, but instead it's just the index position of the probe when you read the array. So for example, the first probe ID for TC01000001.hg.1 is 5481205, and if you were to get the raw data for an array, it would be the 5481205th value (the data from an Affy array is just a vector of numbers).

So for a given PSR, say 18670006, the fids are

> dbGetQuery(con, "select fid, fsetid from pmfeature where fsetid='18670006';")
      fid   fsetid
1  777354 18670006
2  858824 18670006
3 2893709 18670006

and to summarize we would take the three values at those positions in the matrix of array data and run through the RMA algorithm, which is what rma does for you, but for all of the probesets instead of just one.

ADD REPLY
1
Entering edit mode

In addition to James's answer (maybe this was actually confusing you): as far as I know all annotation libraries use the term "PROBEID" to refer to the identifier given by the array manufacturer to a (set of) DNA sequences (=probes) that is/are used to measure the expression of a transcript. FYI: The Affymetrix platform was designed such way that a set of individual probes is used to estimate the expression of a probeSET, that in turn reflects the expression of a transcript. In contrast, on Agilent and Illumina arrays only a single DNA probe is used to measure the expression of a transcript. In other words, Agilent and Illumina arrays don't make use of SETS of probes, and hence the DNA sequences on the glass surface are [just] called probes. The length of the Affymetrix probes are relatively short (~25 nt; hence their rationale to combine multiple probes for a reliable probeset measurement), but the Agilent (60 nt) and Illumina (50 nt) probes are much longer.

ADD REPLY
1
Entering edit mode

Oh, wait. That's probably a more accurate interpretation of the OP's question... Thanks Guido!

ADD REPLY
0
Entering edit mode

Thank you!

ADD REPLY
0
Entering edit mode

Thank you for your help!

ADD REPLY
0
Entering edit mode

I was wondering if you can explain how to analyze the results from the gProfiler

ADD REPLY
0
Entering edit mode

I don't have experience with g:Profiler myselves, but if I recall correctly it is a website that can be used for functional analyses of lists of genes. Please note that this is not (!) a Bioconducor package, so I believe this question is better asked at e.g. Biostars. Yet, if you would like to perform your analyses in R, you may want to check the CRAN package gProfiler2 (link). I also know of the Cytoscape plugin EnrichmentMap that can be used for visualization of g:Profiler results; check for example this Nature Protocols paper (here) to get you started with that. Good luck!

ADD REPLY

Login before adding your answer.

Traffic: 666 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6