Question

Gene set Testing with Limma for Agilent IDs probes?

0

Entering edit mode

Bouzid.a ▴ 20

@bouzida-10697

Last seen 5.3 years ago

Hi all,

I'm studying Agilent Human Gene Expression one color microarray using Limma. It results significant probes with the topTable function but then I didn't succed to do the conversion of my identifiers with "ids2indices" function? nor later the gene set testing?

Please, find below my code, I'm new to Limma what am I doing wrong??
Thanks,

Best,

Amal

##

>require("TxDb.Hsapiens.UCSC.hg19.knownGene")

>humangenes<-genes(TxDb.Hsapiens.UCSC.hg19.knownGene)

or

>download.file("http://bioinf.wehi.edu.au/software/MSigDB/human_c2_v5p2.rdata","human_c2_v5p2.rdata")
>load("human_c2_v5p2.rdata")
>summary(Hs.c2)

>SystematicName.allgenes <- myresults$genes$SystematicName

#Agilent SystematicName look like:

> SystematicName.allgenes[(1:10)]
[1] "NM_004333" "NM_015987" "NM_024604" "NM_178829"
[5] "ENST00000429990" "NM_001040196" "NM_030961" "NM_033081"
[9] "NM_020188" "NM_012318"

>indices.genes <- ids2indices(humangenes$gene_id, SystematicName.allgenes,remove.empty=TRUE)

>indices.genes #named list()

>length(indices.genes) # it results: [1] 0

or

>genes.indices.Hs <- ids2indices(Hs.c2,SystematicName.allgenes)

>genes.indices.Hs #named list()

>length(genes.indices.Hs) # it results: [1] 0

## The same if I use the probe names as identifiers:

probenames.allgenes <- myresults$genes$ProbeName

#Agilent ProbeName look like:

> probenames.allgenes[(1:10)]
[1] "A_23_P42935" "A_23_P117082" "A_23_P2683" "A_23_P157316" "A_32_P14850"
[6] "A_23_P158596" "A_23_P350107" "A_23_P388190" "A_23_P106544" "A_23_P94998"

limma agilent microarrays gene set testing annotation • 2.3k views

ADD COMMENT • link 8.2 years ago Bouzid.a ▴ 20

score 2 · Answer 1 · 2017-01-18

Running summary on a list is probably not that great an idea, as you doubtless found out. Instead, it's preferable to first find the class of an object and then query from there.

> class(Hs.c2)
[1] "list"
> length(Hs.c2)
[1] 4729
> class(Hs.c2[[1]])
[1] "character"
> head(Hs.c2[[1]])
[1] "55902" "2645"  "5232"  "5230"  "5162"  "5160"

So Hs.c2 is a list of numbers in character format. That's not super useful in and of itself, but you could go to the site where you downloaded these data and see that these are character vectors of Entrez Gene IDs.

The help page says the second argument should be a 'character vector of gene identifiers' which you should interpret as 'a character vector of gene identifiers of the same type as in the gene set list you downloaded'. So you need the Entrez Gene IDs for each of your Agilent probes, in character format. This is where things get tricky, because you just have this stupid SystematicName column in the genes data.frame that is pretty useless. The array you are using is the Agilent-026652, if I am not mistaken, so you need the HsAgilentDesign026652.db package (which you can infer by the fact that, um, 026652 is the same for both the array and the annotation package, so you have to do a bit of Sherlock Holmes for that step).

You can now annotate your data, assuming that your normalized data are stored in an EList called 'myresults'

library(BiocInstaller)
biocLite("Hs.AgilentDesign026652.db")
library(Hs.AgilentDesign026652.db)
## add Entrez Gene IDs to your EList
myresults$genes$ENTREZID <- mapIds(Hs.AgilentDesign026652.db, myresults$genes$ProbeName, "ENTREZID","PROBENAME")

## fit the model
fit <- lmFit(myresults, design)
fit2 <- contrasts.fit(fit, contrast)
fit2 <- eBayes(fit2)
tt <- topTable(fit2, 1)

Now the topTable 'tt' has a column of Entrez Gene IDs.

ind <- ids2indices(Hs.c2, as.character(tt$ENTREZID))

score 0 · Answer 2 · 2017-01-20

0

Entering edit mode

Bouzid.a ▴ 20

@bouzida-10697

Last seen 5.3 years ago

Dear James,

Many thanks for your comments and explanation! It really helped me.

Absolutely, I'm using the Agilent-026652 arrays. I write to you a bit of my Sherlock Holmes for that step!

I added to your code:

Considering that "my probes" is a data farme of my resulted significant probes:

>require(HsAgilentDesign026652.db)
>mykeys = myprobes$ProbeName
>mykeys = as.vector(mykeys)
>myprobes$ENTREZID <- mapIds(HsAgilentDesign026652.db,  keys= mykeys , column = "ENTREZID", keytype = "PROBEID")
>probes_ZID <- ids2indices(Hs.c2, as.character(myprobes$ENTREZID)) # results #named list()
##as an alternative
>myENTREZID= myprobes[,12] 
>myind <- ids2indices(Hs.c2, myENTREZID)

##

But after that about 23% of my probes don't have an ENTREZID! It isn't a loss of information?

ADD COMMENT • link 8.2 years ago Bouzid.a ▴ 20

1

Entering edit mode

If you want to make a comment on my post, please use the ADD COMMENT button, rather than the 'Add your answer' box (which is intended for answers only).

Your question is whether or not this is a loss of information, and it isn't clear if it is. Not every sequence has an Entrez Gene ID, and there can be any number of reasons why not. There are lots of cDNA clones, hypothetical transcripts, etc that were put on the array as speculative content, and those things won't have an Entrez Gene ID until there is more evidence that they are real.

So one could argue that restricting to probes with linked Entrez Gene IDs increases the information by getting rid of noise probes that don't measure anything.

ADD REPLY • link 8.2 years ago James W. MacDonald 68k

0

Entering edit mode

Hello James,

Thanks for your quick reply.

So afterwards, it will be interesting to focus only on probes mapped to Enter Gene IDs without seeking explanations of non-mapping for the others!? Thanks,

Best,

ADD REPLY • link 8.2 years ago Bouzid.a ▴ 20

1

Entering edit mode

Oh, I don't know. It depends on how much work you want to put in. You could register for Agilent's eArray site, or more easily, just go to GEO and download the annotation table, which has the probe sequences. You could then get all the probe sequences that they don't have annotations for, and then blast against something (just human, human + mouse, nt?) to get the most likely alignment for that probe, and then add that back into your annotations.

Or you could take the sequences, convert to FASTA, and then use an aligner like subread, or bowtie, or whatever you prefer to align against GRCh38, then read the bam file into R, and use summarizeOverlaps to get the genes that each probe aligns to.

You could probably do other things as well. It just depends on how interested you are in seeking explanations for the non-mapping. In the end I imagine you won't gain many extra annotations by doing that, but you may gain experience in how one would do such a thing, which may be enough of a gain for you to do it anyway.