Search
Question: Retriving Annotation for HTA2.0 using biomRt
0
14 months ago by
Seymoo0
Oslo
Seymoo0 wrote:

I am having problem with biomRt when I want to retrieve annotation for prob IDs from Affymetrix HTA2.0.

My code is as follow;

ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
filters <- listFilters(ensembl)
attributes <- listAttributes(ensembl)

genes <- getBM(attributes = c("affy_hta_2_0", "ensembl_gene_id","ensembl_transcript_id","external_gene_name","entrezgene", "gene_biotype","transcript_biotype", "go_id","strand","chromosome_name","status","transcript_status","transcript_length","description"),
filters = "with_affy_hta_2_0",
values = '2824546_st',
mart = ensembl)

##

biomaRt error:  with_affy_hta_2_0  is a boolean filter and needs a corresponding logical value of TRUE or FALSE to indicate if the query should retrieve all data that fulfill the boolean or alternatively that all data that not fulfill the requirement should be retrieved.

I also tried to set the filter as "affy_hta_2_0" but failed to retrieve the annotation.

Where should I put logical value to fix boolean filter issue?

thanks

modified 14 months ago by Thomas Maurel740 • written 14 months ago by Seymoo0
0
14 months ago by
Thomas Maurel740
United Kingdom
Thomas Maurel740 wrote:

Hello,

You can either retrieve all the Affymetrix HTA2.0 mapped to the Ensembl Transcript IDs using "with_affy_hta_2_0" as follow:

genes <- getBM(attributes = c("affy_hta_2_0", "ensembl_gene_id","ensembl_transcript_id","external_gene_name","entrezgene", "gene_biotype","transcript_biotype", "go_id","strand","chromosome_name","status","transcript_status","transcript_length","description"),
filters = "with_affy_hta_2_0",
values = TRUE,
mart = ensembl)

Since this is quite a lot of data, I suggest filtering one chromosome at a time, e.g:

 genes <- getBM(attributes = c("affy_hta_2_0", "ensembl_gene_id","ensembl_transcript_id","external_gene_name","entrezgene", "gene_biotype","transcript_biotype", "go_id","strand","chromosome_name","status","transcript_status","transcript_length","description"),
filters = c("with_affy_hta_2_0","chromosome_name"),
list(with_affy_hta_2_0=TRUE,chromosome_name="1"),
mart = ensembl)

The other option is to give a list of probe IDs to the affy_hta_2_0 filter, e.g:

genes <- getBM(attributes = c("affy_hta_2_0", "ensembl_gene_id","ensembl_transcript_id","external_gene_name","entrezgene", "gene_biotype","transcript_biotype", "go_id","strand","chromosome_name","status","transcript_status","transcript_length","description"),
filters = "affy_hta_2_0",
values = 'TC01003833.hg',
mart = ensembl)

Hope this helps,

Kind Regards,

Thomas

Thanks you very much for your help Thomas. However, when I RMA normalize all loaded CEL files into R by oligo::RMA(CEL_Files), the output is an expressionSet with affymetrix probe IDs like 2824546_st. So when I set the values = row.names(expressionSet) I get the above message.

But If I run your first or second codes, then how can I annotate my expressionSet ?

I might be remembering incorrectly, but I'm pretty sure both 2824546_st and TC01003833.hg are probeset IDs.  They have very different looking names, because the former is a negative control probe that you don't expect to map to anything in the genome, while the latter is designed to target a specific transcript cluster.

If you take Thomas's third block of code, and use values = row.names(expressionSet) you should not get the error message, but should get some results.  Note that in that piece of code he uses filters = "affy_hta_2_0" not filters = "with_affy_hta_2_0".  These look really similar but are used in opposite ways.

• filters = "affy_hta_2_0" expects you to provide the probe IDs you're interested in, and getBM will return any gene (or transcript) that is targeted by those particular probes.
• filters = "with_affy_hta_2_0" restricts your results to only genes (or transcripts) that are targeted by the affy HTA 2.0 platform, but by default will give you all of them.  You can combine with other attributes in Ensembl (like chromosome, GC content, etc) to filter the list further - as Thomas demonstrated in his second code block.

The error message above really doesn't make that clear to me, so I'll have a look at making the explanation of what that type of filter does a bit better.

The code below demonstrates what you want to do with a list of three probe IDs and a smaller list of attributes (just to make it easier to see here).

genes <- getBM(attributes = c("affy_hta_2_0",
"ensembl_gene_id",
"external_gene_name"),
filters = "affy_hta_2_0",
values = c('TC01003833.hg', '2824546_st', 'TC01000020.hg'),
mart = ensembl)

And the result has only 2 entries, since 2824546_st doesn't target anything.

> genes
affy_hta_2_0 ensembl_gene_id external_gene_name
1 TC01003833.hg ENSG00000136628               EPRS
2 TC01000020.hg ENSG00000187634             SAMD11

Hi Mike.
When I set filters = "affy_hta_2_0" I dont get any error but the output is 0 observation. but in BiomRt docs for Affy IDs like "2824546_st" filters = "with_affy_hta_2_0" has been exampled. I might be wrong but maybe that's why from the example you gave me above you just got 2 entries and that's exactly for 2824546_st.

Thanks for the help though.

Can you point me to where in the biomaRt documentation the with_affy_hta_2_0 example is?  I just want to check what it says, because it should always throw the error you've received if you try to give any value other than TRUE or FALSE.

Also, could you give a few more examples of the probeset IDs you have in your expression set so I can try to understand why you aren't getting any results.  The following code should select 10 at random:

sample(x = row.names(expressionSet), size = 10)
 Here are some more prob IDS Mike

[1] "2824546_st" "2824549_st" "2824551_st" "2824554_st" "2827992_st" "2827995_st" "2827996_st" "2828010_st" "2828012_st"
[10] "2835442_st" "2835447_st" "2835453_st" "2835456_st" "2835459_st" "2835461_st" "2839509_st" "2839511_st" "2839513_st"
[19] "2839515_st" "2839517_st"
1

Mike already told you this, but maybe not explicitly enough. Anything on the HTA array with a probeset ID that has a number, followed by _st is a control probeset, and will not have any annotation. Because, it's a control probeset, not something that measures a gene! The only things that should return any annotation data are the probesets that start with TC, have some numbers, and end with a .hg.