Question

HTA 2.0 ST array annotation

0

Entering edit mode

sylvia ▴ 10

@sylvia-5630

Last seen 6.5 years ago

Hi,

I'm currently working on HTA 2.0 ST array and would like to do annotation through hta20sttranscriptcluster.db

But I realized the probe ID hta20sttranscriptcluster.db is not the same as the ID in the expression dataset:

> head(featureNames(dataset$eset))
[1] "2824546_st" "2824549_st" "2824551_st" "2824554_st" "2827992_st"
[6] "2827995_st"

> head(keys(hta20sttranscriptcluster.db, keytype = "PROBEID"))
[1] "18670000" "18670001" "18670002" "18670003" "18670004" "18670005"

I was wondering if there's a way to convert between the two IDs in hta20sttranscriptcluster.db and expression dataset?

Thank you!

Sylvia

hta20sttranscriptcluster.db oligo • 2.4k views

ADD COMMENT • link updated 8.5 years ago by James W. MacDonald 66k • written 8.5 years ago by sylvia ▴ 10

score 0 · Answer 1 · 2016-01-22

0

Entering edit mode

James W. MacDonald 66k

@james-w-macdonald-5106

Last seen 5 hours ago

United States

Why do you assume that the first 6 probesets in your ExpressionSet will be the same as the first 6 keys for the annotation package? That seems like a pretty strong assumption to me.

> sum(head(featureNames(eset)) %in% keys(hta20sttranscriptcluster.db))
[1] 6
> sum(featureNames(eset) %in% keys(hta20sttranscriptcluster.db))
[1] 67480
> dim(eset)
Features  Samples
   67480        8
> length(keys(hta20sttranscriptcluster.db))
[1] 76809

ADD COMMENT • link 8.5 years ago James W. MacDonald 66k

0

Entering edit mode

Hi James,

I just have a follow up question, I was wondering what build of the genome was the annotation based on. Is it hg18 or hg19?

Best,

Sylvia

ADD REPLY • link 8.5 years ago sylvia ▴ 10

0

Entering edit mode

Annotation data are, for the most part, independent of the genome build. In other words, most data that would normally be considered annotation (various gene, protein or transcript IDs, GO IDs, gene names and symbols, pathway annotations, etc) have more to do with either the function or sequence of the gene or its transcripts and resulting proteins than its genomic location.

For example, if the assigned genomic location for BRCA1 were to move 1 Kb upstream between the hg18 and hg19 builds, the only thing different about that gene would be the chromosomal location data, which are in general better to get from a TxDb object than a ChipDb object anyway. It would still have the same symbol, gene name, Gene ID, Ensembl ID, etc.

In addition, annotation data are not released in build format. Most of the annotation databases at NCBI (and presumably at Ensembl as well) are updated on a weekly basis, so they are in a state of constant flux. We 'freeze' the annotation at six month intervals when we release a new set of annotation packages, but this is a tradeoff we make between being completely updated and allowing people to have reproducible results.

I built the annotation packages in the first week of October 2015, so all the data are based on what was available from the various annotation sites as of that week.

ADD REPLY • link 8.5 years ago James W. MacDonald 66k

0

Entering edit mode

I see, thanks for the quick reply! I was a bit concerned cause some of the same genes in the hta20sttrnascriptcluster.db have different chromosome locations and when I checked NCBI I realized none of the location match to the NCBI chromosome location for that gene. For example, Entrez gene ID: 100873740 have CHRLOCCHR 1 and 14 in hta20sttranscriptcluster.db, however, when I checked NCBI it says RNU6-31P is on chromosome 2. That's why I would like to double check what does the annotation base on.

Best,

Sylvia

ADD REPLY • link 8.5 years ago sylvia ▴ 10

0

Entering edit mode

Opinions differ on the location of that gene. Note that the hta20sttranscriptcluster.db package really only has a single table, mapping probeset ID to Entrez Gene ID. The rest of the mappings are handled by the org.Hs.eg.db package. So this discrepancy isn't due to anything in the hta20sttranscriptcluster.db package.

Anyway, to show what I am talking about, let's do some SQL queries on the tables in the org.Hs.eg.db package.

> library(org.Hs.eg.db)

> con <- org.Hs.eg_dbconn()

> dbGetQuery(con, "select * from chromosome_locations inner join genes using(_id) where gene_id='100873740';")
    _id seqname start_location end_location   gene_id
1 40238      14      -76737449    -76737476 100873740
2 40238       1     -101806424   -101806451 100873740

> dbGetQuery(con, "select * from chromosomes inner join genes using(_id) where gene_id='100873740';")
    _id chromosome   gene_id
1 40238          2 100873740

So this explains the discrepancy we get from this select() call, that I didn't really understand:

> select(org.Hs.eg.db, "100873740", c("CHR","CHRLOC"))
'select()' returned 1:many mapping between keys and columns
   ENTREZID CHR     CHRLOC CHRLOCCHR
1 100873740   2  -76737449        14
2 100873740   2 -101806424         1

In other words, the 'chromosomes' table says the gene is on chr2, whereas the 'chromosome_locations' table says otherwise. But like I said before, this isn't really the package for doing this sort of thing. We should really be using a TxDb package.

> tx <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene)
> tx["100873740"]
GRangesList object of length 1:
$100873740
GRanges object with 2 ranges and 2 metadata columns:
      seqnames                 ranges strand |     tx_id     tx_name
         <Rle>              <IRanges>  <Rle> | <integer> <character>
  [1]     chr1 [101806425, 101806451]      - |      5780  uc031pni.1
  [2]    chr14 [ 76737450,  76737476]      - |     53090  uc031qpk.1

So UCSC is pretty sure that this pseudogene is found on chr1 and chr14, not chr2. There is probably more to this story, but I can't really delve further into it because A.) I don't have the time, really, and B.) the server we used to do the annotation is a virtual server in the cloud, that is not running right now, and I don't have the authoritah to fire it up to check over the code to see exactly where the two (chromosomes and chromosome_location) tables come from. I am pretty sure they are from two different tables we download from NCBI, but I don't know which ones.

I should also note that Bioconductor doesn't make any claim to the accuracy of the data we provide. We are simply re-packaging existing data that we get from the people whose job it is to collect and distribute the data. As with any such large projects that are administered by different teams in different countries, there will inevitably be differences that have to be ironed out over time.

ADD REPLY • link 8.5 years ago James W. MacDonald 66k

0

Entering edit mode

Hi James,

Thank you for your time! I'll take a look at TxDb.

Best,

Sylvia

ADD REPLY • link 8.5 years ago sylvia ▴ 10