Question

Unable to annotate ExpressionSet object from hugene11st chip

0

Entering edit mode

Efra • 0

@8a99e9af

Last seen 28 days ago

Mexico

I have expression data from an Affymetrix Human Gene 1.1 ST Array. According to the CDF, the chip is called hugene11st. I'm wishing to perform a differential expression in limma after parsing CEL files with oligo.

While following limma's userguide, I faced problems while trying to annotate the data. I already have an ExpressionSet object named eset. I found an annotation package called hugene11stprobeset.db, thus I ran

BiocManager::install("hugene11stprobeset.db")
library(hugene11stprobeset.db)
library(annotate)

ids <- featureNames(eset)
symbol <- getSYMBOL(ID,"hugene11stprobeset.db")

Nevertheless, it seems that the array symbol is full of NAs.

It is worth mentioning that, while watching the outputs of keys(hugene11stprobeset.db) and featureNames(eset), the numbers seem to match; thus, it seems that the database is the problem.

Besides, I have a GPL22286_hugene11st_Hs_ENSG.cdf.gz file. Is there a way to parse it into R to annotate the ExpressionSet object?

Any recommendation will be highly appreciated. Thank you.

oligo limma hugene11stprobeset.db AffymetrixChip • 376 views

ADD COMMENT • link 5 weeks ago • updated 4 weeks ago Efra • 0

score 0 · Answer 1 · 2024-03-18

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 17 hours ago

United States

The Human Gene arrays can be summarized at two levels, one that approximates the exon level, and one that estimates overall transcript abundances. The default in oligo is to summarize at the transcript level (which is almost surely what you should be doing), in which case you should annotate using the hugene11sttranscriptcluster.db package instead. The simple way to do that is

> library(oligo)
> library(hugene11sttranscriptcluster.db)
> library(affycoretools)
> dat <- read.celfiles(list.celfiles())
> eset <- rma(dat)
> eset <- annotateEset(eset, hugene11sttranscriptcluster.db)

After which you can run the data through limma, and your tables of differentially expressed genes will contain the gene symbols and NCBI IDs.

ADD COMMENT • link 5 weeks ago James W. MacDonald 65k

0

Entering edit mode

I did as you told, and after running eset.annot <- annotateEset(eset, hugene11sttranscriptcluster.db) I got the next output:

'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns

In fact, after running fData(eset.annot), the ENTREZID, SYMBOL and GENENAME are still a bunch of NAs

ADD REPLY • link 5 weeks ago Efra • 0

0

Entering edit mode

Let's take a closer look.

> library(GEOquery)
## some random data
> getGEOSuppFiles("GSE193983")

> setwd("GSE193983/")
> untar("GSE193983_RAW.tar")
> library(affycoretools)
> library(oligo)
## process
> eset <- rma(read.celfiles(list.celfiles(listGzipped = TRUE)))
Loading required package: pd.hugene.1.0.st.v1
Loading required package: RSQLite
Loading required package: DBI
Platform design info loaded.
Reading in : GSM5825021_Hs_1P_A278_raraujo_UM_JL_HuGene-1_0-st-v1_.CEL.gz
Reading in : GSM5825022_Hs_3P_A280_raraujo_UM_JL_HuGene-1_0-st-v1_.CEL.gz
Reading in : GSM5825023_Hs_Cauc1_P_A437_raraujo_UM_JL_HuGene-1_0-st-v1_.CEL.gz
Reading in : GSM5825024_Hs_Cauc2_P_A439_raraujo_UM_JL_HuGene-1_0-st-v1_.CEL.gz
Reading in : GSM5825025_Hs_Liso1_B151_agomes_UM_JL_HuGene-1_0-st-v1_.CEL.gz
Reading in : GSM5825026_Hs_Black2_P_A435_raraujo_UM_JL_HuGene-1_0-st-v1_.CEL.gz
Reading in : GSM5825027_Hs_Af1_B150_agomes_UM_JL_HuGene-1_0-st-v1_.CEL.gz
Reading in : GSM5825028_Hs_A13_B229_agomes_UM_JL_HuGene-1_0-st-v1_.CEL.gz
Reading in : GSM5825029_Hs_A14_B230_agomes_UM_JL_HuGene-1_0-st-v1_.CEL.gz
Reading in : GSM5825030_Hs_A15_B231_agomes_UM_JL_HuGene-1_0-st-v1_.CEL.gz
Reading in : GSM5825031_Hs_A18_B232_agomes_UM_JL_HuGene-1_0-st-v1_.CEL.gz
Background correcting
Normalizing
Calculating Expression
> library(hugene10sttranscriptcluster.db)
## annotate
> eset <- annotateEset(eset, hugene10sttranscriptcluster.db)
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
## maybe you did this?
> head(fData(eset))
        PROBEID ENTREZID SYMBOL GENENAME
7892501 7892501     <NA>   <NA>     <NA>
7892502 7892502     <NA>   <NA>     <NA>
7892503 7892503     <NA>   <NA>     <NA>
7892504 7892504     <NA>   <NA>     <NA>
7892505 7892505     <NA>   <NA>     <NA>
7892506 7892506     <NA>   <NA>     <NA>
## when you should have done this?
> apply(fData(eset), 2, function(x) sum(is.na(x))/length(x))
 PROBEID ENTREZID   SYMBOL GENENAME 
0.000000 0.338529 0.338529 0.338529 

## still seems like a lot of NA values though

The annotations provided in the various ChipDb packages supplied for Affymetrix are based on data that is supplied by Affymetrix (now Fisher Scientific). We are just providing those data in an easily consumable form, and make no representations about the accuracy of the data. But do note that not all of the probesets on this array should measure something recognizable.

## get the NA probeset IDs
> whatarethese <- fData(eset)[is.na(fData(eset)[,2]),1]
> length(whatarethese)
[1] 11272
## that seems like a lot
## Here is some inside baseball forreal

## first make a connection to the pdInfoPackage
> con <- db(pd.hugene.1.0.st.v1)
## look at the featureSet table
> dbGetQuery(con, "select * from featureSet limit 10;")
    fsetid strand start stop transcript_cluster_id exon_id crosshyb_type level chrom type
1  7892501     NA    NA   NA               7892501       0             0    NA    NA    9
2  7892502     NA    NA   NA               7892502       0             0    NA    NA   10
3  7892503     NA    NA   NA               7892503       0             0    NA    NA   10
4  7892504     NA    NA   NA               7892504       0             0    NA    NA   10
5  7892505     NA    NA   NA               7892505       0             0    NA    NA   10
6  7892506     NA    NA   NA               7892506       0             0    NA    NA   10
7  7892507     NA    NA   NA               7892507       0             0    NA    NA   10
8  7892508     NA    NA   NA               7892508       0             0    NA    NA   10
9  7892509     NA    NA   NA               7892509       0             0    NA    NA    9
10 7892510     NA    NA   NA               7892510       0             0    NA    NA   10
## let's get the 'type' for all these NA probesets
> table(dbGetQuery(con, paste0("select type from featureSet where fsetid in ('", paste(whatarethese, collapse = "','"), "');")))
type
   5    7    9   10 
  57   45 1195 2904 
## check the table that provides type information
> dbGetQuery(con, "select * from type_dict;")
   type                    type_id
1     1                       main
2     2            main->junctions
3     3                 main->psrs
4     4               main->rescue
5     5              control->affx
6     6              control->chip
7     7  control->bgp->antigenomic
8     8      control->bgp->genomic
9     9             normgene->exon
10   10           normgene->intron
11   11   rescue->FLmRNA->unmapped
12   12   control->affx->bac_spike
13   13             oligo_spike_in
14   14            r1_bac_spike_at
15   15 control->affx->polya_spike
16   16        control->affx->ercc
17   17  control->affx->ercc->step

So all of those probesets are controls or normgene exon or intron probesets, none of which should be annotated anyway.

ADD REPLY • link 5 weeks ago James W. MacDonald 65k

0

Entering edit mode

Thank you si much!! This was such a complete and useful explanation, and now everything is clear. Thank you for all the information provided (as well as all the packages)

ADD REPLY • link 4 weeks ago Efra • 0