Question: Codelink differential expression and annotation
1
4.3 years ago by
India
Agaz Hussain Wani260 wrote:
I perform differential expression on codelink data and output is shown below. Now i am looking to annotate the probes and i use ACCN column from hwgcod.db to annotate. I am trying to get the ACCN's from probeName column ,about 7000 probe names and ACCN  does not match, which leads to loss of information. Can i get the ACCN instead of probeName as shown below as the second column.

Note: I am using the infromation from probeName to get ACCN which i can use to annotate, say something like "1691167CB1" "NM_002575"  "3251556CB1" "NM_001045"  "201678"     "8187031CB1" "1019621"    "1023071".

probeName       probeType logicalRow logicalCol   meanSNR     logFC
17143  1691167CB1_PROBE1 DISCOVERY        278         26 0.5780484 -6.767287
1460  NM_002575.1_PROBE1 DISCOVERY         26         23 0.6308975  6.655497
19529  3251556CB1_PROBE1 DISCOVERY        315         57 0.6364621 -4.456800
19025 NM_001045.1_PROBE1 DISCOVERY        307         66 0.5722841  5.984381
5311    201678.10_PROBE1 DISCOVERY         87         40 0.5945943 -5.464377
7264   8187031CB1_PROBE1 DISCOVERY        118         23 0.6700745  6.160234
9024    1019621.1_PROBE1 DISCOVERY        146          5 0.7937480 -7.693983
6481    1023071.1_PROBE1 DISCOVERY        105         69 0.7680499 -6.333866
17143 1.379396 -19.89355 8.231218e-06 0.1629205 -2.990755
1460  1.468584  14.45828 3.746444e-05 0.3707668 -3.020015
19529 2.990133 -13.15169 5.856035e-05 0.3806185 -3.032661
19025 1.065737  11.83323 9.614977e-05 0.3806185 -3.049624
5311  1.250057 -11.28423 1.200560e-04 0.3960448 -3.058378
7264  2.368912  10.32842 1.812488e-04 0.4675345 -3.076801
9024  4.481637 -17.38061 8.927544e-05 0.3806185 -3.659384
6481  4.973425 -14.24512 1.889696e-04 0.4675345 -3.668655

My sample data is here https://www.dropbox.com/s/kz12cwgqes10oqu/GSM108290.TXT?dl=0 , which has ACCN column.

The second confusion i want to clear is, is there a single annotation package for affymetrix, illumina and codelink data platforms to annotate from. Thanks

modified 4.3 years ago by Diego Diez730 • written 4.3 years ago by Agaz Hussain Wani260

In the file you link there is no ACCN column. Are you reading this file with readCodelinkSet() or from GEO with GEOquery? How are you getting the ACCN information (i.e. write the actual code)? This question is not about differential expression (as the title and tags suggests), only about annotation.

I can find " Annotation_NCBI-Acc"

Please find the code which i use to extract information from probeNames() to annotate for h20kcod.db

codelink_processed <- list()
files = list.files(pattern = "TXT")
{
for(i in 1:length(files))
{
codset = codCorrect(codset, method = "half", offset = 0)
codset = codNormalize(codset, method = "loess", weights = getWeight(codset), loess.method = "fast")
exprs <- exprs(codset)
probes <- probeNames(codset)
snr  <- getSNR(codset)

## To get the information from probes compatible with ACCNUM
newids <- gsub("\\..*|_PROBE.*", "",probes)

## Annotating the newids
keys <- select(h20kcod.db, newids, c("SYMBOL"), "ACCNUM")

dataFrame <- cbind(keys,exprs,snr)
colnames(dataFrame) <- c("Probe","Gene Symbol","SignalIntensity", "SNR")
}
}

dataframes <- codelink_data(files)

I do not know why you assume that we should read "ACCN" column as "Annotation_NCBI-Acc"- they may or may not refer to the same thing. Please, be specific when referring to data columns (or in general, to anything you want others to help you with) or define before hand any alias you wish to use. I am preparing a workaround for your issue and will post it as an answer soon.

1
4.3 years ago by
Diego Diez730
Japan
Diego Diez730 wrote:

As we discussed in another question, the problem with the specific dataset you are using is that it reports the legacy probe names, instead of the current ones. Therefore, you can't directly use the h20kcod.db annotation package. You tried to workaround this problem by obtaining the accession number that is found in the legacy probe names. But this has the disadvantage that many ids cannot be found anymore. This is because many of the original mappings have changed, due to improved genome assembly and annotation or some of the accession numbers beeing discontinued.

Anyway, your only hope is to remap the legacy probe names to the new ones. In the GEO page for the file you linked (GSM108290; http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM108290) we can find information about the array design in the GPL file: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL2891. Unfortunately, in this case the current codelink probe names are reported, and there is not the correspondence with the legacy ones. However, there is the row and col location of the probes (LOGICAL_ROW and LOGICAL_COL) and we can use this information for the mapping. Below is the code I used to hack this:

NOTE: you need to download the GPL file into a text file. For that I clicked on the View full table button that lead me to this page: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?view=data&acc=GPL2891&id=12727&db=GeoDb_blob06)

library(codelink)

gpl <- read.table("GPL2891.txt", sep = "\t", header = TRUE, fill = TRUE, stringsAsFactors = FALSE, strip.white = TRUE)

# obtain feature data.
fdata <- fData(codset)

# fix extra white.
fdata$logicalRow <- trimWhiteSpace(fdata$logicalRow)
fdata$logicalCol <- trimWhiteSpace(fdata$logicalCol)

# check dimensions (they do not match!)
dim(gpl)
dim(codset) # in the codelink file BLANK probes are missing (!?)

# match each spot.
# NOTE: slow.
tmp <- apply(fdata, 1, function(x) {
cr <- x["logicalRow"]
cc <- x["logicalCol"]
sel <- gpl$LOGICAL_ROW == cr & gpl$LOGICAL_COL == cc
data.frame(id = gpl$ID[sel], probeName = gpl$PROBE_NAME[sel], ori = x["probeName"], row.names = NULL, stringsAsFactors = FALSE)
})

# sanity check: one match per spot.
any(sapply(tmp, nrow) != 1)

# rbind all rows.
# NOTE: slow.
tmp <- do.call(rbind, tmp)

# bind old and new data.
fdata2 <- cbind(fdata, tmp)

# rename original probe name column (contains legacy probe names).
colnames(fdata2)[1] <- "probeName_old"

# sanity check (everything OK?)
all(fdata2$probeName_old == fdata2$ori)

# assign new fdata.
fData(codset) <- fdata2
probeNames(codset)

# now we can get annotations the traditional way:
library(h20kcod.db)
# most probes have a matching ACCNUM:
h20kcod() # h20kcodACCNUM has 18673 mapped keys (of 19993 keys)
# get annotations for 10 first probes:
select(h20kcod.db, keys = probeNames(codset)[1:10], columns = c("ENTREZID", "SYMBOL"))
PROBEID ENTREZID   SYMBOL
1  GE53815    23541  SEC14L2
2  GE59986     9053     MAP7
3  GE53908    57538    ALPK3
4  GE60064     5018    OXA1L
5  GE53952    54480    CHPF2
6  GE60188    10483   SEC23B
7  GE53801   400961   PAIP2B
8  GE59978     1506     CTRL
9  GE53905    57535 KIAA1324
10 GE60052    11099   PTPN21


This should work. I will consider adding a way to automatically map old legacy probes to the new ones. But this will not be immediately available, as so far this is the first case with this problem I found since the package was released (9 years ago).