Question about converting rownames of an expression dataset to their corresponding gene symbols
2
0
Entering edit mode
svlachavas ▴ 780
@svlachavas-7225
Last seen 7 hours ago
Germany/Heidelberg/German Cancer Resear…

Dear ALL,

i have an expression dataset(called data.trusted.eset)(affymetrix, hgu133a) which i have preprocessed and filtered and which i used as the input for my limma paired analysis to find DE genes. My basic question is, that i want to use the same expressionSet an an input(exprs(data.trusted.eset)) as a basic argument to a gene network co-regulation analysis,----but instead of probesets as rownames i have to replace them with unique gene symbols. Thus, remove in an appropriate way remove NAs but also duplicates.

I tried:

affyUniverse <- featureNames(data.trusted.eset)

symbols <- hgu133aSYMBOL[affyUniverse]

symbol.Universe <- unique(as.character(symbols))

My main problem is then how to reduce my expression set with these symbols. Or my approach is totally wrong ?

Any suggestions or ideas would be grateful

filtering bioconductor genesymbols expressionset • 1.5k views
1
Entering edit mode
@james-w-macdonald-5106
Last seen 38 minutes ago
United States
> gns <- select(hgu133a.db, featureNames(data.trusted.eset), "SYMBOL")
Warning message:
In .generateExtraRows(tab, keys, jointype) :
'select' resulted in 1:many mapping between keys and return rows
> gns <- gns[!duplicated(gns[,1]),]
> ind <- !duplicated(gns[,2]) & !is.na(gns[,2])
> gns2 <- gns[ind,]

> table(table(gns2[,2]))

1
13018

> data.trusted.eset.unique <- data.trusted.eset[ind,]

> featureNames(data.trusted.eset.unique) <- as.character(gns2[,2])
1
Entering edit mode
svlachavas ▴ 780
@svlachavas-7225
Last seen 7 hours ago
Germany/Heidelberg/German Cancer Resear…

Dear Mr James W.MacDonald,

thank you for your valuable answer and code to implement it to my analysis. Before reading your answer in my question, i tried to write some small code to remove duplicate probesets based on median absolut deviation and keep the probeset with the bigger "mad" between the other same duplicates,as i think(correct me if im wrong) the duplicated function keeps the first incidence of a probeset . So i would like to see my code below and comment about my rationale of removing duplicates based on this characteristic(also no hard feelings about any mistakes or wrong syntax for my code because any comments or corrections would be essential to my improvement in R skills and knowledge):

symbols <- unlist(mget(featureNames(data.trusted.eset), env=hgu133aSYMBOL))
top2 <- topTable(fit2, coef="conditionCancer", number=nrow(fit2), genelist=symbols, adjust.method="fdr", sort.by="none")
# where top2$ID holds the gene symbols for the probesets and with the arguments number and sort.by i have all my probesets with the same row as the rownames of the data.trusted.eset m <- apply(exprs(data.trusted.eset),1,mad) top2$MAD <- m

top3 <- top2[order(top2$MAD, decreasing=TRUE),] top3 <- top3[!duplicated(top3$ID),]

filtered <- data.trusted.eset[rownames(top3),]

rownames(filtered) <- top3\$ID

1
Entering edit mode

You have to get out of the habit of using things like mget() for accessing data from the annotation packages. By default these old methods return NA for any probeset with one-to-many mappings. Use select() instead, as I have already shown.

Otherwise I think your code is OK.

0
Entering edit mode

0
Entering edit mode

Dear Mr MacDonald,

please excuse me for posting one more question for the above subject, but i have noticed something very important and i would like to elusidate a specific issue(so please excuse me again for any "naive" questions). More specifically, after your above function:

gns <- select(hgu133a.db, featureNames(data.trusted.eset), "SYMBOL")

i noticed that except from the usual case of different probesets matching to the same Gene Symbol(transcript), i have also noticed several probesets, each of them (could) match to more than one gene symbols. Thus, my main concern is that with

gns <- gns[!duplicated(gns[,1]),]

might we (choose) arbitrary select a symbol for each probeset(which is the first incidence with !duplicated function), which might also includes non-coding genes in some circumstances ? Could for instance a probeset might be affected to coding or non-coding genes ? And also this although includes more annotated probesets than mget, affect in a different way my downstream analysis?

0
Entering edit mode

You would have to check this to be sure, but I am not sure that there was any non-coding content on the old HG-U133A array. Note that these arrays were based on UniGene build 133. I am not an expert on the various NCBI databases, but it seems to me that UniGene is intended to be for genes that are translated to protein. In addition, that build is from like, 2003 or so, and I don't think there was much known untranslated content.

0
Entering edit mode

Ι understand . I have so far checked it manually. For instance, for the probeset

265 200715_x_at   RPL13A
266 200715_x_at SNORD35A

the second match if im mistaken it is a non-coding gene. Thus far, maybe it seems random but for a lot of these probesets the first match(which also is chosen from the !duplicated()) is a coding gene and also for instance

4637   210483_at              LOC254896
4638   210483_at              TNFRSF10C

the first match is a RNA gene(non-coding)

0
Entering edit mode

Thus, if there is a small number after the removal of duplicate probesets which might match to untranslated transcripts, should i keep them in order not to use generally information by removing them as NAs with mget ? The difference between the code i posted and your code is about 200 probesets, but it might be significant

0
Entering edit mode

This is the problem with the one-to-many probesets; it's not clear how to select the 'best' gene for a given probeset, and it is certainly not easy to do this in an automated way.

0
Entering edit mode

Ok thank you again for your consideration. I would consider both approaches and i will finally choose one of them. I hope than the small amount of these probesets wouldnt affect too much the downstream analysis, as you mentioned there isnt a "gold standard" for the most powerful approach