Question

Harmonizing Gene Annotations for Meta-Analysis of GSE68183 and GSE80178

0

Entering edit mode

Dhite • 0

@1d6e7eb8

Last seen 4 months ago

Indonesia

I am currently working on a meta-analysis involving two GEO datasets: GSE68183 and GSE80178. Both datasets include CEL files, and I aim to process them to ensure consistent gene annotations across both studies. However, I have encountered several challenges:

The gene identifiers in the two datasets appear to differ, making it difficult to align them for comparative analysis.

I have attempted to process the CEL files using various R packages, including affy, affyio, oligo, and oligoclasses. Despite these efforts, I have been unable to generate consistent gene annotations.

I am seeking guidance on the following:

What are the recommended approaches to standardize gene identifiers between these two datasets?
Which tools or packages are best suited for processing CEL files from these specific GEO datasets to achieve consistent gene annotations?

Any insights, suggestions, or references to relevant resources would be greatly appreciated.

Best regards,

GEOquery • 341 views

ADD COMMENT • link updated 4 months ago by James W. MacDonald 68k • written 4 months ago by Dhite • 0

1

Entering edit mode

My recommendation would be to first translate probe identifiers to Ensembl gene IDs, for example with biomaRt, and then take the intersect. From "common universe" genes you can then proceed. The problem is that gene annotations change over time, so maybe a probe that 10 years ago captured geneA today is deprecated and considered an artifact, or annotations have changed. Hence, I would find it important to really only look at genes that are consistently annotated and have a stable Ensembl ID in all platforms, imho.

ADD REPLY • link 4 months ago ATpoint ★ 4.8k

score 0 · Answer 1 · 2024-11-18

Those are both HuGene-2.0 ST arrays, so there should be no differences. If you are getting the CEL files, it's a simple process to annotate.

library(BiocManager)
install(c("pd.hugene.2.0.st","hugene20sttranscriptcluster.db","affycoretools"))
getGEOSuppFiles("GSE80178")
getGEOSuppFiles("GSE68183")
setwd("GSE68183/")
untar("GSE68183_RAW.tar")
setwd("../GSE80178/")
untar("GSE80178_RAW.tar")
setwd("../")
library(oligo)
gse68 <- rma(read.celfiles(filenames = dir("GSE68183", "CEL", full.names = TRUE)))
gse80 <- rma(read.celfiles(filenames = dir("GSE80178", "CEL", full.names = TRUE)))
library(affycoretools)
gse68 <- annotatEset(gse68, hugene20sttranscriptcluster.db)
gse80 <- annotatEset(gse80, hugene20sttranscriptcluster.db)
> all.equal(fData(gse68), fData(gse80))
[1] TRUE
> head(fData(gse68))
          PROBEID ENTREZID SYMBOL
16650001 16650001     <NA>   <NA>
16650003 16650003     <NA>   <NA>
16650005 16650005     <NA>   <NA>
16650007 16650007     <NA>   <NA>
16650009 16650009     <NA>   <NA>
16650011 16650011     <NA>   <NA>
         GENENAME
16650001     <NA>
16650003     <NA>
16650005     <NA>
16650007     <NA>
16650009     <NA>
16650011     <NA>

And do note that the head of the featureData object shows a bunch of control probes that aren't annotated.