Question

Filtering the probe set ids from GSE file

0

Entering edit mode

mahm ▴ 20

@mahm-16884

Last seen 7.3 years ago

I'm trying to parse the probe set ids from the file 'GSE15543' ,the GDS file name of the same study is GDS4027.

There is a problem while parsing the probe ids from GSE.

GSE file:

eset2 <- getGEO('GSE15543')[[1]]

fData(eset2)[nrow(fData(eset)),] #nrow = 54675

Output:

          ID GB_ACC SPOT_ID Species Scientific Name Annotation Date Sequence Type
NA.41561 <NA>   <NA>    <NA>                    <NA>            <NA>          <NA>
         Sequence Source Target Description Representative Public ID Gene Title
NA.41561            <NA>               <NA>                     <NA>       <NA>
         Gene Symbol ENTREZ_GENE_ID RefSeq Transcript ID Gene Ontology Biological Process
NA.41561        <NA>           <NA>                 <NA>                             <NA>
         Gene Ontology Cellular Component Gene Ontology Molecular Function
NA.41561

GDS file:

gds <- getGEO('GDS4027')

eset = GDS2eSet(gds)

fData(eset)[nrow(fData(eset)),] #nrow = 54675

Output:

                             ID Gene title Gene symbol Gene ID UniGene title
AFFX-TrpnX-M_at AFFX-TrpnX-M_at                                             
                UniGene symbol UniGene ID Nucleotide Title GI GenBank Accession
AFFX-TrpnX-M_at                                            NA                  
                Platform_CLONEID Platform_ORF Platform_SPOTID Chromosome location
AFFX-TrpnX-M_at                                     --Control                    
                Chromosome annotation GO:Function GO:Process GO:Component GO:Function ID
AFFX-TrpnX-M_at                                                                         
                GO:Process ID GO:Component ID
AFFX-TrpnX-M_at

As displayed in the above outputs, I am not able to obtain the probe id while using the expression set created using gse file.

In short, the trouble is

> rownames(exprs(eset2))[1]
[1] "1007_s_at"
> rownames(exprs(eset2))[54675]
[1] "NA.41561"

I'm not able to parse all the probe ids (e.g "NA.41561") and therefore not able to map these to gene symbols.

I would like to parse all the probe set ids from the gse file, map the probe ids to the gene symbols. Any help will be much appreciated.

geoquery GSE • 2.4k views

ADD COMMENT • link 7.4 years ago mahm ▴ 20

0

Entering edit mode

I'm not sure what is going on. This is what I see:

> gse = getGEO('GSE15543')[[1]]
Found 1 file(s)
GSE15543_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE15nnn/GSE15543/matrix/GSE15543_series_matrix.txt.gz'
Content type 'application/x-gzip' length 14491926 bytes (13.8 MB)
==================================================
downloaded 13.8 MB

Parsed with column specification:
cols(
  .default = col_double(),
  ID_REF = col_character()
)
See spec(...) for full column specifications.
File stored at: 
/var/folders/hq/pzgtdx7j55j0g7r4647vqzrr2yvxz9/T//RtmpVfLEKO/GPL570.soft
|============================================================================================================| 100%   80 MB
> gse
ExpressionSet (storageMode: lockedEnvironment)
assayData: 54675 features, 33 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM388740 GSM388741 ... GSM388772 (33 total)
  varLabels: title geo_accession ... cell type:ch1 (34 total)
  varMetadata: labelDescription
featureData
  featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (54675 total)
  fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
  fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
Annotation: GPL570 
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2.2       GEOquery_2.48.0      Biobase_2.40.0       ExperimentHub_1.6.0  AnnotationHub_2.12.0
[6] BiocGenerics_0.26.0 
...               
> fData(gse)[nrow(gse),]
                             ID GB_ACC   SPOT_ID Species Scientific Name Annotation Date    Sequence Type
AFFX-TrpnX-M_at AFFX-TrpnX-M_at        --Control            Homo sapiens     Oct 6, 2014 Control sequence
                                Sequence Source
AFFX-TrpnX-M_at Affymetrix Proprietary Database
                                                                                                                                                                                         Target Description
AFFX-TrpnX-M_at B. subtilis /GEN=trpD, trpC /DB_XREF=gb:K01391.1 /NOTE=SIF corresponding to nucleotides 2880-3359 of gb:K01391.1, not 100% identical /DEF=B.subtilis tryptophan (trp) operon, complete cds.
                Representative Public ID Gene Title Gene Symbol ENTREZ_GENE_ID RefSeq Transcript ID
AFFX-TrpnX-M_at             AFFX-TrpnX-M                                                           
                Gene Ontology Biological Process Gene Ontology Cellular Component Gene Ontology Molecular Function
AFFX-TrpnX-M_at

I'll look into the issue on Windows when I get a chance.

ADD REPLY • link 7.4 years ago Sean Davis 21k

0

Entering edit mode

On Linux,

> gse = getGEO('GSE15543')[[1]]
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE15nnn/GSE15543/matrix/
Found 1 file(s)
GSE15543_series_matrix.txt.gz
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE15nnn/GSE15543/matrix/GSE15543_series_matrix.txt.gz'
ftp data connection made, file length 14491926 bytes
==================================================
downloaded 13.8 MB

Error in download.file(myurl, destfile, mode = mode, quiet = TRUE, method = getOption("download.file.method.GEOquery")) :
  cannot open URL 'http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=self&acc=GPL570&form=text&view=full'

cannot open URL 'http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=self&acc=GPL570&form=text&view=full'

sessionInfo()

R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

locale:
[1] LC_CTYPE=en_IN       LC_NUMERIC=C         LC_TIME=en_IN
[4] LC_COLLATE=en_IN     LC_MONETARY=en_IN    LC_MESSAGES=en_IN
[7] LC_PAPER=en_IN       LC_NAME=C            LC_ADDRESS=C
[10] LC_TELEPHONE=C       LC_MEASUREMENT=en_IN LC_IDENTIFICATION=C

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base

other attached packages:
[1] GEOquery_2.36.0 Biobase_2.30.0 BiocGenerics_0.16.1

loaded via a namespace (and not attached):
[1] RCurl_1.95-4.11 bitops_1.0-6 XML_3.98-1.16

Any suggestion on how to resolve this error?

ADD REPLY • link 7.4 years ago mahm ▴ 20

2

Entering edit mode

Yes. Use a current version of R/Bioconductor. You are using a version that is almost 3 years old, which is no longer supported. As Sean has already shown, the current version works fine.

ADD REPLY • link 7.4 years ago James W. MacDonald 68k