Filtering the probe set ids from GSE file
0
0
Entering edit mode
mahm ▴ 20
@mahm-16884
Last seen 5.6 years ago

I'm trying to parse the probe set ids from the file 'GSE15543' ,the GDS file name of the same study is  GDS4027.

There is a problem while parsing the probe ids from GSE. 

GSE file: 

eset2 <- getGEO('GSE15543')[[1]]

fData(eset2)[nrow(fData(eset)),] #nrow = 54675

Output:

          ID GB_ACC SPOT_ID Species Scientific Name Annotation Date Sequence Type
NA.41561 <NA>   <NA>    <NA>                    <NA>            <NA>          <NA>
         Sequence Source Target Description Representative Public ID Gene Title
NA.41561            <NA>               <NA>                     <NA>       <NA>
         Gene Symbol ENTREZ_GENE_ID RefSeq Transcript ID Gene Ontology Biological Process
NA.41561        <NA>           <NA>                 <NA>                             <NA>
         Gene Ontology Cellular Component Gene Ontology Molecular Function
NA.41561 

 

GDS file:

gds <- getGEO('GDS4027')

eset = GDS2eSet(gds)

fData(eset)[nrow(fData(eset)),] #nrow = 54675

Output:

                             ID Gene title Gene symbol Gene ID UniGene title
AFFX-TrpnX-M_at AFFX-TrpnX-M_at                                             
                UniGene symbol UniGene ID Nucleotide Title GI GenBank Accession
AFFX-TrpnX-M_at                                            NA                  
                Platform_CLONEID Platform_ORF Platform_SPOTID Chromosome location
AFFX-TrpnX-M_at                                     --Control                    
                Chromosome annotation GO:Function GO:Process GO:Component GO:Function ID
AFFX-TrpnX-M_at                                                                         
                GO:Process ID GO:Component ID
AFFX-TrpnX-M_at  

As displayed in the above outputs, I am not able to obtain the probe id while using the expression set created using gse file.

 

In short, the trouble is 

> rownames(exprs(eset2))[1]
[1] "1007_s_at"
> rownames(exprs(eset2))[54675]
[1] "NA.41561"

I'm not able to parse all the probe ids (e.g "NA.41561") and therefore not able to map these to gene symbols.

I would like to parse all the probe set ids from the gse file, map the probe ids to the gene symbols.  Any help will be much appreciated. 

geoquery GSE • 1.6k views
ADD COMMENT
0
Entering edit mode

I'm not sure what is going on. This is what I see: 

> gse = getGEO('GSE15543')[[1]]
Found 1 file(s)
GSE15543_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE15nnn/GSE15543/matrix/GSE15543_series_matrix.txt.gz'
Content type 'application/x-gzip' length 14491926 bytes (13.8 MB)
==================================================
downloaded 13.8 MB

Parsed with column specification:
cols(
  .default = col_double(),
  ID_REF = col_character()
)
See spec(...) for full column specifications.
File stored at: 
/var/folders/hq/pzgtdx7j55j0g7r4647vqzrr2yvxz9/T//RtmpVfLEKO/GPL570.soft
|============================================================================================================| 100%   80 MB
> gse
ExpressionSet (storageMode: lockedEnvironment)
assayData: 54675 features, 33 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM388740 GSM388741 ... GSM388772 (33 total)
  varLabels: title geo_accession ... cell type:ch1 (34 total)
  varMetadata: labelDescription
featureData
  featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (54675 total)
  fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
  fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
Annotation: GPL570 
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2.2       GEOquery_2.48.0      Biobase_2.40.0       ExperimentHub_1.6.0  AnnotationHub_2.12.0
[6] BiocGenerics_0.26.0 
...               
> fData(gse)[nrow(gse),]
                             ID GB_ACC   SPOT_ID Species Scientific Name Annotation Date    Sequence Type
AFFX-TrpnX-M_at AFFX-TrpnX-M_at        --Control            Homo sapiens     Oct 6, 2014 Control sequence
                                Sequence Source
AFFX-TrpnX-M_at Affymetrix Proprietary Database
                                                                                                                                                                                         Target Description
AFFX-TrpnX-M_at B. subtilis /GEN=trpD, trpC /DB_XREF=gb:K01391.1 /NOTE=SIF corresponding to nucleotides 2880-3359 of gb:K01391.1, not 100% identical /DEF=B.subtilis tryptophan (trp) operon, complete cds.
                Representative Public ID Gene Title Gene Symbol ENTREZ_GENE_ID RefSeq Transcript ID
AFFX-TrpnX-M_at             AFFX-TrpnX-M                                                           
                Gene Ontology Biological Process Gene Ontology Cellular Component Gene Ontology Molecular Function
AFFX-TrpnX-M_at  

I'll look into the issue on Windows when I get a chance.

ADD REPLY
0
Entering edit mode

On Linux,

> gse = getGEO('GSE15543')[[1]]
ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE15nnn/GSE15543/matrix/
Found 1 file(s)
GSE15543_series_matrix.txt.gz
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE15nnn/GSE15543/matrix/GSE15543_series_matrix.txt.gz'
ftp data connection made, file length 14491926 bytes
==================================================
downloaded 13.8 MB

Error in download.file(myurl, destfile, mode = mode, quiet = TRUE, method = getOption("download.file.method.GEOquery")) :
  cannot open URL 'http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=self&acc=GPL570&form=text&view=full'

 cannot open URL 'http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?targ=self&acc=GPL570&form=text&view=full'

sessionInfo()

R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

locale:
 [1] LC_CTYPE=en_IN       LC_NUMERIC=C         LC_TIME=en_IN       
 [4] LC_COLLATE=en_IN     LC_MONETARY=en_IN    LC_MESSAGES=en_IN   
 [7] LC_PAPER=en_IN       LC_NAME=C            LC_ADDRESS=C        
[10] LC_TELEPHONE=C       LC_MEASUREMENT=en_IN LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] GEOquery_2.36.0     Biobase_2.30.0      BiocGenerics_0.16.1

loaded via a namespace (and not attached):
[1] RCurl_1.95-4.11 bitops_1.0-6    XML_3.98-1.16

Any suggestion on how to resolve this error?

ADD REPLY
2
Entering edit mode

Yes. Use a current version of R/Bioconductor. You are using a version that is almost 3 years old, which is no longer supported. As Sean has already shown, the current version works fine.

ADD REPLY

Login before adding your answer.

Traffic: 602 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6