GEOquery reading error on Windows.
1
0
Entering edit mode
Caitlin ▴ 50
@caitlin-5400
Last seen 3.0 years ago
United States

Hi all.

I am trying to download a GSE file from the Gene Expression Omnibus and I am encountering a new error. Since the bug is still not fixed, I am attempting a "workaround".

Code should be placed in three backticks as shown below


library(GEOquery)

my_id <- "GSE12657"
readr::local_edition(1) 
gse <- getGEO(my_id) 

sessionInfo( )

Error in readBin(inn, what = raw(0L), size = 1L, n = BFR.SIZE) : error reading from the connection In addition: Warning message: In readBin(inn, what = raw(0L), size = 1L, n = BFR.SIZE) : invalid or incomplete compressed data

I am using a Windows 10 machine with R 4.1.2, Bioconductor 3.14, and all installed packages are current.

Thank you.

GEOquery GEO Gene • 2.6k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 13 hours ago
United States

You didn't supply the important part of the error! You should provide all the output, not just the part that you think is important. Here is the actual error:

> z <- getGEO("GSE12657")
Found 1 file(s)
GSE12657_series_matrix.txt.gz
Using locally cached version: C:\Users\Public\Documents\Wondershare\CreatorTemp\RtmpKQ60AU/GSE12657_series_matrix.txt.gz
File stored at: 
C:\Users\Public\Documents\Wondershare\CreatorTemp\RtmpKQ60AU/GPL8300.soft.gz
Error in readBin(inn, what = raw(0L), size = 1L, n = BFR.SIZE) : 
  error reading from the connection
In addition: Warning message:
In readBin(inn, what = raw(0L), size = 1L, n = BFR.SIZE) :
  invalid or incomplete compressed data

And note that the problem is reading the GPL8300.soft.gz file, not the GSE12657_series_matrix.txt.gz file. The former being the annotation for the array, and the latter being the data for this particular experiment.

Anyway, the fix is to add an additional argument:

> z <- getGEO("GSE12657", AnnotGPL = TRUE)
Found 1 file(s)
GSE12657_series_matrix.txt.gz
Using locally cached version: C:\Users\Public\Documents\Wondershare\CreatorTemp\RtmpKQ60AU/GSE12657_series_matrix.txt.gz
File stored at: 
C:\Users\Public\Documents\Wondershare\CreatorTemp\RtmpKQ60AU/GPL8300.annot.gz
>

Which works because it gets a different annotation file.

ADD COMMENT
0
Entering edit mode

Thank you James.

Unfortunately, the subsequent code does not work as expected.

sampledata <- pData(z) Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘pData’ for signature ‘"list"’

features <- fData(z) Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘fData’ for signature ‘"list"’

I am following the advice of a postdoc in the hope of completing a course project and this is the code she suggested.

Update: I am attempting to extract all of the gene symbols from each GSM object on the assumption that these would be the names of the genes identified in the experiment and that I could then use Cytoscape in some capacity with them.

sampleData <- pData(z[[1]]) featureData <- fData(z[[1]])

The above code appears to resolve the issue above. Is there a convenient method to extract all of the gene symbols from each GSM listed in the sampleData object?

ADD REPLY
1
Entering edit mode

Heh. Postdocs... You should never listen to them ;-D

What you get from getGEO is a list, so really you should probably do

z <- getGEO("GSE12657", AnnotGPL = TRUE)[[1]]

In which case pData and fData will now work as expected.

ADD REPLY
2
Entering edit mode

Oh, I missed your edit. I normally follow on email, so that's my excuse and I'm sticking with it.

You could use the data from GEO, but it's a pain, because it is just a reformulation of the Affy CSV files, which themselves are a pain to deal with. I have a function in my affycoretools package that might be useful to you.

> library(GEOquery)
> library(affycoretools)
> library(hgu95av2.db)
> z <- getGEO("GSE12657", AnnotGPL = TRUE)[[1]]
Found 1 file(s)
GSE12657_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE12nnn/GSE12657/matrix/GSE12657_series_matrix.txt.gz'
Content type 'application/x-gzip' length 1510339 bytes (1.4 MB)
downloaded 1.4 MB

File stored at: 
C:\Users\Public\Documents\Wondershare\CreatorTemp\Rtmpyg1zkg/GPL8300.annot.gz

> z
ExpressionSet (storageMode: lockedEnvironment)
assayData: 12625 features, 25 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM317711 GSM317712 ... GSM317735 (25 total)
  varLabels: title geo_accession ... data_row_count (29 total)
  varMetadata: labelDescription
featureData
  featureNames: 1000_at 1001_at ... AFFX-YEL024w/RIP1_at (12625 total)
  fvarLabels: ID Gene title ... GO:Component ID (21 total)
  fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
Annotation: GPL8300   

## re-annotate.

> eset <- annotateEset(z, hgu95av2.db)
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:many mapping between keys and columns
> head(fData(eset))
            PROBEID ENTREZID  SYMBOL
1000_at     1000_at     5595   MAPK3
1001_at     1001_at     7075    TIE1
1002_f_at 1002_f_at     1557 CYP2C19
1003_s_at 1003_s_at      643   CXCR5
1004_at     1004_at      643   CXCR5
1005_at     1005_at     1843   DUSP1
                                                                 GENENAME
1000_at                                mitogen-activated protein kinase 3
1001_at   tyrosine kinase with immunoglobulin like and EGF like domains 1
1002_f_at                  cytochrome P450 family 2 subfamily C member 19
1003_s_at                                C-X-C motif chemokine receptor 5
1004_at                                  C-X-C motif chemokine receptor 5
1005_at                                    dual specificity phosphatase 1

And now the SYMBOL column has the gene symbols

ADD REPLY
2
Entering edit mode

Or if you just want the symbols without dealing with all the other stuff, you can use the functions from AnnotationDbi directly.

> library(hgu95av2.db)
>  zz <- select(hgu95av2.db, row.names(z), "SYMBOL")
'select()' returned 1:many mapping between keys and columns

## but note the message - some probeset IDs map to multiple things!

> zzz <- mapIds(hgu95av2.db, row.names(z), "SYMBOL","PROBEID", multiVals = "list")
'select()' returned 1:many mapping between keys and columns
> table(sapply(zzz, length))

    1     2     3     4     5     6     7     8     9    10    14    16    21 
12177   325    60    22    10    11     4     5     1     1     1     1     2 
   22 
    5 
> zzz[sapply(zzz, length) > 16L]
$`31498_f_at`
 [1] "GAGE6"   "GAGE1"   "GAGE2C"  "GAGE4"   "GAGE5"   "GAGE7"   "GAGE12I"
 [8] "GAGE2E"  "GAGE2B"  "GAGE13"  "GAGE12G" "GAGE12J" "GAGE2D"  "GAGE12C"
[15] "GAGE12B" "GAGE12E" "GAGE12H" "GAGE2A"  "GAGE12F" "GAGE8"   "GAGE12D"
[22] "GAGE10" 

$`31954_f_at`
 [1] "GAGE1"   "GAGE2C"  "GAGE4"   "GAGE5"   "GAGE6"   "GAGE7"   "GAGE12I"
 [8] "GAGE2E"  "GAGE2B"  "GAGE13"  "GAGE12G" "GAGE12J" "GAGE2D"  "GAGE12C"
[15] "GAGE12B" "GAGE12E" "GAGE12H" "GAGE2A"  "GAGE12F" "GAGE8"   "GAGE12D"
[22] "GAGE10" 

$`31960_f_at`
 [1] "GAGE2C"  "GAGE1"   "GAGE4"   "GAGE5"   "GAGE6"   "GAGE7"   "GAGE12I"
 [8] "GAGE2E"  "GAGE2B"  "GAGE13"  "GAGE12G" "GAGE12J" "GAGE2D"  "GAGE12C"
[15] "GAGE12B" "GAGE12E" "GAGE12H" "GAGE2A"  "GAGE12F" "GAGE8"   "GAGE12D"

$`33671_f_at`
 [1] "GAGE4"   "GAGE1"   "GAGE2C"  "GAGE5"   "GAGE6"   "GAGE7"   "GAGE12I"
 [8] "GAGE2E"  "GAGE2B"  "GAGE13"  "GAGE12G" "GAGE12J" "GAGE2D"  "GAGE12C"
[15] "GAGE12B" "GAGE12E" "GAGE12H" "GAGE2A"  "GAGE12F" "GAGE8"   "GAGE12D"

$`33680_f_at`
 [1] "GAGE7"   "GAGE1"   "GAGE2C"  "GAGE4"   "GAGE5"   "GAGE6"   "GAGE12I"
 [8] "GAGE2E"  "GAGE2B"  "GAGE13"  "GAGE12G" "GAGE12J" "GAGE2D"  "GAGE12C"
[15] "GAGE12B" "GAGE12E" "GAGE12H" "GAGE2A"  "GAGE12F" "GAGE8"   "GAGE12D"
[22] "GAGE10" 

$`37065_f_at`
 [1] "GAGE5"   "GAGE1"   "GAGE2C"  "GAGE4"   "GAGE6"   "GAGE7"   "GAGE12I"
 [8] "GAGE2E"  "GAGE2B"  "GAGE13"  "GAGE12G" "GAGE12J" "GAGE2D"  "GAGE12C"
[15] "GAGE12B" "GAGE12E" "GAGE12H" "GAGE2A"  "GAGE12F" "GAGE8"   "GAGE12D"
[22] "GAGE10" 

$`657_at`
 [1] "PCDHGC3"  "PCDHGB4"  "PCDHGA8"  "PCDHGA12" "PCDHGC5"  "PCDHGC4" 
 [7] "PCDHGB7"  "PCDHGB6"  "PCDHGB5"  "PCDHGB3"  "PCDHGB2"  "PCDHGB1" 
[13] "PCDHGA11" "PCDHGA10" "PCDHGA9"  "PCDHGA7"  "PCDHGA6"  "PCDHGA5" 
[19] "PCDHGA4"  "PCDHGA3"  "PCDHGA2"  "PCDHGA1"

So there is a decision you have to make as to which symbol you want for the multiples. The simplest and most naive is to take the first one.

> head(as.data.frame(mapIds(hgu95av2.db, row.names(z), "SYMBOL","PROBEID")))
'select()' returned 1:many mapping between keys and columns
          mapIds(hgu95av2.db, row.names(z), "SYMBOL", "PROBEID")
1000_at                                                    MAPK3
1001_at                                                     TIE1
1002_f_at                                                CYP2C19
1003_s_at                                                  CXCR5
1004_at                                                    CXCR5
1005_at                                                    DUSP1
ADD REPLY
0
Entering edit mode

Thank you James ;)

I am attempting to extract all of the gene symbols from each GSM object on the assumption that these would be the names of the genes identified in the experiment and that I could then use Cytoscape in some capacity with them.

Is there a convenient method to extract all of the gene symbols from each GSM listed in the sampleData object?

ADD REPLY
0
Entering edit mode

What you are calling GSM objects are just individual Affy arrays. And each one was used to infer the transcript abundance for a given sample. And by 'infer' what I really mean is that somebody processed some total RNA in such a way that it could be used to hybridize to a bunch of 25-mers that were grown on a silicone wafer, after which a fluorophore was attached and then quantified by measuring the intensity with a CCD camera. The pixel intensity from the picture was then used to generate an 'expression value' which is a unitless thing, based on how bright a spot on a silicone wafer was, and is meant to be proportional to the amount of mRNA was originally in the sample.

That long-winded paragraph is meant to back me up when I say that there are no 'genes identified in the experiment'. All you get are some random numbers that partially reflect the amount of a given transcript that was in the original sample, and can only be used by comparing to the random numbers for the same gene that were generated (preferably) at the same time in the same lab using the same chip type. And the results for the comparison are log fold changes, which give you an estimate of how much the underlying transcript changed between the groups you compared. So you could use those GEO data to find the genes that appear to change expression between glioblastoma samples and controls, but you cannot do anything else, like 'identify genes'. That's not a thing.

If you want to compare groups, then you almost surely want to use the limma package, which has a user guide (google 'limma user guide'), which is very long and very complete, and you should read carefully.

ADD REPLY

Login before adding your answer.

Traffic: 702 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6