Question: Using eset while parsing GSE file located in a path
0
gravatar for mahm
14 months ago by
mahm20
mahm20 wrote:

I get the following error while trying to use pData,

 > eset2 <- getGEO(filename = "GSE15543_family.soft.gz")

Reading file....
Parsing....
Found 34 entities...
GPL570 (1 of 35 entities)
GSM388740 (2 of 35 entities)
GSM388741 (3 of 35 entities)
GSM388742 (4 of 35 entities)
GSM388743 (5 of 35 entities)
GSM388744 (6 of 35 entities)
GSM388745 (7 of 35 entities)
GSM388746 (8 of 35 entities)
GSM388747 (9 of 35 entities)
GSM388748 (10 of 35 entities)
GSM388749 (11 of 35 entities)
GSM388750 (12 of 35 entities)
GSM388751 (13 of 35 entities)
GSM388752 (14 of 35 entities)
GSM388753 (15 of 35 entities)
GSM388754 (16 of 35 entities)
GSM388755 (17 of 35 entities)
GSM388756 (18 of 35 entities)
GSM388757 (19 of 35 entities)
GSM388758 (20 of 35 entities)
GSM388759 (21 of 35 entities)
GSM388760 (22 of 35 entities)
GSM388761 (23 of 35 entities)
GSM388762 (24 of 35 entities)
GSM388763 (25 of 35 entities)
GSM388764 (26 of 35 entities)
GSM388765 (27 of 35 entities)
GSM388766 (28 of 35 entities)
GSM388767 (29 of 35 entities)
GSM388768 (30 of 35 entities)
GSM388769 (31 of 35 entities)
GSM388770 (32 of 35 entities)
GSM388771 (33 of 35 entities)
GSM388772 (34 of 35 entities)

> pData(eset2)
Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘pData’ for signature ‘"GSE"’
> pData(eset2[[1]])
Error in eset2[[1]] : this S4 class is not subsettable

The above command works fine when  > eset2 <- getGEO(filename = "GSE15543_family.soft.gz") is used.
Any suggestions on how to resolve the error?

While using > eset2 <- getGEO(filename = "GSE15543_family.soft.gz"), the file is downloaded every time and not used from the cached one . On the terminal, I use $Rscript file.R to compile

 

geoquery geo • 458 views
ADD COMMENTlink modified 14 months ago by Sean Davis21k • written 14 months ago by mahm20
Answer: Using eset while parsing GSE file located in a path
1
gravatar for Sean Davis
14 months ago by
Sean Davis21k
United States
Sean Davis21k wrote:

Parsing a SOFT format GSE will result in a GSE object. It is up to you, then, to parse out the information as needed. See https://bioconductor.org/packages/release/bioc/vignettes/GEOquery/inst/doc/GEOquery.html#converting-gse-to-an-expressionset.

The GSE SOFT parsing functionality is historical and predates the availability of so-called SeriesMatrix format files. The preferred approach these simply use getGEO('GSE15543'), or to use the SeriesMatrix file(s).

ADD COMMENTlink written 14 months ago by Sean Davis21k
1

I prefer using getGEO('GSE15543'). I trouble that I face is , every time I run the R script the file,GSE15543,gets downloaded(i.e using Rscript file.R).Why does this happen? (To avoid this , I was trying to download and save the file.When I directly use getGEO('GSE15543') in the terminal, the cached version is fetched.

Excuse me for the naive questions

ADD REPLYlink modified 14 months ago • written 14 months ago by mahm20

See the help for getGEO. In particular, if you set `destdir` in your script, the files will be downloaded only once. The second time the cached file is used. If you do not set `destdir`, a temporary directory is used and that temporary directory only lasts as long as R runs.

ADD REPLYlink modified 14 months ago • written 14 months ago by Sean Davis21k

Probably a really silly question,
I did as you advised,

eset <- getGEO('GSE15543', destdir = "/media/nat/entain/toy3/Bioconductor/data/")[[1]]

The cached file is fetched form the destination directory, but unfortunately I am not able to obtain the output in the terminal

nat@nat-HP-Pavilion-Notebook:/media/nat/entain/toy3/Bioconductor$ Rscript new.R
Loading required package: BiocGenerics
Loading required package: methods
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, cbind, colMeans, colnames,
    colSums, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, lengths, Map, mapply, match,
    mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which, which.max, which.min

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Setting options('download.file.method.GEOquery'='auto')
Setting options('GEOquery.inmemory.gpl'=FALSE)
Found 1 file(s)
GSE15543_series_matrix.txt.gz
Using locally cached version: /media/nat/entain/toy3/Bioconductor/data//GSE15543_series_matrix.txt.gz
Parsed with column specification:
cols(
  .default = col_double(),
  ID_REF = col_character()
)
See spec(...) for full column specifications.
Using locally cached version of GPL570 found here:
/media/nat/entain/toy3/Bioconductor/data//GPL570.soft
^C
Execution halted

I had to exit after waiting for quite some time. When the destination directory is not specified the script runs fine. Please find the link to my code here

I'm not sure what went wrong.

Many thanks

 

ADD REPLYlink modified 14 months ago • written 14 months ago by mahm20

Looks like GEOquery is working correctly--not sure why it is slower. You could try using a smaller GSE (GSE20, maybe) to see if the process completes. If you are running in two different settings, file system performance and memory limitations may impact speed of processing, but now I am speculating. 

ADD REPLYlink written 14 months ago by Sean Davis21k

Sean, yes, the smaller file works!

GSE20_series_matrix.txt.gz
Using locally cached version: /media/nat/entain/toy3/Bioconductor/data//GSE20_series_matrix.txt.gz
Parsed with column specification:
cols(
  ID_REF = col_integer(),
  GSM866 = col_double(),
  GSM867 = col_double(),
  GSM869 = col_double(),
  GSM870 = col_double(),
  GSM871 = col_double(),
  GSM885 = col_double(),
  GSM886 = col_double(),
  GSM894 = col_double()
)

I would like to ask for suggestions on what has to be done for larger files.

Many thanks for your time and attention.

 

ADD REPLYlink written 14 months ago by mahm20
2

The behavior you are seeing appears to be specific to your environment, so I do not really have a good set of suggestions other than to experiment. You might benefit from getting some local IT support to work with you. 

ADD REPLYlink written 14 months ago by Sean Davis21k

What kind of a device is at /media/nat? Does it have enough space both for the file and for intermediate storage? Do you have full read / write permissions at that location? Please also post your sessionInfo(), and confirm that BiocInstaller::biocValid() reports that packages are up-to-date.

ADD REPLYlink modified 14 months ago • written 14 months ago by Martin Morgan ♦♦ 24k

 

/media/nat is a directory on  my laptop. Type(inode/directory) .I have full read and write permissions at that location. It has a total capacity of 212 GB and used is 20GB.

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_IN       LC_NUMERIC=C         LC_TIME=en_IN       
 [4] LC_COLLATE=en_IN     LC_MONETARY=en_IN    LC_MESSAGES=en_IN   
 [7] LC_PAPER=en_IN       LC_NAME=C            LC_ADDRESS=C        
[10] LC_TELEPHONE=C       LC_MEASUREMENT=en_IN LC_IDENTIFICATION=C 

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] dplyr_0.7.6          bindrcpp_0.2.2       GEOquery_2.48.0     
 [4] biomaRt_2.36.1       limma_3.36.3         gcrma_2.52.0        
 [7] affy_1.58.0          Biobase_2.40.0       BiocGenerics_0.26.0 
[10] BiocInstaller_1.30.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18          pillar_1.3.0          bindr_0.1.1          
 [4] compiler_3.5.1        XVector_0.20.0        prettyunits_1.0.2    
 [7] bitops_1.0-6          tools_3.5.1           zlibbioc_1.26.0      
[10] progress_1.2.0        digest_0.6.16         bit_1.1-14           
[13] tibble_1.4.2          RSQLite_2.1.1         memoise_1.1.0        
[16] preprocessCore_1.42.0 pkgconfig_2.0.2       rlang_0.2.2          
[19] cli_1.0.0             DBI_1.0.0             curl_3.2             
[22] xml2_1.2.0            stringr_1.3.1         httr_1.3.1           
[25] Biostrings_2.48.0     S4Vectors_0.18.3      IRanges_2.14.11      
[28] hms_0.4.2             tidyselect_0.2.4      stats4_3.5.1         
[31] bit64_0.9-7           glue_1.3.0            R6_2.2.2             
[34] fansi_0.3.0           AnnotationDbi_1.42.1  XML_3.98-1.16        
[37] tidyr_0.8.1           readr_1.1.1           purrr_0.2.5          
[40] blob_1.1.1            magrittr_1.5          splines_3.5.1        
[43] assertthat_0.2.0      utf8_1.1.4            stringi_1.2.4        
[46] RCurl_1.95-4.11       crayon_1.3.4          affyio_1.50.0  

 

ADD REPLYlink written 14 months ago by mahm20

I asked about /media/ because 'usually' this is where one mounts removable media like thumb drives and so on. I guess when you reported available space it was from a command like df -h /media/nat/entain/toy3/Bioconductor/data

You could try and debug by using `debug()` and then stepping through individual functions until something goes wrong; here's my session... I start by debugging some likely functions, one exported the other not...

> debug(parseGEO)
> debug(GEOquery:::parseGPL)

I then evaluate the function, where 'fl' is the file path to my cached file. The function runs until it calls parseGEO

> gseEset2 <- getGEO('GSE76896', destdir=fl)[[1]]
Found 1 file(s)
GSE76896_series_matrix.txt.gz
Using locally cached version: /tmp/Rtmpe4NWji/file4e3d1576c470/GSE76896_series_matrix.txt.gz
Parsed with column specification:
cols(
  .default = col_double(),
  ID_REF = col_character()
)
See spec(...) for full column specifications.
|=================================================================| 100%   84 MB
Using locally cached version of GPL570 found here:
/tmp/Rtmpe4NWji/file4e3d1576c470/GPL570.soft
debugging in: parseGEO(filename, GSElimits, destdir, AnnotGPL = AnnotGPL, getGPL = getGPL)
debug: {
    con <- fileOpen(fname)
    first.entity <- findFirstEntity(con)
    close(con)
    ret <- switch(as.character(first.entity[1]), sample = {
        parseGSM(fname)
    }, series = parseGSE(fname, GSElimits), dataset = {
        parseGDS(fname)
    }, platform = {
        parseGPL(fname)
    }, `0` = {
        parseGSEMatrix(fname, destdir = destdir, AnnotGPL = AnnotGPL,
            getGPL = getGPL)$eset
    }, )
    return(ret)
}

I then look at the help page ?browser and type 'n' to go to the next line of code

Browse[2]> n
debug: con <- fileOpen(fname)
Browse[2]> n
debug: first.entity <- findFirstEntity(con)
Browse[2]> n
debug: close(con)
Browse[2]> n
debug: ret <- switch(as.character(first.entity[1]), sample = {
    parseGSM(fname)
}, series = parseGSE(fname, GSElimits), dataset = {
    parseGDS(fname)
}, platform = {
    parseGPL(fname)
}, `0` = {
    parseGSEMatrix(fname, destdir = destdir, AnnotGPL = AnnotGPL,
        getGPL = getGPL)$eset
}, )
Browse[2]> n

until I get to the next place I'd put a debug call, and then step through some more, eventually exiting

debug: parseGPL(fname)
Browse[2]> n
debugging in: parseGPL(fname)
debug: {
    if (getOption("GEOquery.inmemory.gpl")) {
        info <- file.info(fname, extra_cols = FALSE)
        cache <- get0(fname, envir = GPLcache, inherits = FALSE)
        if (!is.null(cache) && cache$info$mtime == info$mtime) {
            message("Using GPL object found in memory from locally cached version.")
            return(cache$gpl)
        }
    }
    txt = read_lines(fname)
    txt = txt[txt != ""]
    return(.parseGPLTxt(txt))
}
Browse[3]> n
debug: if (getOption("GEOquery.inmemory.gpl")) {
    info <- file.info(fname, extra_cols = FALSE)
    cache <- get0(fname, envir = GPLcache, inherits = FALSE)
    if (!is.null(cache) && cache$info$mtime == info$mtime) {
        message("Using GPL object found in memory from locally cached version.")
        return(cache$gpl)
    }
}
Browse[3]> n
debug: txt = read_lines(fname)
Browse[3]> n
|=================================================================| 100%   80 MB
debug: txt = txt[txt != ""]
Browse[3]> n
debug: return(.parseGPLTxt(txt))
Browse[3]> n
|=================================================================| 100%   75 MB
exiting from: parseGPL(fname)
debug: return(ret)
Browse[2]> n
exiting from: parseGEO(filename, GSElimits, destdir, AnnotGPL = AnnotGPL, getGPL = getGPL)

At some point your own code will freeze. You can then break, debug() that function, and try again. Repeat until you get to what seems like a basic function (e.g., read_csv) that seems like it 'should' work; you can print out things like variables while in the Browser> , so occasionally you will want to do a 'sanity check'.

ADD REPLYlink written 14 months ago by Martin Morgan ♦♦ 24k
Hi, This the output of df -h

Filesystem      Size  Used Avail Use% Mounted on
udev            3.9G     0  3.9G   0% /dev
tmpfs           787M  9.5M  777M   2% /run
/dev/sda4        97G   12G   81G  13% /
tmpfs           3.9G  7.5M  3.9G   1% /dev/shm
tmpfs           5.0M  4.0K  5.0M   1% /run/lock
tmpfs           3.9G     0  3.9G   0% /sys/fs/cgroup
/dev/sda1       256M   98M  159M  38% /boot/efi
cgmfs           100K     0  100K   0% /run/cgmanager/fs
tmpfs           787M   44K  787M   1% /run/user/1000
/dev/sda5       147G   33G  114G  23% /media/nat/New Volume
/dev/sda6       199G   19G  180G  10% /media/nat/entain
/dev/sda3       474G  184G  290G  39% /media/nat/Windows

 

I am following what you mentioned to debug,

> debug(parseGEO)
> debug(GEOquery::parseGPL)
Error: 'parseGPL' is not an exported object from 'namespace:GEOquery'
> debug(GEOquery:::parseGPL)
> gseEset2 <- getGEO('GSE76896', destdir = "/media/nat/entain/toy3/Bioconductor/data/")[[1]]
Found 1 file(s)
GSE76896_series_matrix.txt.gz
Using locally cached version: /media/nat/entain/toy3/Bioconductor/data//GSE76896_series_matrix.txt.gz
Parsed with column specification:
cols(
  .default = col_double(),
  ID_REF = col_character()
)
See spec(...) for full column specifications.
|=================================================================| 100%   84 MB
Using locally cached version of GPL570 found here:
/media/nat/entain/toy3/Bioconductor/data//GPL570.soft 
debugging in: parseGEO(filename, GSElimits, destdir, AnnotGPL = AnnotGPL, getGPL = getGPL)
debug: {
    con <- fileOpen(fname)
    first.entity <- findFirstEntity(con)
    close(con)
    ret <- switch(as.character(first.entity[1]), sample = {
        parseGSM(fname)
    }, series = parseGSE(fname, GSElimits), dataset = {
        parseGDS(fname)
    }, platform = {
        parseGPL(fname)
    }, `0` = {
        parseGSEMatrix(fname, destdir = destdir, AnnotGPL = AnnotGPL, 
            getGPL = getGPL)$eset
    }, )
    return(ret)
}
Browse[2]> n
debug: con <- fileOpen(fname)
Browse[2]> n
debug: first.entity <- findFirstEntity(con)
Browse[2]> n

It freezes at this point.

Sorry for being dumb, but I couldn't exactly understand how this "You can then break, debug() that function, and try again." has to be done.

ADD REPLYlink modified 14 months ago • written 14 months ago by mahm20
1

So 'break' with cntrl-C or when at the Browser[]> prompt type Q. I'd then 

> undebug(GEOquery:::parseGEO)
> debug(GEOquery:::findFirstEntity)
> gseEset2 <- getGEO('GSE76896', destdir = "/media/nat/entain/toy3/Bioconductor/data/")[[1]]

and then start using the browser again. Here's the start of the function definition

Browse[2]> findFirstEntity
function (con) 
{
    while (TRUE) {
        line <- suppressWarnings(readLines(con, 100))
        entity.line <- grep("^\\^(DATASET|SAMPLE|SERIES|PLATFORM|ANNOTATION)", 
            line, ignore.case = TRUE, value = TRUE, perl = TRUE)

I'd look at the variables that are defined and make sure they make sense

Browse[2]> ls()
[1] "con"
Browse[2]> con
A connection with                                                          
description "/tmp/RtmpvmIdit/file63845b26756e/GPL570.soft"
class       "file"                                        
mode        "r"                                           
text        "text"                                        
opened      "opened"                                      
can read    "yes"                                         
can write   "no"                                          
Browse[2]> file.exists(summary(con)$description)
[1] TRUE

and then step through, checking as I go

Browse[2]> n
debug: while (TRUE) {
    line <- suppressWarnings(readLines(con, 100))
...
Browse[2]> 
debug: line <- suppressWarnings(readLines(con, 100))
Browse[2]> 
debug: entity.line <- grep("^\\^(DATASET|SAMPLE|SERIES|PLATFORM|ANNOTATION)", 
    line, ignore.case = TRUE, value = TRUE, perl = TRUE)
Browse[2]> head(line)
[1] "^PLATFORM = GPL570"                                                            
[2] "!Platform_title = [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array"
[3] "!Platform_geo_accession = GPL570"                                              
[4] "!Platform_status = Public on Nov 07 2003"                                      
[5] "!Platform_submission_date = Nov 07 2003"                                       
[6] "!Platform_last_update_date = Aug 09 2018"                                   

etc. If that readLine doesn't look right, I'd start again (Browser[2]> Q, then invoke the function again) and this time instead of using 'n' to step through that line of code, I'd try to evaluate the line in the browser without the 'suppressWarnings()'

Browse[2]> Q
> seEset2 <- getGEO('GSE76896', destdir = fl)
...
debugging in: findFirstEntity(con)
debug: {
    while (TRUE) {
        line <- suppressWarnings(readLines(con, 100))
...
Browse[2]> n
debug: while (TRUE) {
    line <- suppressWarnings(readLines(con, 100))
...
Browse[2]> 
debug: line <- suppressWarnings(readLines(con, 100))
Browse[2]> head(readLines(con, 100))
[1] "^PLATFORM = GPL570"                                                            
[2] "!Platform_title = [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array"
[3] "!Platform_geo_accession = GPL570"                                              
[4] "!Platform_status = Public on Nov 07 2003"                                      
[5] "!Platform_submission_date = Nov 07 2003"                                       
[6] "!Platform_last_update_date = Aug 09 2018"                                      
Warning message:
In eval(formal.args[[as.character(substitute(arg))]], envir = sys.frame(sysP)) :
  closing unused connection 3 (/tmp/RtmpvmIdit/file63845b26756e/GPL570.soft)
Browse[2]> 

You'll have to use this approach to narrow down as much as possible what the problem is; if it is with readLines(), then I wouldn't go further -- readLines is from base R, and is likely to 'work'.

The file should be a plain text file, so you should be able to open it outside R using a plain text editor. It might help to do the same exercise but storing on a different part of your file system, e.g., '/tmp', and comparing things like file.info().

ADD REPLYlink modified 14 months ago • written 14 months ago by Martin Morgan ♦♦ 24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 217 users visited in the last hour