GEOquery: series matrix not parsing correctly
0
0
Entering edit mode
andy_shaps ▴ 10
@andy_shaps-20648
Last seen 5.6 years ago

Hi all,

I have been developing a script to pull data from GEO, format it and output user defined subgroups. It seems however when i getGEO, every now and then i get a series which cant parse correctly and so produces a horrible expression matrix.

an example of this can be seen below:

data <- GetGEO("GSE2193")

> Found 5 file(s) GSE2193-GPL1823_series_matrix.txt.gz Using locally
> cached version:
> C:\Users\as3e15\AppData\Local\Temp\RtmpgXmEZZ/GSE2193-GPL1823_series_matrix.txt.gz
> Parsed with column specification: cols(   .default = col_double() )
> See spec(...) for full column specifications. Using locally cached
> version of GPL1823 found here:
> C:\Users\as3e15\AppData\Local\Temp\RtmpgXmEZZ/GPL1823.soft 
> GSE2193-GPL1824_series_matrix.txt.gz Using locally cached version:
> C:\Users\as3e15\AppData\Local\Temp\RtmpgXmEZZ/GSE2193-GPL1824_series_matrix.txt.gz
> Parsed with column specification: cols(   .default = col_double() )
> See spec(...) for full column specifications. Using locally cached
> version of GPL1824 found here:
> C:\Users\as3e15\AppData\Local\Temp\RtmpgXmEZZ/GPL1824.soft 
> GSE2193-GPL1825_series_matrix.txt.gz Using locally cached version:
> C:\Users\as3e15\AppData\Local\Temp\RtmpgXmEZZ/GSE2193-GPL1825_series_matrix.txt.gz
> Parsed with column specification: cols(   .default = col_double() )
> See spec(...) for full column specifications. Using locally cached
> version of GPL1825 found here:
> C:\Users\as3e15\AppData\Local\Temp\RtmpgXmEZZ/GPL1825.soft 
> GSE2193-GPL1826_series_matrix.txt.gz Using locally cached version:
> C:\Users\as3e15\AppData\Local\Temp\RtmpgXmEZZ/GSE2193-GPL1826_series_matrix.txt.gz
> Parsed with column specification: cols(   `1` = col_double(),  
> `-.954` = col_double(),   `.104` = col_double(),   `-1.08` =
> col_double(),   X5 = col_double(),   `-1.6` = col_double(),   X7 =
> col_double(),   `-.14` = col_double(),   `-.256` = col_double(),  
> `.929` = col_double(),   `.205` = col_double(),   `-.939` =
> col_double() ) Using locally cached version of GPL1826 found here:
> C:\Users\as3e15\AppData\Local\Temp\RtmpgXmEZZ/GPL1826.soft 
> GSE2193-GPL1827_series_matrix.txt.gz Using locally cached version:
> C:\Users\as3e15\AppData\Local\Temp\RtmpgXmEZZ/GSE2193-GPL1827_series_matrix.txt.gz
> Parsed with column specification: cols(   `1` = col_double(),  
> `-.226` = col_double(),   `.85` = col_double(),   `.239` =
> col_double(),   `.239_1` = col_double(),   `.239_2` = col_double(),  
> `.239_3` = col_double(),   `.239_4` = col_double(),   `.597` =
> col_double() ) Using locally cached version of GPL1827 found here:
> C:\Users\as3e15\AppData\Local\Temp\RtmpgXmEZZ/GPL1827.soft  Warning
> messages: 1: Missing column names filled in: 'X3' [3], 'X12' [12],
> 'X24' [24]  2: Missing column names filled in: 'X11' [11], 'X25' [25],
> 'X28' [28], 'X29' [29], 'X33' [33], 'X36' [36], 'X40' [40]  3: Missing
> column names filled in: 'X2' [2], 'X8' [8], 'X9' [9], 'X11' [11],
> 'X19' [19], 'X20' [20], 'X24' [24], 'X25' [25], 'X26' [26], 'X31' [31]
> 4: Missing column names filled in: 'X5' [5], 'X7' [7]  5: Duplicated
> column names deduplicated: '.239' => '.239_1' [5], '.239' => '.239_2'
> [6], '.239' => '.239_3' [7], '.239' => '.239_4' [8]

As you can see it doesn't appear to find the column names and so uses the first row of values. This produces an expression matrix like below (note, only used example from one platform)

head(data[["GSE2193-GPL1823_series_matrix.txt.gz"]]@assayData[["exprs"]])

>  -1.587     X3  1.225  -.195 -1.002  1.519  1.894 -.881  -.354  -.463    X12  -.982  -.393   .047  -.268   .401
>2 -0.771 -0.049     NA  0.353 -1.880 -0.785 -0.965    NA -1.866 -1.807     NA -1.936  0.062 -1.257 -0.663 -1.878
>3 -1.753 -1.470  0.320 -0.499 -0.290 -1.026  1.175 1.396  1.291  1.032 -0.679 -1.995 -0.008 -0.525 -0.094  0.399
>4  0.563  0.195  0.006  1.214  1.506  0.931  0.405 0.178  0.242  1.476  0.357 -0.226  0.588  0.549  1.129  0.008
>5  1.292  0.864 -1.452  0.866 -0.298  0.492  2.379 2.310  0.012  0.784 -0.502  0.573  0.369 -0.171  0.259 -1.411
>6 -1.004 -0.202     NA  0.617 -2.138 -0.436 -0.620    NA -0.428 -0.026     NA -0.844  0.686 -0.505 -0.353 -1.745
>7  1.764  1.416     NA  0.325 -1.795 -1.535  4.634 6.749  0.313  0.427 -3.887  1.149 -1.297 -0.928 -1.351     NA

It seems the downloaded .txt.gz files contain column headers (i.e. GSM###) but isnt getting parsed when creating the matrix. have tried re-running and removing any cached versions but no success

Is this a bug?

I could find a workaround (i.e. saving to locally then reading in as text file then reformatting) but i am hoping to avoid such a hassle.

Many thanks,

Andy

> > sessionInfo() R version 3.5.3 (2019-03-11) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 8.1 x64 (build
> 9600)
> 
> Matrix products: default
> 
> locale: [1] LC_COLLATE=English_United Kingdom.1252 
> LC_CTYPE=English_United Kingdom.1252    [3] LC_MONETARY=English_United
> Kingdom.1252 LC_NUMERIC=C                            [5]
> LC_TIME=English_United Kingdom.1252    
> 
> attached base packages: [1] parallel  stats     graphics  grDevices
> utils     datasets  methods   base     
> 
> other attached packages: [1] xml2_1.2.0          data.table_1.12.2  
> ggplot2_3.1.1       DT_0.5              shiny_1.3.2         [6]
> plyr_1.8.4          GEOquery_2.50.5     Biobase_2.42.0     
> BiocGenerics_0.28.0
> 
> loaded via a namespace (and not attached):  [1] Rcpp_1.0.1      
> pillar_1.3.1     compiler_3.5.3   later_0.8.0      tools_3.5.3     
> digest_0.6.18     [7] tibble_2.1.1     gtable_0.3.0    
> pkgconfig_2.0.2  rlang_0.3.4      rstudioapi_0.10  curl_3.3        
> [13] withr_2.1.2      dplyr_0.8.0.1    htmlwidgets_1.3  hms_0.4.2     
> grid_3.5.3       tidyselect_0.2.5 [19] glue_1.3.1       R6_2.4.0      
> limma_3.38.3     tidyr_0.8.3      readr_1.3.1      purrr_0.3.2     
> [25] magrittr_1.5     scales_1.0.0     promises_1.0.1  
> htmltools_0.3.6  assertthat_0.2.1 colorspace_1.4-1 [31] mime_0.6      
> xtable_1.8-4     httpuv_1.5.1     stringi_1.4.3    lazyeval_0.2.2  
> munsell_0.5.0    [37] crayon_1.3.4
GEOquery GEO • 1.1k views
ADD COMMENT

Login before adding your answer.

Traffic: 831 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6