GEOquery: incomplete feature data from GPL soft file
0
0
Entering edit mode
@renaud-gaujoux-3125
Last seen 10.3 years ago
Hi, I am getting incorrect feature annotation data when loading a dataset from GPL4133. The feature data looks like this: head(fData(eset)[, 1:2]) ID COL 12 12 266 NA <na> <na> NA.1 <na> <na> 15 15 266 16 16 266 NA.2 <na> <na> This possibly also results in having less features in the final expression matrix, if it is at some point restricted to feature names matching the ones in the loaded annotation data. The real issue here seems to be with the soft file being badly formatted, with lines having double quotes where there should not be: 12 266 148 A_24_P66027 A_24_P66027 FALSE NM_004900 NM_004900 9582 APOBEC3B apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like 3B" Hs.226307 ... Looking at the way GEOquery loads the annotation soft files, we see that they are read using `quote="\""`, which clearly returns a messed up data.frame. So: - the real issue should be solved by GEO teams (in Cc), by re- generating the soft files -- by correcting the tools that dump them. - a quicker option could be to patch GEOquery so that it copes with such files. Sean, is is really necessary to use quote="\""? Using quote = '' works fines on this GPL at least (see example code below). Is there any GPL that uses quotes? Tab separation should be pretty robust. Thank you. Bests, Renaud ## example # load eset normally eset <- getGEO('GSE20690', destdir = '.')[[1L]] annotation(eset) dim(eset) # read soft file with quote='' gp <- readLines('GPL4133.soft') grep("!platform_table_begin", gp) grep("!platform_table_end", gp) gpt <- read.table('GPL4133.soft', comment.char='', quote = '', skip = 8578, nrows= 53800 - 8578 - 2, header = TRUE, sep = "\t") dim(gpt) ## output (this is from a "fresh" 2nd-call download, hence the messages about cached files) > eset <- getGEO('GSE20690', destdir = '.')[[1L]] Found 1 file(s) GSE20690_series_matrix.txt.gz Using locally cached version: ./GSE20690_series_matrix.txt.gz Using locally cached version of GPL4133 found here: ./GPL4133.soft > annotation(eset) [1] "GPL4133" > dim(eset) Features Samples 43376 68 > > # read soft file with quote='' > gp <- readLines('GPL4133.soft') > grep("!platform_table_begin", gp) [1] 8578 > grep("!platform_table_end", gp) [1] 53800 > gpt <- read.table('GPL4133.soft', comment.char='', quote = '', skip = 8578, nrows= 53800 - 8578 - 2, header = TRUE, sep = "\t") > dim(gpt) [1] 45220 20 ### [[alternative HTML version deleted]]
Annotation GEOquery Annotation GEOquery • 1.6k views
ADD COMMENT

Login before adding your answer.

Traffic: 554 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6