Entering edit mode
Renaud Gaujoux
▴
170
@renaud-gaujoux-3125
Last seen 10.3 years ago
Hi,
I am getting incorrect feature annotation data when loading a dataset
from
GPL4133.
The feature data looks like this:
head(fData(eset)[, 1:2])
ID COL
12 12 266
NA <na> <na>
NA.1 <na> <na>
15 15 266
16 16 266
NA.2 <na> <na>
This possibly also results in having less features in the final
expression
matrix, if it is at some point restricted to feature names matching
the
ones in the loaded annotation data.
The real issue here seems to be with the soft file being badly
formatted,
with lines having double quotes where there should not be:
12 266 148 A_24_P66027 A_24_P66027 FALSE
NM_004900 NM_004900 9582 APOBEC3B apolipoprotein
B
mRNA editing enzyme, catalytic polypeptide-like 3B" Hs.226307 ...
Looking at the way GEOquery loads the annotation soft files, we see
that
they are read using `quote="\""`, which clearly returns a messed up
data.frame.
So:
- the real issue should be solved by GEO teams (in Cc), by re-
generating
the soft files -- by correcting the tools that dump them.
- a quicker option could be to patch GEOquery so that it copes with
such
files.
Sean, is is really necessary to use quote="\""? Using quote = '' works
fines on this GPL at least (see example code below).
Is there any GPL that uses quotes? Tab separation should be pretty
robust.
Thank you.
Bests,
Renaud
## example
# load eset normally
eset <- getGEO('GSE20690', destdir = '.')[[1L]]
annotation(eset)
dim(eset)
# read soft file with quote=''
gp <- readLines('GPL4133.soft')
grep("!platform_table_begin", gp)
grep("!platform_table_end", gp)
gpt <- read.table('GPL4133.soft', comment.char='', quote = '', skip =
8578, nrows= 53800 - 8578 - 2, header = TRUE, sep = "\t")
dim(gpt)
## output (this is from a "fresh" 2nd-call download, hence the
messages
about cached files)
> eset <- getGEO('GSE20690', destdir = '.')[[1L]]
Found 1 file(s)
GSE20690_series_matrix.txt.gz
Using locally cached version: ./GSE20690_series_matrix.txt.gz
Using locally cached version of GPL4133 found here:
./GPL4133.soft
> annotation(eset)
[1] "GPL4133"
> dim(eset)
Features Samples
43376 68
>
> # read soft file with quote=''
> gp <- readLines('GPL4133.soft')
> grep("!platform_table_begin", gp)
[1] 8578
> grep("!platform_table_end", gp)
[1] 53800
> gpt <- read.table('GPL4133.soft', comment.char='', quote = '', skip
=
8578, nrows= 53800 - 8578 - 2, header = TRUE, sep = "\t")
> dim(gpt)
[1] 45220 20
###
[[alternative HTML version deleted]]