Question

GEOquery: Shifted data columns in phenotype data

0

Entering edit mode

rfb • 0

@rfb-9551

Last seen 10.0 years ago

I am trying to use the GEOquery package to extract phenotype data, but data for some rows data seem to be missing for some characteristics, causing the data to be shifted as shown below. Has anyone else seen this and know a workaround (that does not involve manually manipulating data) or is this simply an error in the raw data that GEOquery cannot handle?

>GSE44861=getGEO("GSE44861")
>x=pData(phenoData(GSE44861[[1]]))
>x[20:30,c("characteristics_ch1","characteristics_ch1.2","characteristics_ch1.3")]

           characteristics_ch1     characteristics_ch1.2     characteristics_ch1.3
GSM1092928      case_id: 11731    rs5995355 genotype: AA tissue: adjacent nontumor
GSM1092929      case_id: 11873    rs5995355 genotype: AG tissue: adjacent nontumor
GSM1092930      case_id: 11873    rs5995355 genotype: AG             tissue: Tumor
GSM1092931      case_id: 11918 tissue: adjacent nontumor                          
GSM1092932      case_id: 11918             tissue: Tumor           mir34a: -2.2859
GSM1092933      case_id: 12031    rs5995355 genotype: AG tissue: adjacent nontumor
GSM1092934      case_id: 12051    rs5995355 genotype: GG tissue: adjacent nontumor
GSM1092935      case_id: 12051    rs5995355 genotype: GG             tissue: Tumor
GSM1092936      case_id: 12076    rs5995355 genotype: GG tissue: adjacent nontumor
GSM1092937      case_id: 12124    rs5995355 genotype: AG tissue: adjacent nontumor
GSM1092938      case_id: 12124    rs5995355 genotype: AG             tissue: Tumor

geoquery getgeo • 2.2k views

ADD COMMENT • link 10.0 years ago rfb • 0

score 1 · Answer 1 · 2016-01-20

I wouldn't blame GEOquery - it looks like there might have been a problem with the submission that the curators didn't catch.

sed -n '41p' GSE44861_series_matrix.txt | cut -f 20-30 | sed 's/\t/\n/g'
"rs5995355 genotype: AA"
"rs5995355 genotype: AA"
"rs5995355 genotype: AG"
"rs5995355 genotype: AG"
"tissue: adjacent nontumor"
"tissue: Tumor"
"rs5995355 genotype: AG"
"rs5995355 genotype: GG"
"rs5995355 genotype: GG"
"rs5995355 genotype: GG"
"rs5995355 genotype: AG"

sed -n '42p' GSE44861_series_matrix.txt | cut -f 20-30 | sed 's/\t/\n/g'
"tissue: Tumor"
"tissue: adjacent nontumor"
"tissue: adjacent nontumor"
"tissue: Tumor"
""
"mir34a: -2.2859"
"tissue: adjacent nontumor"
"tissue: adjacent nontumor"
"tissue: Tumor"
"tissue: adjacent nontumor"
"tissue: adjacent nontumor"

score 1 · Answer 2 · 2016-01-20

In addition to James' comments, I'll add a bit more detail.

When someone submits a GEO sample, the sample can have various "characteristics" associated with it. These are handled as tag:value pairs. When all samples within a GEO Series (which is just a collection of GEO samples, reformatted) have the same number of characteristics, we get a very nice phenoData slot with each column of the data.frame representing a single one of these characteristics. However, GEO allows samples to have numbers of characteristics within the same GEO series. That leads to the (surprisingly common) case noted here. GEO simply stacks characteristics for each sample into columns, so if a characteristic is present in some samples and not in others, a column gets shifted.

This tag:value pair format was defined after I originally write GEOquery, so I did not handle these data as intelligently as possible. None of this helps you directly, but I have an issue in to modify parsing to deal with these data more effectively, but I just haven't gotten around to doing it.

https://github.com/seandavi/GEOquery/issues/25

score 0 · Answer 3 · 2016-01-20

0

Entering edit mode

rfb • 0

@rfb-9551

Last seen 10.0 years ago

Alright, thanks. So my best course of action is to notify the curators and hope they fix it, or do some heavy manual editing?

ADD COMMENT • link 10.0 years ago rfb • 0

score 0 · Answer 4 · 2016-01-21

Thanks for elaborating Sean. So as I understand it, it becomes quite difficult to keep track of which characteristics belong to which individuals if columns are just pasted together without inserting NA values.

I wrote a small function that will read the individual soft files, based on a list of individual geo accession numbers,and convert the characteristics into a data.frame. Hopefully that should suffice until you update GEOquery with a more elegant solution.

convert_geo_meta=function(x){
  out_table=data.frame("geo_accession"=x)
  for(id in out_table$geo_accession){
    geo=getGEO(id)
    characteristics=unlist(geo@header[["characteristics_ch1"]])
    characteristics2=strsplit(characteristics,": ")

    for(i in 1:length(characteristics)){
      if(is.na(characteristics2[[i]][2])==FALSE){
        out_table[[characteristics2[[i]][1]]][out_table$geo_accession==id]=characteristics2[[i]][2]
      }
      else{
        out_table[[characteristics2[[i]][1]]][out_table$geo_accession==id]=NA
      }
    }
  }
  return(out_table)
}

> xx=c("GSM1092919","GSM1092920","GSM1092931")
> convert_geo_meta(xx)

   geo_accession case_id microarray batch rs5995355 genotype                   tissue     mir34a       mir34bc
1    GSM1092919    11275                        A                          GG                   Tumor    -0.8128 -6.9421665
2    GSM1092920    11303                        A                          AG adjacent nontumor       <NA>          <NA>
3    GSM1092931    11918                        A                       <NA> adjacent nontumor       <NA>         <NA>