Question: GEOquery: Shifted data columns in phenotype data
0
gravatar for rfb
3.9 years ago by
rfb0
rfb0 wrote:

I am trying to use the GEOquery package to extract phenotype data, but data for some rows data seem to be missing for some characteristics, causing the data to be shifted as shown below. Has anyone else seen this and know a workaround (that does not involve manually manipulating data) or is this simply an error in the raw data that GEOquery cannot handle?

 

>GSE44861=getGEO("GSE44861")
>x=pData(phenoData(GSE44861[[1]]))
>x[20:30,c("characteristics_ch1","characteristics_ch1.2","characteristics_ch1.3")]

           characteristics_ch1     characteristics_ch1.2     characteristics_ch1.3
GSM1092928      case_id: 11731    rs5995355 genotype: AA tissue: adjacent nontumor
GSM1092929      case_id: 11873    rs5995355 genotype: AG tissue: adjacent nontumor
GSM1092930      case_id: 11873    rs5995355 genotype: AG             tissue: Tumor
GSM1092931      case_id: 11918 tissue: adjacent nontumor                          
GSM1092932      case_id: 11918             tissue: Tumor           mir34a: -2.2859
GSM1092933      case_id: 12031    rs5995355 genotype: AG tissue: adjacent nontumor
GSM1092934      case_id: 12051    rs5995355 genotype: GG tissue: adjacent nontumor
GSM1092935      case_id: 12051    rs5995355 genotype: GG             tissue: Tumor
GSM1092936      case_id: 12076    rs5995355 genotype: GG tissue: adjacent nontumor
GSM1092937      case_id: 12124    rs5995355 genotype: AG tissue: adjacent nontumor
GSM1092938      case_id: 12124    rs5995355 genotype: AG             tissue: Tumor
geoquery getgeo • 790 views
ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by rfb0
Answer: GEOquery: Shifted data columns in phenotype data
1
gravatar for James W. MacDonald
3.9 years ago by
United States
James W. MacDonald52k wrote:

I wouldn't blame GEOquery - it looks like there might have been a problem with the submission that the curators didn't catch.

sed -n '41p' GSE44861_series_matrix.txt | cut -f 20-30 | sed 's/\t/\n/g'
"rs5995355 genotype: AA"
"rs5995355 genotype: AA"
"rs5995355 genotype: AG"
"rs5995355 genotype: AG"
"tissue: adjacent nontumor"
"tissue: Tumor"
"rs5995355 genotype: AG"
"rs5995355 genotype: GG"
"rs5995355 genotype: GG"
"rs5995355 genotype: GG"
"rs5995355 genotype: AG"

sed -n '42p' GSE44861_series_matrix.txt | cut -f 20-30 | sed 's/\t/\n/g'
"tissue: Tumor"
"tissue: adjacent nontumor"
"tissue: adjacent nontumor"
"tissue: Tumor"
""
"mir34a: -2.2859"
"tissue: adjacent nontumor"
"tissue: adjacent nontumor"
"tissue: Tumor"
"tissue: adjacent nontumor"
"tissue: adjacent nontumor"
ADD COMMENTlink written 3.9 years ago by James W. MacDonald52k
Answer: GEOquery: Shifted data columns in phenotype data
1
gravatar for Sean Davis
3.9 years ago by
Sean Davis21k
United States
Sean Davis21k wrote:

In addition to James' comments, I'll add a bit more detail.  

When someone submits a GEO sample, the sample can have various "characteristics" associated with it. These are handled as tag:value pairs.  When all samples within a GEO Series (which is just a collection of GEO samples, reformatted) have the same number of characteristics, we get a very nice phenoData slot with each column of the data.frame representing a single one of these characteristics.  However, GEO allows samples to have numbers of characteristics within the same GEO series.  That leads to the (surprisingly common) case noted here.  GEO simply stacks characteristics for each sample into columns, so if a characteristic is present in some samples and not in others, a column gets shifted.

This tag:value pair format was defined after I originally write GEOquery, so I did not handle these data as intelligently as possible.  None of this helps you directly, but I have an issue in to modify parsing to deal with these data more effectively, but I just haven't gotten around to doing it.  

https://github.com/seandavi/GEOquery/issues/25

ADD COMMENTlink written 3.9 years ago by Sean Davis21k
Answer: GEOquery: Shifted data columns in phenotype data
0
gravatar for rfb
3.9 years ago by
rfb0
rfb0 wrote:

Alright, thanks. So my best course of action is to notify the curators and hope they fix it, or do some heavy manual editing?

ADD COMMENTlink written 3.9 years ago by rfb0
Answer: GEOquery: Shifted data columns in phenotype data
0
gravatar for rfb
3.9 years ago by
rfb0
rfb0 wrote:

Thanks for elaborating Sean. So as I understand it, it becomes quite difficult to keep track of which characteristics belong to which individuals if columns are just pasted together without inserting NA values.

I wrote a small function that will read the individual soft files, based on a list of individual geo accession numbers,and convert the characteristics into a data.frame. Hopefully that should suffice until you update GEOquery with a more elegant solution.

convert_geo_meta=function(x){
  out_table=data.frame("geo_accession"=x)
  for(id in out_table$geo_accession){
    geo=getGEO(id)
    characteristics=unlist(geo@header[["characteristics_ch1"]])
    characteristics2=strsplit(characteristics,": ")

    for(i in 1:length(characteristics)){
      if(is.na(characteristics2[[i]][2])==FALSE){
        out_table[[characteristics2[[i]][1]]][out_table$geo_accession==id]=characteristics2[[i]][2]
      }
      else{
        out_table[[characteristics2[[i]][1]]][out_table$geo_accession==id]=NA
      }
    }
  }
  return(out_table)
}

> xx=c("GSM1092919","GSM1092920","GSM1092931")
> convert_geo_meta(xx)

   geo_accession case_id  microarray batch  rs5995355 genotype                   tissue     mir34a       mir34bc
1    GSM1092919    11275                        A                          GG                   Tumor    -0.8128  -6.9421665
2    GSM1092920    11303                        A                          AG  adjacent nontumor       <NA>          <NA>
3    GSM1092931    11918                        A                       <NA>  adjacent nontumor       <NA>         <NA>

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by rfb0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 287 users visited in the last hour