Question

[xps] Problems building scheme for DroGene 1.1 ST arrays in xps

0

Entering edit mode

christopher.kimber • 0

@christopherkimber-17633

Last seen 7.3 years ago

Hi all,

I am attempting to use the xps package to perform background correction and summarization (at the probeset level) on Affymetrix DroGene 1.1 ST arrays, but I am encountering an error right away when attempting to build a scheme file for these arrays as 'exon' arrays ie. using both the probeset and transcript annotations.

I am guessing that the issue may involve changes in the annotation files for na36 that are producing some sort of mismatch with the available PGF and CLF files, but I am not savvy enough with these formats to find the issue. Here's what the error looks like:

> library(xps)

Welcome to xps version 1.40.0
    an R wrapper for XPS - eXpression Profiling System
    (c) Copyright 2001-2018 by Christian Stratowa
    
>
> libdir <- "/Users/chrki23/Documents/Becky Data/XPS/DroGene-1_1-st"
> anndir <- "/Users/chrki23/Documents/Becky Data/XPS/DroGene-1_1-st-v1-na36-dm3-transcript-csv"
> anndir2 <- "/Users/chrki23/Documents/Becky Data/XPS/DroGene-1_1-st-v1-na36-dm3-probeset-csv"
> scmdir <- "/Users/chrki23/Documents/Becky Data/XPS/Schemes"
> ### DroGene-1_1-st-v1.na36.dm3: as exon array
> scheme.drogene11stv1.na36.dm3 <- import.exon.scheme("Scheme_DroGene11stv1", filedir = scmdir,
+  layoutfile = file.path(libdir, "DroGene-1_1-st.clf"),
+  schemefile = file.path(libdir, "DroGene-1_1-st.pgf"),
+  probeset = file.path(anndir2, "DroGene-1_1-st-v1.na36.dm3.probeset.csv"),
+  transcript = file.path(anndir, "DroGene-1_1-st-v1.na36.dm3.transcript.csv"), add.mask = T, verbose = T)
Creating new file </Users/chrki23/Documents/Becky Data/XPS/Schemes/Scheme_DroGene11stv1.root>...
Importing </Users/chrki23/Documents/Becky Data/XPS/DroGene-1_1-st/DroGene-1_1-st.clf> as <DroGene-1_1-st.cxy>...
   <1416100> records imported...Finished
New dataset <DroGene-1_1-st> is added to Content...
Importing </Users/chrki23/Documents/Becky Data/XPS/DroGene-1_1-st-v1-na36-dm3-probeset-csv/DroGene-1_1-st-v1.na36.dm3.probeset.csv> as <DroGene-1_1-st.anp>...
   Number of probesets is <176275>.
   <176275> records read...Finished
   <175929> records imported...Finished
   <71433> exon annotations imported.
Importing </Users/chrki23/Documents/Becky Data/XPS/DroGene-1_1-st/DroGene-1_1-st.pgf> as <DroGene-1_1-st.scm>...
   Reading data from input file...
   Number of probesets is <176275>.
   <176275> records read...Finished
   Sorting data for probeset_type and position...
   Total number of controls is <23>
   Filling trees with data for probeset type: control->chip...
   Number of control->chip items is <0>.
   Filling trees with data for probeset type: control->bgp...
   Filling trees with data for probeset type: control->affx...
   Number of control->affx probesets is <167>.
Error: Number of control->affx imported <167> is not equal to number of annotated AFFX controls <75>.
Error: CDF with version/magic number </Users/chrki23/Documents/Becky Data/XPS/DroGene-1_1-st/DroGene-1_1-st.pgf> is not supported.
Error in import.exon.scheme("Scheme_DroGene11stv1", filedir = scmdir,  :
  error in function ‘ImportExonSchemes’

My session info looks like this:

R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X Yosemite 10.10.5

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xps_1.40.0

loaded via a namespace (and not attached):
[1] tools_3.3.2 yaml_2.2.0

The mention of 167 affx controls vs. 75 makes me think that maybe something in the pgf/clf files is indicating there are also affx->ercc spike-in controls, which I think are present in the HuGeneST arrays (where there are 167 affx controls), but these are not present in the DroGene annotation files. I don't think they are actually present on the DroGene arrays either, though it would be nice if they were. Any ideas how to resolve this problem and successfully build a DroGeneST scheme? Do I somehow have to cook up a new pgf or clf?

Thanks!

Chris

xps • 2.3k views

ADD COMMENT • link updated 7.3 years ago by cstrato ★ 3.9k • written 7.3 years ago by christopher.kimber • 0

0

Entering edit mode

Hi Chris,

This is an old problem of Affymetrix. Sometimes they do not add all AFFX controls to the
annotation files or had controls in the annotation files which were missing in the pgf-file.
Usually I have asked Affymetrix DevNet to add the controls.

I will check my old records to see if there is a possibility to add/remove controls,
and will let you know.

Best regards,
Christian

ADD REPLY • link 7.3 years ago cstrato ★ 3.9k

0

Entering edit mode

Thanks Christian, I appreciate any help you can offer!

ADD REPLY • link 7.3 years ago christopher.kimber • 0

score 3 · Accepted Answer · 2018-10-05

PART 2:

Here is my code and the corresponding correct output:

library(xps)

libdir <- "/Volumes/T3Data/CRAN/Workspaces/Drosophila/DroGene-1_1-st"
anndir <- "/Volumes/T3Data/CRAN/Workspaces/Drosophila/DroGene-1_1-st-v1-na36-dm3-transcript-csv"
anndir2 <- "/Volumes/T3Data/CRAN/Workspaces/Drosophila/DroGene-1_1-st-v1.na36.dm3.probeset.csv"
scmdir <- "/Volumes/T3Data/CRAN/Workspaces/Drosophila/Schemes"

## corrected DroGene-1_1-st-v1.na36.dm3: as exon array
scheme.drogene11stv1.na36.dm3 <- import.exon.scheme("Scheme_DroGene11stv1", filedir = scmdir,
  layoutfile = file.path(libdir, "DroGene-1_1-st.clf"),
  schemefile = file.path(libdir, "DroGene-1_1-st.pgf"),
  probeset   = file.path(anndir2, "DroGene-1_1-st-v1.na36.dm3.probeset.corr.csv"),
  transcript = file.path(anndir, "DroGene-1_1-st-v1.na36.dm3.transcript.corr.csv"), add.mask = F, verbose = T)

Creating new file </Volumes/T3Data/CRAN/Workspaces/Drosophila/Schemes/Scheme_DroGene11stv1.root>...
Importing </Volumes/T3Data/CRAN/Workspaces/Drosophila/DroGene-1_1-st/DroGene-1_1-st.clf> as <DroGene-1_1-st.cxy>...
   <1416100> records imported...Finished
New dataset <DroGene-1_1-st> is added to Content...
Importing </Volumes/T3Data/CRAN/Workspaces/Drosophila/DroGene-1_1-st-v1.na36.dm3.probeset.csv/DroGene-1_1-st-v1.na36.dm3.probeset.corr.csv> as <DroGene-1_1-st.anp>...
   Number of probesets is <176222>.
   <176222> records read...Finished
   <176021> records imported...Finished
   <71434> exon annotations imported.
Importing </Volumes/T3Data/CRAN/Workspaces/Drosophila/DroGene-1_1-st/DroGene-1_1-st.pgf> as <DroGene-1_1-st.scm>...
   Reading data from input file...
   Number of probesets is <176275>.
   Note: Number of annotated probesets <176222> is not equal to number of probesets <176275>.
   <176275> records read...Finished
   Sorting data for probeset_type and position...
   Total number of controls is <23>
   Filling trees with data for probeset type: control->chip...
   Number of control->chip items is <0>.
   Filling trees with data for probeset type: control->bgp...
   Filling trees with data for probeset type: control->affx...
   Number of control->affx probesets is <167>.
   Filling trees with data for probeset type: main...
   <176021> probeset tree entries read...Finished
   Note: Number of exons imported <71433> is not equal to number of annotated exons <71434>.
   Filling trees with data for non-annotated probesets...
   <820798> records imported...Finished
   <49448> total transcript units imported.
   <71946> total exon array units imported.
   Exon cell statistics: 
      Number of probeset cells: minimum = 1,  maximum = 973
      Number of exon cells:     minimum = 1,  maximum = 1531
      Number of unit cells:     minimum = 1,  maximum = 973
Importing </Volumes/T3Data/CRAN/Workspaces/Drosophila/DroGene-1_1-st-v1-na36-dm3-transcript-csv/DroGene-1_1-st-v1.na36.dm3.transcript.corr.csv> as <DroGene-1_1-st.ann>...
   Number of annotated transcripts is <50472>.
   <17818> records read...Finished
   <17678> records imported...Finished

I hope that you can now use 'xps' as intended.

Best regards,
Christian

P.S.:
1, This time I will not contact Affymetrix, however, feel free to ask them to correct the
annotation files.

2, Since you are using a Mac, I suggest that you download "Nedit" as editor. It works even
on High Sierra, and allows to add syntax highlighting for R.
The main advantage of Nedit is that you can open the PGF-files and annotation files within
seconds and scroll interactively, e.g. to delete types "rescue" and "reporter".
The only thing to do is to add the following lines to '.xinitrc':
export DYLD_LIBRARY_PATH=/opt/X11/lib/flat_namespace
/Applications/nedit-5.5-Darwin-i386/nedit -geometry +60+60 &

score 2 · Accepted Answer · 2018-10-05

Hi Chris,

NOTE:
My reply seems to be tooo long, only 5000 characters are allowed, so I divide my answer into
different parts!!

PART 1:

As I have already mentioned the missing AFFY controls in the probeset and transcript annotation
files are an old problem. In 2012 it took me many weeks until Affymetrix DevNet has finally
upgraded the HuGene-2_0-st and HuGene-2_1-st arrays. Since a user at that time did need the
correct HuGene-2_1-st annotation files, I had created a Perl-script, which does correct them, see:
https://www.stat.math.ethz.ch/pipermail/bioconductor/2012-August/047755.html

You can use this perl-script (shown below), which I have updated for your DroGene-1_1-st arrays,
which adds the missing AFFX probesets and creates the corrected annotation files:
- DroGene-1_1-st-v1.na36.dm3.probeset.corr.csv
- DroGene-1_1-st-v1.na36.dm3.transcript.corr.csv
In this script you need only change the directories of the input/output file names.

However, these files are not yet ready for use since Affymetrix has added two probeset types,
which are not present in the PGF-file, i.e. types:
- "rescue"
- "reporter"
In my mail to DevNet from 2013/09/23 I had mentioned:
Neither the annotation files nor the mps/ps files should contain probesets which are not listed in the PGF-file!!

For this reason you need to delete the following lines from the corrected annotation files:

1, in "DroGene-1_1-st-v1.na36.dm3.probeset.corr.csv":
- type "rescue": delete lines 147113 - 174175
- type "reporter": delete remaining lines 176077 - 176158

2, in "DroGene-1_1-st-v1.na36.dm3.transcript.corr.csv":
- type "rescue": delete lines 47985 - 48047
- type "reporter": delete remaining lines 50328 - 50409

Now you are finally ready to use the corrected annotation files.

NOTE: the Part 2 will contain my code.

Best regards,
Christian
_._._._._._._._._._._._._._._._._._
C.h.r.i.s.t.i.a.n S.t.r.a.t.o.w.a
V.i.e.n.n.a A.u.s.t.r.i.a
e.m.a.i.l: cstrato at aon.at
_._._._._._._._._._._._._._._._._._

score 2 · Accepted Answer · 2018-10-05

Perl Script PART 1:

###### BEGIN perlscript "DroGene11_update_AFFX.pl" ######

#!/usr/bin/perl
#
#------------------------------------------------------------------------------#
# Perl script to update AFFX controls of DroGene-1_1-st annotation files na36
#
# Copyright (c) 2012-2018 Christian Stratowa, Vienna, Austria.
# All rights reserved.
#
# save DroGene-1_1-st pgf-file and annotation files in current directory and run:
# > perl DroGene11_update_AFFX.pl
# 
#------------------------------------------------------------------------------#

use strict;
use warnings;
# get current working dir
use Cwd;

#------------------------------------------------------------------------------#
# intialize constants
#------------------------------------------------------------------------------#

# input file names
my $in_pgf      = "/Volumes/T3Data/CRAN/Workspaces/Drosophila/DroGene-1_1-st/DroGene-1_1-st.pgf";
my $in_annot_tc = "/Volumes/T3Data/CRAN/Workspaces/Drosophila/DroGene-1_1-st-v1-na36-dm3-transcript-csv/DroGene-1_1-st-v1.na36.dm3.transcript.csv";
my $in_annot_ps = "/Volumes/T3Data/CRAN/Workspaces/Drosophila/DroGene-1_1-st-v1.na36.dm3.probeset.csv/DroGene-1_1-st-v1.na36.dm3.probeset.csv";

# output file names
my $out_affx     = "DroGene11.affx.csv";
my $out_annot_tc = "/Volumes/T3Data/CRAN/Workspaces/Drosophila/DroGene-1_1-st-v1-na36-dm3-transcript-csv/DroGene-1_1-st-v1.na36.dm3.transcript.corr.csv";
my $out_annot_ps = "/Volumes/T3Data/CRAN/Workspaces/Drosophila/DroGene-1_1-st-v1.na36.dm3.probeset.csv/DroGene-1_1-st-v1.na36.dm3.probeset.corr.csv";

# predefined strings
my $na                = "---";
my $beg_assignment_tc = "--- // --- // ";
my $end_assignment_tc = " // --- // --- // --- // --- // --- // ---";
my $beg_assignment_ps = "--- // ";
my $end_assignment_ps = " // --- // --- // --- // ---";

# variables
my @array;

my $idx;

#------------------------------------------------------------------------------#
# read pgf-file and put control->affx into array
#------------------------------------------------------------------------------#

print("reading pgf-file and storing control->affx in array ... ");

open(INFILE, $in_pgf) or die("Couldn't read $in_pgf: $!");

# fill array with [probeset_id,mrna_assignment,category, line_nr]
$idx = 0;
while (my $line = <INFILE>) {
   $idx++;
   if ($line =~ /control->affx/) {
      chomp($line);
      $line =~ s/\r//;  # remove optional carriage return character
      my @tmp = split(/\t/, $line);
      push @array, [@tmp, $idx];
   }#if
}#while
push @array, [0, "NA",, "NA", $idx+1];

close(INFILE)  or die("Couldn't close $in_pgf: $!");

# replace "line_nr" with "total_probes"
for (my $i=0; $i<$#array; $i++) {
   $array[$i][3] = ($array[$i+1][3] - $array[$i][3] - 1)/2;
   #very dirty workaround (would need to find number of lines between probeset_ids)
#   if ($array[$i][3] > 100) {$array[$i][3] = $array[$i-1][3];}
   if ($array[$i][3] > 100) {$array[$i][3] = 20;}
}#for

print("done.\n");

#------------------------------------------------------------------------------#
# write control->affx array to out_affx (for testing purposes only)
#------------------------------------------------------------------------------#

print("writing control->affx to $out_affx ... ");

open(OUTFILE, ">$out_affx") or die("Couldn't open $out_affx: $!");

for (my $i=0; $i<$#array; $i++) {
   my $tmp = join("\",\"", @{$array[$i]});
#   print(OUTFILE "\"$tmp\"\n");
   print(OUTFILE "\"$tmp\"\r\n");
}#for

close(OUTFILE) or die("Couldn't close $out_affx: $!");

print("done.\n");

#------------------------------------------------------------------------------#
# append/replace control->affx lines to transcript annotation file out_annot_tc
#------------------------------------------------------------------------------#

print("appending control->affx lines to $out_annot_tc ... ");

open(OUTFILE, ">$out_annot_tc") or die("Couldn't open $out_annot_tc: $!");
open(INFILE, $in_annot_tc)    or die("Couldn't read $in_annot_tc: $!");

# delete old control->affx lines
while (<INFILE>) {
   if (/control->affx/) {next;}
   print(OUTFILE $_);
}#while

# append new control->affx lines
for (my $i=0; $i<$#array; $i++) {
   my $afx = join("", $beg_assignment_tc, $array[$i][2], $end_assignment_tc);
   my $tmp = join("\",\"", $array[$i][0], $array[$i][0], $na, $na, 0,0, $array[$i][3],$na, $afx, $na, $na, $na, $na, $na, $na, $na, $na, $array[$i][1]);
   print(OUTFILE "\"$tmp\"\n");
#   print(OUTFILE "\"$tmp\"\r\n");
}#for

close(INFILE)  or die("Couldn't close $in_annot_tc: $!");
close(OUTFILE) or die("Couldn't close $out_annot_tc: $!");

print("done.\n");

score 2 · Accepted Answer · 2018-10-05

Perl Script PART 2:

#------------------------------------------------------------------------------#
# append/replace control->affx lines to probeset annotation file out_annot_ps
#------------------------------------------------------------------------------#

print("appending control->affx lines to $out_annot_ps ... ");

open(OUTFILE, ">$out_annot_ps") or die("Couldn't open $out_annot_ps: $!");
open(INFILE, $in_annot_ps)      or die("Couldn't read $in_annot_ps: $!");

# delete old control->affx lines
while (<INFILE>) {
   if (/control->affx/) {next;}
   print(OUTFILE $_);
}#while

# append new control->affx lines
for (my $i=0; $i<$#array; $i++) {
   my $afx = join("", $beg_assignment_ps, $array[$i][2], $end_assignment_ps);
   my $tmp = join("\",\"", $array[$i][0], $na, $na, 0, 0, $array[$i][3], 0, 0, 0, $na, $afx, 0, 0, 0, 0, $na, $na, $na, $na, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  $array[$i][1]);
   print(OUTFILE "\"$tmp\"\n");
#   print(OUTFILE "\"$tmp\"\r\n");
}#for

close(INFILE)  or die("Couldn't close $in_annot_ps: $!");
close(OUTFILE) or die("Couldn't close $out_annot_ps: $!");

print("done.\n");

#------------------------------------------------------------------------------#

###### END perlscript "DroGene11_update_AFFX.pl" ######