pdInfoBuilder created problematic package for Affymetrix chip tiling array
1
0
Entering edit mode
efoss ▴ 10
@efoss-8908
Last seen 9 months ago
United States

I tried to use pdInfoBuilder to make a package with the platform information for a budding yeast tiling array from Affymetrix. I wanted to use this package together with Bioconductor's "oligo" package.

4008948_yeast_3X_A.CEL
Sc03b_MF_v04.bpmap
Sc03b_MR_v04.bpmap

I made a package with the following code:

library(pdInfoBuilder)

baseDir <- '/Users/EricF/projects/ISW2_FUN30/Yoshida_et_al_Pasero_BrdU_Mol_Cell_2014/s_cerevisiae_libraryfile/BPMAP'
bpmap <- list.files(path = baseDir, pattern = '^\\S+_MR_\\S+bpmap$', full.names = TRUE) stopifnot(length(bpmap) == 1) cel <- list.files(path = baseDir, pattern = 'CEL$', full.names = TRUE)
stopifnot(length(cel) == 1)

seed <- new("AffyTilingPDInfoPkgSeed",
bpmapFile = bpmap, celFile = cel,
author = "Eric Foss",
email = "efoss@fredhutch.org",
biocViews = "AnnotationData",
genomebuild = "sacCer1",
organism = "Yeast", species = "Saccharomyces cerevisiae")

makePdInfoPackage(seed, destDir = ".")

install.packages('pd.sc03b.mr.v03', repos = NULL, type = 'source')


(I have done this both with the "_MF_" and with the "_MR_" .bpmap files, and I had identical problems with each.) Everything seemed to go well, although I did get 11 warnings, each of which was the following:

1: In result_fetch(res@ptr, n = n) :
SQL statements must be issued with dbExecute() or dbSendStatement() instead of dbGetQuery() or dbSendQuery().


When I loaded the package with "library(pd.sc03b.mr.v03)", everything went smoothly. However, when I look at the lengths of the chromosomes, something is wrong: 5 out of the 16 chromosomes had maximum coordinates that were far greater than the length of the chromosome (e.g. almost twice the length of the entire genome for a single chromosome; the other 11 chromosomes had maximum coordinates that were reasonable). The code below shows an example of how I determined this.

fls <- list.files(pattern = 'CEL$') # 15 .CEL files from PMID 24856221 affyRaw <- read.celfiles(filenames = fls, pkgname = 'pd.sc03b.mr.v03') max(pmPosition(affyRaw)) # 23,944,040 -almost twice the length of the cerevisiae genome  My session info is pasted below: > sessionInfo() R version 4.1.0 (2021-05-18) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur 11.4 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] pd.sc03b.mr.v03_0.0.1 DBI_1.1.1 oligo_1.56.0 Biobase_2.52.0 oligoClasses_1.54.0 RSQLite_2.2.7 [7] Biostrings_2.60.1 GenomeInfoDb_1.28.1 XVector_0.32.0 IRanges_2.26.0 S4Vectors_0.30.0 BiocGenerics_0.38.0 loaded via a namespace (and not attached): [1] Rcpp_1.0.6 BiocManager_1.30.16 compiler_4.1.0 MatrixGenerics_1.4.0 bitops_1.0-7 [6] iterators_1.0.13 tools_4.1.0 zlibbioc_1.38.0 bit_4.0.4 preprocessCore_1.54.0 [11] memoise_2.0.0 lattice_0.20-44 ff_4.0.4 pkgconfig_2.0.3 rlang_0.4.11 [16] Matrix_1.3-4 foreach_1.5.1 DelayedArray_0.18.0 rstudioapi_0.13 fastmap_1.1.0 [21] GenomeInfoDbData_1.2.6 affxparser_1.64.0 vctrs_0.3.8 bit64_4.0.5 grid_4.1.0 [26] blob_1.2.1 splines_4.1.0 codetools_0.2-18 matrixStats_0.59.0 GenomicRanges_1.44.0 [31] SummarizedExperiment_1.22.0 RCurl_1.98-1.3 cachem_1.0.5 crayon_1.4.1 affyio_1.62.0 >  I would very much appreciate any suggestions for fixing this problem. Thank you. Eric oligoClasses oligo pdInfoBuilder • 410 views ADD COMMENT 0 Entering edit mode @james-w-macdonald-5106 Last seen 20 minutes ago United States It will probably require some hard-coding in pdInfoBuilder to handle this issue, but I would have to take a closer look to say for sure. A quick look at the BPMAP file (using readBpmap from the affxparser package) brings up the following weirdness: > z <- readBpmap("Sc03b_MF_v04.bpmap") > names(z) [1] "AffxCtrl:v1;r2_Tag" "AffxCtrlBkGrAntiGenomic:v1;gcBin03" [3] "AffxCtrlBkGrAntiGenomic:v1;gcBin04" "AffxCtrlBkGrAntiGenomic:v1;gcBin05" [5] "AffxCtrlBkGrAntiGenomic:v1;gcBin06" "AffxCtrlBkGrAntiGenomic:v1;gcBin07" [7] "AffxCtrlBkGrAntiGenomic:v1;gcBin08" "AffxCtrlBkGrAntiGenomic:v1;gcBin09" [9] "AffxCtrlBkGrAntiGenomic:v1;gcBin10" "AffxCtrlBkGrAntiGenomic:v1;gcBin11" [11] "AffxCtrlBkGrAntiGenomic:v1;gcBin12" "AffxCtrlBkGrAntiGenomic:v1;gcBin13" [13] "AffxCtrlBkGrAntiGenomic:v1;gcBin14" "AffxCtrlBkGrAntiGenomic:v1;gcBin15" [15] "AffxCtrlBkGrAntiGenomic:v1;gcBin16" "AffxCtrlBkGrAntiGenomic:v1;gcBin17" [17] "AffxCtrlBkGrAntiGenomic:v1;gcBin18" "AffxCtrlBkGrAntiGenomic:v1;gcBin19" [19] "AffxCtrlBkGrAntiGenomic:v1;gcBin20" "AffxCtrlBkGrAntiGenomic:v1;gcBin21" [21] "AffxCtrlBkGrAntiGenomic:v1;gcBin22" "AffxCtrlBkGrAntiGenomic:v1;gcBin23" [23] "AffxCtrlBkGrAntiGenomic:v1;gcBin24" "AffxCtrlBkGrAntiGenomic:v1;gcBin25" [25] "AffxCtrlBkGrGenomic:v1;gcBin04" "AffxCtrlBkGrGenomic:v1;gcBin05" [27] "AffxCtrlBkGrGenomic:v1;gcBin06" "AffxCtrlBkGrGenomic:v1;gcBin07" [29] "AffxCtrlBkGrGenomic:v1;gcBin08" "AffxCtrlBkGrGenomic:v1;gcBin09" [31] "AffxCtrlBkGrGenomic:v1;gcBin10" "AffxCtrlBkGrGenomic:v1;gcBin11" [33] "AffxCtrlBkGrGenomic:v1;gcBin12" "AffxCtrlBkGrGenomic:v1;gcBin13" [35] "AffxCtrlBkGrGenomic:v1;gcBin14" "AffxCtrlBkGrGenomic:v1;gcBin15" [37] "AffxCtrlBkGrGenomic:v1;gcBin16" "AffxCtrlBkGrGenomic:v1;gcBin17" [39] "AffxCtrlBkGrGenomic:v1;gcBin18" "AffxCtrlBkGrGenomic:v1;gcBin19" [41] "AffxCtrlBkGrGenomic:v1;gcBin20" "AffxCtrlBkGrGenomic:v1;gcBin21" [43] "AffxCtrlBkGrGenomic:v1;gcBin22" "AffxCtrlBkGrGenomic:v1;gcBin23" [45] "AffxCtrlBs:v1;r2_dap" "AffxCtrlBs:v1;r2_lys" [47] "AffxCtrlBs:v1;r2_phe" "AffxCtrlBs:v1;r2_thr" [49] "AffxCtrlBs:v1;X_Dap" "AffxCtrlBs:v1;X_Lys" [51] "AffxCtrlBs:v1;X_Phe" "AffxCtrlBs:v1;X_Thr" [53] "AffxCtrlBs:v1;X_Trpn" "AffxCtrlEc:v1;r2_bioB" [55] "AffxCtrlEc:v1;r2_bioC" "AffxCtrlEc:v1;r2_bioD" [57] "AffxCtrlHouseKeepingSc:v1;18sr" "AffxCtrlHouseKeepingSc:v1;25sr" [59] "AffxCtrlHouseKeepingSc:v1;YER022w" "AffxCtrlHouseKeepingSc:v1;YER148w" [61] "AffxCtrlHouseKeepingSc:v1;YFL039C" "AffxCtrlP1:v1;r2_cre" [63] "At:TIGRv5;chloroplast" "At:TIGRv5;chr1" [65] "At:TIGRv5;chr2" "At:TIGRv5;chr3" [67] "At:TIGRv5;chr4" "At:TIGRv5;chr5" [69] "Bs:Jan_2004;NC_000964.1" "Sc:Oct_2003;chr1" [71] "Sc:Oct_2003;chr10" "Sc:Oct_2003;chr11" [73] "Sc:Oct_2003;chr12" "Sc:Oct_2003;chr13" [75] "Sc:Oct_2003;chr14" "Sc:Oct_2003;chr15" [77] "Sc:Oct_2003;chr16" "Sc:Oct_2003;chr2" [79] "Sc:Oct_2003;chr3" "Sc:Oct_2003;chr4" [81] "Sc:Oct_2003;chr5" "Sc:Oct_2003;chr6" [83] "Sc:Oct_2003;chr7" "Sc:Oct_2003;chr8" [85] "Sc:Oct_2003;chr9" "Sc:Oct_2003;mitochondrion" [87] "Sc:Oct_2003;plasmid"  The weirdness being the At:TIGRv5 list items (63-68), which appear to be Arabidopsis thaliana, and the Bs item (69), which appears to be the Bacillus subtilis complete genome. Neither of which, so far as I know, are Saccharomyces cerevisiae. And we can see where those random lengths are coming from: > sapply(z[64:85], function(x) max(x$startpos))
At:TIGRv5;chr1          At:TIGRv5;chr2          At:TIGRv5;chr3
20953856                16581460                20352943
At:TIGRv5;chr4          At:TIGRv5;chr5 Bs:Jan_2004;NC_000964.1
17447368                23944040                 3314096
Sc:Oct_2003;chr1       Sc:Oct_2003;chr10       Sc:Oct_2003;chr11
230182                  745639                  666429
Sc:Oct_2003;chr12       Sc:Oct_2003;chr13       Sc:Oct_2003;chr14
1078149                  924401                  784303
Sc:Oct_2003;chr15       Sc:Oct_2003;chr16        Sc:Oct_2003;chr2
1091262                  948010                  813152
Sc:Oct_2003;chr3        Sc:Oct_2003;chr4        Sc:Oct_2003;chr5
316588                 1531886                  576833
Sc:Oct_2003;chr6        Sc:Oct_2003;chr7        Sc:Oct_2003;chr8
270123                 1090910                  562616
Sc:Oct_2003;chr9
439860


Like I said, I haven't looked closely at the code for parsing the BPMAP file, but I bet it just uses all the AFFX list items for controls and then uses the rest as 'main' probes. Which is going to be problematic given the excess cruft in the file. It will probably require something really stupid like an

if(bpmapFile == stupidaffyScTilingArray) {
<ignore the extra stuff>
} else {
<just do the regular>
}

2
Entering edit mode

Now that I think about it, you can just get around the issue yourself. This is definitely a bug and should be fixed, but the days of people using oligo on the regular, and in particular doing Chip-ChIP are probably behind us now, so maybe it's not that pressing. Anyway, you can do the following.

library(pdInfoPackage)
## we have to debug an internal function
debug(pdInfoBuilder:::parseBpmapCel)

## I just downloaded a random CEL from GEO for this - you can use the one you have
seed <- new("AffyTilingPDInfoPkgSeed", bpmapFile = "Sc03b_MF_v04.bpmap", celFile = "GSM2127843_M50.IPB.CEL",
author = "me", email = "me@mine.org", biocViews = "AnnotationData", genomebuild = "sacCer1", organism = "Yeast",
species = "Saccharomyces cerevisiae")

## once we start the function it will drop into the debugger, and we can iterate through, line by line
> makePdInfoPackage(seed, destDir = ".")
================================================================================
Building annotation package for Affymetrix Tiling array
BPMAP: Sc03b_MF_v04.bpmap
CEL..: GSM2127843_M50.IPB.CEL
================================================================================
debugging in: parseBpmapCel(object@bpmapFile, object@celFile, verbose = !quiet)
debug: {
if (verbose)
msgParsingFile(bpmapFile)
if (verbose)
msgOK()
if (verbose)
simpleMessage("Getting geometry from CEL file... ")

<snip>

## Now just go through the debugger, hitting Enter after each line until you get to this:

debug: experimental <- (1:length(grpNames))[-controls]
Browse[2]>
debug: problem <- intersect(experimental, background)

## now we hijack the process to fix things by adding this line

Browse[2]> experimental <- 70:87

## after which you can hit the 'c' key to jump out of the debugger and continue to the end. Then we install

Browse[2]> c
Getting MMs... OK
Getting background probes... OK
Getting sequences... OK
exiting from: parseBpmapCel(object@bpmapFile, object@celFile, verbose = !quiet)
Creating package in ./pd.sc03b.mf.v04
Inserting 18 rows into table chrom_dict... OK
Inserting 2643812 rows into table pmfeature... OK
Inserting 2643812 rows into table mmfeature... OK
Inserting 38551 rows into table bgfeature... OK
Counting rows in bgfeature
Counting rows in chrom_dict
Counting rows in mmfeature
Counting rows in pmfeature
Creating index idx_bgfid on bgfeature... OK
Creating index idx_pmfid on pmfeature... OK
Creating index idx_mmfid on mmfeature... OK
Creating index idx_mmpmfid on mmfeature... OK
Saving DataFrame object for PM.

Saving DataFrame object for BG.

Done.

There were 11 warnings (use warnings() to see them)

## I am on Windows, so have to add the type argument here

> install.packages("pd.sc03b.mf.v04/", repos = NULL, type = "source")
Installing package into 'C:/Users/jmacdon/AppData/Roaming/R/win-library/4.1'
(as 'lib' is unspecified)
* installing *source* package 'pd.sc03b.mf.v04' ...
** using staged installation
** R
<snip>

## and now let's load and check

> library(pd.sc03b.mf.v04)

> max(pmPosition(pd.sc03b.mf.v04))
[1] 1531886

## which seems reasonable to me

0
Entering edit mode

Hi James,

Thanks so much! This worked perfectly. And the Arabidopsis thing explains why it was specifically chromosomes 1-5 whose lengths were so far off, since Arabidopsis has 5 chromosomes. I think that this also suggests that the accessor function called "pmChr" in the oligo package treats, for example, "At:TIGRv5;chr1" as equivalent to "Sc:Oct_2003;chr1", which seems like a bad thing. That's probably part of the bug you referred to.

I have one follow-up question: Your package ended up being named "pd.sc03b.mf.v04" whereas mine was called "pd.sc03b.mr.v03". I started with a .bpmap file with a name identical to that of the one you used: Sc03b_MF_v04.bpmap

This .bpmap file has "MF" in its name. As I understand it, "MF" records strands that hybridize to the Forward strand (so reverse strand sequences), whereas "MR" does the opposite. Also, I understand that it doesn't matter whether one chooses the _MF_ or the _MR_ .bpmap file if the technique you are using is not strand-aware (as mine is not). Also, the .bpmap file has "v04" in its name. It seems pretty reasonable to me that you ended up with a package with "mf" and "v04" in its name; so why did I end up with a package with "mr" and "v03" in its name?

Thanks again for the help. There is no way I could have figured this out on my own.

Best wishes,

Eric

0
Entering edit mode

Yes, that's exactly the problem. Whenever one tries to write a package that parses a type of file (in this case bpmap files from Affy tiling arrays), usually a canonical example file is used to infer what to expect in the file, and that then defines how you will parse and interpret the contents. The original tiling arrays were for human and mouse, so I would assume that Benilton used one or both of those as examples, and assumed all other tiling arrays would follow the same convention. But Affy has a history (IMO) of breaking that sort of convention and consequently breaking our tools. This is just one example of that. Maybe they decided they have the real estate on the array for three different species, so they put them all on there and now they can make one array that they can sell as three different tiling arrays. Which is smart!

But anyway.

The name of the package comes from the CEL file you used. To see the code for an S4 method like makePdInfoPackage, you use selectMethod (or showMethods).

> selectMethod(makePdInfoPackage, "AffyTilingPDInfoPkgSeed")
Method Definition:

function (object, destDir = ".", batch_size = 10000, quiet = FALSE,
{
msgBar()
message("Building annotation package for Affymetrix Tiling array")
message("BPMAP: ", basename(object@bpmapFile))
message("CEL..: ", basename(object@celFile))
msgBar()
chip <- chipName(object)
pkgName <- cleanPlatformName(chip)

<snip>

## the package name comes from those last two lines. Let's explore!

> selectMethod(chipName, "AffyTilingPDInfoPkgSeed")
Method Definition:

function (object)
{
readCelHeader(object@celFile)$chiptype } <bytecode: 0x000000002d3caab0> <environment: namespace:pdInfoBuilder> Signatures: object target "AffyTilingPDInfoPkgSeed" defined "AffyTilingPDInfoPkgSeed" ## and for the random file I used, it is > readCelHeader("GSM2127843_M50.IPB.CEL")$chiptype
[1] "Sc03b_MF_v04"

## and cleanPlatformName just clears out extra cruft, so

> cleanPlatformName(readCelHeader("GSM2127843_M50.IPB.CEL")$chiptype) [1] "pd.sc03b.mf.v04"  If you do the same using your tiling array I am confident you will get 'pd.sc03b.mf.v03'. ADD REPLY 0 Entering edit mode Hi James, Yes, everything checks out exactly as you expected: > readCelHeader('4008948_yeast_3X_A.CEL')$chiptype
[1] "Sc03b_MR_v03"
>


Thanks again. This is really helpful.

Best wishes,

Eric

0
Entering edit mode

I have run into one more problem here with chromosome lengths: The information for this Affy chip is supposed to refer to the sacCer1 build of the saccharomyces cerevisiae genome. I created a TilingFeatureSet object called "affyRaw":

fls <- list.files(pattern = 'CEL\$')

pkgname = 'pd.sc03b.mr.v03')


The maximum coordinates for almost all of my chromosomes in affyRaw are just a little shorter than the lengths of the chromosomes in the sacCer1 genome assembly, which is reasonable. However, for two chromosomes, the maximum coordinates are longer than the length of the chromosome in the genome assembly - in one case longer by 16 base pairs, and in the other by 193 base pairs:

chr2idcs <- which(pmChr(affyRaw) == 'chr2')
chr10idcs <- which(pmChr(affyRaw) == 'chr10')

max(pmPosition(affyRaw)[chr2idcs]) # 813152 - but chr2 in the sacCer1 genome is supposed to be 813136
max(pmPosition(affyRaw)[chr10idcs]) # 745639 - but chr10 in the sacCer1 genome is supposed to be 745446


I don't particularly care about this, except if it is an indication that I am somehow misunderstanding which genome build the Affy chip is supposed to be referring to. "getPlatformDesign" indicates that it is sacCer1:



getPlatformDesign(affyRaw)
# Class........: AffyTilingPDInfo
# Manufacturer.: Affymetrix
# Genome Build.: sacCer1
# Chip Geometry: 2560 rows x  2560 columns


It's possible that this is going back far enough that the system of yeast genome builds wasn't completely formalized, so that there was some ambiguity about what genome "sacCer1" actually refers to. If that's the case, then everything is fine, and I will just accept my results as they are, but I want to make sure that this discrepancy doesn't reflect a mistake that I am making. For example, if somehow these coordinates referred to the most current build of the yeast genome (sacCer3), then all of the chromosome lengths would be longer than the maximum coordinates that I find for each chromosome, so everything would make sense. (My sessionInfo is pasted below.)

Thanks.

Eric

> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.5

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] rtracklayer_1.52.0    GenomicRanges_1.44.0  affxparser_1.64.0     pd.sc03b.mr.v03_0.0.1 DBI_1.1.1             oligo_1.56.0
[7] Biobase_2.52.0        oligoClasses_1.54.0   RSQLite_2.2.7         Biostrings_2.60.1     GenomeInfoDb_1.28.1   XVector_0.32.0
[13] IRanges_2.26.0        S4Vectors_0.30.0      BiocGenerics_0.38.0

loaded via a namespace (and not attached):
[1] Rcpp_1.0.7                  restfulr_0.0.13             BiocManager_1.30.16         compiler_4.1.0              MatrixGenerics_1.4.0
[6] bitops_1.0-7                iterators_1.0.13            tools_4.1.0                 zlibbioc_1.38.0             bit_4.0.4
[11] preprocessCore_1.54.0       memoise_2.0.0               lattice_0.20-44             ff_4.0.4                    pkgconfig_2.0.3
[16] rlang_0.4.11                Matrix_1.3-4                foreach_1.5.1               DelayedArray_0.18.0         rstudioapi_0.13
[21] yaml_2.2.1                  fastmap_1.1.0               GenomeInfoDbData_1.2.6      vctrs_0.3.8                 bit64_4.0.5
[26] grid_4.1.0                  BiocParallel_1.26.1         XML_3.99-0.6                blob_1.2.1                  GenomicAlignments_1.28.0
[31] Rsamtools_2.8.0             splines_4.1.0               codetools_0.2-18            matrixStats_0.59.0          SummarizedExperiment_1.22.0
[36] RCurl_1.98-1.3              cachem_1.0.5                rjson_0.2.20                crayon_1.4.1                BiocIO_1.2.0
[41] affyio_1.62.0
>

0
Entering edit mode

Note that you are the one who provided the genome build information when you built the package. However, it appears this array was developed in 2003, which is circa sacCer1, so that makes sense.

I wouldn't be too worried that some of the probes are off the end of two chromosomes - it's not uncommon for genomes to be updated between builds, and the chromosomal extent when Affy designed this array may have been different from what is reported currently.

0
Entering edit mode

Good point - I had forgotten that I was the one who provided that "sacCer1" information, so my "confirmation" that I was working with the correct genome assembly was meaningless.

I think that you are right and that I shouldn't be concerned that the lengths of 2 out of 16 chromosomes are a bit off. Thanks very much for the help.

Eric