Question

Confused by the probe design of Affymetrix Human Transcriptome Array (HTA 2.0)

2

Entering edit mode

tangboyun ▴ 30

@tangboyun-6933

Last seen 7.8 years ago

China

Dear List,

I 'm developing a pipeline for analysing the HTA 2.0 chips. It works smoothly if I use the default settings of the rma function in the oligo package. But when I try to add some extra procedures for bg corrections of the non-specific binding using the probe informations, I found something strange:

The probe number in the pm matrix is less than the probes extracted from the getProbeInfo function:

> nrow(pm(dat))
[1] 6058440

> all.probes <- getProbeInfo(dat,field=c('type'),target='probeset')

> nrow(all.probes)

[1] 7576209

Only 'main' probes can be extracted:

> unique(all.probes$type)
[1] "main" NA

Using raw sql calls confirmed that only 'main' probes are in pm matrix:

> conn <- db(get(annotation(dat)))
> pbinfo <- dbGetQuery(conn,"SELECT fid, fsetid as man_fsetid FROM pmfeature ORDER BY fid ASC")
> psetType <- dbGetQuery(conn,"SELECT fsetid, type FROM featureSet")
> type2typeid <- dbGetQuery(conn,"SELECT type, type_id FROM type_dict")
> idx <- match(pbinfo[,c('man_fsetid')],psetType[,c('fsetid')])
> probeTypeIdx <- match(psetType[idx,c('type')],type2typeid$type)
> unique(type2typeid$type_id[probeTypeIdx])
[1] "main" NA

Where are the other types of probes?

> type2typeid$type_id
 [1] "main"                       "control->affx"             
 [3] "control->chip"              "control->bgp->antigenomic" 
 [5] "control->bgp->genomic"      "normgene->exon"            
 [7] "normgene->intron"           "rescue->FLmRNA->unmapped"  
 [9] "control->affx->bac_spike"   "oligo_spike_in"            
[11] "r1_bac_spike_at"            "control->affx->polya_spike"
[13] "control->affx->ercc"        "control->affx->ercc->step"

By the way, the mm matrix was not empty:

> nrow(mm(dat))
[1] 1121

> head(mm(dat))
      HTA2_Liver_BetaSamplePool_1.CEL HTA2_Liver_BetaSamplePool_2.CEL
49142                              62                              70
49772                             130                             154
50922                             113                             166
51012                             250                             228
51102                              58                             104

EDIT: Add session info:

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=zh_CN.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=zh_CN.UTF-8        LC_COLLATE=zh_CN.UTF-8    
 [5] LC_MONETARY=zh_CN.UTF-8    LC_MESSAGES=zh_CN.UTF-8   
 [7] LC_PAPER=zh_CN.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=zh_CN.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] oligo_1.30.0        Biostrings_2.34.0   XVector_0.6.0      
[4] IRanges_2.0.0       S4Vectors_0.4.0     Biobase_2.26.0     
[7] oligoClasses_1.28.0 BiocGenerics_0.12.0

loaded via a namespace (and not attached):
 [1] affxparser_1.38.0     affyio_1.34.0         BiocInstaller_1.16.0 
 [4] bit_1.1-12            codetools_0.2-8       DBI_0.3.1            
 [7] ff_2.2-13             foreach_1.4.2         GenomeInfoDb_1.2.2   
[10] GenomicRanges_1.18.1  iterators_1.0.7       preprocessCore_1.28.0
[13] splines_3.1.1         zlibbioc_1.12.0

microarray normalization oligo • 4.2k views

ADD COMMENT • link updated 8.5 years ago by burchard • 0 • written 9.5 years ago by tangboyun ▴ 30

0

Entering edit mode

There may be a problem with the package in release. If I build my own version, it works as expected. I have sent Benilton an email directly to see what he says.

But note that you should expect a different number of probes from pm() and getProbeInfo(). The former function is simply extracting all the PM probes. The latter function is extracting all the probes for each probeset, and since there is some re-use of probes (e.g., some probes are used in more than one probeset), you should expect more data from getProbeInfo().

ADD REPLY • link 9.5 years ago James W. MacDonald 65k

score 1 · Answer 1 · 2014-11-05

There is some problem with the annotation files that Affymetrix supplies for this array. In other words, the previous versions (HTA-2_0.na33.hg19.probeset.csv and HTA-2_0.na33.hg19.transcript.csv) of the annotation files do work, when building the package. The newer release versions (change na33 to na34) are somehow different, and that causes problems when building the package.

It's not clear to me what the differences are, and a cursory glance at the two files yesterday didn't turn up anything obvious, so I have built a new pd.hta.2.0 package (version 3.10.1) that I will be sending to Marc Carlson shortly. I don't know how long it will take to get the package up for download, but I would imagine it will be soon.

In the interim, if you need it sooner, you can just download the required files from Affy (the two csv files above, plus the pgf, clf, mps files:

HTA-2_0.r1.clf
HTA-2_0.r1.pgf
HTA-2_0.r1.Psrs.mps

Which are all in the annotation file you can find here: http://www.affymetrix.com/Auth/analysis/downloads/lf/hta/HTA-2_0/HTA-2_0_Analysis_r1.zip

And you can build your own using pdInfoBuilder, doing something like

p <- new("AffyHTAPDInfoPkgSeed", clfFile = "HTA-2_0.r1.clf", pgfFile = "HTA-2_0.r1.pgf", coreMps = "HTA-2_0.r1.Psrs.mps", transFile = "HTA-2_0.na33.hg19.transcript.csv", probeFile = "HTA-2_0.na33.hg19.probeset.csv", author = "tangboyun", email = "tangboyun@wherever.cn", version = "3.10.1")

makePdInfoPackage(p, destDir = ".")

install.packages("pd.hta.2.0/", repos = NULL)

Hope that helps!

score 0 · Answer 2 · 2014-11-05

0

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 7.7 years ago

United States

Thank for the reprocessed package Jim! I have just finished building windows and mac binaries for it and I am pushing it up to release and devel branches of the repository. It should appear online this afternoon.

Marc

ADD COMMENT • link 9.5 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

Now that's service with a smile... Err, plastic hamburger toy ;-D

ADD REPLY • link 9.5 years ago James W. MacDonald 65k

score 0 · Answer 3 · 2015-11-14

Dear List,

Thanks very much for this helpful thread!

The loss of annotation from Affy HTA 2.0 platform design seen prior to pd.hta.2.0 13.10 appears to have recurred in pd.hta.2.0 13.12.0.

To check the issue, I built pd.hta.2.0. packages according to James W. MacDonald's instructions from both Affy annotation releases na33/r1 and na35/r3. As Guilherme Rocha noted for na34, the na35/r3 package contains numeric probeset IDs not found in na33/r1. Both packages supported reading in of CEL files, with the same number of unique probe feature IDs and probe-set IDs. However, the package built from na35/r3 had no probe type annotations other than "main".&nbps; The package built from na33/r1 had diverse probe type annotations, and slightly more "main" annotations.

Has the loss of annotation from HTA 2.0 platform design annotation with Affy's newer annotation file format been tracked to a part of the parsing and package building process?

Code, output and session info are below. Thanks for any help you can provide!

library(pdInfoBuilder)

p = new("AffyHTAPDInfoPkgSeed", clfFile="HTA-2_0.r1.clf", pgfFile="HTA-2_0.r1.pgf", coreMps="HTA-2_0.r1.Psrs.mps", transFile="HTA-2_0.na33.hg19.transcript.csv", probeFile="HTA-2_0.na33.hg19.probeset.csv", author="burchard", email="burchard@ohsu.edu", version="3.13.1")

p2 = new("AffyHTAPDInfoPkgSeed", clfFile="HTA-2_0.r3.clf", pgfFile="HTA-2_0.r3.pgf", coreMps="HTA-2_0.r3.Psrs.mps", transFile="HTA-2_0.na35.2.hg19.transcript.csv", probeFile="HTA-2_0.na35.hg19.probeset.csv", author="burchard", email="burchard@ohsu.edu", version="3.13.2")

makePdInfoPackage(p, destDir = ".") 
makePdInfoPackage(p2, destDir = "..")

install.packages("pd.hta.2.0/", repos = NULL) # also done with ../pd.hta.2.0

# using package built from na35:
library(pd.hta.2.0)
# load data and extract platform info
setwd("../Genome")
eCELs = list.celfiles('.',full.names=T)
Data = read.celfiles(eCELs)
pInfo = getProbeInfo(Data,target="probeset",field=c("fid","type"),sortBy="none")
# check what's here
dim(Data)
# Features  Samples 
#  6892960        8 
dim(pInfo)
# [1] 7576209       3
length(unique(pInfo$fid))
# [1] 6744285
length(unique(pInfo$man_fsetid))
# [1] 925032
unique(pInfo$type)
# [1] "main" NA    
table(pInfo$type)
#
#    main 
# 7341944 

# using package built from na33:
library(pd.hta.2.0)
# load data and extract platform info
setwd("../Genome")
eCELs = list.celfiles('.',full.names=T)
Data = read.celfiles(eCELs)
pInfo = getProbeInfo(Data,target="probeset",field=c("fid","type"),sortBy="none")
# check what's here
dim(Data)
# Features  Samples 
#  6892960        8 
dim(pInfo)
# [1] 7576209       3
length(unique(pInfo$fid))
# [1] 6744285
length(unique(pInfo$man_fsetid))
# [1] 925032
unique(pInfo$type)
#
table(pInfo$type[pInfo$fid %in% rownames(Data)])
#
#   control->affx->bac_spike        control->affx->ercc 
#                         44                       2709 
#  control->affx->ercc->step control->affx->polya_spike 
#                       1870                         44 
#  control->bgp->antigenomic      main 
#                      16943                    7356242 
#             normgene->exon           normgene->intron 
#                       1145                       2328 
table(pInfo$type[pInfo$fid %in% rownames(Data)])
# 
#   control->affx->bac_spike        control->affx->ercc 
#                         44                       2709 
#  control->affx->ercc->step control->affx->polya_spike 
#                       1870                         44 
#  control->bgp->antigenomic      main 
#                      16943                    7356242 
#             normgene->exon           normgene->intron 
#                       1145                       2328 

sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS release 6.7 (Final)

locale:
 [1] LC_CTYPE=en_US.iso885915       LC_NUMERIC=C                  
 [3] LC_TIME=en_US.iso885915        LC_COLLATE=en_US.iso885915    
 [5] LC_MONETARY=en_US.iso885915    LC_MESSAGES=en_US.iso885915   
 [7] LC_PAPER=en_US.iso885915       LC_NAME=C                     
 [9] LC_ADDRESS=C                   LC_TELEPHONE=C                
[11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C           

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] pd.hta.2.0_3.13.1   RSQLite_1.0.0       DBI_0.3.1          
 [4] oligo_1.34.0        Biostrings_2.38.0   XVector_0.10.0     
 [7] IRanges_2.4.1       S4Vectors_0.8.1     Biobase_2.30.0     
[10] oligoClasses_1.32.0 BiocGenerics_0.16.1

loaded via a namespace (and not attached):
 [1] affxparser_1.42.0          GenomicRanges_1.22.1      
 [3] splines_3.2.2              zlibbioc_1.16.0           
 [5] bit_1.1-12                 foreach_1.4.3             
 [7] GenomeInfoDb_1.6.1         SummarizedExperiment_1.0.1
 [9] ff_2.2-13                  iterators_1.0.8           
[11] preprocessCore_1.32.0      affyio_1.40.0             
[13] codetools_0.2-14           BiocInstaller_1.20.0