Question: Confused by the probe design of Affymetrix Human Transcriptome Array (HTA 2.0)
2
gravatar for tangboyun
5.0 years ago by
tangboyun30
China
tangboyun30 wrote:

Dear List,

     I 'm developing a pipeline for analysing the HTA 2.0 chips. It works smoothly if I use the default settings of the rma function in the oligo package. But when I try to add some extra procedures for bg corrections of the non-specific binding  using the probe informations, I found something strange:

The probe number in the pm matrix is less than the probes extracted from the getProbeInfo function:

> nrow(pm(dat))
[1] 6058440

> all.probes <- getProbeInfo(dat,field=c('type'),target='probeset')

> nrow(all.probes)

[1] 7576209

Only 'main' probes can be extracted:

> unique(all.probes$type)
[1] "main" NA 

Using raw sql calls confirmed that only 'main' probes are in pm matrix:

> conn <- db(get(annotation(dat)))
> pbinfo <- dbGetQuery(conn,"SELECT fid, fsetid as man_fsetid FROM pmfeature ORDER BY fid ASC")
> psetType <- dbGetQuery(conn,"SELECT fsetid, type FROM featureSet")
> type2typeid <- dbGetQuery(conn,"SELECT type, type_id FROM type_dict")
> idx <- match(pbinfo[,c('man_fsetid')],psetType[,c('fsetid')])
> probeTypeIdx <- match(psetType[idx,c('type')],type2typeid$type)
> unique(type2typeid$type_id[probeTypeIdx])
[1] "main" NA    

Where are the other types of probes?

> type2typeid$type_id
 [1] "main"                       "control->affx"             
 [3] "control->chip"              "control->bgp->antigenomic" 
 [5] "control->bgp->genomic"      "normgene->exon"            
 [7] "normgene->intron"           "rescue->FLmRNA->unmapped"  
 [9] "control->affx->bac_spike"   "oligo_spike_in"            
[11] "r1_bac_spike_at"            "control->affx->polya_spike"
[13] "control->affx->ercc"        "control->affx->ercc->step" 

By the way, the mm matrix was not empty:

> nrow(mm(dat))
[1] 1121

> head(mm(dat))
      HTA2_Liver_BetaSamplePool_1.CEL HTA2_Liver_BetaSamplePool_2.CEL
49142                              62                              70
49772                             130                             154
50922                             113                             166
51012                             250                             228
51102                              58                             104

 

EDIT: Add session info:

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=zh_CN.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=zh_CN.UTF-8        LC_COLLATE=zh_CN.UTF-8    
 [5] LC_MONETARY=zh_CN.UTF-8    LC_MESSAGES=zh_CN.UTF-8   
 [7] LC_PAPER=zh_CN.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=zh_CN.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] oligo_1.30.0        Biostrings_2.34.0   XVector_0.6.0      
[4] IRanges_2.0.0       S4Vectors_0.4.0     Biobase_2.26.0     
[7] oligoClasses_1.28.0 BiocGenerics_0.12.0

loaded via a namespace (and not attached):
 [1] affxparser_1.38.0     affyio_1.34.0         BiocInstaller_1.16.0 
 [4] bit_1.1-12            codetools_0.2-8       DBI_0.3.1            
 [7] ff_2.2-13             foreach_1.4.2         GenomeInfoDb_1.2.2   
[10] GenomicRanges_1.18.1  iterators_1.0.7       preprocessCore_1.28.0
[13] splines_3.1.1         zlibbioc_1.12.0 

 

microarray normalization oligo • 2.8k views
ADD COMMENTlink modified 3.9 years ago by burchard0 • written 5.0 years ago by tangboyun30

There may be a problem with the package in release. If I build my own version, it works as expected. I have sent Benilton an email directly to see what he says.

But note that you should expect a different number of probes from pm() and getProbeInfo(). The former function is simply extracting all the PM probes. The latter function is extracting all the probes for each probeset, and since there is some re-use of probes (e.g., some probes are used in more than one probeset), you should expect more data from getProbeInfo().

 

ADD REPLYlink written 5.0 years ago by James W. MacDonald51k
Answer: Confused by the probe design of Affymetrix Human Transcriptome Array (HTA 2.0)
1
gravatar for James W. MacDonald
5.0 years ago by
United States
James W. MacDonald51k wrote:

There is some problem with the annotation files that Affymetrix supplies for this array. In other words, the previous versions (HTA-2_0.na33.hg19.probeset.csv and HTA-2_0.na33.hg19.transcript.csv) of the annotation files do work, when building the package. The newer release versions (change na33 to na34) are somehow different, and that causes problems when building the package.

It's not clear to me what the differences are, and a cursory glance at the two files yesterday didn't turn up anything obvious, so I have built a new pd.hta.2.0 package (version 3.10.1) that I will be sending to Marc Carlson shortly. I don't know how long it will take to get the package up for download, but I would imagine it will be soon.

In the interim, if you need it sooner, you can just download the required files from Affy (the two csv files above, plus the pgf, clf, mps files:

HTA-2_0.r1.clf
HTA-2_0.r1.pgf
HTA-2_0.r1.Psrs.mps

Which are all in the annotation file you can find here: http://www.affymetrix.com/Auth/analysis/downloads/lf/hta/HTA-2_0/HTA-2_0_Analysis_r1.zip

And you can build your own using pdInfoBuilder, doing something like

p <- new("AffyHTAPDInfoPkgSeed", clfFile = "HTA-2_0.r1.clf", pgfFile = "HTA-2_0.r1.pgf", coreMps = "HTA-2_0.r1.Psrs.mps", transFile = "HTA-2_0.na33.hg19.transcript.csv", probeFile = "HTA-2_0.na33.hg19.probeset.csv", author = "tangboyun", email = "tangboyun@wherever.cn", version = "3.10.1")

makePdInfoPackage(p, destDir = ".")

install.packages("pd.hta.2.0/", repos = NULL)

Hope that helps!

 

ADD COMMENTlink modified 5.0 years ago • written 5.0 years ago by James W. MacDonald51k

  I've also had some problem using the na34 annotation file to construct the InfoPackage for HTA-2.0.

  It appears that, for 2995 of the probesets,  the probeset_id column in na34 lists the numerical id of the probeset rather than its alphanumeric id. As far as I could tell, that was the source of the problem.

  Has anybody else noticed that problem?

  G

ADD REPLYlink written 4.6 years ago by Guilherme Rocha40
Answer: Confused by the probe design of Affymetrix Human Transcriptome Array (HTA 2.0)
0
gravatar for Marc Carlson
5.0 years ago by
Marc Carlson7.2k
United States
Marc Carlson7.2k wrote:

Thank for the reprocessed package Jim!  I have just finished building windows and mac binaries for it and I am pushing it up to release and devel branches of the repository.  It should appear online this afternoon.

 

 Marc

ADD COMMENTlink written 5.0 years ago by Marc Carlson7.2k

Now that's service with a smile... Err, plastic hamburger toy ;-D

ADD REPLYlink written 5.0 years ago by James W. MacDonald51k
Answer: Confused by the probe design of Affymetrix Human Transcriptome Array (HTA 2.0)
0
gravatar for burchard
3.9 years ago by
burchard0
burchard0 wrote:

Dear List,

Thanks very much for this helpful thread!

 

The loss of annotation from Affy HTA 2.0 platform design seen prior to pd.hta.2.0 13.10 appears to have recurred in pd.hta.2.0 13.12.0.

 

To check the issue, I built pd.hta.2.0. packages according to James W. MacDonald's instructions from both Affy annotation releases na33/r1 and na35/r3.  As Guilherme Rocha noted for na34, the na35/r3 package contains numeric probeset IDs not found in na33/r1.  Both packages supported reading in of CEL files, with the same number of unique probe feature IDs and probe-set IDs.  However, the package built from na35/r3 had no probe type annotations other than "main".&nbps; The package built from na33/r1 had diverse probe type annotations, and slightly more "main" annotations.

 

Has the loss of annotation from HTA 2.0 platform design annotation with Affy's newer annotation file format been tracked to a part of the parsing and package building process?

 

Code, output and session info are below.  Thanks for any help you can provide!

 

library(pdInfoBuilder)

p = new("AffyHTAPDInfoPkgSeed", clfFile="HTA-2_0.r1.clf", pgfFile="HTA-2_0.r1.pgf", coreMps="HTA-2_0.r1.Psrs.mps", transFile="HTA-2_0.na33.hg19.transcript.csv", probeFile="HTA-2_0.na33.hg19.probeset.csv", author="burchard", email="burchard@ohsu.edu", version="3.13.1")

p2 = new("AffyHTAPDInfoPkgSeed", clfFile="HTA-2_0.r3.clf", pgfFile="HTA-2_0.r3.pgf", coreMps="HTA-2_0.r3.Psrs.mps", transFile="HTA-2_0.na35.2.hg19.transcript.csv", probeFile="HTA-2_0.na35.hg19.probeset.csv", author="burchard", email="burchard@ohsu.edu", version="3.13.2")

makePdInfoPackage(p, destDir = ".") 
makePdInfoPackage(p2, destDir = "..")

install.packages("pd.hta.2.0/", repos = NULL) # also done with ../pd.hta.2.0

# using package built from na35:
library(pd.hta.2.0)
# load data and extract platform info
setwd("../Genome")
eCELs = list.celfiles('.',full.names=T)
Data = read.celfiles(eCELs)
pInfo = getProbeInfo(Data,target="probeset",field=c("fid","type"),sortBy="none")
# check what's here
dim(Data)
# Features  Samples 
#  6892960        8 
dim(pInfo)
# [1] 7576209       3
length(unique(pInfo$fid))
# [1] 6744285
length(unique(pInfo$man_fsetid))
# [1] 925032
unique(pInfo$type)
# [1] "main" NA    
table(pInfo$type)
#
#    main 
# 7341944 

# using package built from na33:
library(pd.hta.2.0)
# load data and extract platform info
setwd("../Genome")
eCELs = list.celfiles('.',full.names=T)
Data = read.celfiles(eCELs)
pInfo = getProbeInfo(Data,target="probeset",field=c("fid","type"),sortBy="none")
# check what's here
dim(Data)
# Features  Samples 
#  6892960        8 
dim(pInfo)
# [1] 7576209       3
length(unique(pInfo$fid))
# [1] 6744285
length(unique(pInfo$man_fsetid))
# [1] 925032
unique(pInfo$type)
#
table(pInfo$type[pInfo$fid %in% rownames(Data)])
#
#   control->affx->bac_spike        control->affx->ercc 
#                         44                       2709 
#  control->affx->ercc->step control->affx->polya_spike 
#                       1870                         44 
#  control->bgp->antigenomic      main 
#                      16943                    7356242 
#             normgene->exon           normgene->intron 
#                       1145                       2328 
table(pInfo$type[pInfo$fid %in% rownames(Data)])
# 
#   control->affx->bac_spike        control->affx->ercc 
#                         44                       2709 
#  control->affx->ercc->step control->affx->polya_spike 
#                       1870                         44 
#  control->bgp->antigenomic      main 
#                      16943                    7356242 
#             normgene->exon           normgene->intron 
#                       1145                       2328 

sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS release 6.7 (Final)

locale:
 [1] LC_CTYPE=en_US.iso885915       LC_NUMERIC=C                  
 [3] LC_TIME=en_US.iso885915        LC_COLLATE=en_US.iso885915    
 [5] LC_MONETARY=en_US.iso885915    LC_MESSAGES=en_US.iso885915   
 [7] LC_PAPER=en_US.iso885915       LC_NAME=C                     
 [9] LC_ADDRESS=C                   LC_TELEPHONE=C                
[11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C           

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] pd.hta.2.0_3.13.1   RSQLite_1.0.0       DBI_0.3.1          
 [4] oligo_1.34.0        Biostrings_2.38.0   XVector_0.10.0     
 [7] IRanges_2.4.1       S4Vectors_0.8.1     Biobase_2.30.0     
[10] oligoClasses_1.32.0 BiocGenerics_0.16.1

loaded via a namespace (and not attached):
 [1] affxparser_1.42.0          GenomicRanges_1.22.1      
 [3] splines_3.2.2              zlibbioc_1.16.0           
 [5] bit_1.1-12                 foreach_1.4.3             
 [7] GenomeInfoDb_1.6.1         SummarizedExperiment_1.0.1
 [9] ff_2.2-13                  iterators_1.0.8           
[11] preprocessCore_1.32.0      affyio_1.40.0             
[13] codetools_0.2-14           BiocInstaller_1.20.0    
ADD COMMENTlink written 3.9 years ago by burchard0

This is the sort of thing that just makes me wonder about Affymetrix. It's almost as if they don't want people to use their products... Or maybe they just don't want Bioconductor people to use their products. It's a mystery, really. So here is the problem:

jmacdon$ awk -F, '{if($1 !~/#/) print $13}' HTA-2_0.na33.hg19.probeset.csv | sort | uniq
"control->affx->bac_spike"
"control->affx->ercc"
"control->affx->ercc->step"
"control->affx->polya_spike"
"control->bgp->antigenomic"
"main"
"normgene->exon"
"normgene->intron"

This is the naming scheme Affy has used for like forever and stuff but now, it is obviously critical to rename things for no apparent reason:

jmacdon$ awk -F, '{if($1 !~/#/) print $13}' HTA-2_0.na35.hg19.probeset.csv | sort | uniq
"Antigenomic background control"
"control->affx->bac_spike"
"control->affx->polya_spike"
"ERCC (External RNA Controls Consortium) step control"
"Exonic normalization control (Positive Control)"
"Intronic normalization control (Negative Control)"
"main"
"Positive Control"

But not everything, just some things. And not all arrays, just some of them. The mind, it boggles.

ADD REPLYlink written 3.9 years ago by James W. MacDonald51k

I was actually wrong about the problem with this annotation package. I put in code last release to get the type_dict based on what Affy was calling things, rather than based on a static set of types, which kept getting blown up because of Affy's penchant for changing things.

The new problem arose from a 'fix' that I put in to account for the fact that the Mouse Transcriptome array has 811 probes that are in the probeset.csv file but are not actually on the array. In other words, they designed the probes, found out that they were not performant for various reasons, and left them off the array. But they then neglected to exclude those probes from all of the annotation that they supply.

Unfortunately the fix for the MTA arrays then broke the newest version of the HTA package because Affy dumped out a bunch of the probeset IDs (man_fsetid) and replaced with probe IDs (fsetid). In other words, they are now mixing probe and probeset IDs in the column that formerly held only probeset IDs. Anyway, I re-patched pdInfoBuilder, and re-made the pd.hta.2.0 package and sent it up the line to the folks Roswell. It should appear in the next day or two - you are looking for version 2.12.1.
 

ADD REPLYlink written 3.9 years ago by James W. MacDonald51k

Thanks very much Jim!

 

I totally appreciate your series of fixes.  I'll watch for your new pd.hta.2.0 package!  

 

Thanks again,

Julja

ADD REPLYlink written 3.9 years ago by burchard0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 139 users visited in the last hour