Question: rma summarization error when normalizing using oligo and brainarray cdf
0
25 days ago by
Moosa0
Moosa0 wrote:

Hello. I'm trying to read some raw (.cel) files generated from Affymetrix U133 Plus 2.0 Array using Brainarray custom CDFs. The code that I'm using are:

install.packages("http://mbni.org/customcdf/23.0.0/entrezg.download/pd.hgu133plus2.hs.entrezg_23.0.0.tar.gz", repos = NULL, type = "source")
library(pd.hgu133plus2.hs.entrezg)
library(oligo)
path = #.cel files path
raw_data <- read.celfiles(path, pkgname = "pd.hgu133plus2.hs.entrezg")
normalized_data = oligo::rma(raw_data, target = "core")


Platform design info loaded.


When I'm not passing the argument target = "core", the normalization process seems to be executed without a problem, but using the argument target = "core" leads to the following error:

Background correcting... OK
Normalizing... OK
Available tables: featureSet1, mmfeature, mps1mm, mps1pm, pmfeature, table_info
Error in getMPSInfo(get(annotation(object)), substr(target, 4, 4), "fid",  :
Table mpsepm does not exist.


thank you for your time. regards.

oligo brainarray mbni • 93 views
modified 25 days ago • written 25 days ago by Moosa0
1

target = "core" is primarily for Gene or Exon arrays, which typically have 'GeneChip' and/or 'ST' in their name. core will instruct the algorithm to summarise the probesets to gene or exon level. The U133 A and B arrays, which is what you are using, are fundamentally designed differently, so, the usage of core is not valid for these. This is my understanding, at least. Please wait for another person to respond.

Seem valid. I've also read your explanations about target parameters in Biostar (for example). target="probeset" also yields the same error. So, if I just execute rma without any arguments, in that case, the result should be a normalized dataset summarized based on gene levels? Am I correct?

Also, I've run read.celfiles files with and without pkgname = "pd.hgu133plus2.hs.entrezg", the normalized objects from each try are as follow: Do the equal number of assayData in both cases and the different number of assayData means that the .CEL files has been read correctly using Barainarray CDF, and the different of assaydata (aka probes) reflects the different CDF design of Brainarray? I'm sorry to bother you with my rudementary questions, I've done my searches and reading and just checking with to be sure.

Code1:

raw_data <- read.celfiles(path, pkgname = "pd.hgu133plus2.hs.entrezg")
normalized_data2 = oligo::rma(raw_data)


the ruslts:

> raw_data
GenericFeatureSet (storageMode: lockedEnvironment)
assayData: 1354896 features, 38 samples
....
Annotation: pd.hgu133plus2.hs.entrezg


and:

> normalized_data
ExpressionSet (storageMode: lockedEnvironment)
assayData: 20481 features, 38 samples
....
Annotation: pd.hgu133plus2.hs.entrezg


Code 2:

raw_data2 <- read.celfiles(path)
normalized_data2 = oligo::rma(raw_data2)


the rsults:

raw_data2
ExpressionFeatureSet (storageMode: lockedEnvironment)
assayData: 1354896 features, 38 samples
...
Annotation: pd.hg.u133.plus.2


and:

> normalized data2
ExpressionSet (storageMode: lockedEnvironment)
assayData: 54675 features, 38 samples
...
Annotation: pd.hg.u133.plus.2

1

Yes, you are correct!

raw_data: The assayData of this object reflect the number of probes (not probesets) present on the array. This number is independent of the 'chip definition file' (i.e. probe-to-probeset mapping) that is used. Hence these are the same for raw_data and raw_data2.

In the case of normalized_data, you have used modified probe-to-probeset mapping information based on up-to-date genome annotation information (a so-called Custom CDF from the MBNI group). Assuming you used the latest version (i.e. version 23), Manhong Dai (MBNI) generated these remapping files in September/October 2018. Since you used an entrez gene-based remapping file, each probeset in normalized_data now reflects the expression level of a gene (as annotated by the NCBI ENTREZ database (status September 2018)). The probeset ID as such corresponds to the the ENTREZ ID, with suffix _at. To be in line with Affymetrix nomenclature, _at indicates that the probeset detects an antisense target (see e.g. here).

In code chunk 2 you used the probe-to-probeset mapping as defined by Affymetrix at the time they designed this array, which was in the early 2000's. FYI: U133 refers to Unigene version 133 (released April 20, 2001), the version of the Unigene database Affymetrix used to design their probes and probesets for this array. By definition, the probesets in normalized_data2 are NOT (always) unique for a single gene (genes could be detected by multiple probesets), and this can be inferred from the probeset name (whether e.g. _s or _x are present in the probeset name; see here for more info).

Hence, your object normalized_data2 is comprised of more probesets than normalized_data, but the number of uniquely detected genes should be roughly similar.

Lastly, to complete the story, the content of the (two) hgu133A and hgu133B arrays together is on the (single) hgu133plus2 array. The difference is that the first 2 arrays were manufactured in a photolithographic process in which the minimum (physical) distance between each probe was 11 microM. Not all 'required' probes could then be 'printed' on a single array. However, improved technology allowed the distance to be reduced to only 5 microM, which in turn allowed to synthesize all probes on a single array. See e.g. here (section Array Manufacturing).

Dear Guido I appreciate your informative response and great help. It was just amazing and comprehensive. : )

best regards