Question

getProbeInfo and exprs return matrices and data.frames of different dimensions

0

Entering edit mode

mark.aaron.fisher • 0

@markaaronfisher-9197

Last seen 7.4 years ago

United States

I'm trying to write a function that can summarize intensity matrices from Affymetrix array types that require the affy package alongside those that require the oligo package to read (incidentally, it's not clear to my why two separate packages are required to accomplish this).

The most straightforward way that I can think to do this is to pass the function three arguments: 1) the matrix of non-normalized probe intensities extracted from the ExpressionFeatureSet object using exprs(), 2) a vector of probe IDs (fids?) and 3) a vector of another identifier (e.g., probeset IDs) that correspond, element-for-element to the vector described in 2.

In the process of writing this function, I have noticed that the dimensions of the data.frame I get using `getProbeInfo()` doesn't correspond to the dimensions of the matrix of probe intensities I acquired using `exprs()`.

Why are these dimensions different, and how do I access probe and probeset (fset_id?)-level identifiers for each entry of my matrix of probe intensities?

To instantiate, I downloaded the CEL files for the chicken genome array [here](http://www.affymetrix.com/support/downloads/demo_data/Demo_Data_Chicken.zip).

path2CELfiles2 = "/Users/fishema/Desktop/CEL_files_for_testing/Demo_Data_Chicken//" #This would obviously be different for you.

microarrayProcessing_ls = processCELfiles(path2CELfiles2)
test1_mat = microarrayProcessing_ls$NonNormalizedMatrix

probeset_df = getProbeInfo(microarrayProcessing_ls$MicroarrayObject,field=c('fid','fsetid'),probeType = "pm",target = "probeset")

dim(test1_mat)
[1] 968256 2

dim(probeset_df)
[1] 424097 3

#The processCELfiles function

processCELfiles = function(pathToCELfiles){
#' Read in CEL files from a directory argument, using the oligo package first and then, failing that, trying the affy package instead. Requires the relevent array package to be installed from BioConductor before run. Not sure of a good strategy for installing these if they haven't already been. Also want more sophistocated error handling here, but SO recommends against nested tryCatches(), because they don't work as expected:
#' http://stackoverflow.com/questions/35716394/nested-try-catch-in-r
#'
#' @param pathToCELfiles a string containing the path to the directory containing
#' @return A list with two elements. The first element is the affy or oligo object (GeneFeatureSet). The second is a matrix with rownames and colnames that contains non-normalized intensities for each probe in a given array
#' @examples
#' path2CELfiles1 = "/Users/fishema/Desktop/CEL_files_for_testing/HTA/"
#' test1_ls = convertCELtoNonNormMat(path2CELfiles1)
#' path2CELfiles2 = "/Users/fishema/Desktop/CEL_files_for_testing/Rhesus/"
#' test2_ls = convertCELtoNonNormMat(path2CELfiles2)
#' path2CELfiles3 = "/Users/fishema/Desktop/CEL_files_for_testing/SNP6/"
#' test3_ls = convertCELtoNonNormMat(path2CELfiles3)
#' test3_mat = test3_ls[[2]]

require(oligo)
require(affy)
tryCatch({
eCELs = list.celfiles(pathToCELfiles,full.names=T)
Data_obj = read.celfiles(eCELs)
return_val = list(Data_obj, exprs(Data_obj))
names(return_val) = c("MicroarrayObject", "NonNormalizedMatrix")
}, warning = function(war) {
print(paste("Warning in convertCELtoNonNormMat function: ",war))
return_val = NULL
#return(return_val)
}, error = function(err) {
# Might be affy then
print(paste("Array type not suitable for oligoClasses package. Trying affy package: ",err))
Data_obj=ReadAffy(celfile.path=pathToCELfiles)
non_norm_mat = exprs(Data_obj)
return_val = list(Data_obj, non_norm_mat)
names(return_val) = c("MicroarrayObject", "NonNormalizedMatrix")
#return(return_val)
}, finally = {
print(paste("Done with convertCELtoNonNormMat"))
return(return_val)
}) # END tryCatch
} #END processCELfiles

oligo oligoclasses affy • 1.3k views

ADD COMMENT • link updated 7.5 years ago by James W. MacDonald 65k • written 7.5 years ago by mark.aaron.fisher • 0

score 1 · Answer 1 · 2016-10-26

The oligo package is intended to be the successor to the affy package, and can process any array type that affy can, plus a host of other arrays that the affy package isn't designed for. If you are simply planning to analyze Affymetrix arrays of any type, there is no reason to use affy any longer, and you can just start using oligo instead.

As to your question, you are doing an apples to oranges comparison, particularly with this array. In other words, when you get the exprs slot for the AffyBatch (or ExpressionFeatureSet if using oligo), you are getting the values for every little grid position on that array. But not all of those grids have anything in them! You can see this most clearly by doing

dat <- read.celfiles(list.celfiles(path2CELfiles2, full.names = TRUE))
image(dat[,1])

And you will see that there is a band at the bottom that has nothing at all in it. But it's even more complicated than that, as there are other empty grid positions all over that array that aren't apparent from just looking at the image. The array itself is a 984 X 984 grid, but even within the region that appears to have lots of data, there are positions that are being skipped, and some of the positions are used as 'landing lights' to help align the scanner. You can see that yourself if you do

library(pd.chicken)
con <- db(pd.chicken)
mat <- dbGetQuery(con, "select * from pmfeature;")
rle <- Rle(mat$y)
runLengths(rle)
runValue(rle)

The output from runLengths will give the number of probes that are in each row of the array, and the output from runValue will give the row number for each of the preceding counts. Note that it's every other row, except for the region between 575-595, where it's every row (that's where all those rows with just 18 probes come from).

In addition, this is a 3'-biased array, which means there are PM and MM probes! And when you use getProbeInfo and say you want the 'pm' probes, you are only asking for those grid positions that have PM probes, which is about half of them.

So what you get from the exprs slot is a value for all 968256 (984^2) grid positions, and what you are getting from getProbeInfo is based on the 424097 grid positions that A) have probes in them, and B) are PM probes.