Question

ReadAffy does not take into account parameter sampleNames with phenoData object

0

Entering edit mode

bastien_chassagnol • 0

@5de73a99

Last seen 3.5 years ago

I am reading cel files using affy::readAffy functions. To that purpose, I used a named vector files_URLs (having used GEOquery to retrieve sample information), whose values are full names of the files, and names are the rownames I want to be used for my exprs and phenoData object (using argument sampleNames=names(files_URL)).

Operation is correctly performed (adequate files are read, with names being used for phenodata and exprs matrix as being defined by parameter sampleNames), however, i got following warning message (cf picture) when I perfom following instruction:

raw_object <- affy::ReadAffy (filenames = files_URLs, celfile.path = destdir,  compress = TRUE, sampleNames = names(files_URLs), phenoData = pheno_data,...)

Warning message:
Mismatched phenoData and celfile names!

Please note that the row.names of your phenoData object should be identical to what you get from list.celfiles()!
Otherwise you are responsible for ensuring that the ordering of your phenoData object conforms to the ordering of the celfiles as they are read into the AffyBatch!
If not, errors may result from using the phenoData for subsetting or creating linear models, etc.

all.equal(names(files_URLs), row.names(pheno_data))
[1] TRUE

ReadAffy_warning_pheno It would be great that additionally to result of list.celfiles() function, correct ordering could be guessed as well from parameter sampleNames given.

sessionInfo( )

R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /softhpc/R/4.0.2/lib64/R/lib/libRblas.so
LAPACK: /softhpc/R/4.0.2/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bmkanalysis_1.0.0

affycoretools ReadAffy Biobase phenoData affy • 1.5k views

ADD COMMENT • link updated 3.6 years ago by James W. MacDonald 66k • written 3.6 years ago by bastien_chassagnol • 0

score 0 · Answer 1 · 2020-12-17

0

Entering edit mode

James W. MacDonald 66k

@james-w-macdonald-5106

Last seen 2 hours ago

United States

Pasting a picture of your console rather than the actual output is suboptimal. Why not just copy and paste like you already did with the first part? Anyway, you need to show your work better than that.

> getGEOSuppFiles("GSE23117")
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE23nnn/GSE23117/suppl//GSE23117_RAW.tar?tool=geoquery'
Content type 'application/x-tar' length 73072640 bytes (69.7 MB)
downloaded 69.7 MB

                                                       size isdir mode
C:/Users/jmacdon/Desktop/GSE23117/GSE23117_RAW.tar 73072640 FALSE  666
                                                                 mtime
C:/Users/jmacdon/Desktop/GSE23117/GSE23117_RAW.tar 2020-12-17 09:50:07
                                                                 ctime
C:/Users/jmacdon/Desktop/GSE23117/GSE23117_RAW.tar 2020-12-17 09:50:00
                                                                 atime exe
C:/Users/jmacdon/Desktop/GSE23117/GSE23117_RAW.tar 2020-12-17 09:50:07  no
> setwd("GSE23117/")
> untar("GSE23117_RAW.tar")
> library(affy)
> fn <- dir(".", "CEL.gz")
## reorder to not match output from list.celfiles()
> fn <- fn[sample(1:length(fn), length(fn))]
> all.equal(list.celfiles(), fn)
[1] "13 string mismatches"
## make a phenoData object
> pd <- AnnotatedDataFrame(data.frame(sample = 1:15, row.names = fn))
> pd
An object of class 'AnnotatedDataFrame'
  rowNames: GSM569481.CEL.gz GSM569478.CEL.gz ... GSM569485.CEL.gz (15
    total)
  varLabels: sample
  varMetadata: labelDescription
> z <- ReadAffy(filenames = fn, phenoData = pd)

So it doesn't matter what order you use (list.celfiles is only called if you don't tell ReadAffy which files to read in). But it does matter that the row.names of your phenoData object match the order of the files you read in, which is what the warning is telling you.

ADD COMMENT • link 3.6 years ago James W. MacDonald 66k

0

Entering edit mode

Thanks for your reply, but I don't want that the row.names of my phenoData match the files I read in in filenames, I want it to match, when provided, with argument sampleNames. Maybe a way of doing that properly would be to be able to provide a named vector for filenames argument: when you don't have names, function would perform as usual. But when you provide names to filenames, this would be used in place of argument sampleNames, assuring both sampleNames of ÀffyBatch` object are the names given in named vector filenames, and the order is respected.

This would besides remove an argument to function readAffy, replacing sampleNames by names, when avalaible, of filenames argument.

ADD REPLY • link 3.6 years ago bastien_chassagnol • 0

0

Entering edit mode

This isn't really an issue with the affy package, nor ReadAffy. The issue is that you are creating a phenoData object that won't match other data in your AffyBatch, and is getting rejected when it is tested for validity. In general I don't recommend making a phenoData object at the outset, which is why I added the warning that you saw in your original post, attempting to tell people exactly what the problem is.

In other words, if you want to make your own phenoData object and pass that into your AffyBatch, then you have taken on the responsibility of ensuring it matches up correctly or validObject will tell you that your object isn't valid, as it should.

Picking up where I left off,

> sampleNames <- gsub("\\.CEL.gz", "", fn)
> pd <- AnnotatedDataFrame(data.frame(sample = 1:15, row.names = sampleNames))
> pData(pd)
          sample
GSM569471      1
GSM569472      2
GSM569473      3
GSM569474      4
GSM569475      5
GSM569476      6
GSM569477      7
GSM569478      8
GSM569479      9
GSM569480     10
GSM569481     11
GSM569482     12
GSM569483     13
GSM569484     14
GSM569485     15
> z <- ReadAffy(filenames = fn, phenoData = pd, sampleNames = sampleNames)
> validObject(z)
[1] TRUE
> colnames(exprs(z))
 [1] "GSM569471" "GSM569472" "GSM569473" "GSM569474" "GSM569475" "GSM569476"
 [7] "GSM569477" "GSM569478" "GSM569479" "GSM569480" "GSM569481" "GSM569482"
[13] "GSM569483" "GSM569484" "GSM569485"
> pData(z)
          sample
GSM569471      1
GSM569472      2
GSM569473      3
GSM569474      4
GSM569475      5
GSM569476      6
GSM569477      7
GSM569478      8
GSM569479      9
GSM569480     10
GSM569481     11
GSM569482     12
GSM569483     13
GSM569484     14

ADD REPLY • link 3.6 years ago James W. MacDonald 66k