Question

Help! ExpressionSet function from Biobase not working! (issue with data structure??)

0

Entering edit mode

sa825 • 0

@sa825-23346

Last seen 4.8 years ago

Good afternoon,

The aim of my analysis is to perform Limma in order to generate p values and fold changes for each protein between alive and dead patients. I have created 3 csv files with information on these proteins:

1) ExpressionMatrix: Contains protein abundance values as rows (1552). And each sample (alive and dead patients) as columns (81) 2) FeatureData: Contains each protein as rows (1552). And one other column (Description: This is a description of each protein) 3) PhenotypeData: Contains each sample as rows (81). And two other columns with the headings : Patient ID and State.

I understand that I need to create a Class to hold these different objects in other to make the Limma analysis more straightforward (I know this from using the Limma Tutorial from DataCamp). I did the boxplot function which uses the 3 different objects I'm using in other to confirm that all 3 objects are in the correct data format as the boxplot produced the expected result. But when I try to create a Class of all 3 objects, I keep getting the error: Error in validObject(.Object) : invalid class “ExpressionSet” object: 1: sampleNames differ between assayData and phenoData invalid class “ExpressionSet” object: 2: sampleNames differ between phenoData and protocolData

I figure the issue is with the way I have structured objects "e" and "p"(ExpressionMatrix and PhenotypeData) but I am not sure how to fix it? Below is the code I have used from start to finish. Thank you for you help in advance!

Shimon

>setwd("D:/sa825/Using Limma")
> x<-read.csv("ExpressionMatrix.csv", stringsAsFactors = FALSE)
> f<-read.csv("FeatureData.csv")
> p<-read.csv("PhenotypeData.csv")
> e<-as.matrix(x)
> typeof(e)
[1] "double"
> class(e)
[1] "matrix"
> boxplot(e[1 , ] ~ p$State, main = f[1 , "Description"])
> source("https://bioconductor.org/biocLite.R")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of
Rtools before proceeding:

https://cran.rstudio.com/bin/windows/Rtools/
Installing package into ‘C:/Users/User/Documents/R/win-library/3.5’
(as ‘lib’ is unspecified)
trying URL 'https://bioconductor.org/packages/3.7/bioc/bin/windows/contrib/3.5/BiocInstaller_1.30.0.zip'
Content type 'application/zip' length 102191 bytes (99 KB)
downloaded 99 KB

package ‘BiocInstaller’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\User\AppData\Local\Temp\RtmpkLuJuJ\downloaded_packages
Bioconductor version 3.7 (BiocInstaller 1.30.0),
  ?biocLite for help
A newer version of Bioconductor is available for this
  version of R, ?BiocUpgrade for help
> biocLite("Biobase")
BioC_mirror: https://bioconductor.org
Using Bioconductor 3.7 (BiocInstaller 1.30.0), R 3.5.3
  (2019-03-11).
Installing package(s) ‘Biobase’
also installing the dependency ‘BiocGenerics’

trying URL 'https://bioconductor.org/packages/3.7/bioc/bin/windows/contrib/3.5/BiocGenerics_0.26.0.zip'
Content type 'application/zip' length 745077 bytes (727 KB)
downloaded 727 KB

trying URL 'https://bioconductor.org/packages/3.7/bioc/bin/windows/contrib/3.5/Biobase_2.40.0.zip'
Content type 'application/zip' length 2413751 bytes (2.3 MB)
downloaded 2.3 MB

package ‘BiocGenerics’ successfully unpacked and MD5 sums checked
package ‘Biobase’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\User\AppData\Local\Temp\RtmpkLuJuJ\downloaded_packages
installation path not writeable, unable to update
  packages: boot, class, cluster, KernSmooth, lattice,
  MASS, Matrix, mgcv, nlme, nnet, rpart, spatial,
  survival
> library(Biobase)
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind,
    colMeans, colnames, colSums, dirname, do.call, duplicated,
    eval, evalq, Filter, Find, get, grep, grepl, intersect,
    is.unsorted, lapply, lengths, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position,
    rank, rbind, Reduce, rowMeans, rownames, rowSums, sapply,
    setdiff, sort, table, tapply, union, unique, unsplit,
    which, which.max, which.min

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages
    'citation("pkgname")'.

> eset <- ExpressionSet(assayData = e,
+                       phenoData = AnnotatedDataFrame(p),
+                       featureData = AnnotatedDataFrame(f))
Error in validObject(.Object) : 
  invalid class “ExpressionSet” object: 1: sampleNames differ between assayData and phenoData
invalid class “ExpressionSet” object: 2: sampleNames differ between phenoData and protocolData

> sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 
[2] LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets 
[7] methods   base     

other attached packages:
[1] Biobase_2.40.0      BiocGenerics_0.26.0

loaded via a namespace (and not attached):
[1] compiler_3.5.3 tools_3.5.3

limma biobase ExpressionSet • 3.0k views

ADD COMMENT • link updated 4.8 years ago by Gordon Smyth 52k • written 4.8 years ago by sa825 • 0

1

Entering edit mode

Firstly:

Since you are using R > 3.5.0 Could you try using BiocManager for updating packages instead of biocLite( ) . biocLite( ) was deprecated in favor of BiocManager::install( ) . Once this is installed, please check that all your packages are up-to-date and a valid version of Bioconductor.

install.packages("BiocManager")
BiocManager::valid( )
BiocManager::install( )   # choose 'a'  for all package update

From R > 3.5.0 you should be using BiocManager.

Since we don't have access to your files,

> eset <- ExpressionSet(assayData = e,
+                       phenoData = AnnotatedDataFrame(p),
+                       featureData = AnnotatedDataFrame(f))
Error in validObject(.Object) : 
  invalid class “ExpressionSet” object: 1: sampleNames differ between assayData and phenoData
invalid class “ExpressionSet” object: 2: sampleNames differ between phenoData and protocolData

You could check the rownames and colnames of the objects e , AnnotatedDataFrame(p) and AnnotatedDataFrame(f). As the ERROR indicates the sampleNames need to be consistent so you can make sure none are missing/excluded or that a transpose might be necessary.

ADD REPLY • link 4.8 years ago shepherl 4.1k

0

Entering edit mode

Hi Shepherl,

Thank you so much for your response. I have used BiocManager as you instructed:

> BiocManager::valid( )

* sessionInfo()

R version 3.5.3 (2019-03-11)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 
[2] LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] BiocInstaller_1.32.1

loaded via a namespace (and not attached):
[1] BiocManager_1.30.10 compiler_3.5.3      tools_3.5.3        

Bioconductor version '3.8'

  * 2 packages out-of-date
  * 0 packages too new

create a valid installation with

  BiocManager::install(c(
    "Biobase", "BiocGenerics"
  ), update = TRUE, ask = FALSE)

more details: BiocManager::valid()$too_new, BiocManager::valid()$out_of_date

Warning message:
2 packages out-of-date; 0 packages too new 
> BiocManager::install( )
Bioconductor version 3.8 (BiocManager 1.30.10), R 3.5.3 (2019-03-11)
Installation path not writeable, unable to update packages: boot,
  class, cluster, KernSmooth, lattice, MASS, Matrix, mgcv, nlme, nnet,
  rpart, spatial, survival
Old packages: 'Biobase', 'BiocGenerics'
Update all/some/none? [a/s/n]: 
a
trying URL 'https://bioconductor.org/packages/3.8/bioc/bin/windows/contrib/3.5/Biobase_2.42.0.zip'
Content type 'application/zip' length 2415278 bytes (2.3 MB)
downloaded 2.3 MB

trying URL 'https://bioconductor.org/packages/3.8/bioc/bin/windows/contrib/3.5/BiocGenerics_0.28.0.zip'
Content type 'application/zip' length 749094 bytes (731 KB)
downloaded 731 KB

package ‘Biobase’ successfully unpacked and MD5 sums checked
package ‘BiocGenerics’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\User\AppData\Local\Temp\Rtmpg15kyZ\downloaded_packages
> library(Biobase)

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind,
    colMeans, colnames, colSums, dirname, do.call, duplicated,
    eval, evalq, Filter, Find, get, grep, grepl, intersect,
    is.unsorted, lapply, lengths, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position,
    rank, rbind, Reduce, rowMeans, rownames, rowSums, sapply,
    setdiff, sort, table, tapply, union, unique, unsplit,
    which, which.max, which.min

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages
    'citation("pkgname")'.

Then I re-tried the ExpressionSet function but it didn't work:

    eset <- ExpressionSet(assayData = e,
+                       phenoData = AnnotatedDataFrame(p),
+                       featureData = AnnotatedDataFrame(f))
Error in validObject(.Object) : 
  invalid class “ExpressionSet” object: 1: sampleNames differ between assayData and phenoData
invalid class “ExpressionSet” object: 2: sampleNames differ between phenoData and protocolData

I used colnames and rownames as you suggested and I can see that the colnames and rownames are different. So I transposed the csv files that match AnnotatedDataFrame(p) & AnnotatedDataFrame(f) to match "e". But with this new transposed files (p3 and f3), the boxplot function does not work and neither does the ExpressionSet function:

     f3<-read.csv("FeatureData3.csv")
> p3<-read.csv("PhenotypeData3.csv")
> boxplot(e[1 , ] ~ p3$State, main = f3[1 , "Description"])
Error in model.frame.default(formula = e[1, ] ~ p3$State) : 
  invalid type (NULL) for variable 'p3$State'
> eset <- ExpressionSet(assayData = e,
+                       phenoData = AnnotatedDataFrame(p3),
+                       featureData = AnnotatedDataFrame(f3))
Error in `featureNames<-`(`*tmp*`, value = featureNames(featureData)) : 
  'value' length (0) must equal feature number in AssayData
  (1552)'

I'm not really sure what to do. Can I send you access to the files I am working on? Or screenshots of how the data is structured

Thanks, Shimon

ADD REPLY • link updated 4.8 years ago by shepherl 4.1k • written 4.8 years ago by sa825 • 0

Gordon Smyth · Answer 1 · 2020-04-29

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 3 days ago

WEHI, Melbourne, Australia

You can make an ExpressionSet object if you want, but it does not make the limma analysis any more or less straightforward than it otherwise would be. limma can just as easily work with your data.frames x, f, and p directly. The limma User's Guide can be viewed by typing

library(limma)
limmaUsersGuide()

or from the limma landing page: https://www.bioconductor.org/packages/release/bioc/html/limma.html

You are reading the reading the data with read.csv, and that seems to be working, but it is important to look at the data.frames that have been read in to check that the dimensions and content are what you expect them to be. It is apparent from the error messages that there is a mismatch somewhere. What does your data show if you type the following?

dim(f)
head(f)
dim(p)
head(p)
dim(f)
head(f)

Also consider using more up-to-date software, although updating won't make any difference to the error messages you've reported so far. The current version of R is 4.0.0 and the current release of Bioconductor is 3.11. Your software is a few Bioc releases out of date.

ADD COMMENT • link 4.8 years ago Gordon Smyth 52k

0

Entering edit mode

Hi Gordon,

Thank you for your response. I have updated to the latest version of R and updated the Bioconductor package. I used the dim() and head() functions as described and got the following:

> dim(e)
[1] 1552   81
> dim(f)
[1] 1552    2
> dim(p)
[1] 81  3

Its clear that "e" and "p" do not match but I'm not sure they should as I thought "p" had to be that structure for the analysis to work? I'm not sure what other way I'm supposed to structure the data...

Below are the results of using the head() function for "f" and "p". Doing head(e) produced a result with too many characters to upload here so I attached an image for what that looked like:

> head(f)
       X
1 P09871
2 Q96NC0
3 Q15582
4 Q6ZMK1
5 Q8NHZ8
6 Q9NPP4
                                                                               Description
1                             Complement C1s subcomponent OS=Homo sapiens GN=C1S PE=1 SV=1
2                     Zinc finger matrin-type protein 2 OS=Homo sapiens GN=ZMAT2 PE=1 SV=1
3 Transforming growth factor-beta-induced protein ig-h3 OS=Homo sapiens GN=TGFBI PE=1 SV=1
4                 Cysteine and histidine-rich protein 1 OS=Homo sapiens GN=CYHR1 PE=2 SV=2
5              Anaphase-promoting complex subunit CDC26 OS=Homo sapiens GN=CDC26 PE=1 SV=1
6           NLR family CARD domain-containing protein 4 OS=Homo sapiens GN=NLRC4 PE=1 SV=2
> head(p)
                                X Patient.ID State
1        CHT11112016BiostatLRA_A2         A2 Alive
2        CHT11112016BiostatLRA_A5         A5 Alive
3        CHT11112016BiostatLRA_A7         A7 Alive
4       CHT11112016BiostatLRA_A10        A10 Alive
5 CHT11112016BiostatLRA_A14_500ng        A14 Alive
6 CHT11112016BiostatLRA_A15_500ng        A15 Alive

head(e): https://ibb.co/8XWdZH3

ADD REPLY • link updated 4.8 years ago by Gordon Smyth 52k • written 4.8 years ago by sa825 • 0

1

Entering edit mode

Thanks for showing the output. There's nothing wrong with your data and you could use e, f and p in a limma analysis just as they are. The only requirement is that e and f should have the same number of rows and that nrow(p) should match ncol(e).

It would be more normal to set row names for e instead of leaving them blank, and similarly you could set row names for p. Your peptide IDs are f$X and your sample IDs are p$X. You could set row.names by:

row.names(e) <- as.character(f$X)
p <- read.csv("PhenotypeData.csv", row.names=1)

If you do that, then is no further need for featureData and no need for an ExpressionSet.

For example you could proceed with

plotMDS(e)

or

design <- model.matrix(~ State, data=p)
fit <- lmFit(e, design)
fit <- eBayes(fit, robust=TRUE)
plotMD(fit, coef=2)
topTable(fit, coef=2)

ADD REPLY • link 4.8 years ago Gordon Smyth 52k

0

Entering edit mode

Hi Gordon,

Thank you for your response. The code worked perfectly! I just have one final question. When I do the topTable() function I can only see the results for the first few proteins, How do I access the remaining proteins? Thanks, Shimon

ADD REPLY • link 4.8 years ago sa825 • 0

1

Entering edit mode

If you read the doc, you will find that number might answer your question.

ADD REPLY • link 4.8 years ago SamGG ▴ 360

1

Entering edit mode

The doc mentioned by SamGG is accessed by ?topTable. Every function in limma has the same sort of help page and they are the first place to go to answer questions.

ADD REPLY • link 4.8 years ago Gordon Smyth 52k