Question

How to mport publically available RMA processed dataset (on ArrayExpress) into Limma?

0

Entering edit mode

Momeneh Foroutan ▴ 10

@momeneh-foroutan-7398

Last seen 8.2 years ago

Australia

Hi there,

I have a processed microarray dataset (HG-U133A) downloaded from the ArrayExpress. The authors have processed the data using RMA in affy package and only the processed data is available. What I have now is a spreadsheet containing a summary of normalized expression values. I know that for using limma, it is assumed that we have "eset" of class "exprSet", however, the dataset I downloaded is not definitely of that class. I need to do DE analysis on that dataset... how can I use limma for this purpose?

I checked these links as well but I did not find an answer for my case:

A: Importing RMA processed data into limma for eBayes

Importing quantile normalized .txt file to limma for differential expression

Thanks in advance for any help with this.

Regards,

Sepideh

limma microarray • 2.7k views

ADD COMMENT • link updated 9.2 years ago by Aaron Lun ★ 28k • written 9.2 years ago by Momeneh Foroutan ▴ 10

0

Entering edit mode

Note that you can download data directly from ArrayExpress, into convenient containers (ExpressionSets or AffyBatches, depending on the data) using the ArrayExpress package. As an example, say we want the E-GEOD-24068 dataset.

> library(ArrayExpress)

> library(affy)

> dat <- ArrayExpress("E-GEOD-24068")
trying URL 'http://www.ebi.ac.uk/arrayexpress/files/A-AFFY-44/A-AFFY-44.adf.txt'
Content type 'text/plain' length 4517088 bytes (4.3 Mb)
opened URL
==================================================
downloaded 4.3 Mb

trying URL 'http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-24068/E-GEOD-24068.sdrf.txt'
Content type 'text/plain' length 10304 bytes (10 Kb)
opened URL
==================================================
downloaded 10 Kb

trying URL 'http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-24068/E-GEOD-24068.idf.txt'
Content type 'text/plain' length 3742 bytes
opened URL
==================================================
downloaded 3742 bytes

Copying raw data files

trying URL 'http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-24068/E-GEOD-24068.raw.1.zip'
Content type 'application/zip' length 52828214 bytes (50.4 Mb)
opened URL
=================================================
downloaded 50.4 Mb

Unpacking data files
ArrayExpress: Reading pheno data from SDRF
ArrayExpress: Reading data files
Read 28 items

 E-GEOD-24068  was successfully loaded into  AffyBatch

> eset <- rma(dat)

Background correcting
Normalizing
Calculating Expression

And this comes with all sorts of information about the samples:

> names(pData(phenoData(eset)))
 [1] "Source.Name"                            
 [2] "Comment..Sample_description."           
 [3] "Comment..Sample_source_name."           
 [4] "Comment..Sample_title."                 
 [5] "Characteristics..organism."             
 [6] "Term.Source.REF"                        
 [7] "Term.Accession.Number"                  
 [8] "Characteristics..organism.part."        
 [9] "Characteristics..disease."              
[10] "Characteristics..xenograft."            
[11] "Protocol.REF"                           
[12] "Term.Source.REF.1"                      
[13] "Protocol.REF.1"                         
[14] "Term.Source.REF.2"                      
[15] "Protocol.REF.2"                         
[16] "Term.Source.REF.3"                      
[17] "Extract.Name"                           
[18] "Material.Type"                          
[19] "Protocol.REF.3"                         
[20] "Term.Source.REF.4"                      
[21] "Labeled.Extract.Name"                   
[22] "Label"                                  
[23] "Protocol.REF.4"                         
[24] "Term.Source.REF.5"                      
[25] "Assay.Name"                             
[26] "Array.Design.REF"                       
[27] "Term.Source.REF.6"                      
[28] "Technology.Type"                        
[29] "Protocol.REF.5"                         
[30] "Term.Source.REF.7"                      
[31] "Array.Data.File"                        
[32] "Comment..ArrayExpress.FTP.file."        
[33] "Protocol.REF.6"                         
[34] "Term.Source.REF.8"                      
[35] "Normalization.Name"                     
[36] "Derived.Array.Data.File"                
[37] "Comment..Derived.ArrayExpress.FTP.file."
[38] "FactorValue..compound."                 
[39] "Factor.Value..dose."                    
[40] "Unit.concentration.unit."

Which you can use to do the analysis

> trt <- pData(phenoData(eset))[,38]
> trt
[1] "parathyroid hormone" "parathyroid hormone" "parathyroid hormone"
[4] "parathyroid hormone" "parathyroid hormone" "saline"            
[7] "saline"              "saline"              "saline"            
[10] "saline"             

> library(limma)
> fit <- lmFit(eset, model.matrix(~trt))
> fit2 <- eBayes(fit)

> library(hgu133plus2.db)

> gns <- select(hgu133plus2.db, row.names(fit2$coef), c("ENTREZID","SYMBOL","GENENAME"))
Warning message:
In .generateExtraRows(tab, keys, jointype) :
  'select' resulted in 1:many mapping between keys and return rows
> gns <- gns[!duplicated(gns[,1]),]
> fit2$genes <- gns
> topTable(fit2,2)
                PROBEID ENTREZID  SYMBOL
218002_s_at 218002_s_at     9547  CXCL14
215506_s_at 215506_s_at     9077  DIRAS3
217821_s_at 217821_s_at    51729   WBP11
202388_at     202388_at     5997    RGS2
205870_at     205870_at      624  BDKRB2
209959_at     209959_at     8013   NR4A3
205100_at     205100_at     9945   GFPT2
222899_at     222899_at    22801  ITGA11
218935_at     218935_at    30845    EHD3
213004_at     213004_at    23452 ANGPTL2
                                                   GENENAME      logFC  AveExpr
218002_s_at               chemokine (C-X-C motif) ligand 14 -2.0147181 7.161729
215506_s_at            DIRAS family, GTP-binding RAS-like 3 -1.9505608 5.607836
217821_s_at                    WW domain binding protein 11  0.8534752 6.899359
202388_at                regulator of G-protein signaling 2 -2.0633228 9.550311
205870_at                            bradykinin receptor B2 -1.3754981 6.035307
209959_at   nuclear receptor subfamily 4, group A, member 3 -1.5975650 7.487489
205100_at     glutamine-fructose-6-phosphate transaminase 2 -0.8129612 6.008675
222899_at                                integrin, alpha 11 -1.3640798 7.776450
218935_at                            EH-domain containing 3 -1.9003569 7.110979
213004_at                               angiopoietin-like 2 -1.3629536 7.081965
                    t      P.Value adj.P.Val          B
218002_s_at -7.928451 9.362982e-06 0.5119211 -0.1054202
215506_s_at -7.253964 2.087725e-05 0.5707317 -0.3215610
217821_s_at  6.443773 5.874638e-05 0.6018951 -0.6369279
202388_at   -6.362932 6.543297e-05 0.6018951 -0.6722592
205870_at   -6.200199 8.150267e-05 0.6018951 -0.7457188
209959_at   -6.167721 8.519036e-05 0.6018951 -0.7607632
205100_at   -6.064561 9.813941e-05 0.6018951 -0.8094168
222899_at   -6.020962 1.042326e-04 0.6018951 -0.8303826
218935_at   -5.883415 1.262624e-04 0.6018951 -0.8981350
213004_at   -5.881011 1.266892e-04 0.6018951 -0.8993415

ADD REPLY • link 9.2 years ago James W. MacDonald 65k

0

Entering edit mode

Thanks James for that all,

I downloaded the data in the same way you showed here, however, the data is RMA normalized and I cannot obtain eset by doing rma() on the dataset which is already normalized.

Regards,

Sepideh

ADD REPLY • link 9.2 years ago Momeneh Foroutan ▴ 10

0

Entering edit mode

I doubt you downloaded the data the same way, else you would have already had an ExpressionSet. And you cannot use ArrayExpress() if there are no raw data, so you could not have downloaded the way I showed. You would have to do something like this

> ae <- getAE("E-GEOD-56177", type = "processed")
trying URL 'http://www.ebi.ac.uk/arrayexpress/files/A-AFFY-44/A-AFFY-44.adf.txt'
Content type 'text/plain' length 4517088 bytes (4.3 Mb)
opened URL
==================================================
downloaded 4.3 Mb

trying URL 'http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-56177/E-GEOD-56177.sdrf.txt'
Content type 'text/plain' length 6155 bytes
opened URL
==================================================
downloaded 6155 bytes

trying URL 'http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-56177/E-GEOD-56177.idf.txt'
Content type 'text/plain' length 3303 bytes
opened URL
==================================================
downloaded 3303 bytes

Copying processed data files

trying URL 'http://www.ebi.ac.uk/arrayexpress/files/E-GEOD-56177/E-GEOD-56177.processed.1.zip'
Content type 'application/zip' length 3444865 bytes (3.3 Mb)
opened URL
==================================================
downloaded 3.3 Mb

Unpacking data files

> aeset <- procset(ae, getcolproc(ae)[2])
Read 2 items
ArrayExpress: Reading pheno data from SDRF
Reading processed File:  GSM1357190_sample_table.txt
Reading processed File:  GSM1357189_sample_table.txt
Reading processed File:  GSM1357188_sample_table.txt
Reading processed File:  GSM1357187_sample_table.txt
Reading processed File:  GSM1357186_sample_table.txt
Reading processed File:  GSM1357185_sample_table.txt
Reading processed File:  GSM1357184_sample_table.txt
Reading processed File:  GSM1357183_sample_table.txt
ArrayExpress: Reading feature metadata from ADF
Read 25 items

 E-GEOD-56177  processed data was successfully loaded into  ExpressionSet

> aeset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 54675 features, 8 samples
  element names: exprs
protocolData: none
phenoData
  rowNames: GSM1357190_sample_table GSM1357189_sample_table ...
    GSM1357183_sample_table (8 total)
  varLabels: Source.Name Comment..Sample_source_name. ...
    Term.Accession.Number.4 (43 total)
  varMetadata: labelDescription
featureData
  featureNames: AFFX-BioB-5_at AFFX-BioB-M_at ... 1570653_at (54675
    total)
  fvarLabels: Composite.Element.Name
    Composite.Element.Database.Entry.affymetrix_netaffx. ...
    Composite.Element.Database.Entry.cp450. (15 total)
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:

So my point remains. Just use the ArrayExpress package as intended, without exporting data and converting to a spreadsheet, and you can use limma directly.

ADD REPLY • link 9.2 years ago James W. MacDonald 65k

score 2 · Answer 1 · 2015-02-25

2

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 8 hours ago

The city by the bay

If you have a spreadsheet, you can convert it to CSV format using Excel, and load it into R using read.csv. This should give you a data.frame object, that you can just plug into lmFit at the start of the limma pipeline (see page 41 of the user's guide). Don't worry about the object class, as lmFit will convert the data.frame into a numeric matrix for you. I don't know what you mean by exprSet, though; check out ?getEAWP for the objects that lmFit can handle.

ADD COMMENT • link 9.2 years ago Aaron Lun ★ 28k

2

Entering edit mode

In addition, the OP doesn't even need to convert to a CSV from Excel because one could import directly from Excel using the read.xls function from the gdata package

ADD REPLY • link 9.2 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Thank you Aaron, I did not think that we can import something other than exprSet object into it because Gordon Smyth has mentioned in his paper "http://www.statsci.org/smyth/pubs/limma-biocbook-reprint.pdf" that "the object eset is assumed to be of class exprSet containing normalized probe-set log-intensities from an Affymetrix..."

I will check it to see if I can do what you have suggested,

Thank you so much,

Sepideh

ADD REPLY • link 9.2 years ago Momeneh Foroutan ▴ 10