Question

How to analyze correctly agilent two-colour microarray data from the ArrayExpress repository with the limma R package

1

Entering edit mode

svlachavas ▴ 830

@svlachavas-7225

Last seen 6 months ago

Germany/Heidelberg/German Cancer Resear…

Dear Community,

for an ongoing collaborative project, i want to analyze a transcriptomic dataset, which is not deposited in GEO, but in ArrayExpress:

https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-6326/?page=1&pagesize=500

So, through the R package arrayexpress, i tried initially to download the relative raw data:

library(ArrayExpress)

rawset = ArrayExpress("E-MTAB-6326")

rawset NChannelSet (storageMode: lockedEnvironment) assayData: 62976 features, 119 samples element names: E, Eb protocolData: none phenoData rowNames: US84600244_253949426815_S01_GE1_107_Sep09_1_4.txt US84600244_253949426926_S01_GE1_107_Sep09_2_3.txt ... US84600244_253949426813_S01_GE1_107_Sep09_1_4.txt (119 total) varLabels: Source.Name Characteristics.organism. ... Factor.Value.RIDD.score. (29 total) varMetadata: labelDescription channel featureData featureNames: 1 2 ... 62976 (62976 total) fvarLabels: Row Col ... SystematicName (5 total) fvarMetadata: labelDescription experimentData: use 'experimentData(object)' Annotation:

However, my main concerns are the following:

1) Based on the manual of ArrayExpress, if the returned class object is an NChannelSet, then the experiment is a two-colour experiment-

but, when i checked the relative "Sample and data relationship" txt file, the column Label only reports the Cy3 label-

thus, this platform is indeed a two colour, or i have understood something wrong here ? Also this is evident, from the phenodata info from the downloaded object:

table(rawset$Label)

Cy3 119

head(table(rawset$Labeled.Extract.Name))

1428:Cy3 2948:Cy3 3174:Cy3 3178:Cy3 3183:Cy3 3280:Cy3 1 1 1 1 1 1

Just for simplicity, the platform is

the Agilent SurePrint G3 Human GE v2 8x60k Microarray (Array Design A-MEXP-2320)

2) In conjuction with my previous question, how should i continue with downstream analysis ? My main goals is primarily test a small gene signature on how the data are clustered, as also to perform a DE-analysis; however, how i should proceed with the NChannelSet ? in order to utilize limma for normalization , filtering etc ?

Thank you in advance,

Efstathios

agilent microarrays limma arrayexpress NChannelSet • 3.2k views

ADD COMMENT • link updated 5.6 years ago by Gordon Smyth 50k • written 5.6 years ago by svlachavas ▴ 830

score 2 · Answer 1 · 2018-09-17

2

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 5 hours ago

WEHI, Melbourne, Australia

This dataset is single color, so ArrayExpress's use of a NChannelSet seems inappropriate.

You could of course simply download the data and read it using limma directly, which is quick and easy:

> library(limma)
> x <- read.maimages(files,path="Raw",source="agilent",green.only=TRUE)
Read Raw/US84600244_253949426796_S01_GE1_107_Sep09_1_1.txt 
Read Raw/US84600244_253949426796_S01_GE1_107_Sep09_1_2.txt 
Read Raw/US84600244_253949426796_S01_GE1_107_Sep09_1_3.txt 
Read Raw/US84600244_253949426796_S01_GE1_107_Sep09_1_4.txt 
Read Raw/US84600244_253949426796_S01_GE1_107_Sep09_2_1.txt
...

The ArrayExpress package also runs limma's read.maimages function, but then bundles up the output in a way that might or might not be helpful.

ADD COMMENT • link 5.6 years ago Gordon Smyth 50k

1

Entering edit mode

Dear Gordon,

thank you one more time for your feedback-i would like just to add some crusial comments based on my downstream analysis, in which I'm facing some issues :

Initially, based on your suggestion, i downloaded the 3 zipped parts of the dataset, unzipped them, and then use the SDRF file of the phenotype information, to read with the same order each sample, in order also to have the additional sample information in the right order:

sdrf <- read.table("E-MTAB-6326.sdrf.txt",stringsAsFactors = FALSE,sep="\t",header=T)

x <- read.maimages(sdrf[,"Array.Data.File"], source="agilent", green.only=TRUE, other.columns="gIsWellAboveBG")

identical(sdrf$Assay.Name,colnames(x)) x$targets <- sdrf

### ANNOTATION PROCEDURE PRIOR DOWNSTEAM ANALYSIS ######### ??

y <- backgroundCorrect(x,method="normexp")

y <- normalizeBetweenArrays(y,method="quantile")

head(y$genes$ProbeName) [1] "GE_BrightCorner" "DarkCorner" "DarkCorner" "A_23_P117082" [5] "A_33_P3246448" "A_33_P3318220"

rownames(y) <-y$genes$ProbeName

Control <- y$genes$ControlType==1L IsExpr <- rowSums(y$other$gIsWellAboveBG > 0) >= 59

yfilt <- y[!Control & IsExpr, ]

As my main objectives here are:

Firstly, on the filtered dataset, with unique gene symbols on the row, create a heatmap plot of the dataset, based on a tested 38 gene signature, and how it separates the samples

Secondly, on the expression of a specific tested gene of interest, separate the samples on two groups (median-high and low) and perform DE analysis between these 2 groups using limma, with a simple two group factor (characterizing each sample as high or low)-the notion for this, is that we are searching for DE genes betweeen tumors that have differences in expression in this gene, and then map the DEgenes to biological processes and pathways...

1) My main first issue, is that i can't find an annotation R package for this specific Agilent type (Agilent SurePrint G3 Human GE v2-Agilent design ID 039494)-from ArrayExpress they provide a txt file with gene symbol mappings, but this is from 2014, and I'm a bit reluctant to use them- what is your opinion on this ?

2) In case the only solution is to use these gene symbols, and add them to a slot in the initial object

A) For both of my goals, i should have unique gene identifiers in the rows ? as desribed above ? and perform the annotation prior background correction and normalization ?

B) For the usual phenomenon of agilent arrays, i should also use the function avereps in the above yfilt object ?

final <- avereps(yfilt,ID=yfilt$genes$ProbeName)

Thank you,

Efstathios

ADD REPLY • link 5.6 years ago svlachavas ▴ 830

1

Entering edit mode

Dear Efstathios,

I can't give you detailed advice regarding this particular dataset, but here are some brief responses:

I don't see a big problem with using a array annotation file from 2014. You can update the gene symbols to current official symbols using the alias2SymbolTable() ro alias2SymbolUsingNCBI() functions in limma, although the process will be much more complicated if the file contains old RefSeq Ids or Ensemble transcript Ids, as I think it might. On the other hand, you can download the current annotation file for this Agilent array (last updated 13 Aug 2018) from the Agilent support website at https://earray.chem.agilent.com/earray/ (follow the link to Catalog Gene Lists).

I don't see any need to restrict to one probe for each gene symbol. Multiple probes cause no problems as at all in a limma analysis. If you do have a pressing reason to restrict to one probe per gene, then I don't recommend avereps(). Instead I recommend selecting the most highly expressed probe for each gene, see for example: Duplicate gene ID's returned from limma microarray analysis or Use probesets with highest baseline expression for differential gene

ADD REPLY • link 5.6 years ago Gordon Smyth 50k

0

Entering edit mode

Dear Gordon, thank you for your valuable comments and considerations-without the intention to disturb you more, i will just add some quick final comments based on your suggestions:

1) Firstly, based on your initial part of your answer, you meant something like the following ? based on the "old" annotation:

head(y$genes)
  Row Col ControlType       ProbeName  SystematicName
1   1   1           1 GE_BrightCorner GE_BrightCorner
2   1   2           1      DarkCorner      DarkCorner
3   1   3           1      DarkCorner      DarkCorner
4   1   4           0    A_23_P117082       NM_015987
5   1   5           0   A_33_P3246448       NM_080671
6   1   6           0   A_33_P3318220       NM_178466

head(ant2) # the annotation downloaded csv
  Comment.SystematicName. Comment.GENE_SYMBOL.
1         GE_BrightCorner                 <NA>
2              DarkCorner                 <NA>
3              DarkCorner                 <NA>
4               NM_015987                HEBP1
5               NM_080671                KCNE4
6               NM_178466               BPIFA3
                                             Comment.GENE_NAME. Comment.CYTOBAND.
1                                                          <NA>              <NA>
2                                                          <NA>              <NA>
3                                                          <NA>              <NA>
4                                        heme binding protein 1        hs|12p13.1
5 potassium voltage-gated channel, Isk-related family, member 4         hs|2q36.1
6                        BPI fold containing family A, member 3       hs|20q11.2

identical(y$genes$SystematicName,as.character(ant2$Comment.SystematicName.))
[1] TRUE

y$genes$Gene.Symbol <- as.character(ant2$Comment.GENE_SYMBOL.)
> head(y$genes)
  Row Col ControlType       ProbeName  SystematicName Gene.Symbol
1   1   1           1 GE_BrightCorner GE_BrightCorner        <NA>
2   1   2           1      DarkCorner      DarkCorner        <NA>
3   1   3           1      DarkCorner      DarkCorner        <NA>
4   1   4           0    A_23_P117082       NM_015987       HEBP1
5   1   5           0   A_33_P3246448       NM_080671       KCNE4
6   1   6           0   A_33_P3318220       NM_178466      BPIFA3

And then, after removing control probesets, probesets mapping to NA symbols, and filtering-in the last final step,

i should perform your suggestion and use alias2SymbolTable ? despite your concerns that various gene symbols might have NA values when quering the official gene symbol ?

B) Based on your suggestion, to use the agilent's link-when i found the specific platform gene list:

 agl <- read.csv("039494_D_GeneList_20180813.txt",header=T,stringsAsFactors = FALSE,sep="\t")
dim(agl)
[1] 50683     6

head(agl[,1:3])
                ProbeID TargetID GeneSymbol
1 A_33_P3671291 AA378382    SNORA12
2 A_33_P3272010 AA496144           
3 A_33_P3412560 AA504818           
4 A_33_P3696965 AA593742           
5 A_33_P3287731 AA626923      KAT6B
6 A_21_P0014788 AA682772

Thus, from an initial detailed search, it has less entries that the agilent dataset (50683 vs 62976), as also two other things:

the ProbeID, despite matching the ProbeName column from the y object above (A_21_P0003064 example) does not contain any control probes

on the other hand, the TargetID from the updated annotation, matches the SystematicName of the y object (mainly the prefixs like NM_139162, as also ENST00000414796 or non coding abbreviations TCONS_l2_00027818) but also it does not contain the control probes-

so, in your opinion, i should first remove the control probes from the EList limma object, and then match for example the ProbeID column from the updated annotation with the ProbeName column from the y object ? as the ProbeID has more entries than the TargetID ? and in the end, keep only the common probeIDs to continue ?

3) Finally, yes, our main reason at the end, is to restrict to a unique gene symbol signature, to perform exploratory heatmaps, as also for separating the samples based on the expression of one specific gene-but your advise seems more robust than avereps, so i will follow this approach for the duplicated probes.

Best,

Efstathios

ADD REPLY • link 5.6 years ago svlachavas ▴ 830

1

Entering edit mode

Your approaches sound ok.

ADD REPLY • link 5.6 years ago Gordon Smyth 50k