Dear Community,
for an ongoing collaborative project, i want to analyze a transcriptomic dataset, which is not deposited in GEO, but in ArrayExpress:
https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-6326/?page=1&pagesize=500
So, through the R package arrayexpress, i tried initially to download the relative raw data:
library(ArrayExpress)
rawset = ArrayExpress("E-MTAB-6326")
rawset
NChannelSet (storageMode: lockedEnvironment)
assayData: 62976 features, 119 samples
element names: E, Eb
protocolData: none
phenoData
rowNames: US84600244_253949426815_S01_GE1_107_Sep09_1_4.txt
US84600244_253949426926_S01_GE1_107_Sep09_2_3.txt ...
US84600244_253949426813_S01_GE1_107_Sep09_1_4.txt (119 total)
varLabels: Source.Name Characteristics.organism. ...
Factor.Value.RIDD.score. (29 total)
varMetadata: labelDescription channel
featureData
featureNames: 1 2 ... 62976 (62976 total)
fvarLabels: Row Col ... SystematicName (5 total)
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:
However, my main concerns are the following:
1) Based on the manual of ArrayExpress, if the returned class object is an NChannelSet, then the experiment is a two-colour experiment-
but, when i checked the relative "Sample and data relationship" txt file, the column Label only reports the Cy3 label-
thus, this platform is indeed a two colour, or i have understood something wrong here ? Also this is evident, from the phenodata info from the downloaded object:
table(rawset$Label)
Cy3
119
head(table(rawset$Labeled.Extract.Name))
1428:Cy3 2948:Cy3 3174:Cy3 3178:Cy3 3183:Cy3 3280:Cy3
1 1 1 1 1 1
Just for simplicity, the platform is
the Agilent SurePrint G3 Human GE v2 8x60k Microarray (Array Design A-MEXP-2320)
2) In conjuction with my previous question, how should i continue with downstream analysis ? My main goals is primarily test a small gene signature on how the data are clustered, as also to perform a DE-analysis; however, how i should proceed with the NChannelSet ? in order to utilize limma for normalization , filtering etc ?
Thank you in advance,
Efstathios
Dear Gordon,
thank you one more time for your feedback-i would like just to add some crusial comments based on my downstream analysis, in which I'm facing some issues :
Initially, based on your suggestion, i downloaded the 3 zipped parts of the dataset, unzipped them, and then use the SDRF file of the phenotype information, to read with the same order each sample, in order also to have the additional sample information in the right order:
sdrf <- read.table("E-MTAB-6326.sdrf.txt",stringsAsFactors = FALSE,sep="\t",header=T)
x <- read.maimages(sdrf[,"Array.Data.File"],
source="agilent", green.only=TRUE, other.columns="gIsWellAboveBG")
identical(sdrf$Assay.Name,colnames(x))
x$targets <- sdrf
### ANNOTATION PROCEDURE PRIOR DOWNSTEAM ANALYSIS ######### ??
y <- backgroundCorrect(x,method="normexp")
y <- normalizeBetweenArrays(y,method="quantile")
head(y$genes$ProbeName)
[1] "GE_BrightCorner" "DarkCorner" "DarkCorner" "A_23_P117082"
[5] "A_33_P3246448" "A_33_P3318220"
rownames(y) <-y$genes$ProbeName
Control <- y$genes$ControlType==1L
IsExpr <- rowSums(y$other$gIsWellAboveBG > 0) >= 59
yfilt <- y[!Control & IsExpr, ]
As my main objectives here are:
Firstly, on the filtered dataset, with unique gene symbols on the row, create a heatmap plot of the dataset, based on a tested 38 gene signature, and how it separates the samples
Secondly, on the expression of a specific tested gene of interest, separate the samples on two groups (median-high and low) and perform DE analysis between these 2 groups using limma, with a simple two group factor (characterizing each sample as high or low)-the notion for this, is that we are searching for DE genes betweeen tumors that have differences in expression in this gene, and then map the DEgenes to biological processes and pathways...
1) My main first issue, is that i can't find an annotation R package for this specific Agilent type (Agilent SurePrint G3 Human GE v2-Agilent design ID 039494)-from ArrayExpress they provide a txt file with gene symbol mappings, but this is from 2014, and I'm a bit reluctant to use them- what is your opinion on this ?
2) In case the only solution is to use these gene symbols, and add them to a slot in the initial object
A) For both of my goals, i should have unique gene identifiers in the rows ? as desribed above ? and perform the annotation prior background correction and normalization ?
B) For the usual phenomenon of agilent arrays, i should also use the function avereps in the above yfilt object ?
final <- avereps(yfilt,ID=yfilt$genes$ProbeName)
Thank you,
Efstathios
Dear Efstathios,
I can't give you detailed advice regarding this particular dataset, but here are some brief responses:
I don't see a big problem with using a array annotation file from 2014. You can update the gene symbols to current official symbols using the alias2SymbolTable() ro alias2SymbolUsingNCBI() functions in limma, although the process will be much more complicated if the file contains old RefSeq Ids or Ensemble transcript Ids, as I think it might. On the other hand, you can download the current annotation file for this Agilent array (last updated 13 Aug 2018) from the Agilent support website at https://earray.chem.agilent.com/earray/ (follow the link to Catalog Gene Lists).
I don't see any need to restrict to one probe for each gene symbol. Multiple probes cause no problems as at all in a limma analysis. If you do have a pressing reason to restrict to one probe per gene, then I don't recommend avereps(). Instead I recommend selecting the most highly expressed probe for each gene, see for example: Duplicate gene ID's returned from limma microarray analysis or Use probesets with highest baseline expression for differential gene
Dear Gordon, thank you for your valuable comments and considerations-without the intention to disturb you more, i will just add some quick final comments based on your suggestions:
1) Firstly, based on your initial part of your answer, you meant something like the following ? based on the "old" annotation:
And then, after removing control probesets, probesets mapping to NA symbols, and filtering-in the last final step,
i should perform your suggestion and use alias2SymbolTable ? despite your concerns that various gene symbols might have NA values when quering the official gene symbol ?
B) Based on your suggestion, to use the agilent's link-when i found the specific platform gene list:
Thus, from an initial detailed search, it has less entries that the agilent dataset (50683 vs 62976), as also two other things:
the ProbeID, despite matching the ProbeName column from the y object above (A_21_P0003064 example) does not contain any control probes
on the other hand, the TargetID from the updated annotation, matches the SystematicName of the y object (mainly the prefixs like NM_139162, as also ENST00000414796 or non coding abbreviations TCONS_l2_00027818) but also it does not contain the control probes-
so, in your opinion, i should first remove the control probes from the EList limma object, and then match for example the ProbeID column from the updated annotation with the ProbeName column from the y object ? as the ProbeID has more entries than the TargetID ? and in the end, keep only the common probeIDs to continue ?
3) Finally, yes, our main reason at the end, is to restrict to a unique gene symbol signature, to perform exploratory heatmaps, as also for separating the samples based on the expression of one specific gene-but your advise seems more robust than avereps, so i will follow this approach for the duplicated probes.
Best,
Efstathios
Your approaches sound ok.