Question

Annotation Problem regarding double platform merged microarray eset using the inSilicoMerging R package, after differential expression analysis

1

Entering edit mode

svlachavas ▴ 830

@svlachavas-7225

Last seen 7 months ago

Germany/Heidelberg/German Cancer Resear…

Dear Bioconductor Community,

in addition to one of my previous posts (Merge different datasets and perform differential expression analysis in limma) regarding the merging of two datasets with the R package inSilicoMerging and performing differential expression with limma , i have encoutered a problem with annotating the selected DEG probesets: because the annotation in my merged eset is double: hgu133plus2 hgu133a, i cant use any annotation platform like hgu133a.db or hgu133plus2.db which comprise the two datasets with any functions(i.e. select).

Thus, it would be grateful for any advice or solution about annotating my DEG probesets(genes) or deal with this specific problem with any possible tool !!

(* I have also send an email to the creators of the inSilicoMerging package, but as currently i havent got any response or feedback, i decided to also create here a question to get any possible feedback or suggestions)

inSilicoMerging doubleannotation affymetrix microarrays limma inSilicoDB • 1.9k views

ADD COMMENT • link 8.7 years ago svlachavas ▴ 830

score 1 · Answer 1 · 2015-08-24

Just annotate first, and then merge. Stick annotations into the featureData of the ExpressionSet object, and they should preserved in the output of the merge command. Note that only the featureData of the first ExpressionSet seems to be preserved by inSilicoMerging. This means that you don't really need to use annotation from both platforms, as the annotation in the second ExpressionSet won't be used anyway. At any rate, it's not clear how you would combine the annotation if they were different between platforms; and if they're the same, there's no need to combine them at all, so you might as well just use one or the other.

score 0 · Answer 2 · 2015-08-24

Dear Aaron,

thank you for your immediate response. Firstly i want to pinpoint two important things before asking some further questions about the annotations:

1. When i use the function merge from the package inSilicoMerging, it keeps only the common probesets between the two annotation platforms. So, in your opinion, these common probesets should have the same annotation(i.e SYMBOL) in both hgu133a & hgu133plus2 platforms ?

2. Secondly, regarding your comment " Note that only the featureData of the firstExpressionSet seems to be preserved by inSilicoMerging" ,

it seems that when i create a list of the datasets which i want to merge, the first dataset used for the list is this that you mention that seems to be preserved.

Thus, if my above points are on the right direction, regarding your proposal of first annotating and then merging the two datasets, should i follow the next lines of code-using the annotation of the hgu133plus2 dataset ??-

library(hgu133plus2.db)

gns <- select(hgu133plus2.db,featureNames(agcrma),"SYMBOL")

gns <- gns[!duplicated(gns[,1]),] # with the most naive way of removing probesets that return one to many mappings

head(gns)
PROBEID SYMBOL
1 1007_s_at DDR1
3 1053_at RFC2
4 117_at HSPA6
5 121_at PAX8
6 1255_g_at GUCA1A
7 1294_at UBA7

featureData(agcrma) <- new("AnnotatedDataFrame", data=gns) # possible way of annotating the ExpressionSet first(?)

agcrma
ExpressionSet (storageMode: lockedEnvironment)
assayData: 54675 features, 34 samples
element names: exprs
protocolData
sampleNames: St_1_WL57.CEL St_2_WL57.CEL ... St_T_WM60.CEL (34 total)
varLabels: ScanDate
varMetadata: labelDescription
phenoData
sampleNames: St_1_WL57.CEL St_2_WL57.CEL ... St_T_WM60.CEL (34 total)
varLabels: meta factor tissue
varMetadata: labelDescription
featureData
featureNames: 1 3 ... 58608 (54675 total)
fvarLabels: PROBEID SYMBOL
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: hgu133plus2

AND then after merging:

eset_COMBAT
ExpressionSet (storageMode: lockedEnvironment)
assayData: 22277 features, 60 samples
element names: exprs
protocolData: none
phenoData
sampleNames: St_1_WL57.CEL St_2_WL57.CEL ... 1554_03_Gemmer_1.CEL (60
total)
varLabels: Disease meta factor ... tissue (5 total)
varMetadata: labelDescription
featureData
featureNames: NA NA.1 ... NA.22276 (22277 total)
fvarLabels: PROBEID SYMBOL
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: hgu133plus2 hgu133a

So this is the appropriate way of moving forward ? Or the functions above about annotating first the one ExpressionSet is wrong ? Im asking because i have never annotated first an ExpressionSet, but only after differential expression with limma.

Finally, if the above functions are ok, how then after merging and performing DExpression with limma, can i access my annotations(Gene Symbols) ?

Thank you for your consideration on this matter !!

score 0 · Answer 3 · 2015-08-24

0

Entering edit mode

svlachavas ▴ 830

@svlachavas-7225

Last seen 7 months ago

Germany/Heidelberg/German Cancer Resear…

Dear Aaron,

thank you again for your recommendations. Firstly, regarding your answer in my first question : "Well, you'd hope so" , i thought with the following code maybe to get more confidence about using only the one of the two annotation platforms for the merged eset and use the common probesets of the two datasets:

eset.2 <- eset[rownames(eset_COMBAT),] # the common probesets after merging for the hgu133a dataset

agcrma.2 <- agcrma[rownames(eset_COMBAT),] # ..for the hgu133plus2 dataset

gns.hgu133a <- select(hgu133a.db, featureNames(eset.2), "SYMBOL")

gns.hgu133a <- gns.hgu133a[!duplicated(gns.hgu133a[,1]),]

gns.hgu133plus2 <- select(hgu133plus2.db, featureNames(agcrma.2), "SYMBOL")

gns.hgu133plus2 <- gns.hgu133plus2[!duplicated(gns.hgu133plus2[,1]),]

identical(gns.hgu133a,gns.hgu133plus2)
[1] TRUE

Secondly, when i used your above function: featureData(eset_COMBAT)$SYMBOL

featureData(eset_COMBAT)$SYMBOL
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[27] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[53] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[79] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[105] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA.........

[ reached getOption("max.print") -- omitted 12277 entries ]

So here i wonder where is possibly the problem ?

maybe when i type eset_COMBAT

eset_COMBAT
ExpressionSet (storageMode: lockedEnvironment)
assayData: 22277 features, 60 samples
element names: exprs
protocolData: none
phenoData
sampleNames: St_1_WL57.CEL St_2_WL57.CEL ... 1554_03_Gemmer_1.CEL (60
total)
varLabels: Disease meta factor ... tissue (5 total)
varMetadata: labelDescription
featureData
featureNames: NA NA.1 ... NA.22276 (22277 total) # Or this is irrelevant with the NA values returned ??
fvarLabels: PROBEID SYMBOL
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: hgu133plus2 hgu133a

ADD COMMENT • link 8.7 years ago svlachavas ▴ 830

0

Entering edit mode

I can't reproduce your problem with the NA values. However, it seems like the merge function determines matching rows based on the featureNames of each ExpressionSet. You could try assigning the featureData(eset)$SYMBOL as the featureNames(eset) for each eset to be merged, and see if that helps. Obviously, if the symbols for one of the input ExpressionSet objects are all NA, then you should first figure out why that is occurring.

ADD REPLY • link 8.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Dear Aaron,

you mean something like this:

# for instanse for the hgu133plus2 dataset: library(hgu133plus2.db)

gns <- select(hgu133plus2.db,featureNames(agcrma),"SYMBOL")

gns <- gns[!duplicated(gns[,1]),]

featureNames(agcrma)<-featureData(agcrma)$SYMBOL

for each dataset before merging ?? or you meant something different ??

because i tried it but i got the following message:

Error in `row.names<-.data.frame`(`*tmp*`, value = c("DDR1", "RFC2", "HSPA6", :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘A1CF’, ‘A2M’, ‘A2ML1......

ADD REPLY • link 8.7 years ago svlachavas ▴ 830

1

Entering edit mode

Try using the PROBEID as the featureNames instead, it looks like they match up between the two platforms. In any case, you need to make sure that the featureNames for each input ExpressionSet are meaningful, otherwise the merge function won't know how to stitch them together.

ADD REPLY • link 8.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you again for your idea !!!

Before merging, i tried as you proposed : featureNames(agcrma)<-featureData(agcrma)$PROBEID # for the hgu133plus2 dataset, as it is the dataset which is preserved by inSilicoMerging, without interfere to the other dataset

agcrma
ExpressionSet (storageMode: lockedEnvironment)
assayData: 54675 features, 34 samples
element names: exprs
protocolData
sampleNames: St_1_WL57.CEL St_2_WL57.CEL ... St_T_WM60.CEL (34 total)
varLabels: ScanDate
varMetadata: labelDescription
phenoData
sampleNames: St_1_WL57.CEL St_2_WL57.CEL ... St_T_WM60.CEL (34 total)
varLabels: Meta_factor Disease Study
varMetadata: labelDescription
featureData
featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (54675 total)
fvarLabels: PROBEID SYMBOL
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: hgu133plus2

and then i continued with merging ...after merge:

eset_COMBAT
ExpressionSet (storageMode: lockedEnvironment)
assayData: 22277 features, 60 samples
element names: exprs
protocolData: none
phenoData
sampleNames: St_1_WL57.CEL St_2_WL57.CEL ... 1554_03_Gemmer_1.CEL (60 total)
varLabels: Disease Meta_factor Study
varMetadata: labelDescription
featureData
featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (22277 total)
fvarLabels: PROBEID SYMBOL
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: hgu133plus2 hgu133a

and when i used: head(featureData(eset_COMBAT)$SYMBOL)
[1] "DDR1" "RFC2" "HSPA6" "PAX8" "GUCA1A" "UBA7"

So it seems that now it works and probably i can access my annotations (especially after implementing limma), while in my above answer i have checked that regarding the common probesets, that they match to the same gene symbols in both affymetrix platforms.

One last idea i have but i have never implement it(but i believe it is general on annotation) is maybe i could also use the custom CDF from human Brain Array to get rid of duplicates and similar issues, but i have never used them (http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/CDF_download.asp)

ADD REPLY • link 8.7 years ago svlachavas ▴ 830