Question: Error when restore rownames of Feature data of ExpressionSet
0
2.0 years ago by
peirinl0
peirinl0 wrote:

Dear all,

I do annotation and removing multiple mappings,

anno_dataSet <- AnnotationDbi::select(hgu133plus2.db,
keys=(featureNames(dataSet_filtered)),
columns = c("SYMBOL", "GENENAME"),
keytype="PROBEID")

probe_stats <- anno_dataSet %>% group_by(PROBEID) %>%
summarize(no_of_matches = n_distinct(SYMBOL)) %>%
filter(no_of_matches > 1)

ids_to_exlude <- ((featureNames(dataSet_filtered) %in% probe_stats$PROBEID) | featureNames(dataSet_filtered) %in% subset(anno_dataSet, is.na(SYMBOL))$PROBEID)

dataSet_final <- subset(dataSet_filtered, !ids_to_exlude)

fData(dataSet_final)$PROBEID <- rownames(fData(dataSet_final)) fData(dataSet_final) <- left_join(fData(dataSet_final), anno_dataSet) rownames(fData(dataSet_final)) <-fData(dataSet_final)$PROBEID

But I get

Error in row.names<-.data.frame(*tmp*, value = value) :
duplicate 'row.names' are not allowed
Besides: Warning message:
non-unique value when setting 'row.names': '227320_at'

My question is,

1. is it normal to have a duplicate probe unremoved after I removing multiple mappings?

2. can I remove the duplicate data, so that I can change the rownames using PROBEID? If so, what is the best way?

Thank you.

expressionset • 494 views
modified 2.0 years ago by Martin Morgan ♦♦ 23k • written 2.0 years ago by peirinl0

If I run your code with

k <- keys(hgu133plus2.db)
dataSet_filtered <- ExpressionSet(assayData=matrix(runif(length(k)), nrow=length(k), ncol=11))
featureNames(dataSet_filtered) <- k

as the dataSet_filtered then I get no errors, so I would advise double-checking the integrity of the original object - I can't offhand see an easy way that '227320_at' would have got through the filtering by ids_to_exlude (or be re-introduced by the left_join) - you might want to debug by tracing this particular IDs appearance (or any other ID that is duplicated in fData(dataSet_final)$PROBEID before the final line of your analysis). ADD REPLYlink written 2.0 years ago by Gavin Kelly560 I agree with Gavins's answer - the code will exclude all probe IDs with multiple mappings to gene symbols, so by definition there should not be any duplicate PROBEIDs in fData as I retain only the PROBEIDs matching to exactly one gene symbol. Are you using the correct annotation database for your array? Is it the hgu133plus2 platform? The code obviously won't work if the annotation database / package is not the right one for your array. ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by Bernd Klaus560 You are right, the duplication is induced by left_join. Before left_join, > anyDuplicated(rownames(fData(dataSet_final))) [1] 0 > anyDuplicated(anno_dataSet$PROBEID)
[1] 2

> rownames(fData(riester_final))[28589]
[1] "227321_at"
> rownames(fData(riester_final))[28588]
[1] "227320_at"

After Left join,

> anyDuplicated(fData(riester_final)$PROBEID) [1] 28589 > fData(riester_final)$PROBEID[28589]
[1] "227320_at"
> fData(riester_final)\$PROBEID[28588]
[1] "227320_at"

Clearly, I dont know how to explain. Base on my understanding, it should not have such insertion on left_joint. How can I solve that?

I try another dataset, also using hgu133plus2.db. The outcome is same.

This is very strange.  2273020_at _is_ appearing in riester_final, so that implies it is not duplicated in the anno_dataSet, so the inner join - assuming it is the dplyr version and is being done on PROBEID (maybe set an explicit 'by' in the join to enforce this) - should only return one row, as you expect.  Could you report the result of doing anno_dataSet %>% filter(PROBEID=="2273020_at") and probe_stats %>% filter(PROBEID=="2273020_at") and similar with your riester_final before the join: I can only imagine that somehow the n_distinct isn't resolving all identical cases - maybe your annotation has two identical symbols with different gene-names, or something like that.

1

This is the behavior of left_join()

> left_join(data.frame(x=1), data.frame(x=1, y=1:2))
Joining, by = "x"
x y
1 1 1
2 1 2

and the documentation, consistent with this behavior

     'left_join()' return all rows from 'x', and all columns from 'x'
and 'y'. Rows in 'x' with no match in 'y' will have 'NA'
values in the new columns. If there are multiple matches
between 'x' and 'y', all combinations of the matches are
returned.

There are multiple symbols associated with key 227320_at

> AnnotationDbi::select(hgu133plus2.db, "227320_at", c("SYMBOL", "GENENAME"), "PROBEID")
'select()' returned 1:many mapping between keys and columns
PROBEID SYMBOL                 GENENAME
2 227320_at  RFLNA                refilin A


So there are no surprises in the code.

> anno_riester %>% filter(PROBEID=="227320_at")
PROBEID SYMBOL                 GENENAME
2 227320_at  RFLNA                refilin A

> probe_stats %>% filter(PROBEID=="227320_at")
# A tibble: 0 x 2
# ... with 2 variables: PROBEID <chr>, no_of_matches <int>
ADD REPLYlink modified 2.0 years ago by Martin Morgan ♦♦ 23k • written 2.0 years ago by peirinl0

It looks like the two instances aren't distinguished by the SYMBOL (RFLNA in both cases), so rather than use n_distinct(SYMBOL) (which will give 1), you may want to use n() instead, which will give the number of rows (2 in this case) in the group. Or use an alternative philosophy regarding multi-symbol probeids, such as just quoting the first (like James' annotateEset method will), or concatenating them...

Answer: Error when restore rownames of Feature data of ExpressionSet
0
2.0 years ago by
United States
James W. MacDonald51k wrote:

You might consider using annotateEset in my affycoretools package. Something like

anno_dataSet <- annotateEset(dataSet_filtered, hgu133plus2.db)

will do exactly what you are trying to do in one line of code.