Question

Error when restore rownames of Feature data of ExpressionSet

0

Entering edit mode

peirinl • 0

@peirinl-14078

Last seen 3.9 years ago

Dear all,

I do annotation and removing multiple mappings,

anno_dataSet <- AnnotationDbi::select(hgu133plus2.db,
                                      keys=(featureNames(dataSet_filtered)),
                                      columns = c("SYMBOL", "GENENAME"),
                                      keytype="PROBEID")

probe_stats <- anno_dataSet %>% group_by(PROBEID) %>%
  summarize(no_of_matches = n_distinct(SYMBOL)) %>%
  filter(no_of_matches > 1)

ids_to_exlude <- ((featureNames(dataSet_filtered) %in% probe_stats$PROBEID) | featureNames(dataSet_filtered) %in% subset(anno_dataSet, is.na(SYMBOL))$PROBEID)

dataSet_final <- subset(dataSet_filtered, !ids_to_exlude)

fData(dataSet_final)$PROBEID <- rownames(fData(dataSet_final))
fData(dataSet_final) <- left_join(fData(dataSet_final), anno_dataSet)
rownames(fData(dataSet_final)) <-fData(dataSet_final)$PROBEID

But I get

Error in `row.names<-.data.frame`(`*tmp*`, value = value) : 
  duplicate 'row.names' are not allowed
Besides: Warning message:
non-unique value when setting 'row.names': '227320_at'

My question is,

1. is it normal to have a duplicate probe unremoved after I removing multiple mappings?

2. can I remove the duplicate data, so that I can change the rownames using PROBEID? If so, what is the best way?

Thank you.

expressionset • 1.8k views

ADD COMMENT • link updated 6.6 years ago by Martin Morgan 25k • written 6.6 years ago by peirinl • 0

0

Entering edit mode

If I run your code with

k <- keys(hgu133plus2.db)
dataSet_filtered <- ExpressionSet(assayData=matrix(runif(length(k)), nrow=length(k), ncol=11))
featureNames(dataSet_filtered) <- k

as the dataSet_filtered then I get no errors, so I would advise double-checking the integrity of the original object - I can't offhand see an easy way that '227320_at' would have got through the filtering by ids_to_exlude (or be re-introduced by the left_join) - you might want to debug by tracing this particular IDs appearance (or any other ID that is duplicated in fData(dataSet_final)$PROBEID before the final line of your analysis).

ADD REPLY • link 6.6 years ago Gavin Kelly ▴ 680

0

Entering edit mode

I agree with Gavins's answer - the code will exclude all probe IDs with multiple mappings to gene symbols, so by definition there should not be any duplicate PROBEIDs in fData as I retain only the PROBEIDs matching to exactly one gene symbol.

Are you using the correct annotation database for your array? Is it the hgu133plus2 platform?

The code obviously won't work if the annotation database / package is not the right one for your array.

ADD REPLY • link 6.6 years ago Bernd Klaus ▴ 610

0

Entering edit mode

You are right, the duplication is induced by left_join. Before left_join,

> anyDuplicated(rownames(fData(dataSet_final)))
[1] 0

> anyDuplicated(anno_dataSet$PROBEID)
[1] 2

> rownames(fData(riester_final))[28589]
[1] "227321_at"
> rownames(fData(riester_final))[28588]
[1] "227320_at"

After Left join,

> anyDuplicated(fData(riester_final)$PROBEID)
[1] 28589

> fData(riester_final)$PROBEID[28589]
[1] "227320_at"
> fData(riester_final)$PROBEID[28588]
[1] "227320_at"

Clearly, I dont know how to explain. Base on my understanding, it should not have such insertion on left_joint. How can I solve that?

ADD REPLY • link 6.6 years ago peirinl • 0

0

Entering edit mode

I try another dataset, also using hgu133plus2.db. The outcome is same.

ADD REPLY • link 6.6 years ago peirinl • 0

0

Entering edit mode

This is very strange. 2273020_at _is_ appearing in riester_final, so that implies it is not duplicated in the anno_dataSet, so the inner join - assuming it is the dplyr version and is being done on PROBEID (maybe set an explicit 'by' in the join to enforce this) - should only return one row, as you expect. Could you report the result of doing anno_dataSet %>% filter(PROBEID=="2273020_at") and probe_stats %>% filter(PROBEID=="2273020_at") and similar with your riester_final before the join: I can only imagine that somehow the n_distinct isn't resolving all identical cases - maybe your annotation has two identical symbols with different gene-names, or something like that.

ADD REPLY • link 6.6 years ago Gavin Kelly ▴ 680

1

Entering edit mode

This is the behavior of left_join()

> left_join(data.frame(x=1), data.frame(x=1, y=1:2))
Joining, by = "x"
  x y
1 1 1
2 1 2

and the documentation, consistent with this behavior

     'left_join()' return all rows from 'x', and all columns from 'x'
          and 'y'. Rows in 'x' with no match in 'y' will have 'NA'
          values in the new columns. If there are multiple matches
          between 'x' and 'y', all combinations of the matches are
          returned.

There are multiple symbols associated with key 227320_at

> AnnotationDbi::select(hgu133plus2.db, "227320_at", c("SYMBOL", "GENENAME"), "PROBEID")
'select()' returned 1:many mapping between keys and columns
    PROBEID SYMBOL                 GENENAME
1 227320_at  RFLNA ZNF664-RFLNA readthrough
2 227320_at  RFLNA                refilin A

So there are no surprises in the code.

ADD REPLY • link 6.6 years ago Martin Morgan 25k

0

Entering edit mode

> anno_riester %>% filter(PROBEID=="227320_at")
    PROBEID SYMBOL                 GENENAME
1 227320_at  RFLNA ZNF664-RFLNA readthrough
2 227320_at  RFLNA                refilin A

> probe_stats %>% filter(PROBEID=="227320_at")
# A tibble: 0 x 2
# ... with 2 variables: PROBEID <chr>, no_of_matches <int>

ADD REPLY • link updated 6.6 years ago by Martin Morgan 25k • written 6.6 years ago by peirinl • 0

0

Entering edit mode

It looks like the two instances aren't distinguished by the SYMBOL (RFLNA in both cases), so rather than use n_distinct(SYMBOL) (which will give 1), you may want to use n() instead, which will give the number of rows (2 in this case) in the group. Or use an alternative philosophy regarding multi-symbol probeids, such as just quoting the first (like James' annotateEset method will), or concatenating them...

ADD REPLY • link 6.6 years ago Gavin Kelly ▴ 680

score 0 · Answer 1 · 2017-10-05

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 11 hours ago

United States

You might consider using annotateEset in my affycoretools package. Something like

anno_dataSet <- annotateEset(dataSet_filtered, hgu133plus2.db)

will do exactly what you are trying to do in one line of code.

ADD COMMENT • link 6.6 years ago James W. MacDonald 65k

0

Entering edit mode

Nice! So I can use anno_dataSet to do the subsequent LIMMA testing?

ADD REPLY • link 6.6 years ago peirinl • 0