Question: Error when restore rownames of Feature data of ExpressionSet
0
gravatar for peirinl
18 months ago by
peirinl0
peirinl0 wrote:

Dear all,

I do annotation and removing multiple mappings, 

anno_dataSet <- AnnotationDbi::select(hgu133plus2.db,
                                      keys=(featureNames(dataSet_filtered)),
                                      columns = c("SYMBOL", "GENENAME"),
                                      keytype="PROBEID")

probe_stats <- anno_dataSet %>% group_by(PROBEID) %>%
  summarize(no_of_matches = n_distinct(SYMBOL)) %>%
  filter(no_of_matches > 1)

ids_to_exlude <- ((featureNames(dataSet_filtered) %in% probe_stats$PROBEID) | featureNames(dataSet_filtered) %in% subset(anno_dataSet, is.na(SYMBOL))$PROBEID)

dataSet_final <- subset(dataSet_filtered, !ids_to_exlude)

fData(dataSet_final)$PROBEID <- rownames(fData(dataSet_final))
fData(dataSet_final) <- left_join(fData(dataSet_final), anno_dataSet)
rownames(fData(dataSet_final)) <-fData(dataSet_final)$PROBEID

But I get 

Error in `row.names<-.data.frame`(`*tmp*`, value = value) : 
  duplicate 'row.names' are not allowed
Besides: Warning message:
non-unique value when setting 'row.names': '227320_at'

My question is, 

1. is it normal to have a duplicate probe unremoved after I removing multiple mappings?

2. can I remove the duplicate data, so that I can change the rownames using PROBEID? If so, what is the best way?

Thank you.

expressionset • 413 views
ADD COMMENTlink modified 18 months ago by Martin Morgan ♦♦ 23k • written 18 months ago by peirinl0

If I run your code with 

k <- keys(hgu133plus2.db)
dataSet_filtered <- ExpressionSet(assayData=matrix(runif(length(k)), nrow=length(k), ncol=11))
featureNames(dataSet_filtered) <- k

as the dataSet_filtered then I get no errors, so I would advise double-checking the integrity of the original object - I can't offhand see an easy way that '227320_at' would have got through the filtering by ids_to_exlude (or be re-introduced by the left_join) - you might want to debug by tracing this particular IDs appearance (or any other ID that is duplicated in fData(dataSet_final)$PROBEID before the final line of your analysis).

 

ADD REPLYlink written 18 months ago by Gavin Kelly560

I agree with Gavins's answer - the code will exclude all probe IDs with multiple mappings to gene symbols, so by definition there should not be any duplicate PROBEIDs in fData as I retain only the PROBEIDs matching to exactly one gene symbol.

Are you using the correct annotation database for your array? Is it the hgu133plus2 platform?

The code obviously won't work if the annotation database / package is not the right one for your array.

ADD REPLYlink modified 18 months ago • written 18 months ago by Bernd Klaus540

You are right, the duplication is induced by left_join. Before left_join, 

> anyDuplicated(rownames(fData(dataSet_final)))
[1] 0

> anyDuplicated(anno_dataSet$PROBEID)
[1] 2

> rownames(fData(riester_final))[28589]
[1] "227321_at"
> rownames(fData(riester_final))[28588]
[1] "227320_at"

After Left join, 

> anyDuplicated(fData(riester_final)$PROBEID)
[1] 28589

> fData(riester_final)$PROBEID[28589]
[1] "227320_at"
> fData(riester_final)$PROBEID[28588]
[1] "227320_at"

Clearly, I dont know how to explain. Base on my understanding, it should not have such insertion on left_joint. How can I solve that?

ADD REPLYlink written 18 months ago by peirinl0

I try another dataset, also using hgu133plus2.db. The outcome is same. 

ADD REPLYlink written 18 months ago by peirinl0

This is very strange.  2273020_at _is_ appearing in riester_final, so that implies it is not duplicated in the anno_dataSet, so the inner join - assuming it is the dplyr version and is being done on PROBEID (maybe set an explicit 'by' in the join to enforce this) - should only return one row, as you expect.  Could you report the result of doing anno_dataSet %>% filter(PROBEID=="2273020_at") and probe_stats %>% filter(PROBEID=="2273020_at") and similar with your riester_final before the join: I can only imagine that somehow the n_distinct isn't resolving all identical cases - maybe your annotation has two identical symbols with different gene-names, or something like that.

ADD REPLYlink written 18 months ago by Gavin Kelly560
1

This is the behavior of left_join()

> left_join(data.frame(x=1), data.frame(x=1, y=1:2))
Joining, by = "x"
  x y
1 1 1
2 1 2

and the documentation, consistent with this behavior

     'left_join()' return all rows from 'x', and all columns from 'x'
          and 'y'. Rows in 'x' with no match in 'y' will have 'NA'
          values in the new columns. If there are multiple matches
          between 'x' and 'y', all combinations of the matches are
          returned.

There are multiple symbols associated with key 227320_at

> AnnotationDbi::select(hgu133plus2.db, "227320_at", c("SYMBOL", "GENENAME"), "PROBEID")
'select()' returned 1:many mapping between keys and columns
    PROBEID SYMBOL                 GENENAME
1 227320_at  RFLNA ZNF664-RFLNA readthrough
2 227320_at  RFLNA                refilin A

So there are no surprises in the code.

ADD REPLYlink modified 18 months ago • written 18 months ago by Martin Morgan ♦♦ 23k
> anno_riester %>% filter(PROBEID=="227320_at")
    PROBEID SYMBOL                 GENENAME
1 227320_at  RFLNA ZNF664-RFLNA readthrough
2 227320_at  RFLNA                refilin A

> probe_stats %>% filter(PROBEID=="227320_at")
# A tibble: 0 x 2
# ... with 2 variables: PROBEID <chr>, no_of_matches <int>
ADD REPLYlink modified 18 months ago by Martin Morgan ♦♦ 23k • written 18 months ago by peirinl0

It looks like the two instances aren't distinguished by the SYMBOL (RFLNA in both cases), so rather than use n_distinct(SYMBOL) (which will give 1), you may want to use n() instead, which will give the number of rows (2 in this case) in the group. Or use an alternative philosophy regarding multi-symbol probeids, such as just quoting the first (like James' annotateEset method will), or concatenating them...

ADD REPLYlink written 18 months ago by Gavin Kelly560
Answer: Error when restore rownames of Feature data of ExpressionSet
0
gravatar for James W. MacDonald
18 months ago by
United States
James W. MacDonald49k wrote:

You might consider using annotateEset in my affycoretools package. Something like

anno_dataSet <- annotateEset(dataSet_filtered, hgu133plus2.db)

will do exactly what you are trying to do in one line of code.

ADD COMMENTlink written 18 months ago by James W. MacDonald49k

Nice! So I can use anno_dataSet to do the subsequent LIMMA testing?

ADD REPLYlink modified 18 months ago • written 18 months ago by peirinl0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 154 users visited in the last hour