Question: Error when restore rownames of Feature data of ExpressionSet
0
gravatar for peirinl
2.0 years ago by
peirinl0
peirinl0 wrote:

Dear all,

I do annotation and removing multiple mappings, 

anno_dataSet <- AnnotationDbi::select(hgu133plus2.db,
                                      keys=(featureNames(dataSet_filtered)),
                                      columns = c("SYMBOL", "GENENAME"),
                                      keytype="PROBEID")

probe_stats <- anno_dataSet %>% group_by(PROBEID) %>%
  summarize(no_of_matches = n_distinct(SYMBOL)) %>%
  filter(no_of_matches > 1)

ids_to_exlude <- ((featureNames(dataSet_filtered) %in% probe_stats$PROBEID) | featureNames(dataSet_filtered) %in% subset(anno_dataSet, is.na(SYMBOL))$PROBEID)

dataSet_final <- subset(dataSet_filtered, !ids_to_exlude)

fData(dataSet_final)$PROBEID <- rownames(fData(dataSet_final))
fData(dataSet_final) <- left_join(fData(dataSet_final), anno_dataSet)
rownames(fData(dataSet_final)) <-fData(dataSet_final)$PROBEID

But I get 

Error in `row.names<-.data.frame`(`*tmp*`, value = value) : 
  duplicate 'row.names' are not allowed
Besides: Warning message:
non-unique value when setting 'row.names': '227320_at'

My question is, 

1. is it normal to have a duplicate probe unremoved after I removing multiple mappings?

2. can I remove the duplicate data, so that I can change the rownames using PROBEID? If so, what is the best way?

Thank you.

expressionset • 494 views
ADD COMMENTlink modified 2.0 years ago by Martin Morgan ♦♦ 23k • written 2.0 years ago by peirinl0

If I run your code with 

k <- keys(hgu133plus2.db)
dataSet_filtered <- ExpressionSet(assayData=matrix(runif(length(k)), nrow=length(k), ncol=11))
featureNames(dataSet_filtered) <- k

as the dataSet_filtered then I get no errors, so I would advise double-checking the integrity of the original object - I can't offhand see an easy way that '227320_at' would have got through the filtering by ids_to_exlude (or be re-introduced by the left_join) - you might want to debug by tracing this particular IDs appearance (or any other ID that is duplicated in fData(dataSet_final)$PROBEID before the final line of your analysis).

 

ADD REPLYlink written 2.0 years ago by Gavin Kelly560

I agree with Gavins's answer - the code will exclude all probe IDs with multiple mappings to gene symbols, so by definition there should not be any duplicate PROBEIDs in fData as I retain only the PROBEIDs matching to exactly one gene symbol.

Are you using the correct annotation database for your array? Is it the hgu133plus2 platform?

The code obviously won't work if the annotation database / package is not the right one for your array.

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by Bernd Klaus560

You are right, the duplication is induced by left_join. Before left_join, 

> anyDuplicated(rownames(fData(dataSet_final)))
[1] 0

> anyDuplicated(anno_dataSet$PROBEID)
[1] 2

> rownames(fData(riester_final))[28589]
[1] "227321_at"
> rownames(fData(riester_final))[28588]
[1] "227320_at"

After Left join, 

> anyDuplicated(fData(riester_final)$PROBEID)
[1] 28589

> fData(riester_final)$PROBEID[28589]
[1] "227320_at"
> fData(riester_final)$PROBEID[28588]
[1] "227320_at"

Clearly, I dont know how to explain. Base on my understanding, it should not have such insertion on left_joint. How can I solve that?

ADD REPLYlink written 2.0 years ago by peirinl0

I try another dataset, also using hgu133plus2.db. The outcome is same. 

ADD REPLYlink written 2.0 years ago by peirinl0

This is very strange.  2273020_at _is_ appearing in riester_final, so that implies it is not duplicated in the anno_dataSet, so the inner join - assuming it is the dplyr version and is being done on PROBEID (maybe set an explicit 'by' in the join to enforce this) - should only return one row, as you expect.  Could you report the result of doing anno_dataSet %>% filter(PROBEID=="2273020_at") and probe_stats %>% filter(PROBEID=="2273020_at") and similar with your riester_final before the join: I can only imagine that somehow the n_distinct isn't resolving all identical cases - maybe your annotation has two identical symbols with different gene-names, or something like that.

ADD REPLYlink written 2.0 years ago by Gavin Kelly560
1

This is the behavior of left_join()

> left_join(data.frame(x=1), data.frame(x=1, y=1:2))
Joining, by = "x"
  x y
1 1 1
2 1 2

and the documentation, consistent with this behavior

     'left_join()' return all rows from 'x', and all columns from 'x'
          and 'y'. Rows in 'x' with no match in 'y' will have 'NA'
          values in the new columns. If there are multiple matches
          between 'x' and 'y', all combinations of the matches are
          returned.

There are multiple symbols associated with key 227320_at

> AnnotationDbi::select(hgu133plus2.db, "227320_at", c("SYMBOL", "GENENAME"), "PROBEID")
'select()' returned 1:many mapping between keys and columns
    PROBEID SYMBOL                 GENENAME
1 227320_at  RFLNA ZNF664-RFLNA readthrough
2 227320_at  RFLNA                refilin A

So there are no surprises in the code.

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by Martin Morgan ♦♦ 23k
> anno_riester %>% filter(PROBEID=="227320_at")
    PROBEID SYMBOL                 GENENAME
1 227320_at  RFLNA ZNF664-RFLNA readthrough
2 227320_at  RFLNA                refilin A

> probe_stats %>% filter(PROBEID=="227320_at")
# A tibble: 0 x 2
# ... with 2 variables: PROBEID <chr>, no_of_matches <int>
ADD REPLYlink modified 2.0 years ago by Martin Morgan ♦♦ 23k • written 2.0 years ago by peirinl0

It looks like the two instances aren't distinguished by the SYMBOL (RFLNA in both cases), so rather than use n_distinct(SYMBOL) (which will give 1), you may want to use n() instead, which will give the number of rows (2 in this case) in the group. Or use an alternative philosophy regarding multi-symbol probeids, such as just quoting the first (like James' annotateEset method will), or concatenating them...

ADD REPLYlink written 2.0 years ago by Gavin Kelly560
Answer: Error when restore rownames of Feature data of ExpressionSet
0
gravatar for James W. MacDonald
2.0 years ago by
United States
James W. MacDonald51k wrote:

You might consider using annotateEset in my affycoretools package. Something like

anno_dataSet <- annotateEset(dataSet_filtered, hgu133plus2.db)

will do exactly what you are trying to do in one line of code.

ADD COMMENTlink written 2.0 years ago by James W. MacDonald51k

Nice! So I can use anno_dataSet to do the subsequent LIMMA testing?

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by peirinl0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 335 users visited in the last hour