Question: recount TCGA data one gene_id mapped to multiple symbols and vice versa
1
gravatar for Sheldon Pang
13 days ago by
Sheldon Pang20
Sheldon Pang20 wrote:

For example: Gene id ENSG00000283638 mapped to multiple symbols MIR106A MIR19B2 MIR92A2 MIR363 MIR20B MIR18B Gene id ENSG00000015479 and ENSG00000280987 both mapped to symbol MATR3

It looks to me these are incorrect. Any idea? Thanks.

library(recount) recount_genes$symbol[57992] # CharacterList of length 1 # [["ENSG00000283638"]] MIR106A MIR19B2 MIR92A2 MIR363 MIR20B MIR18B

recountgenes$symbol[362] # CharacterList of length 1 # [["ENSG00000015479"]] MATR3 recountgenes$symbol[57163] # CharacterList of length 1 # [["ENSG00000280987"]] MATR3

packageVersion('recount') # [1] ‘1.10.13’

recount • 104 views
ADD COMMENTlink modified 13 days ago by Leonardo Collado Torres710 • written 13 days ago by Sheldon Pang20
Answer: recount TCGA data one gene_id mapped to multiple symbols and vice versa
2
gravatar for Leonardo Collado Torres
13 days ago by
United States
Leonardo Collado Torres710 wrote:

Hi Sheldon,

The gene symbols in recount were obtained using this code https://github.com/leekgroup/recount/blob/master/R/reproduce_ranges.R#L95-L97 that relies on AnnotationDbi::mapIds() and org.Hs.eg.db::org.Hs.eg.db, which I adapted below for the case you report.

genes <- c('ENSG00000283638', 'ENSG00000015479', 'ENSG00000280987')
AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db,
    genes, 'SYMBOL', 'ENSEMBL',
    multiVals = 'CharacterList')

options(width = 120)
sessioninfo::session_info()

Basically, the first gene no longer maps to those 6 miRNAs based on changes that happened over time to org.Hs.eg.db. In any case, you could run recount::reproduce_ranges() to get the latest symbols or manually run the AnnotationDbi::mapIds() code as shown in the next code chunk.

library('recount')
rowRanges(rse_gene_SRP009615)$symbol_updated <- AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db,
    gsub('\\..*', '', rowRanges(rse_gene_SRP009615)$gene_id), 'SYMBOL', 'ENSEMBL',
    multiVals = 'CharacterList')

which will look like this:

> rowRanges(rse_gene_SRP009615)[c(57992, 362, 57163)]
GRanges object with 3 ranges and 4 metadata columns:
                     seqnames              ranges strand |            gene_id bp_length
                        <Rle>           <IRanges>  <Rle> |        <character> <integer>
   ENSG00000283638.1     chrX 134168911-134174089      - |  ENSG00000283638.1      3863
  ENSG00000015479.17     chr5 139293648-139331359      + | ENSG00000015479.17     11772
   ENSG00000280987.3     chr5 139273752-139331677      + |  ENSG00000280987.3      6640
                                          symbol  symbol_updated
                                 <CharacterList> <CharacterList>
   ENSG00000283638.1 MIR106A,MIR19B2,MIR92A2,...            <NA>
  ENSG00000015479.17                       MATR3           MATR3
   ENSG00000280987.3                       MATR3           MATR3
  -------
  seqinfo: 25 sequences (1 circular) from an unspecified genome; no seqlengths

## R session info is the one from R 3.6.1 further below
> packageVersion('recount')
[1] ‘1.12.0

Best, Leonardo

Output over R versions

R 3.3.2

> genes <- c('ENSG00000283638', 'ENSG00000015479', 'ENSG00000280987')
> AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db,
+     genes, 'SYMBOL', 'ENSEMBL',
+     multiVals = 'CharacterList')

'select()' returned 1:many mapping between keys and columns
CharacterList of length 3
[["ENSG00000283638"]] MIR106A MIR19B2 MIR92A2 MIR363 MIR20B MIR18B
[["ENSG00000015479"]] MATR3
[["ENSG00000280987"]] MATR3
>
> options(width = 120)
> sessioninfo::session_info()
Error in loadNamespace(name) : there is no package called ‘sessioninfo’
> sessionInfo()
R version 3.3.2 RC (2016-10-26 r71594)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS  10.14.6

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] testthat_1.0.2  devtools_1.13.2 colorout_1.1-2

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12         IRanges_2.8.2        digest_0.6.12        crayon_1.3.2         withr_1.0.2
 [6] R6_2.2.2             DBI_0.7              stats4_3.3.2         magrittr_1.5         RSQLite_2.0
[11] rlang_0.1.1          blob_1.1.0           S4Vectors_0.12.2     org.Hs.eg.db_3.4.0   bit64_0.9-7
[16] Biobase_2.34.0       bit_1.1-12           parallel_3.3.2       pkgconfig_2.0.1      BiocGenerics_0.20.0
[21] AnnotationDbi_1.36.2 memoise_1.1.0        tibble_1.3.3
ADD COMMENTlink written 13 days ago by Leonardo Collado Torres710

R 3.4.4

> genes <- c('ENSG00000283638', 'ENSG00000015479', 'ENSG00000280987')
> AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db,
+     genes, 'SYMBOL', 'ENSEMBL',
+     multiVals = 'CharacterList')

'select()' returned 1:1 mapping between keys and columns
CharacterList of length 3
[["ENSG00000283638"]] <NA>
[["ENSG00000015479"]] MATR3
[["ENSG00000280987"]] MATR3
>
> options(width = 120)
> sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 3.4.4 Patched (2018-03-19 r74624)
 os       macOS  10.14.6
 system   x86_64, darwin15.6.0
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 tz       America/New_York
 date     2019-11-06

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
 package       * version date       source
 AnnotationDbi   1.40.0  2017-10-31 Bioconductor
 Biobase         2.38.0  2017-10-31 Bioconductor
 BiocGenerics    0.24.0  2017-10-31 Bioconductor
 bit             1.1-12  2014-04-09 CRAN (R 3.4.0)
 bit64           0.9-7   2017-05-08 CRAN (R 3.4.0)
 blob            1.1.1   2018-03-25 CRAN (R 3.4.4)
 clisymbols      1.2.0   2017-05-21 CRAN (R 3.4.0)
 colorout      * 1.2-0   2018-05-03 Github (jalvesaq/colorout@c42088d)
 DBI             1.0.0   2018-05-02 CRAN (R 3.4.4)
 devtools      * 1.13.6  2018-06-27 cran (@1.13.6)
 digest          0.6.18  2018-10-10 cran (@0.6.18)
 IRanges         2.12.0  2017-10-31 Bioconductor
 magrittr        1.5     2014-11-22 cran (@1.5)
 memoise         1.1.0   2017-04-21 CRAN (R 3.4.0)
 org.Hs.eg.db    3.5.0   2018-05-03 Bioconductor
 pkgconfig       2.0.1   2017-03-21 CRAN (R 3.4.0)
 R6              2.3.0   2018-10-04 cran (@2.3.0)
 Rcpp            0.12.19 2018-10-01 cran (@0.12.19)
 rlang           0.3.0.1 2018-10-25 cran (@0.3.0.1)
 RSQLite         2.1.0   2018-03-29 CRAN (R 3.4.4)
 S4Vectors       0.16.0  2017-10-31 Bioconductor
 sessioninfo     1.0.0   2017-06-21 CRAN (R 3.4.1)
 testthat      * 2.0.0   2017-12-13 CRAN (R 3.4.3)
 withr           2.1.2   2018-03-15 CRAN (R 3.4.4)

R 3.6.1 with BioC 3.10 (current release)

> genes <- c('ENSG00000283638', 'ENSG00000015479', 'ENSG00000280987')
> AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db,
+     genes, 'SYMBOL', 'ENSEMBL',
+     multiVals = 'CharacterList')

'select()' returned 1:1 mapping between keys and columns
CharacterList of length 3
[["ENSG00000283638"]] <NA>
[["ENSG00000015479"]] MATR3
[["ENSG00000280987"]] MATR3
>
> options(width = 120)
> sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 3.6.1 Patched (2019-09-23 r77210)
 os       macOS Mojave 10.14.6
 system   x86_64, darwin15.6.0
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2019-11-06

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
 package       * version date       lib source
 AnnotationDbi   1.48.0  2019-10-29 [1] Bioconductor
 assertthat      0.2.1   2019-03-21 [1] CRAN (R 3.6.0)
 backports       1.1.5   2019-10-02 [1] CRAN (R 3.6.0)
 Biobase         2.46.0  2019-10-29 [1] Bioconductor
 BiocGenerics    0.32.0  2019-10-29 [1] Bioconductor
 bit             1.1-14  2018-05-29 [1] CRAN (R 3.6.0)
 bit64           0.9-7   2017-05-08 [1] CRAN (R 3.6.0)
 blob            1.2.0   2019-07-09 [1] CRAN (R 3.6.0)
 cli             1.1.0   2019-03-19 [1] CRAN (R 3.6.0)
 crayon          1.3.4   2017-09-16 [1] CRAN (R 3.6.0)
 DBI             1.0.0   2018-05-02 [1] CRAN (R 3.6.0)
 digest          0.6.22  2019-10-21 [1] CRAN (R 3.6.1)
 IRanges         2.20.0  2019-10-29 [1] Bioconductor
 memoise         1.1.0   2017-04-21 [1] CRAN (R 3.6.0)
 org.Hs.eg.db    3.10.0  2019-10-31 [1] Bioconductor
 pillar          1.4.2   2019-06-29 [1] CRAN (R 3.6.0)
 pkgconfig       2.0.3   2019-09-22 [1] CRAN (R 3.6.1)
 Rcpp            1.0.2   2019-07-25 [1] CRAN (R 3.6.0)
 rlang           0.4.1   2019-10-24 [1] CRAN (R 3.6.1)
 RSQLite         2.1.2   2019-07-24 [1] CRAN (R 3.6.0)
 S4Vectors       0.24.0  2019-10-29 [1] Bioconductor
 sessioninfo     1.1.1   2018-11-05 [1] CRAN (R 3.6.0)
 tibble          2.1.3   2019-06-06 [1] CRAN (R 3.6.0)
 vctrs           0.2.0   2019-07-05 [1] CRAN (R 3.6.0)
 withr           2.1.2   2018-03-15 [1] CRAN (R 3.6.0)
 zeallot         0.1.0   2018-01-28 [1] CRAN (R 3.6.0)

[1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library
ADD REPLYlink written 13 days ago by Leonardo Collado Torres710

I noticed this:

which(id_symbol == "ATG5")

ENSG00000057663 ENSG00000283623

777 57980

gene_expression[c(777,57980),1:2]

3DFF72D2-F292-497E-ACE3-6FAA9C884205 B1E54366-42B9-463C-8615-B34D52BD14DC

ENSG00000057663.13 438 1121

ENSG00000283623.1 114 245

The values are scaled gene expression data. The two ensemble id's both corresponds to ATG5. Say I want gene expression data for ATG5, then which row should I use? They are giving different counts within the same patient.

ADD REPLYlink modified 12 days ago • written 12 days ago by dz16e0

Then you need to look at resources outside of recount, say, ENSEMBL, to check which is the particular gene ID you want based on the gene structure. It's not rare for two Ensembl IDs to match to the same gene symbol.

ADD REPLYlink written 12 days ago by Leonardo Collado Torres710

Which if you do, you will find that ENSG00000283623 has been removed from the Ensembl database

Which brings up an additional point; what we think of as the 'genome' is really not static, and it changes as we go through time. Doing something like looking up the IDs for a set of static data and coming here and asking 'Hey, what about this?' is probably not as optimal as doing your own homework first, looking up the IDs you have in hand and making sure they are current.

This is particularly true here, since this support site is primarily intended as a place where people can get technical help with Bioconductor tools, rather than as a place for generalized questions about genes and annotation and what it all means, etc.

ADD REPLYlink written 12 days ago by James W. MacDonald51k
Answer: recount TCGA data one gene_id mapped to multiple symbols and vice versa
1
gravatar for James W. MacDonald
13 days ago by
United States
James W. MacDonald51k wrote:

You are pointing out things that Ensembl says about those genes!

There are 6 miRNAs (that you list) that are part of the first exon for ENSG00000283638

And Ensembl says ENSG00000015479 and ENSG00000280987 are both MATR3.

So why do you think they are incorrect?

ADD COMMENTlink written 13 days ago by James W. MacDonald51k

Thanks for this answer James!

ADD REPLYlink written 13 days ago by Leonardo Collado Torres710
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 420 users visited in the last hour