Hi Sheldon,
The gene symbols in recount
were obtained using this code https://github.com/leekgroup/recount/blob/master/R/reproduce_ranges.R#L95-L97 that relies on AnnotationDbi::mapIds()
and org.Hs.eg.db::org.Hs.eg.db
, which I adapted below for the case you report.
genes <- c('ENSG00000283638', 'ENSG00000015479', 'ENSG00000280987')
AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db,
genes, 'SYMBOL', 'ENSEMBL',
multiVals = 'CharacterList')
options(width = 120)
sessioninfo::session_info()
Basically, the first gene no longer maps to those 6 miRNAs based on changes that happened over time to org.Hs.eg.db
. In any case, you could run recount::reproduce_ranges()
to get the latest symbols or manually run the AnnotationDbi::mapIds()
code as shown in the next code chunk.
library('recount')
rowRanges(rse_gene_SRP009615)$symbol_updated <- AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db,
gsub('\\..*', '', rowRanges(rse_gene_SRP009615)$gene_id), 'SYMBOL', 'ENSEMBL',
multiVals = 'CharacterList')
which will look like this:
> rowRanges(rse_gene_SRP009615)[c(57992, 362, 57163)]
GRanges object with 3 ranges and 4 metadata columns:
seqnames ranges strand | gene_id bp_length
<Rle> <IRanges> <Rle> | <character> <integer>
ENSG00000283638.1 chrX 134168911-134174089 - | ENSG00000283638.1 3863
ENSG00000015479.17 chr5 139293648-139331359 + | ENSG00000015479.17 11772
ENSG00000280987.3 chr5 139273752-139331677 + | ENSG00000280987.3 6640
symbol symbol_updated
<CharacterList> <CharacterList>
ENSG00000283638.1 MIR106A,MIR19B2,MIR92A2,... <NA>
ENSG00000015479.17 MATR3 MATR3
ENSG00000280987.3 MATR3 MATR3
-------
seqinfo: 25 sequences (1 circular) from an unspecified genome; no seqlengths
## R session info is the one from R 3.6.1 further below
> packageVersion('recount')
[1] ‘1.12.0
Best,
Leonardo
Output over R versions
R 3.3.2
> genes <- c('ENSG00000283638', 'ENSG00000015479', 'ENSG00000280987')
> AnnotationDbi::mapIds(org.Hs.eg.db::org.Hs.eg.db,
+ genes, 'SYMBOL', 'ENSEMBL',
+ multiVals = 'CharacterList')
'select()' returned 1:many mapping between keys and columns
CharacterList of length 3
[["ENSG00000283638"]] MIR106A MIR19B2 MIR92A2 MIR363 MIR20B MIR18B
[["ENSG00000015479"]] MATR3
[["ENSG00000280987"]] MATR3
>
> options(width = 120)
> sessioninfo::session_info()
Error in loadNamespace(name) : there is no package called ‘sessioninfo’
> sessionInfo()
R version 3.3.2 RC (2016-10-26 r71594)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS 10.14.6
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] testthat_1.0.2 devtools_1.13.2 colorout_1.1-2
loaded via a namespace (and not attached):
[1] Rcpp_0.12.12 IRanges_2.8.2 digest_0.6.12 crayon_1.3.2 withr_1.0.2
[6] R6_2.2.2 DBI_0.7 stats4_3.3.2 magrittr_1.5 RSQLite_2.0
[11] rlang_0.1.1 blob_1.1.0 S4Vectors_0.12.2 org.Hs.eg.db_3.4.0 bit64_0.9-7
[16] Biobase_2.34.0 bit_1.1-12 parallel_3.3.2 pkgconfig_2.0.1 BiocGenerics_0.20.0
[21] AnnotationDbi_1.36.2 memoise_1.1.0 tibble_1.3.3
R 3.4.4
R 3.6.1 with BioC 3.10 (current release)
I noticed this:
ENSG00000057663 ENSG00000283623
777 57980
3DFF72D2-F292-497E-ACE3-6FAA9C884205 B1E54366-42B9-463C-8615-B34D52BD14DC
ENSG00000057663.13 438 1121
ENSG00000283623.1 114 245
The values are scaled gene expression data. The two ensemble id's both corresponds to ATG5. Say I want gene expression data for ATG5, then which row should I use? They are giving different counts within the same patient.
Then you need to look at resources outside of
recount
, say, ENSEMBL, to check which is the particular gene ID you want based on the gene structure. It's not rare for two Ensembl IDs to match to the same gene symbol.Which if you do, you will find that ENSG00000283623 has been removed from the Ensembl database
Which brings up an additional point; what we think of as the 'genome' is really not static, and it changes as we go through time. Doing something like looking up the IDs for a set of static data and coming here and asking 'Hey, what about this?' is probably not as optimal as doing your own homework first, looking up the IDs you have in hand and making sure they are current.
This is particularly true here, since this support site is primarily intended as a place where people can get technical help with Bioconductor tools, rather than as a place for generalized questions about genes and annotation and what it all means, etc.