Question

protein_domain_source from ensembldb and mapping with PFAM.db

1

Entering edit mode

bruce.moran ▴ 30

@brucemoran-8388

Last seen 2.5 years ago

Ireland

Hi,

I have been using the excellent ensembldb package to annotate genomic regions with protein domain information as per the doc.

I now have tables of occurrence of domains for the various protein_domain_source categories: c("pfscan", "scanprosite", "superfamily", "pfam", "smart", "prints").

Can someone comment on the availability of further annotation of these sources, and where specifically they are taken from?

I am using the PFAM.db package from which I can get Description for interpreting the protein_domain_id, this works by using mappedkeys(PFAMDE) and then mapping other IDs back to the pfam ID. But that is not available for c("superfamily"). Do I give up on that protein_domain_source? Or does anyone have info on mapping this to pfam or another description of the domains?

Secondary question: I have occurrence of protein_domain_id over a certain set of GRanges based on different sets of genes. I think it is necessary to scale occurrence to the domain width (based on mean start->end) to determine if a domain is over-represented. Has anyone experience of this kind of analysis or any links/papers that might be relevant? I have not worked on protein domains before so this may be a standard thing and I haven't found the right reference yet!

Many thanks,

Bruce

>sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /usr/local/lib64/R/lib/libRblas.so
LAPACK: /usr/local/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] PFAM.db_3.6.0             BiocManager_1.30.10
 [3] forcats_0.3.0             stringr_1.4.0
 [5] dplyr_0.8.5               purrr_0.3.3
 [7] readr_1.2.1               tidyr_0.8.2
 [9] tibble_2.1.3              ggplot2_3.1.0
[11] tidyverse_1.2.1           plyranges_1.0.3
[13] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.4.1
[15] AnnotationFilter_1.4.0    GenomicFeatures_1.32.3
[17] AnnotationDbi_1.42.1      Biobase_2.40.0
[19] GenomicRanges_1.32.7      GenomeInfoDb_1.16.0
[21] IRanges_2.14.12           S4Vectors_0.18.3
[23] BiocGenerics_0.26.0

loaded via a namespace (and not attached):
 [1] nlme_3.1-145                ProtGenerics_1.12.0
 [3] bitops_1.0-6                matrixStats_0.55.0
 [5] lubridate_1.7.4             bit64_0.9-7
 [7] progress_1.2.0              httr_1.4.1
 [9] tools_3.5.1                 backports_1.1.5
[11] utf8_1.1.4                  R6_2.4.1
[13] DBI_1.1.0                   lazyeval_0.2.2
[15] colorspace_1.4-1            withr_2.1.2
[17] tidyselect_0.2.5            prettyunits_1.1.1
[19] bit_1.1-15.2                curl_4.3
[21] compiler_3.5.1              cli_2.0.2
[23] rvest_0.3.2                 xml2_1.2.0
[25] DelayedArray_0.6.6          rtracklayer_1.40.6
[27] scales_1.0.0                digest_0.6.25
[29] Rsamtools_1.32.3            XVector_0.20.0
[31] pkgconfig_2.0.3             rlang_0.4.5
[33] readxl_1.1.0                rstudioapi_0.11
[35] RSQLite_2.1.1               jsonlite_1.6.1
[37] BiocParallel_1.14.2         RCurl_1.98-1.1
[39] magrittr_1.5                GenomeInfoDbData_1.1.0
[41] Matrix_1.2-18               Rcpp_1.0.3
[43] munsell_0.5.0               fansi_0.4.1
[45] stringi_1.4.6               SummarizedExperiment_1.10.1
[47] zlibbioc_1.26.0             plyr_1.8.6
[49] grid_3.5.1                  blob_1.1.1
[51] crayon_1.3.4                lattice_0.20-40
[53] Biostrings_2.48.0           haven_2.0.0
[55] hms_0.4.2                   pillar_1.4.3
[57] biomaRt_2.36.1              XML_3.99-0.3
[59] glue_1.3.1                  modelr_0.1.2
[61] vctrs_0.2.3                 cellranger_1.1.0
[63] gtable_0.3.0                assertthat_0.2.1
[65] broom_0.5.0                 GenomicAlignments_1.16.0
[67] memoise_1.1.0

ensembldb pfam.db • 838 views

ADD COMMENT • link 4.0 years ago bruce.moran ▴ 30

score 1 · Answer 1 · 2020-04-24

Dear Bruce,

all annotations in EnsDb databases are retrieved from Ensembl using the Ensembl perl API. Ensembl provides protein domains predicted/defined by a variety of different methods/sources. One source of protein domain definitions is Pfam but there are also others like SMART or scanprosite. For Pfam you can use the PFAM Bioconductor package to get descriptions etc from. I don't know if there are packages providing also descriptions for the other sources. Also, AFAIK, protein domain definitions in the different databases/sources are for a big part redundant. So I guess you would not loose too much if you stick to Pfam.

I myself have never worked with these protein domains, so I can not help you with the second part of your question.

cheers, jo