Hi,
I have been using the excellent ensembldb
package to annotate genomic regions with protein domain information as per the doc.
I now have tables of occurrence of domains for the various protein_domain_source
categories: c("pfscan", "scanprosite", "superfamily", "pfam", "smart", "prints")
.
Can someone comment on the availability of further annotation of these sources, and where specifically they are taken from?
I am using the PFAM.db
package from which I can get Description
for interpreting the protein_domain_id
, this works by using mappedkeys(PFAMDE)
and then mapping other IDs back to the pfam
ID. But that is not available for c("superfamily")
. Do I give up on that protein_domain_source
? Or does anyone have info on mapping this to pfam
or another description of the domains?
Secondary question: I have occurrence of protein_domain_id
over a certain set of GRanges based on different sets of genes. I think it is necessary to scale occurrence to the domain width (based on mean start->end) to determine if a domain is over-represented. Has anyone experience of this kind of analysis or any links/papers that might be relevant? I have not worked on protein domains before so this may be a standard thing and I haven't found the right reference yet!
Many thanks,
Bruce
>sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /usr/local/lib64/R/lib/libRblas.so
LAPACK: /usr/local/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] PFAM.db_3.6.0 BiocManager_1.30.10
[3] forcats_0.3.0 stringr_1.4.0
[5] dplyr_0.8.5 purrr_0.3.3
[7] readr_1.2.1 tidyr_0.8.2
[9] tibble_2.1.3 ggplot2_3.1.0
[11] tidyverse_1.2.1 plyranges_1.0.3
[13] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.4.1
[15] AnnotationFilter_1.4.0 GenomicFeatures_1.32.3
[17] AnnotationDbi_1.42.1 Biobase_2.40.0
[19] GenomicRanges_1.32.7 GenomeInfoDb_1.16.0
[21] IRanges_2.14.12 S4Vectors_0.18.3
[23] BiocGenerics_0.26.0
loaded via a namespace (and not attached):
[1] nlme_3.1-145 ProtGenerics_1.12.0
[3] bitops_1.0-6 matrixStats_0.55.0
[5] lubridate_1.7.4 bit64_0.9-7
[7] progress_1.2.0 httr_1.4.1
[9] tools_3.5.1 backports_1.1.5
[11] utf8_1.1.4 R6_2.4.1
[13] DBI_1.1.0 lazyeval_0.2.2
[15] colorspace_1.4-1 withr_2.1.2
[17] tidyselect_0.2.5 prettyunits_1.1.1
[19] bit_1.1-15.2 curl_4.3
[21] compiler_3.5.1 cli_2.0.2
[23] rvest_0.3.2 xml2_1.2.0
[25] DelayedArray_0.6.6 rtracklayer_1.40.6
[27] scales_1.0.0 digest_0.6.25
[29] Rsamtools_1.32.3 XVector_0.20.0
[31] pkgconfig_2.0.3 rlang_0.4.5
[33] readxl_1.1.0 rstudioapi_0.11
[35] RSQLite_2.1.1 jsonlite_1.6.1
[37] BiocParallel_1.14.2 RCurl_1.98-1.1
[39] magrittr_1.5 GenomeInfoDbData_1.1.0
[41] Matrix_1.2-18 Rcpp_1.0.3
[43] munsell_0.5.0 fansi_0.4.1
[45] stringi_1.4.6 SummarizedExperiment_1.10.1
[47] zlibbioc_1.26.0 plyr_1.8.6
[49] grid_3.5.1 blob_1.1.1
[51] crayon_1.3.4 lattice_0.20-40
[53] Biostrings_2.48.0 haven_2.0.0
[55] hms_0.4.2 pillar_1.4.3
[57] biomaRt_2.36.1 XML_3.99-0.3
[59] glue_1.3.1 modelr_0.1.2
[61] vctrs_0.2.3 cellranger_1.1.0
[63] gtable_0.3.0 assertthat_0.2.1
[65] broom_0.5.0 GenomicAlignments_1.16.0
[67] memoise_1.1.0