drawProteins from Uniprot entry (missing chain information)
1
0
Entering edit mode
mblango • 0
@mblango-19625
Last seen 5.2 years ago

Hi, I am using the drawProteins package to draw protein domains as described nicely in several other places. My problem is that in some instances, Uniprot entries are missing CHAIN information, which is required for drawing the background chain in the plot. The CHAIN information essentially provides the length of a given protein. Is there a way to add this information to the data.frame produced by drawProteins::featurestodataframe? I am new to R, so there is probably an embarrassingly simple solution to this problem. I understand how to add rows to a data.frame, but unfortunately I do not understand how to add this information to the slightly more complicated data.frame created by drawProteins. Alternatively, I could contact Uniprot.

Here is the code I am using. If you replace Uniprot ID Q4WXX3 with Q4WVE3 (a different protein), then you can see what is missing.

Thanks in advance!

library("drawProteins")
library("ggplot2")

prot <- drawProteins::get_features("Q4WXX3") 

drawProteins::feature_to_dataframe(prot) -> prot_data

draw_canvas(prot_data) -> p
p <- draw_chains(p, prot_data,
                 labels = c("AgoA"))
p <- draw_domains(p, prot_data,
                  label_domains = FALSE)
p <- draw_regions(p, prot_data) 
p <- draw_repeat(p, prot_data)
p <- draw_motif(p, prot_data)
p <- draw_phospho(p, prot_data, size = 8)

p <- p + theme_bw(base_size = 20) + # white background
  theme(panel.grid.minor=element_blank(), 
        panel.grid.major=element_blank()) +
  theme(axis.ticks = element_blank(), 
        axis.text.y = element_blank()) +
  theme(panel.border = element_blank())
p <- p + theme(legend.position="bottom") + labs(fill="") 

prot_subtitle <- paste0("nsource:Uniprot")
p <- p + labs(title = "Protein Domains",
              subtitle = prot_subtitle)
p
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14.3

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] biomaRt_2.38.0       BiocInstaller_1.32.1 forcats_0.3.0        stringr_1.3.1       
 [5] dplyr_0.7.8          purrr_0.2.5          readr_1.3.1          tidyr_0.8.2         
 [9] tibble_2.0.1         tidyverse_1.2.1      ggplot2_3.1.0        drawProteins_1.2.0  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0           lubridate_1.7.4      lattice_0.20-38      prettyunits_1.0.2   
 [5] assertthat_0.2.0     digest_0.6.18        R6_2.3.0             cellranger_1.1.0    
 [9] plyr_1.8.4           backports_1.1.3      stats4_3.5.1         RSQLite_2.1.1       
[13] httr_1.4.0           pillar_1.3.1         rlang_0.3.1          progress_1.2.0      
[17] lazyeval_0.2.1       curl_3.3             readxl_1.2.0         rstudioapi_0.9.0    
[21] blob_1.1.1           S4Vectors_0.20.1     labeling_0.3         RCurl_1.95-4.11     
[25] bit_1.1-14           munsell_0.5.0        broom_0.5.1          compiler_3.5.1      
[29] modelr_0.1.2         pkgconfig_2.0.2      BiocGenerics_0.28.0  tidyselect_0.2.5    
[33] IRanges_2.16.0       XML_3.98-1.16        crayon_1.3.4         withr_2.1.2         
[37] bitops_1.0-6         grid_3.5.1           nlme_3.1-137         jsonlite_1.6        
[41] gtable_0.2.0         DBI_1.0.0            magrittr_1.5         scales_1.0.0        
[45] cli_1.0.1            stringi_1.2.4        bindrcpp_0.2.2       xml2_1.2.0          
[49] generics_0.0.2       tools_3.5.1          bit64_0.9-7          Biobase_2.42.0      
[53] glue_1.3.0           hms_0.4.2            parallel_3.5.1       yaml_2.2.0          
[57] AnnotationDbi_1.44.0 colorspace_1.4-0     rvest_0.3.2          memoise_1.1.0       
[61] bindr_0.1.1          haven_2.0.0         
drawProteins uniprot • 841 views
ADD COMMENT
4
Entering edit mode
@james-w-macdonald-5106
Last seen 3 hours ago
United States

When you query for Q4WXX3, you end up going here, and you can see that there isn't any chain information provided. In fact there is a lot of missing data, presumably because this is a putative protein. If the annotation service doesn't have the data you need to make a plot, there isn't much that drawProteins can do to fix the situation. If you have more data, then it wouldn't be that difficult to add it by hand. For example I can get much of the protein drawn by just adding the chain information by hand:

> prot_data
                 type description begin end length accession    entryName
featuresTemp   DOMAIN         PAZ   302 391     89    Q4WXX3 Q4WXX3_ASPFU
featuresTemp.1 DOMAIN        Piwi   564 871    307    Q4WXX3 Q4WXX3_ASPFU
                taxid order
featuresTemp   330879     1
featuresTemp.1 330879     1

> prot_data <- rbind(data.frame(type = "CHAIN", description = "Eukaryotic translation initiation factor eIF-2C4", begin = 1, end = 320, length = 320, accession = "Q4WXX3", entryName = "Q4WXX3_ASPFU", taxid = 330879, order = 1), prot_data)
> prot_data
                 type                                      description begin
1               CHAIN Eukaryotic translation initiation factor eIF-2C4     1
featuresTemp   DOMAIN                                              PAZ   302
featuresTemp.1 DOMAIN                                             Piwi   564
               end length accession    entryName  taxid order
1              320    320    Q4WXX3 Q4WXX3_ASPFU 330879     1
featuresTemp   391     89    Q4WXX3 Q4WXX3_ASPFU 330879     1
featuresTemp.1 871    307    Q4WXX3 Q4WXX3_ASPFU 330879     1

But I have no idea if that is the correct protein length! If you have those data, you can easily add. But if you don't, then there's no way to add anything because you don't have the data.

ADD COMMENT
0
Entering edit mode

Hi James, Nice job. To add to your answer, we can use the amino acid sequence to calculate the protein length. Here is some code that will do that. Best wishes, Paul

# Load the package required to read JSON files.
library("rjson")
url<-"https://www.ebi.ac.uk/proteins/api/features?offset=0&size=100&accession=Q4WXX3"
data <- readLines(url)
# extract JSON
result <- fromJSON(data)
# here is the sequence
sequence <- result[[1]]$sequence
# count the number of characters.
length <- nchar(sequence)

prot_data <- rbind(data.frame(type = "CHAIN",
  description = "Eukaryotic translation initiation factor eIF-2C4",
  begin = 1,
  end = length, length = length,
  accession = "Q4WXX3",
  entryName = "Q4WXX3_ASPFU",
  taxid = 330879, order = 1), prot_data)

ADD REPLY

Login before adding your answer.

Traffic: 631 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6