Question

Retrieve GenBank file using RefSeq ID

0

Entering edit mode

heiko_kin ▴ 60

@heiko_kin-23266

Last seen 5.0 years ago

Dear all,

I am looking for a way to load the .gb GenBank file for a specific RefSeq ID directly into R. I had no success using the genbankr, rentrez and biomaRt package. Does anyone have a suggestion on how this can be done?

Thanks, Heiko

genbankr • 2.9k views

ADD COMMENT • link updated 5.1 years ago by Nicholas Cooley • 0 • written 5.1 years ago by heiko_kin ▴ 60

0

Entering edit mode

I typically accomplish these kinds of operations by either directly accessing EDirect in the terminal, or having R build a bunch of EDirect queries directly, passing those to the terminal, and capturing the output.

In the terminal this simply looks like:

$ esearch -db nuccore -query NM_130786 | efetch -format gp

This is pretty easily recreated in R with as long as R has access to edirect - check Sys.getenv("PATH"):

> query <- paste0("esearch -db ",
                  "nuccore ", # your specified DB
                  "-query ", 
                  "NM_130786", # your query ID
                  " | ",
                  "efetch -format",
                  " gp")
> z <- system(command = query,
              intern = TRUE,
              timeout = 300L)
> head(z)
[1] "LOCUS       NM_130786               3382 bp    mRNA    linear   PRI 12-DEC-2020"
[2] "DEFINITION  Homo sapiens alpha-1-B glycoprotein (A1BG), mRNA."                  
[3] "ACCESSION   NM_130786"                                                          
[4] "VERSION     NM_130786.4"                                                        
[5] "KEYWORDS    RefSeq; MANE Select."                                               
[6] "SOURCE      Homo sapiens (human)"                                               
> tail(z)
[1] "     3181 gccccagctt agttttttaa aaagtttatt ttagccattc taataggtat gtagacatat"
[2] "     3241 ctaatagtgg ttttccttgc gtttccttaa tgtctgatga tgttaagcat tttttcccca"
[3] "     3301 agtgcttagt tgccatctat atagcatctt tgatgaaatg tctgtttata tattttgcac"
[4] "     3361 actttaaaat attgggttgt tt"                                         
[5] "//"                                                                         
[6] ""

There are a few quirks with this process: If you run a loop across a whole bunch of system() calls R can eventually run out of connections, which is weird. Also depending on your version of RStudio, calling system() with the default timeout argument can cause R to hang indefinitely. This does not occur if you load up R from the terminal or just use the GUI, and appears to have been fixed in the most recent version of RStudio - but don't quote me on that.

I like running queries this way, but it does require a bit of background knowledge about edirect. If you're new to it, there's an old Cookbook for edirect that can be helpful.

ADD REPLY • link 5.1 years ago Nicholas Cooley • 0

score 0 · Answer 1 · 2020-12-15

You will have to be more explicit than that. What exactly do you want to do? What have you tried?

It seems pretty straightforward to download and read in a .gb file using rentrez (less so with genbankr, but that has more to do with the arbitrariness of .gb records than any fault with genbankr. I mean how do you programmatically prepare for random-ish inputs?).

> z <- entrez_search("nuccore","NM_130786[ACCNUM]")
> zz <- entrez_fetch("nuccore", z, rettype ="gbwithparts",retmode = "text")
> strsplit(zz, "\\n")[1:3]
[[1]]
  [1] "LOCUS       NM_130786               3382 bp    mRNA    linear   PRI 12-DEC-2020"  
  [2] "DEFINITION  Homo sapiens alpha-1-B glycoprotein (A1BG), mRNA."                    
  [3] "ACCESSION   NM_130786"                                                            
  [4] "VERSION     NM_130786.4"                                                          
  [5] "KEYWORDS    RefSeq; MANE Select."                                                 
  [6] "SOURCE      Homo sapiens (human)"                                                 
  [7] "  ORGANISM  Homo sapiens"                                                         
  [8] "            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;"    
  [9] "            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;"         
 [10] "            Catarrhini; Hominidae; Homo."                                         
 [11] "REFERENCE   1  (bases 1 to 3382)"                                                 
 [12] "  AUTHORS   Barallobre-Barreiro J, Gupta SK, Zoccarato A, Kitazume-Taneike R,"    
 [13] "            Fava M, Yin X, Werner T, Hirt MN, Zampetaki A, Viviano A, Chong M,"   
 [14] "            Bern M, Kourliouros A, Domenech N, Willeit P, Shah AM, Jahangiri M,"  
 [15] "            Schaefer L, Fischer JW, Iozzo RV, Viner R, Thum T, Heineke J,"        
 [16] "            Kichler A, Otsu K and Mayr M."                                        
 [17] "  TITLE     Glycoproteomics Reveals Decorin Peptides With Anti-Myostatin"     
<snip>