Is it possible to import many .tsr sequence file into R using bio strings for use in DECIPHER?
0
0
Entering edit mode
@reubenmcgregor88-13722
Last seen 3.8 years ago

There are a series of amino acid sequences I want to import into R for alignment ,using DECIPHER, with some sequences I imported as FASTA sequences. 

The ftp site I want to get them from is : ftp://ftp.cdc.gov/pub/infectious_diseases/biotech/tstransl/. Within this folder there are two types of files: ".tsr" and ".pep". I only want to import the ".tsr" files as these are the processed trimmed versions of the ".prp" files.

When I simply open the files using my native text editor they are not fast format which I have worked with before, they look like this:

emm1.0

M1 type-specific region: mature product residues 1-50.
Streptococcus pyogenes M type 1 gene (emm1) 5' partial sequence
                  CDC reference strain= SS745
emm1 isolates are usually T antigen type 1, opacity factor negative. 
emm1 is a very common emm type from sterile site isolates . . . 

emm1.0 Length: 50  August 23, 2002 09:47  Type: P  Check: 4148  ..

       1  NGDGNPREVI EDLAANNPAI QNIRLRHENK DLKARLENAM EVAGRDFKRA 

I would like them to be import with the identifier as  "emm1.0"  (either from first line or the "emm1.0 Length:50 etc" line)and the aa acid sequence as the sequence following from 1.

Is this possible? Or do the files have to be in a more recognisable format? Ideally I would like t import both the ".pep" and ".tsr" files and have them in separate databases, one with the full length and one with the trimmed sequences as above, but maybe this is too optimistic?

 

Just for reference the ".pep" files look like this:

!!AA_SEQUENCE 1.0
TRANSLATE of: emm1 check: 1997 from: 1 to: 398
generated symbols 1 to: 132.

Streptococcus pyogenes M type 1 gene (emm1) 5' partial sequence
                  CDC reference strain= SS745
emm1 isolates are usually T antigen type 1, opacity factor negative. 
emm1 is a very common emm type from sterile site isolates
 within the United States and other countries.
 The emm1 amplicon (see emm typing protocol) almost always gives . . . 

 emm1.pep  Length: 132  February 9, 1998 14:36  Type: P  Check: 757  ..

       1  ASVAVALTVL GAGFANQTEV KANGDGNPRE VIEDLAANNP AIQNIRLRHE 

      51  NKDLKARLEN AMEVAGRDFK RAEELEKAKQ ALEDQRKDLE TKLKELQQDY 

     101  DLAKESTSWD RQRLEKELEE KKEALELAID QA

I have tried the very simplistic:

"Seqs2DB("ftp://ftp.cdc.gov/pub/infectious_diseases/biotech/tstransl/", type = "XStringSet", identifier = "")"

But this obviously does not work.

 

r biostrings decipher • 1.2k views
ADD COMMENT
0
Entering edit mode

Sorry, this is not a supported format by either R package.  However, you could write your own code to read and parse the text files.

ADD REPLY
0
Entering edit mode

Thought this might be the case, I think that may be beyond my capabilities but I will look into it. Thanks

ADD REPLY

Login before adding your answer.

Traffic: 520 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6