There are a series of amino acid sequences I want to import into R for alignment ,using DECIPHER, with some sequences I imported as FASTA sequences.
The ftp site I want to get them from is : ftp://ftp.cdc.gov/pub/infectious_diseases/biotech/tstransl/. Within this folder there are two types of files: ".tsr" and ".pep". I only want to import the ".tsr" files as these are the processed trimmed versions of the ".prp" files.
When I simply open the files using my native text editor they are not fast format which I have worked with before, they look like this:
emm1.0
M1 type-specific region: mature product residues 1-50.
Streptococcus pyogenes M type 1 gene (emm1) 5' partial sequence
CDC reference strain= SS745
emm1 isolates are usually T antigen type 1, opacity factor negative.
emm1 is a very common emm type from sterile site isolates . . .
emm1.0 Length: 50 August 23, 2002 09:47 Type: P Check: 4148 ..
1 NGDGNPREVI EDLAANNPAI QNIRLRHENK DLKARLENAM EVAGRDFKRA
I would like them to be import with the identifier as "emm1.0" (either from first line or the "emm1.0 Length:50 etc" line)and the aa acid sequence as the sequence following from 1.
Is this possible? Or do the files have to be in a more recognisable format? Ideally I would like t import both the ".pep" and ".tsr" files and have them in separate databases, one with the full length and one with the trimmed sequences as above, but maybe this is too optimistic?
Just for reference the ".pep" files look like this:
!!AA_SEQUENCE 1.0
TRANSLATE of: emm1 check: 1997 from: 1 to: 398
generated symbols 1 to: 132.
Streptococcus pyogenes M type 1 gene (emm1) 5' partial sequence
CDC reference strain= SS745
emm1 isolates are usually T antigen type 1, opacity factor negative.
emm1 is a very common emm type from sterile site isolates
within the United States and other countries.
The emm1 amplicon (see emm typing protocol) almost always gives . . .
emm1.pep Length: 132 February 9, 1998 14:36 Type: P Check: 757 ..
1 ASVAVALTVL GAGFANQTEV KANGDGNPRE VIEDLAANNP AIQNIRLRHE
51 NKDLKARLEN AMEVAGRDFK RAEELEKAKQ ALEDQRKDLE TKLKELQQDY
101 DLAKESTSWD RQRLEKELEE KKEALELAID QA
I have tried the very simplistic:
"Seqs2DB("ftp://ftp.cdc.gov/pub/infectious_diseases/biotech/tstransl/", type = "XStringSet", identifier = "")"
But this obviously does not work.
Sorry, this is not a supported format by either R package. However, you could write your own code to read and parse the text files.
Thought this might be the case, I think that may be beyond my capabilities but I will look into it. Thanks