Biostrings error at retrieving protein sequences
2
0
Entering edit mode
nromerov • 0
@3ef12fd6
Last seen 4 weeks ago
Colombia

Hello i was tying to extract the secuences of 1000 proteins corresponding to a set differentially expressed genes using biostrings after i obtained the protein_is's corresponding to those genes but the code retrieves me thhis error

library(Biostrings)
fastap <- "path_to_protein.faa"
protein_sequences <- readAAStringSet(fastap)
pids <- annot_degs0$protein_id
selected_proteins <- protein_sequences[pids]
Error: subscript contains invalid names

i thing this is maybe because i only have the id's in the form of XP_057289302.1 and the .faa file has the header with extra information as >XP_057289302.1 tubulinyl-Tyr carboxypeptidase 1-like [Hydractinia symbiolongicarpus]

SequenceMatching Biostrings DifferentialExpression Proteome • 1.1k views
ADD COMMENT
1
Entering edit mode
ATpoint ★ 4.5k
@atpoint-13662
Last seen 1 day ago
Germany

Yes, this error comes up when you try to subset with names that do not exist in the AAStringSet, for example:

library(Biostrings)

aa <- AAStringSet(x="MGCCTGA")
names(aa) <- "peptide1"
aa["not_Existing_name"]

If the problem is just to get rid of this suffix like tubulinyl-Tyr carboxypeptidase 1-like (...) and all your names have a whitespace between the XP string and this suffix then simply do:

gsub(" .*", "", pids)

This removes everything after the whitespace.

ADD COMMENT
0
Entering edit mode

Thaks a lot for your help

ADD REPLY
0
Entering edit mode

If i write it like


gsub(" .*", "", names(protein_sequences))

it will delet all the extra information if it is on the AAStringSet?

ADD REPLY
0
Entering edit mode

It depends on what kind of whitespace it is. If the whitespace is just a space, then it will work. But if it's a tab, it won't.

> gsub(" .*", "", c("this and that", "that\tand\this", "other  and nother"))
[1] "this"           "that\tand\this" "other"      
## use perl whitespace identifier   
> gsub("\\s.*", "", c("this and that", "that\tand\this", "other  and nother"), perl = TRUE)
[1] "this"  "that"  "other"
ADD REPLY
0
Entering edit mode

Yes, but you need to put the clean names back on the AAStringSet object:

names(protein_sequences) <- gsub(" .*", "", names(protein_sequences))

Now all the protein ids in pids will hopefully exist in names(protein_sequences). You can quickly check this with:

table(pids %in% names(protein_sequences))

If everything is TRUE, then try to subset again:

selected_proteins <- protein_sequences[pids]

It should work now!

ADD REPLY
1
Entering edit mode
@james-w-macdonald-5106
Last seen 11 minutes ago
United States

You appear to have already answered your question. But to confirm, computers are dumb and will not match two things that aren't exactly identical.

ADD COMMENT

Login before adding your answer.

Traffic: 892 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6