Question

Biostrings error at retrieving protein sequences

0

Entering edit mode

nromerov • 0

@3ef12fd6

Last seen 2 days ago

Colombia

Hello i was tying to extract the secuences of 1000 proteins corresponding to a set differentially expressed genes using biostrings after i obtained the protein_is's corresponding to those genes but the code retrieves me thhis error

library(Biostrings)
fastap <- "path_to_protein.faa"
protein_sequences <- readAAStringSet(fastap)
pids <- annot_degs0$protein_id
selected_proteins <- protein_sequences[pids]
Error: subscript contains invalid names

i thing this is maybe because i only have the id's in the form of XP_057289302.1 and the .faa file has the header with extra information as >XP_057289302.1 tubulinyl-Tyr carboxypeptidase 1-like [Hydractinia symbiolongicarpus]

SequenceMatching Biostrings DifferentialExpression Proteome • 762 views

ADD COMMENT • link updated 6 months ago by Hervé Pagès 16k • written 6 months ago by nromerov • 0

score 1 · Answer 1 · 2023-10-02

1

Entering edit mode

ATpoint ★ 4.0k

@atpoint-13662

Last seen 22 hours ago

Germany

Yes, this error comes up when you try to subset with names that do not exist in the AAStringSet, for example:

library(Biostrings)

aa <- AAStringSet(x="MGCCTGA")
names(aa) <- "peptide1"
aa["not_Existing_name"]

If the problem is just to get rid of this suffix like tubulinyl-Tyr carboxypeptidase 1-like (...) and all your names have a whitespace between the XP string and this suffix then simply do:

gsub(" .*", "", pids)

This removes everything after the whitespace.

ADD COMMENT • link 6 months ago ATpoint ★ 4.0k

0

Entering edit mode

Thaks a lot for your help

ADD REPLY • link 6 months ago nromerov • 0

0

Entering edit mode

If i write it like


gsub(" .*", "", names(protein_sequences))

it will delet all the extra information if it is on the AAStringSet?

ADD REPLY • link 6 months ago nromerov • 0

0

Entering edit mode

It depends on what kind of whitespace it is. If the whitespace is just a space, then it will work. But if it's a tab, it won't.

> gsub(" .*", "", c("this and that", "that\tand\this", "other  and nother"))
[1] "this"           "that\tand\this" "other"      
## use perl whitespace identifier   
> gsub("\\s.*", "", c("this and that", "that\tand\this", "other  and nother"), perl = TRUE)
[1] "this"  "that"  "other"

ADD REPLY • link 6 months ago James W. MacDonald 65k

0

Entering edit mode

Yes, but you need to put the clean names back on the AAStringSet object:

names(protein_sequences) <- gsub(" .*", "", names(protein_sequences))

Now all the protein ids in pids will hopefully exist in names(protein_sequences). You can quickly check this with:

table(pids %in% names(protein_sequences))

If everything is TRUE, then try to subset again:

selected_proteins <- protein_sequences[pids]

It should work now!

ADD REPLY • link 6 months ago Hervé Pagès 16k

score 1 · Answer 2 · 2023-10-02

1

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 17 hours ago

United States

You appear to have already answered your question. But to confirm, computers are dumb and will not match two things that aren't exactly identical.

ADD COMMENT • link 6 months ago James W. MacDonald 65k