Question

Biostrings error at retrieving protein sequences

0

Entering edit mode

nromerov • 0

@3ef12fd6

Last seen 6 weeks ago

Colombia

Hello i was tying to extract the secuences of 1000 proteins corresponding to a set differentially expressed genes using biostrings after i obtained the protein_is's corresponding to those genes but the code retrieves me thhis error

library(Biostrings)
fastap <- "path_to_protein.faa"
protein_sequences <- readAAStringSet(fastap)
pids <- annot_degs0$protein_id
selected_proteins <- protein_sequences[pids]
Error: subscript contains invalid names

i thing this is maybe because i only have the id's in the form of XP_057289302.1 and the .faa file has the header with extra information as >XP_057289302.1 tubulinyl-Tyr carboxypeptidase 1-like [Hydractinia symbiolongicarpus]

SequenceMatching Biostrings DifferentialExpression Proteome • 1.7k views

ADD COMMENT • link updated 2.4 years ago by Hervé Pagès 16k • written 2.4 years ago by nromerov • 0

score 1 · Answer 1 · 2023-10-02

1

Entering edit mode

ATpoint ★ 5.0k

@atpoint-13662

Last seen just now

Germany

Yes, this error comes up when you try to subset with names that do not exist in the AAStringSet, for example:

library(Biostrings)

aa <- AAStringSet(x="MGCCTGA")
names(aa) <- "peptide1"
aa["not_Existing_name"]

If the problem is just to get rid of this suffix like tubulinyl-Tyr carboxypeptidase 1-like (...) and all your names have a whitespace between the XP string and this suffix then simply do:

gsub(" .*", "", pids)

This removes everything after the whitespace.

ADD COMMENT • link 2.4 years ago ATpoint ★ 5.0k

0

Entering edit mode

Thaks a lot for your help

ADD REPLY • link 2.4 years ago nromerov • 0

0

Entering edit mode

If i write it like


gsub(" .*", "", names(protein_sequences))

it will delet all the extra information if it is on the AAStringSet?

ADD REPLY • link 2.4 years ago nromerov • 0

0

Entering edit mode

It depends on what kind of whitespace it is. If the whitespace is just a space, then it will work. But if it's a tab, it won't.

> gsub(" .*", "", c("this and that", "that\tand\this", "other  and nother"))
[1] "this"           "that\tand\this" "other"      
## use perl whitespace identifier   
> gsub("\\s.*", "", c("this and that", "that\tand\this", "other  and nother"), perl = TRUE)
[1] "this"  "that"  "other"

ADD REPLY • link 2.4 years ago James W. MacDonald 68k

0

Entering edit mode

Yes, but you need to put the clean names back on the AAStringSet object:

names(protein_sequences) <- gsub(" .*", "", names(protein_sequences))

Now all the protein ids in pids will hopefully exist in names(protein_sequences). You can quickly check this with:

table(pids %in% names(protein_sequences))

If everything is TRUE, then try to subset again:

selected_proteins <- protein_sequences[pids]

It should work now!

ADD REPLY • link 2.4 years ago Hervé Pagès 16k

score 1 · Answer 2 · 2023-10-02

1

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 22 hours ago

United States

You appear to have already answered your question. But to confirm, computers are dumb and will not match two things that aren't exactly identical.

ADD COMMENT • link 2.4 years ago James W. MacDonald 68k