Question

Biostrings error at retrieving protein sequences

0

Entering edit mode

nromerov • 0

@3ef12fd6

Last seen 9 weeks ago

Colombia

Hello i was tying to extract the secuences of 1000 proteins corresponding to a set differentially expressed genes using biostrings after i obtained the protein_is's corresponding to those genes but the code retrieves me thhis error

library(Biostrings)
fastap <- "path_to_protein.faa"
protein_sequences <- readAAStringSet(fastap)
pids <- annot_degs0$protein_id
selected_proteins <- protein_sequences[pids]
Error: subscript contains invalid names

i thing this is maybe because i only have the id's in the form of XP_057289302.1 and the .faa file has the header with extra information as >XP_057289302.1 tubulinyl-Tyr carboxypeptidase 1-like [Hydractinia symbiolongicarpus]

SequenceMatching Biostrings DifferentialExpression Proteome • 1.2k views

ADD COMMENT • link updated 14 months ago by Hervé Pagès 16k • written 15 months ago by nromerov • 0

score 1 · Answer 1 · 2023-10-02

1

Entering edit mode

ATpoint ★ 4.6k

@atpoint-13662

Last seen 16 hours ago

Germany

Yes, this error comes up when you try to subset with names that do not exist in the AAStringSet, for example:

library(Biostrings)

aa <- AAStringSet(x="MGCCTGA")
names(aa) <- "peptide1"
aa["not_Existing_name"]

If the problem is just to get rid of this suffix like tubulinyl-Tyr carboxypeptidase 1-like (...) and all your names have a whitespace between the XP string and this suffix then simply do:

gsub(" .*", "", pids)

This removes everything after the whitespace.

ADD COMMENT • link 15 months ago ATpoint ★ 4.6k

0

Entering edit mode

Thaks a lot for your help

ADD REPLY • link 15 months ago nromerov • 0

0

Entering edit mode

If i write it like


gsub(" .*", "", names(protein_sequences))

it will delet all the extra information if it is on the AAStringSet?

ADD REPLY • link 15 months ago nromerov • 0

0

Entering edit mode

It depends on what kind of whitespace it is. If the whitespace is just a space, then it will work. But if it's a tab, it won't.

> gsub(" .*", "", c("this and that", "that\tand\this", "other  and nother"))
[1] "this"           "that\tand\this" "other"      
## use perl whitespace identifier   
> gsub("\\s.*", "", c("this and that", "that\tand\this", "other  and nother"), perl = TRUE)
[1] "this"  "that"  "other"

ADD REPLY • link 15 months ago James W. MacDonald 67k

0

Entering edit mode

Yes, but you need to put the clean names back on the AAStringSet object:

names(protein_sequences) <- gsub(" .*", "", names(protein_sequences))

Now all the protein ids in pids will hopefully exist in names(protein_sequences). You can quickly check this with:

table(pids %in% names(protein_sequences))

If everything is TRUE, then try to subset again:

selected_proteins <- protein_sequences[pids]

It should work now!

ADD REPLY • link 14 months ago Hervé Pagès 16k

score 1 · Answer 2 · 2023-10-02

1

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 2 days ago

United States

You appear to have already answered your question. But to confirm, computers are dumb and will not match two things that aren't exactly identical.

ADD COMMENT • link 15 months ago James W. MacDonald 67k