Hello i was tying to extract the secuences of 1000 proteins corresponding to a set differentially expressed genes using biostrings after i obtained the protein_is's corresponding to those genes but the code retrieves me thhis error
library(Biostrings)
fastap <- "path_to_protein.faa"
protein_sequences <- readAAStringSet(fastap)
pids <- annot_degs0$protein_id
selected_proteins <- protein_sequences[pids]
Error: subscript contains invalid names
i thing this is maybe because i only have the id's in the form of XP_057289302.1 and the .faa file has the header with extra information as >XP_057289302.1 tubulinyl-Tyr carboxypeptidase 1-like [Hydractinia symbiolongicarpus]
Thaks a lot for your help
If i write it like
it will delet all the extra information if it is on the AAStringSet?
It depends on what kind of whitespace it is. If the whitespace is just a space, then it will work. But if it's a tab, it won't.
Yes, but you need to put the clean names back on the AAStringSet object:
Now all the protein ids in
pids
will hopefully exist innames(protein_sequences)
. You can quickly check this with:If everything is
TRUE
, then try to subset again:It should work now!