Question: msa alignement: sorted sequences by sequence name
0
2.8 years ago by
Olorin50
Olorin50 wrote:

Hello,

I wonder if there is a way to use msaPrettyPrint() with sorted alignement by sequence name without rewriting the output fasta file. I know I can read the output fasta file with python, and re-order the sequences by the sequence name, but I'm pretty sure there is a quickest way to achieve that.

I mean i want to see in fasta file and pdf file (output from msaPrettyPrint):

>A

seq_A

>B

seq_B

Anybody can help?

output msa alignement sorted • 1.1k views
modified 2.8 years ago • written 2.8 years ago by Olorin50
Answer: msa alignement: sorted sequences by sequence name
1
2.8 years ago by
UBodenhofer250
Johannes Kepler University, Linz, Austria
UBodenhofer250 wrote:

TeXshade reorders the sequences by similarity. So changing the order in the FASTA file will not help. My suggestion is to enforce a certain order in the alignment object and then to force TeXshade to leave the order unchanged by an extra tweak.

msaPrettyPrint() uses exactly the same order as in the multiple alignment object. The order of the sequences in the multiple alignment object can be altered by all MSA methods in the 'msa' package: they have an argument 'order' which can either be "aligned" (i.e. as the algorithm reorders them) or "input" (as in the input sequences object). If this is not what you want, you can reorder the sequences as you like by simply using a permutation of the sequence. If you want to order sequences alphabetically by their names, you can do that as in the following example (re-order sequences before alignment and then use 'order="input"'):

library(msa)

filepath <- system.file("examples", "exampleAA.fasta", package="msa")

## order sequences alphabetically by names
permutation <- order(names(mySeqs))
myAlignment <- msa(mySeqs[permutation],
order="input")  ## forces msa() to leave order as it is
myAlignment

msaPrettyPrint(myAlignment, output="pdf",
code=paste0("\\orderseqs{1-", nrow(myAlignment), "}")) ## reordering off

Sorry that the solution is sort of complicated, but since TeXshade reorders by default, there seems to be no other way.

Answer: msa alignement: sorted sequences by sequence name
1
2.8 years ago by
Olorin50
Olorin50 wrote:

PROBLEM SOLVED

______________________________

for (i in 1:n){
permutation <- order(names(current_seq )
current_align <- msa(current_seq[permutation],order="input")
#I prefer to make tex file first
file = tex_file_1, alFile = fasta_file_1,
code=tex_code)
#compile tex file
texi2pdf(tex_file_1, clean=TRUE)
#move outputs files
file.rename(from = fasta_file_1, to = fasta_file_2)
file.rename(from = tex_file_1, to = tex_file_2)
file.rename(from = pdf_file_1, to = pdf_file_2
}

#This code need to be insert in the loop
#Tex_mis take basics customisations (logo consensus etc..)
#Put all basics stuff here
\\threshold{50}
\\showsequencelogo[chemical]{top}
\\hidelogoscale
\\shownumbering{right}
\\\showconsensus{}
\\showlegend"
tex_misc = unlist(strsplit(tex_misc, split="\n"))
tex_code = c(tex_misc)
#Put the full name for each sequences
for (i in 1:33){
ind = permutation[i]
print(names(current_seq[ind]))
new_name = gsub(pattern = "_", replacement = "-",names(current_seq[ind]))
tex_code = c(tex_code,paste("\\nameseq{",i,"}{",new_name,"}",sep =""))
}

________________________________________________________

Good things to know:

https://github.com/Bioconductor-mirror/msa/blob/master/R/msaPrettyPrint.R for helping retreving default customisations (logo, colors, consensus).

You can also use msaPrettyPrint() with only its own arguments (without changing the code), looks at the tex file to find how the basics stuff are written in tex file and copy/paste them in tex_misc in here (that's how I did)

LaTeX doesnt handle every character in sequence name. Any issues may be due to a wrong char in one sequence name.

/!\ for adding several LINES in tex code, the code arguments in msaPrettyPrint() need to have a list ( eg custom_tex_code = c(), where each elements in the list/vector will be a line in the tex file

The code shown above isnt ready to direct use. I use msaPrettyPrint() in a loop because i have several multifasta file to align seperately.

I can answer questions if you have some issues. Now i'm good in this awesome function.

Awesome, Olorin! May I add a few more bits:

• Not all of your customizations really need to be done in the LaTeX source. The most common customizations can be made using optional arguments of the msaPrettyPrint() function. See '?msaPrettyPrint' for details.

I saw right. you can see msaPrettyPrint help. If you add code arguments, it overrides all other arguments:

code
this argument can be used to specify the entire LaTeX code in the texshade environment. This overrides all arguments that customize the appearance of the output. Instead, all customizations must be done as LaTeX commands provided by the package texshade.sty directly. This option should only be used by expert users and for special applciations in which the possibilities of the customizations of the msaPrettyPrint function turn out to be insufficient.

That is true, 'code' overrides everything else. However, if you just want to add some customizations on top of what msaPrettyPrint() configures by itself, you can use the 'furtherCode' argument which only adds TeX code without overriding the rest.

Good to know. thank you

Answer: msa alignement: sorted sequences by sequence name
0
2.8 years ago by
Olorin50
Olorin50 wrote:

Thanks for you great help! Your solutions is tricky but I understand how you solve my problem =)

Your solution works fine. Sequences in fasta file are sorted by name and in the pdf too.

But there are 2 littles issues:

-My sequences names are like : Escherichia coli|gene_name|. In the pdf i get only Escherichia. It's a problem since I have n sequences from the same genus (e.g Escherichia strain_a, Escherichia strain_b, and so on..). I wonder how I can keep the full name.

-A very little issue is I lost the logo on the pdf.

________

There is not always the space to display full names if they are longer. I suggest to use some string processing to shorten the names properly, e.g. to remove "Escherichia coli" entirely (if you only have E.coli sequences anyway) or to abbreviate it by "E.coli". This can be done by altering the names of the original sequences of the row names of the alignment object.