I am trying to understand how the index from the synteny analysis in the DECIPHER package can be translated into chromosomes. I basically used what is in the official website:
# load the DECIPHER library in R library(DECIPHER) # specify the path to each FASTA file (in quotes) # each genome must be given a unique identifier here # for example: Genome1, Genome2, etc. fas <- c(Genome1="<<REPLACE WITH PATH TO FASTA FILE 1>>", Genome2="<<REPLACE WITH PATH TO FASTA FILE 2>>", Genome3="<<REPLACE WITH PATH TO FASTA FILE 3>>") # specify where to create the new sequence database db <- "<<REPLACE WITH PATH TO SEQUENCE DATABASE>>" # load the sequences from the file in a loop for (i in seq_along(fas)) { Seqs2DB(fas[i], "FASTA", db, names(fas[i])) } # map the syntenic regions between each genome pair synteny <- FindSynteny(db) # print a summary of the syntenic map (optional) synteny # view the syntenic regions (optional) pairs(synteny) # displays a dot plot of all pairs plot(synteny) # displays a bar plot of adjacent pairs # perform alignments of all pairs of genomes DNA <- AlignSynteny(synteny, db) # view the aligned syntenic blocks for each pair unlist(DNA[[1]]) # Genomes 1 and 2 unlist(DNA[[2]]) # Genomes 1 and 3 unlist(DNA[[3]]) # Genomes 2 and 3
Accessing the sequences of DNA[[3]]:
unlist(DNA[[3]][1])
A DNAStringSet instance of length 2
width seq names
[1] 1962640 CTAACTTGACACCTTTCCCTTG...TCATTCCCCTTCCCACACACAC 1.Genome2
[2] 1962640 CTAACTTGACACCTTTCCCTTG...TCATTCCTCTTCCCACACACAC 1.Genome3
The correspondent genomic region should be the first row of the following data-frame:
head(synteny[[6]]) index1 index2 strand score start1 start2 end1 end2 first_hit [1,] 10 12 0 307470 15003547 14701712 16785549 16531717 1 [2,] 6 8 0 153713 33444467 32224687 34316599 33159921 8622 [3,] 12 14 0 146558 15237602 15554593 16266481 16701047 11977 [4,] 5 7 0 130431 47010916 48608979 48610127 50288861 16388 [5,] 11 13 0 128074 3879383 9476868 4991057 10661494 21719 [6,] 8 10 0 124988 22515057 23515792 23999163 25084001 25947 last_hit [1,] 8621 [2,] 11976 [3,] 16387 [4,] 21718 [5,] 25946 [6,] 31151
Therefore, I imported the 'FASTA' file of the Genome3 to understand if the order of the sequences (indexing).
library("Biostrings") G = readDNAStringSet("GTgenome1.1.fa")
At first, I assumed that it would obey the same order of the G object (i.e. first row would be index = 1, second row index = 2 and so on):
A DNAStringSet instance of length 1674 width seq names [1] 20202851 CCCAGTTTTCCCCACTCTGT...AGATCTTACAACCGATTTT chr10 Parus_Major... [2] 20315886 AGCCGACGAGACTCACAGAA...ACAAACCCCCTCGGGAGGG chr11 Parus_Major... [3] 20466350 GATTAGACCTCCGAAAGGGG...TATTAATTATTAAATATTA chr12 Parus_Major... [4] 16480340 GTCTCCACTTGCCCCACAAC...ATGACGATGATGAAGATGA chr13 Parus_Major... [5] 16193477 CTCTGTGACATCACAGCCAT...GTTACACACGTTGTTTTTT chr14 Parus_Major...
If we take the sequence of the first row from synteny[[6]] object (index2 is = 12):
unlist(DNA[[3]][1]) A DNAStringSet instance of length 2 width seq names [1] 1962640 CTAACTTGACACCTTTCCCTTG...TCATTCCCCTTCCCACACACAC 1.Genome2 [2] 1962640 CTAACTTGACACCTTTCCCTTG...TCATTCCTCTTCCCACACACAC 1.Genome3
Checking the chromosome of the first part of the sequence of Genome3:
ali <- vmatchPattern("CTAACTTGACACCTTTCCCTTG", G, max.mismatch=0) forw <- unlist(ali) names(forw) <- gsub("Parus_Major_build_May2013", "", names(forw)) forw <- makeGRangesFromDataFrame(forw, seqnames.field=c("names"), keep.extra.columns=TRUE) forw GRanges object with 1 range and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr10 [14701712, 14701733] * ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths
Based on that, I discovery that when the index2 is equal to 12, it corresponds to chromosome 10. However, it should be a easier way to understand how the index reflects the chromosome names in a FASTA file. I would be glad for any light in here.
Hi Erik, thank you for your prompt answer. In your example you used 'XStringSet; instead 'FASTA' directly. Running the analysis again using your example will work fine. However, would be nice to understand what I have already, where I used the 'FASTA' files. As far I understand, the 'Seqs2DB' function places an index automatically for each sequence within a 'FASTA' file. Each of these sequences correspond to a specific chromosome, that has a name. My main question is: How can I find out the correspondence between the index and the original sequence name (i.e chromosome name). Unfortunately I was unable to find it in your answer. I would be glad for any further help.
Thanks for clarifying your question. The sequences are numbered (indexed) according to their order in the imported file. You can query the sequence headers from the database, for example with:
Or search the database and request the sequences to be named by their description.