Search
Question: Transform index to chromosome in DECIPHER package
0
9 months ago by
Brazil

I am trying to understand how the index from the synteny analysis in the DECIPHER package can be translated into chromosomes. I basically used what is in the official website:

# load the DECIPHER library in R
library(DECIPHER)

# specify the path to each FASTA file (in quotes)
# each genome must be given a unique identifier here
# for example: Genome1, Genome2, etc.
fas <- c(Genome1="<<REPLACE WITH PATH TO FASTA FILE 1>>",
Genome2="<<REPLACE WITH PATH TO FASTA FILE 2>>",
Genome3="<<REPLACE WITH PATH TO FASTA FILE 3>>")

# specify where to create the new sequence database
db <- "<<REPLACE WITH PATH TO SEQUENCE DATABASE>>"

# load the sequences from the file in a loop
for (i in seq_along(fas)) {
Seqs2DB(fas[i], "FASTA", db, names(fas[i]))
}

# map the syntenic regions between each genome pair
synteny <- FindSynteny(db)

# print a summary of the syntenic map (optional)
synteny

# view the syntenic regions (optional)
pairs(synteny) # displays a dot plot of all pairs
plot(synteny) # displays a bar plot of adjacent pairs

# perform alignments of all pairs of genomes
DNA <- AlignSynteny(synteny, db)

# view the aligned syntenic blocks for each pair
unlist(DNA[[1]]) # Genomes 1 and 2
unlist(DNA[[2]]) # Genomes 1 and 3
unlist(DNA[[3]]) # Genomes 2 and 3

Accessing the sequences of DNA[[3]]:

 unlist(DNA[[3]][1])   A DNAStringSet instance of length 2       width seq                                             names [1] 1962640 CTAACTTGACACCTTTCCCTTG...TCATTCCCCTTCCCACACACAC 1.Genome2 [2] 1962640 CTAACTTGACACCTTTCCCTTG...TCATTCCTCTTCCCACACACAC 1.Genome3

The correspondent genomic region should be the first row of the following data-frame:

head(synteny[[6]])
index1 index2 strand  score   start1   start2     end1     end2 first_hit
[1,]     10     12      0 307470 15003547 14701712 16785549 16531717         1
[2,]      6      8      0 153713 33444467 32224687 34316599 33159921      8622
[3,]     12     14      0 146558 15237602 15554593 16266481 16701047     11977
[4,]      5      7      0 130431 47010916 48608979 48610127 50288861     16388
[5,]     11     13      0 128074  3879383  9476868  4991057 10661494     21719
[6,]      8     10      0 124988 22515057 23515792 23999163 25084001     25947
last_hit
[1,]     8621
[2,]    11976
[3,]    16387
[4,]    21718
[5,]    25946
[6,]    31151

Therefore, I imported the 'FASTA' file of the Genome3 to understand if the order of the sequences (indexing).

library("Biostrings")
G = readDNAStringSet("GTgenome1.1.fa")

At first, I assumed that it would obey the same order of the G object (i.e. first row would be index = 1, second row index = 2 and so on):

A DNAStringSet instance of length 1674
width seq                                        names
[1]  20202851 CCCAGTTTTCCCCACTCTGT...AGATCTTACAACCGATTTT chr10 Parus_Major...
[2]  20315886 AGCCGACGAGACTCACAGAA...ACAAACCCCCTCGGGAGGG chr11 Parus_Major...
[3]  20466350 GATTAGACCTCCGAAAGGGG...TATTAATTATTAAATATTA chr12 Parus_Major...
[4]  16480340 GTCTCCACTTGCCCCACAAC...ATGACGATGATGAAGATGA chr13 Parus_Major...
[5]  16193477 CTCTGTGACATCACAGCCAT...GTTACACACGTTGTTTTTT chr14 Parus_Major...



If we take the sequence of the first row from synteny[[6]] object (index2 is = 12):

unlist(DNA[[3]][1])
A DNAStringSet instance of length 2
width seq                                             names
[1] 1962640 CTAACTTGACACCTTTCCCTTG...TCATTCCCCTTCCCACACACAC 1.Genome2
[2] 1962640 CTAACTTGACACCTTTCCCTTG...TCATTCCTCTTCCCACACACAC 1.Genome3



Checking the chromosome of the first part of the sequence of Genome3:

ali <- vmatchPattern("CTAACTTGACACCTTTCCCTTG", G, max.mismatch=0)
forw <- unlist(ali)
names(forw) <- gsub("Parus_Major_build_May2013", "", names(forw))
forw <- makeGRangesFromDataFrame(forw, seqnames.field=c("names"), keep.extra.columns=TRUE)
forw

GRanges object with 1 range and 0 metadata columns:
seqnames               ranges strand
<Rle>            <IRanges>  <Rle>
[1]   chr10  [14701712, 14701733]      *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths



Based on that, I discovery that when the index2 is equal to 12, it corresponds to chromosome 10. However, it should be a easier way to understand how the index reflects the chromosome names in a FASTA file. I would be glad for any light in here.

modified 9 months ago by Erik Wright130 • written 9 months ago by Vinicius Henrique da Silva20
0
9 months ago by
Erik Wright130
Erik Wright130 wrote:

Thanks for your interest in DECIPHER.

The synteny object that is output by FindSynteny is defined in Synteny-class, see:

?Synteny-class

Since you have 3 genomes, your Synteny object will be a 3 x 3 matrix, each each cell containing a list.  Therefore, synteny[[6]] corresponds to the lower triangle of the matrix, genome 3 x genome 2 (row i = 3, column j = 2).  The lower triangle of the matrix contains syntenic blocks (i.e., collinear regions) and the upper triangle contains hits (i.e., exact k-mer matches in an alphabet).

In the (lower triangle) matrix that you showed, "index1" corresponds to the jth genome and "index2" the ith genome.  Hence, the 10th index of Genome2 (positions 15003547 - 16785549) is collinear with the 12th index of Genome3 (positions 14701712 - 16531717).

The easiest way to understand Synteny objects is through toy examples:

dbConn <- dbConnect(SQLite(), ":memory:")
s1 <- DNAStringSet("ACTAGACCCAGACCGATAAACGGACTGGACAAG")
s3 <- reverseComplement(s1)
s2 <- c(s1, s3)
Seqs2DB(c(c(s1, s2), s3),
"XStringSet",
dbConn,
c("s1", "s2", "s2", "s3"))
syn <- FindSynteny(dbConn, minScore=1)
dbDisconnect(dbConn) # release the in-memory database

We can now look at the resulting Synteny object:

> syn # Note:  > 100% hits because of sequence reuse across blocks
s1        s2        s3
s1    1 seq 200% hits 100% hits
s2 2 blocks    2 seqs 200% hits
s3  1 block  2 blocks     1 seq

Now we can see that s1 matches both indices (sequences) of s2:

> syn[2, 1][[1]] # i = 2 (s2) and j = 1 (s1)
index1 index2 strand score start1 start2 end1 end2 first_hit last_hit
[1,]      1      1      0    45      1      1   33   33         1        1
[2,]      1      2      1    45      1      1   33   33         2        2​

The identifier s2 is composed of two indices (sequences) and corresponds to index2 (i = 2) in the lower triangle.  The identifier s1 is composed of one index and corresponds to index1 (j=1) in the lower triangle.

Note that the role of i and j are flipped in the upper triangle (hits) of the Synteny object.  That is, in both the upper triangle and lower triangle index1 and index2 point to the same identifier, even though i and j are switched:

> syn[1, 2][[1]] # i = 1 (s1) and j = 2 (s2)
​     index1 index2 strand width start1 start2 frame1 frame2
[1,]      1      1      0    33      1      1      0      0
[2,]      1      2      1    33      1     33      0      0​

Finally, the index number only refers to the index of a sequence corresponding to an identifier -- it has nothing to do with the sequence names.  However, you can lookup the sequence names once you know the index.

I hope that helps clarify the structure of Synteny objects.

Hi Erik, thank you for your prompt answer. In your example you used 'XStringSet; instead 'FASTA' directly. Running the analysis again using your example will work fine. However, would be nice to understand what I have already, where I used the 'FASTA' files. As far I understand, the 'Seqs2DB' function places an index automatically for each sequence within a 'FASTA' file. Each of these sequences correspond to a specific chromosome, that has a name. My main question is: How can I find out the correspondence between the index and the original sequence name (i.e chromosome name). Unfortunately I was unable to find it in your answer. I would be glad for any further help.

1

Thanks for clarifying your question.  The sequences are numbered (indexed) according to their order in the imported file.  You can query the sequence headers from the database, for example with:

desc <- dbGetQuery(dbConn, 'select description from Seqs where identifier is "Genome2"')
desc\$description

Or search the database and request the sequences to be named by their description.

dna <- SearchDB(dbConn, identifier="Genome2", nameBy="description")
names(dna)