Question

RSVSim insertions from chr1 to chr2

1

Entering edit mode

FDG ▴ 70

@fdg-17954

Last seen 5.4 years ago

####edit

For the Stack exchange post, that eleborates more

https://bioinformatics.stackexchange.com/questions/5269/rsvsim-insertions-from-chr1-to-chr2

####edit

I have a question and I tried a lot of options already, but not with the desired result.

So I want to simulate some insertions. Because I don't want my insertions to be a random sequence, I made a DNAseq that is named chr1 (which has 100.000bp of similar DNA seq) and chr2 (which is the seq I want my insertions in).

So I want to take a DNA seq from chr1 and put them in chr2 as an insert.

If I simulate SVs, they either copy/cut from chr2 to chr2 (which essentially makes a duplication or DEL+INS) or if you don't specify anything, it is random. Sometimes from chr1 -> chr2 and sometimes chr2 -> chr1. This is really annoying. Trying to set chrA ="chr1" and chrB="chr2" doesn't work. Also the examples pasted below do not give the desired result.

Using translocations also wont give me the effect that I want, because you cannot generate more translocations than genomes/chromosomes that you add.

**

However, when processing INS, I found out that it just copies a DNA seq from one part, and puts it somewhere else in the same DNA seq (Bascially a DUP then), or if you specify another an extra DNAseq, to cut from, It randomly cuts from either chr1 or chr2 and puts it in the other. This creates deletions in the DNA seq I want to investigate, which is not what I want.

Does anyone know how to get this kind of result?

    width seq                                                                                                                           names               
    [1]    40 AAAAAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTTTTTT    chr1
    [2]    40 GGGGGGGGGGGGGGGGGGGGCCCCCCCCCCCCCCCCCCCC    chr2

to this:

    width seq                                                                                                                           names               
    [1]    32 AAAAAAAAAAAAAAAATTTTTTTTTTTTTTTT                    chr1
    [2]    48 GGGGGGGGGGGGGGAAAATTTTGGGGGGCCCCCCCCCCCCCCCCCCCC    chr2

**

I have tried several things already over the past hours, I dont even have all the examples anymore. Could someone help with this? I need 30 INS on random places of chr2 of varying sizes.

   
#some things I tried, see stackexchange  post for more examples
seq_random <- readDNAStringSet("/path/random.fasta", "fasta")
seq_ref <- readDNAStringSet("/path/reference1MB.fasta", "fasta")
genome2 = DNAStringSet(c(seq_random, seq_ref))
names(genome2) = c("chr1","chr2")
length_seq = width(genome2[2])
knownInsertion = GRanges(IRanges(0,length_seq), seqnames="chr2")
knownInsertion = GRanges(IRanges(0,length_seq), seqnames="chr1", chrB="chr2", chrA="chr1")
knownInsertion = GRanges(seqnames="chr1", chrB="chr2")
sim = simulateSV(output='/folder/of/output/', genome=genome2, 
                 ins = 30, sizeIns=y, regionsIns=knownInsertion)

rsvsim • 1.9k views

ADD COMMENT • link 5.5 years ago FDG ▴ 70

0

Entering edit mode

Can someone elaborate if there is something missing or not clear in my question? I really need the answer, because if it is not possible I have to rewrite a lot of stuff, including code.

ADD REPLY • link 5.5 years ago FDG ▴ 70

0

Entering edit mode

Please, how am I able to contact BioConductor? The only way possible seems to be this forum and other posts seem to be answered within the hour.

ADD REPLY • link 5.5 years ago FDG ▴ 70

0

Entering edit mode

You've tagged your question with 'RSVSim' to indicate that you have a problem with the RSVSim package. If the maintainer of the RSVSim package does not monitor this forum (all maintainers are required to subscribe, but...) or if they are simply as busy as you and have not had a chance to respond in the last 2 days, or if other users of RSVSim are not able to help you, or if there are not that many users of RSVSim monitoring this forum, then you will not receive a response quickly...

You could try to reach out directly to maintainer("RSVSim").

ADD REPLY • link 5.5 years ago Martin Morgan 25k

score 3 · Accepted Answer · 2018-10-26

3

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 22 hours ago

United States

I don't know how to use the RSVSim package. In Biostrings, and here I am also not an expert, I did the following, partly using your more explicit example in the post you made to stackexchange.

Start with creating an example 'genome'

library(Biostrings)
library(GenomicRanges)

genome = DNAStringSet(
    c(chr1 = "AAAAAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTTTTTT",
      chr2 = "GGGGGGGGGGGGGGGGGGGGCCCCCCCCCCCCCCCCCCCC")
)

Here's the sequence we'd like to delete from chr1 and insert into chr2

knownInsertion = GRanges(IRanges(16, 25), seqnames="chr1")
sequence <- genome[knownInsertion]

And then the insertion

> replaceAt(genome["chr2"], 20, sequence)
  50-letter "DNAString" instance
seq: GGGGGGGGGGGGGGGGGGGAAAAATTTTTGCCCCCCCCCCCCCCCCCCCC

and deletion

> replaceAt(genome[seqnames(knownInsertion)], ranges(knownInsertion), "")
  30-letter "DNAString" instance
seq: AAAAAAAAAAAAAAATTTTTTTTTTTTTTT

The genome could be modified by assignment, e.g., genome["chr2"] <- replaceAt(.... I believe that replaceAt() is vectorized, so can perform several insertions / deletions at a time. I'm not sure if that's a helpful starting point or not.

ADD COMMENT • link 5.5 years ago Martin Morgan 25k

0

Entering edit mode

I will have a look into it, thanks for the reply! Now it doesn't seem to be at random places though, but that is something I can implement later. I let you know how it goes.

ADD REPLY • link 5.5 years ago FDG ▴ 70

0

Entering edit mode

While it does give the desired result in the example, it does not give me the desired result overall. So now my code is:

library(Biostrings)
library(GenomicRanges)

genome = DNAStringSet(
  c(chr1 = "AAAAAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTTTTTT",
    chr2 = "GGGGGGGGGGGGGGGGGGGGCCCCCCCCCCCCCCCCCCCC"))

knownInsertion = GRanges(IRanges(16, 25), seqnames="chr1")
sequence <- genome[knownInsertion]
genome["chr2"] <- replaceAt(genome["chr2"], 20, sequence)
genome["chr1"] <- replaceAt(genome[seqnames(knownInsertion)], ranges(knownInsertion), "")
genome

And the result is what you pasted above, I now don' t have any output file, specifying where all the modifications were made. RSVSim produces "insertions.csv" (same for other SVs). It specifies exactly where each SV is inserted into the reference genome (in this case chr2) (see also the sim(metadata) from the StackExchange). This is something I need downstream, or something similar. Even if I have the numbers only, I could put that in a CSV myself.

So while in theory it gives the result that I gave as an example, I cannot use it in this way, and also I see no possibility to randomly distribute 30 INS with varying sizes.

ADD REPLY • link 5.5 years ago FDG ▴ 70

1

Entering edit mode

This function samples from random positions in the genome (approximately; almost certainly there are 'off-by-one' errors)

sampleFromGenome <- function(genome, wd) {
    chrProb <- width(genome) / sum(width(genome))
    chr <- sample(names(genome), n, TRUE, prob = chrProb)
    start <- runif(n, 1, nchar(genome[chr]) - wd)
    GRanges(IRanges(start, width = wd), seqnames = chr)
}

so you could get random places to sample from and insert at with widths between, for instance, 10 and 15 with

n <- 30
wd <- sample(10:15, n, TRUE)
insertFrom <- sampleFromGenome(genome, wd)
insertAt <- sampleFromGenome(genome, 0)

these values are easily coerced to a data.frame and written to a file, e.g.,

writeTable(insertFrom, stdout()) . # replace stdout with a file name

I don't know what you're trying to do, 30 separate simulations of insertions, or a single simulation with 30 insertions. I think for the latter you could generate the sequences then split them by chromosome and replace; I think you want to remove the insertion sequences as well; you'd want to take some care in dealing with indels that occur at overlapping positions, and the order in which these are applied; maybe when the sample size is small compared to the entire genome one doesn't have to worry about this too much.... So a start at the insertion part is

sequence <- genome[insertFrom]
at <- split(insertAt, seqnames(insertAt))[names(genome)]
atSequence <- split(sequence, seqnames(insertAt))[names(genome)]

replaceAt(genome, unname(ranges(at)), atSequence)

with similar approach for the deletions. I guess RSVsim maintainer has thought through many of these issues and you've been benefiting from the maintainer's effort; now there is an opportunity to work through the detailed logic yourself.

Hope the above is helpful.

ADD REPLY • link 5.5 years ago Martin Morgan 25k

0

Entering edit mode

Sorry for the late reply. I will try this out today. thanks again for the help

ADD REPLY • link 5.5 years ago FDG ▴ 70

0

Entering edit mode

I looked at what you wrote, and it almost seems to do the trick.
The problem is that the output is now still the same as with simulateSV. Sometimes the insertion is in chr1 and sometimes in chr2. I really need it to go from chr1 --> chr2 (the deletions don't even matter, but that was something that simulate SV did).
So with a small example of your code (for testing)

sampleFromGenome <- function(genome, wd) {
  chrProb <- width(genome) / sum(width(genome))
  chr <- sample(names(genome), n, TRUE, prob = chrProb)
  start <- runif(n, 1, nchar(genome[chr]) - wd)
  GRanges(IRanges(start, width = wd), seqnames = chr)
}

n <- 2
wd <- sample(5:10, n, TRUE)
insertFrom <- sampleFromGenome(genome, wd)
insertAt <- sampleFromGenome(genome, 0)

sequence <- genome[insertFrom]
at <- split(insertAt, seqnames(insertAt))[names(genome)]
atSequence <- split(sequence, seqnames(insertAt))[names(genome)]
replaceAt(genome, unname(ranges(at)), atSequence)

  A DNAStringSet instance of length 2
    width seq                                                                                                                           names           
[1]    49 AAAAAAAAAAAAAAACCCCCCCCCAAAAATTTTTTTTTTTTTTTTTTTT                                                                             chr1
[2]    47 GGGGGGTTTTTTTGGGGGGGGGGGGGGCCCCCCCCCCCCCCCCCCCC                                                                               chr2

> insertAt
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr2  [ 7,  6]      *
  [2]     chr1  [16, 15]      *
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

The problem lies here (insertAt) I think,because the Rle seems to be chr1 as well as chr2, while it should be chr2 only.

The other stuff (like getting the numbers I need downstream) I can get very well with your code, so that is perfect. I try some stuff myself also today, but I just wanted to put this here in case you find it much quicker or know the answer already. R statistics is not really my speciality.

ADD REPLY • link 5.5 years ago FDG ▴ 70

0

Entering edit mode

I think I got it now. Do you see any problems in this code?

genome = DNAStringSet(
  c(chr1 = "AAAAAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTTTTTT",
    chr2 = "GGGGGGGGGGGGGGGGGGGGCCCCCCCCCCCCCCCCCCCC"))

sampleFromGenome <- function(genome, wd) {
  chrProb <- width(genome) / sum(width(genome))
  chr <- sample(names(genome), n, TRUE, prob = chrProb)
  start <- runif(n, 1, nchar(genome[chr]) - wd)
  GRanges(IRanges(start, width = wd), seqnames = chr)
}

n <- 2
wd <- sample(5:10, n, TRUE)
insertFrom <- sampleFromGenome(genome[1], wd)
insertAt <- sampleFromGenome(genome[2], 0)

#writeTable(insertFrom, stdout()) . # replace stdout with a file name

sequence <- genome[insertFrom]
at <- split(insertAt, seqnames(insertAt))[names(genome[2])]
atSequence <- split(sequence, seqnames(insertAt))[names(genome[2])]

replaceAt(genome[2], unname(ranges(at)), atSequence)

  A DNAStringSet instance of length 1
    width seq                                                                                                                           names               
[1]    54 GGGGGGGGGGGGGGGGGGTTTTTTTTTGGCCCCCCCCCCCCTTTTTCCCCCCCC                                                                        chr2

ADD REPLY • link 5.5 years ago FDG ▴ 70

0

Entering edit mode

yes this looks ok to me.

ADD REPLY • link 5.5 years ago Martin Morgan 25k

score 3 · Accepted Answer · 2018-10-30

Finally found a solution based on RSVSim

    genome = DNAStringSet(
      c(chr1 = "AAAAAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTTTTTT",
        chr2 = "GGGGGGGGGGGGGGGGGGGGCCCCCCCCCCCCCCCCCCCC"))
    
    length_seq_chr1 = width(genome[1]) - 10
    length_seq_chr2 = width(genome[2]) - 10
    sizes <- c(3,3,5,5)
    start_chr1 <- c("list", length(sizes))
    start_chr2 <- c("list", length(sizes))
    end_chr1 <- c("list", length(sizes))
    index = 1
    for(i in sizes){
      newstartchr1 <- sample(1:length_seq_chr1,1)
      newstartchr2 <- sample(1:length_seq_chr2,1)
      start_chr1[[index]] <- newstartchr1
      start_chr2[[index]] <- newstartchr2
      end_chr1[[index]] <- newstartchr1 + (i-1)
      index <- index + 1}
    start_chr1 <- as.integer(start_chr1)
    end_chr1 <- as.integer(end_chr1)
    start_chr2 <- as.integer(start_chr2)
    
    knownInsertion2 = GRanges(IRanges(c(start_chr1),c(end_chr1)),seqnames="chr1", chrB="chr2", startB=c(start_chr2), copied=TRUE)
    knownInsertion2
    
    sim = simulateSV(output=NA, genome=genome, regionsIns=knownInsertion2, bpSeqSize=5, random=FALSE, verbose=FALSE)
    sim
    metadata(sim)

Gives as an result:

    > knownInsertion2
    GRanges object with 4 ranges and 3 metadata columns:
          seqnames    ranges strand |        chrB    startB    copied
             <Rle> <IRanges>  <Rle> | <character> <integer> <logical>
      [1]     chr1  [21, 24]      * |        chr2        30      TRUE
      [2]     chr1  [16, 19]      * |        chr2        21      TRUE
      [3]     chr1  [ 6, 11]      * |        chr2        10      TRUE
      [4]     chr1  [ 9, 14]      * |        chr2        27      TRUE
      -------
      seqinfo: 1 sequence from an unspecified genome; no seqlengths
    >
    > sim = simulateSV(output=NA, genome=genome, regionsIns=knownInsertion2, bpSeqSize=5, random=FALSE, verbose=FALSE)
    > sim
      A DNAStringSet instance of length 2
        width seq                                                                                                                           names               
    [1]    40 AAAAAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTTTTTT                                                                                      chr1
    [2]    60 GGGGGGGGGAAAAAAGGGGGGGGGGGAAAACCCCCCAAAAAACCCTTTTCCCCCCCCCCC                                                                  chr2
    > metadata(sim)
    $insertions
      Name ChrA StartA EndA ChrB StartB EndB Size Copied BpSeqA BpSeqB_5prime BpSeqB_3prime
    1    1 chr1     21   24 chr2     30   33    4   TRUE                 CCTT          TTCC
    2    2 chr1     16   19 chr2     21   24    4   TRUE                 GGAA          AACC
    3    3 chr1      6   11 chr2     10   15    6   TRUE                 GGAA          AAGG
    4    4 chr1      9   14 chr2     27   32    6   TRUE                 CCAA          AACC