Question

calculate distance to a certain sequence in one gene

0

Entering edit mode

Marcus • 0

@5c953da8

Last seen 12 months ago

Germany

Hello everyone,

I do have a data set of analysed RNAseq experiments as a summarized experiment. Every raw is one specific triplet on a certain gene. What I want to do is calculate the distance of that position to the next GGUC sequence. So: how far is the certain triplet away from the next GGUC sequence? I have 0 idea how to start. I do know how to get the sequence of the certain gene but I thought there might me a smart shortcut, a function that can calculate distances?

I would be very happy if someone has an idea.



coorsID               gene_id                annotated          gene_name          Seq

1:7821354-782136:+     ENSG00000237491          Y                  LINC01409       GAG
1:782189-782191:+      ENSG00000237491          Y                  LINC01409       AAA  
1:783361-783363:+      ENSG00000237491          Y                 LINC01409        GAA

SummarizedExperiment Genetics • 924 views

ADD COMMENT • link updated 12 months ago by James W. MacDonald 65k • written 12 months ago by Marcus • 0

1

Entering edit mode

You can find the position of all GGTC in a given BSgenome object with something like:

Biostrings::vmatchPattern(pattern="GGTC", subject=BSgenome.Hsapiens.UCSC.hg38)

...and then use these positions together with James suggestions to find the nearest ones per coorsID.

ADD REPLY • link 12 months ago ATpoint ★ 4.0k

score 1 · Answer 1 · 2023-04-18

1

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 1 day ago

United States

You can use distanceToNearest to find the distances. You would need your GGUC sequences in a GRanges object, and the other GGUC sequences you care about in another GRanges. Or if you are just asking for the distances between the ones you already know, you can just use the one. As an example

> library(GenomicRanges)
> gr <-GRanges(rep("chr1", 5), IRanges(c(2,57,172, 289, 4590), width = 3))
> distanceToNearest(gr)
Hits object with 5 hits and 1 metadata column:
      queryHits subjectHits |  distance
      <integer>   <integer> | <integer>
  [1]         1           2 |        52
  [2]         2           1 |        52
  [3]         3           2 |       112
  [4]         4           3 |       114
  [5]         5           4 |      4298
  -------
  queryLength: 5 / subjectLength: 5

ADD COMMENT • link 12 months ago James W. MacDonald 65k

1

Entering edit mode

Also, note that you can make a GRanges using the 'coorsID', once you fix the first sequence, which is obviously incorrect.

## fix the first sequence, as chr1:7821354-782136:+ has a start > end
> coorsID <- c("1:782134-782136:+", "1:782189-782191:+","1:783361-783363:+")
> GRanges(coorsID)
GRanges object with 3 ranges and 0 metadata columns:
      seqnames        ranges strand
         <Rle>     <IRanges>  <Rle>
  [1]        1 782134-782136      +
  [2]        1 782189-782191      +
  [3]        1 783361-783363      +
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

ADD REPLY • link 12 months ago James W. MacDonald 65k

0

Entering edit mode

Hi everyone!

Thanks a lot for your help. It took me a few days to get familiar with the functions but now it works the way you suggested. However, I still have 2 problems that I cant solve. a) I am actually looking for a sequence in the RNA. Of course I use the DNA data cause they correspond anyway. But the problem is that its single stranded. So, if my sequence is on the + strand, I only want to get those GGTC sequences that are also on the + strand. What I did now is that I filtered first for + strand and - strand (GGTCpos are the positions on the + strand, GGTCneg those on the - strand), ran the distanceToNearest() function and then merged the GRs again. Is there a more elegant way of doing it?

coorsID <- c("1:782134-782136:+", "1:782189-782191:+","1:783361-783363:+")
df_pos <- data.frame(coorsID) %>%
   dplyr::filter(strand == "+")
df_neg <- data.frame(coorsID) %>%
   dplyr::filter(strand == "-")

gr_pos <- GRanges(df_pos)
gr_neg <- GRanges(df_neg)

distanceToNearest(gr_pos, GGTCpos)
distanceToNearest(gr_neg, GGTCneg)

b) Does someone know if there is a way to distinguish between upstream and downstream distance? I noticed that it gives you the same distance if it is eg 300nt down or upstream. But would there be a way to see from the output if the neared GGTC is found down or upstream of my sequence?

Thanks a lot !

ADD REPLY • link 12 months ago Marcus • 0

0

Entering edit mode

Please don't add an additional question as an answer! I have moved your question to a reply instead.

Note that by default the distances are only computed between positions on the same strand.

> coorsID <- c("1:782134-782136:+", "1:782189-782191:-","1:783361-783363:+")
> fakeranges <- c("1:782155-782157:-","1:782173-782175:+", "1:783350-783353:+")
> coorsgr <- GRanges(coorsID)
> fakegr <- GRanges(fakeranges)
> fo <- distanceToNearest(coorsgr, fakegr)
> fo
Hits object with 3 hits and 1 metadata column:
      queryHits subjectHits |  distance
      <integer>   <integer> | <integer>
  [1]         1           2 |        36
  [2]         2           1 |        31
  [3]         3           3 |         7
  -------
  queryLength: 3 / subjectLength: 3

As compared to

> distanceToNearest(coorsgr, fakegr, ignore.strand = TRUE)
Hits object with 3 hits and 1 metadata column:
      queryHits subjectHits |  distance
      <integer>   <integer> | <integer>
  [1]         1           1 |        18
  [2]         2           2 |        13
  [3]         3           3 |         7
  -------
  queryLength: 3 / subjectLength: 3

As for directionality, there are precede and follow that you could use

precedeOrFollow <- function(x, y, hits){
    prec <- precede(x, y[subjectHits(hits)])
    fol <- follow(x, y[subjectHits(hits)])
    dir <- vector(length = length(hits))
    dir[!is.na(prec)] <- "precedes"
    dir[!is.na(fol)] <- "follows"
    mcols(hits)$dir <- dir
    hits
}

> fo <- precedeOrFollow(coorsgr, fakegr, fo)
> fo
Hits object with 3 hits and 2 metadata columns:
      queryHits subjectHits |  distance         dir
      <integer>   <integer> | <integer> <character>
  [1]         1           2 |        36    precedes
  [2]         2           1 |        31    precedes
  [3]         3           3 |         7     follows
  -------
  queryLength: 3 / subjectLength: 3

And you could add the sequence as well.

addSeq <- function(x, hits, bsgenome) {
    seqlevelsStyle(x) <- "UCSC"
    seq <- getSeq(bsgenome, x[subjectHits(hits)])
    mcols(hits)$seq <- seq
    hits
}

> library(BSgenome.Hsapiens.UCSC.hg38)
> fo <- addSeq(fakegr, fo, Hsapiens)
> fo
Hits object with 3 hits and 3 metadata columns:
      queryHits subjectHits |  distance         dir            seq
      <integer>   <integer> | <integer> <character> <DNAStringSet>
  [1]         1           2 |        36    precedes            GTA
  [2]         2           1 |        31    precedes            AAG
  [3]         3           3 |         7     follows           AGTA
  -------
  queryLength: 3 / subjectLength: 3

Do note that 'precedes' and 'follows' in this case indicates the directionality of the distance between the query and the subject. You might want to switch that, depending on how you want to think about it.

ADD REPLY • link 12 months ago James W. MacDonald 65k