Question: Dealing with HSP in blast output with GRanges
0
gravatar for Bio_conducta
12 weeks ago by
Bio_conducta 0 wrote:

Hi everyone, I have actually a blast file format such as:

qseqid  sseqid  pident  length  mismatch    gapopen qstart  qend    sstart  send    evalue  bitscore    strand
scaffold_1  YP_009352567.1  27.0    278 163 9   20  50  222 487 1.000000e-16    94.0    +
scaffold_1  YP_009352567.1  35.5    166 93  2   130 300 63  227 2.000000e-15    89.7    +
scaffold_1  NP_056889.1 42.5    106 56  4   906 1223    910 1010    2.600000e-12    79.3    -
scaffold_1  NP_955606.1 42.5    106 56  4   906 1223    601 701 2.600000e-12    79.3    -
scaffold_3  NP_077672.1 44.4    99  51  1   14000   15000   219 317 4.400000e-12    78.6    -
scaffold_3  NP_077672.1 32.2    115 70  1   16000   20300   119 225 1.200000e-09    70.5    -
scaffold_4  YP_009173615.1  29.9    174 84  4   800 1000    209 344 4.500000e-15    88.2    -
scaffold_8  YP_009482546.1  29.1    148 98  4   42822   43265   1531    1671    6.300000e-13    80.5    -
scaffold_8  YP_009481279.1  31.1    148 94  4   3   500 1488    1627    1.100000e-12    79.7    -
scaffold_8  YP_008319462.1  29.3    140 91  4   2   500 1466    1597    5.900000e-11    73.9    -
scaffold_8  YP_008437707.2  30.0    140 90  4   3   500 1473    1604    5.900000e-11    73.9    -

and I wanted to reduce all the HSP together and get something like:

qseqid  qstart  qend
scaffold_1  20  300
scaffold_1  906 1223
scaffold_1  906 1223
scaffold_3  14000   20300
scaffold_4  800 1000
scaffold_8  42822   43265
scaffold_8  3   500
scaffold_8  2   500
scaffold_8  3   500

As you can see the coordinates of start and end changed for some scaffold because of the HSP merging.

I know that GRanges can be helpful and I tried in R :

blast<-read.table("dataframe.txt",header=T,sep="")

genes <- GRanges(seqnames=qseqid,ranges=IRanges(start=qstart,end = qend),strand=strand)

gr1 = reduce(genes)

But it does not work..

Thank you for your help.

R grangers • 109 views
ADD COMMENTlink modified 12 weeks ago by Martin Morgan ♦♦ 23k • written 12 weeks ago by Bio_conducta 0

When I tried your above code I end up with something like the following:

> reduce(genes)
GRanges object with 8 ranges and 0 metadata columns:
        seqnames      ranges strand
           <Rle>   <IRanges>  <Rle>
  [1] scaffold_1       20-50      +
  [2] scaffold_1     130-300      +
  [3] scaffold_1    906-1223      -
  [4] scaffold_3 14000-15000      -
  [5] scaffold_3 16000-20300      -
  [6] scaffold_4    800-1000      -
  [7] scaffold_8       2-500      -
  [8] scaffold_8 42822-43265      -
  -------
  seqinfo: 4 sequences from an unspecified genome; no seqlengths

In what way do you feel it doesn't work?

ADD REPLYlink written 12 weeks ago by shepherl ♦♦ 1.3k

Because the coordinates are wrong, we should get:

qseqid  qstart  qend
scaffold_1  20  300
scaffold_1  906 1223
scaffold_1  906 1223
scaffold_3  14000   20300
scaffold_4  800 1000
scaffold_8  42822   43265
scaffold_8  3   500
scaffold_8  2   500
scaffold_8  3   500

I do not understand why there is for the scaffold1 a range from 20 to 50 since there are two matches with the sseqid YP_009352567.1 One from 20 to 50 and one from 130 to 300 so those two matches are 2 HSP and should be gathered together to form a new range from 20 (the minimum of the first HSP) to 300 (the max of the second HSP)...

ADD REPLYlink written 12 weeks ago by Bio_conducta 0
Answer: Dealing with HSP in blast output with GRanges
2
gravatar for Martin Morgan
12 weeks ago by
Martin Morgan ♦♦ 23k
United States
Martin Morgan ♦♦ 23k wrote:

It seems like what you mean by 'reduce' is to use sseqid to split the ranges into groups, then calculate the range of each group, and simplify that as much as possible. So I created a GRanges with sseqid as an additonal column

genes <- with(df, 
    GRanges(
        seqnames=qseqid,
        ranges=IRanges(start=qstart,end = qend),
        strand=strand, sseqid = sseqid
    )
)

then split the genes into groups based on sseqid and calculated the range of each group

gr <- range(split(genes, genes$sseqid))

and then simplified

> sort(unlist(gr))
GRanges object with 9 ranges and 0 metadata columns:
                       seqnames      ranges strand
                          <Rle>   <IRanges>  <Rle>
  YP_009352567.1     scaffold_1      20-300      +
     NP_056889.1     scaffold_1    906-1223      -
     NP_955606.1     scaffold_1    906-1223      -
     NP_077672.1     scaffold_3 14000-20300      -
  YP_009173615.1     scaffold_4    800-1000      -
  YP_008319462.1     scaffold_8       2-500      -
  YP_008437707.2     scaffold_8       3-500      -
  YP_009481279.1     scaffold_8       3-500      -
  YP_009482546.1     scaffold_8 42822-43265      -
  -------
ADD COMMENTlink written 12 weeks ago by Martin Morgan ♦♦ 23k

Thank you for your help it was exactly what I needed.

ADD REPLYlink written 12 weeks ago by Bio_conducta 0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 153 users visited in the last hour