Question

how to renameSeqlevels-GenomicAlignments

0

Entering edit mode

teresati ▴ 10

@teresati-12364

Last seen 8.8 years ago

Dear all,

I would like to ask if it is possible to change the seqnames of a bam file object, giving a vector of character to the function renameSeqlevels. This is because in order to use the fuction summarizeOverlap or count/find, the seqnames have to match.

From the bamfile object below, I have extracted the locus annotations from the seqnames (i.e ERCC00002, NC_001133.9...etc), since the seqnames are constructed differently respect to the one in the gff file, and I have created a list (same length as the seqlevels of the bam file).

bamfile
GAlignments object with 6 alignments and 0 metadata columns:
                                                                     seqnames
                                                                        <Rle>
[1] DQ459430_gene=ERCC00002_loc:ERCC00002|1-1061|+_exons:1-1061_segs:1-1061
[2] DQ459430_gene=ERCC00002_loc:ERCC00002|1-1061|+_exons:1-1061_segs:1-1061
[3] DQ459430_gene=ERCC00002_loc:ERCC00002|1-1061|+_exons:1-1061_segs:1-1061
[4] DQ459430_gene=ERCC00002_loc:ERCC00002|1-1061|+_exons:1-1061_segs:1-1061
[5] DQ459430_gene=ERCC00002_loc:ERCC00002|1-1061|+_exons:1-1061_segs:1-1061
[6] DQ459430_gene=ERCC00002_loc:ERCC00002|1-1061|+_exons:1-1061_segs:1-1061
      strand       cigar    qwidth     start       end     width     njunc
       <Rle> <character> <integer> <integer> <integer> <integer> <integer>
[1]      +     8M2D27M        35      1025      1061        37         0
[2]      +     8M2D27M        35      1025      1061        37         0
[3]      -         36M        36      1025      1060        36         0
[4]      -         36M        36      1026      1061        36         0
[5]      +         35M        35      1027      1061        35         0
[6]      +         35M        35      1027      1061        35         0
-------

gffile
GRanges object with 6 ranges and 12 metadata columns:
         seqnames           ranges strand |       source     type     score
            <Rle>        <IRanges> <Rle> |     <factor> <factor> <numeric>
[1] NC_001133.9 [ 24837, 25070]      + | s_cerevisiae     exon      <NA>
[2] NC_001133.9 [ 25048, 25394]      + | s_cerevisiae     exon      <NA>
[3] NC_001133.9 [ 27155, 27786]      + | s_cerevisiae     exon      <NA>
[4] NC_001133.9 [ 73431, 73792]      + | s_cerevisiae     exon      <NA>
[5] NC_001133.9 [165314, 165561]      + | s_cerevisiae     exon      <NA>
[6] NC_001133.9 [165388, 165781]      + | s_cerevisiae     exon      <NA>
          phase     gene_id transcript_id exon_number   gene_name
      <integer> <character>    <character> <character> <character>
[1]      <NA> XLOC_000040 TCONS_00000191           1        FLO9
[2]      <NA> XLOC_000040 TCONS_00000192           1        FLO9
[3]      <NA> XLOC_000041 TCONS_00000193           1        FLO9
[4]      <NA> XLOC_000055 TCONS_00000200           1   YAL037C-A
[5]      <NA> XLOC_000075 TCONS_00000100           1     YAR010C
[6]      <NA> XLOC_000075 TCONS_00000219           1     YAR010C
                                         oId nearest_ref class_code
                                 <character> <character> <character>
[1]   {TRINITY_GG_normal}16_c1_g1_i1.mrna1        rna8           x
[2]   {TRINITY_GG_normal}16_c0_g1_i1.mrna1        rna8           x
[3]   {TRINITY_GG_normal}12_c0_g1_i1.mrna1        rna8           x
[4]    {TRINITY_GG_normal}3_c3_g1_i1.mrna1       rna31           x
[5] {TRINITY_GG_normal}3479_c0_g1_i1.mrna1       rna77           x
[6]   {TRINITY_GG_normal}24_c0_g1_i1.mrna1       rna77           x
           tss_id
      <character>
[1]       TSS42
[2]       TSS43
[3]       TSS44
[4]       TSS71
[5]      TSS118
[6]      TSS118
-------

It is possible to replace the seqlevels names with the ones of the vector of character?
I have tried:

bamfile1 <- renameSeqlevels(seqlevels(bamfile), listx)

Thank you for any advice,

Kind regards

annotation GAlignments bamfileobject renameseqlevels • 3.3k views

ADD COMMENT • link 8.9 years ago teresati ▴ 10

score 0 · Answer 1 · 2017-02-14

Yes renameSeqlevels() should work but you have to call it on your GAlignments object, not on seqlevels(bamfile). Also I'm not sure what listx is in your code above and how you made it. Your description of what you did is confused. It is generally more useful to show the exact code you used, including the error messages and/or warnings you got.

H.

score 0 · Answer 2 · 2017-02-14

Apologies, I have loaded a bam file (mapped on transcript) and a gff file:

>library(GenomicAlignments)
>bamfileS1 <-list.files("/path/to/bam/S1.bam", pattern="bam$", full=TRUE)
GAlignments object
                                                                     seqnames
                                                                        <Rle>
  [1] DQ459430_gene=ERCC00002_loc:ERCC00002|1-1061|+_exons:1-1061_segs:1-1061
  [2] DQ459430_gene=ERCC00002_loc:ERCC00002|1-1061|+_exons:1-1061_segs:1-1061
      strand       cigar    qwidth     start       end     width     njunc
       <Rle> <character> <integer> <integer> <integer> <integer> <integer>
  [1]      +     8M2D27M        35      1025      1061        37         0..


>gff0 <- import("/path/gffFile")

GRanges object with 6 ranges and 12 metadata columns:
         seqnames           ranges strand |       source     type     score
            <Rle>        <IRanges>  <Rle> |     <factor> <factor> <numeric>
  [1] NC_001133.9 [ 24837,  25070]      + | s_cerevisiae     exon      <NA>
  [2] NC_001133.9 [ 25048,  25394]      + | s_cerevisiae     exon      <NA>..
          phase     gene_id  transcript_id exon_number   gene_name
      <integer> <character>    <character> <character> <character>
  [1]      <NA> XLOC_000040 TCONS_00000191           1        FLO9...

I would like to obtain a table of counts but I have a warning.

>tr_hits <- summarizeOverlaps(gff0, bamfileS1, mode="Union",singleEnd=FALSE)

-->Warning message:
In .Seqinfo.mergexy(x, y) :
  The 2 combined objects have no sequence levels in common. (Use
  suppressWarnings() to suppress this warning.)

> table(assay(tr_hits))

    0
26664

The problem should be that the seqnames of the bam object and the gff are different. To fix this I could modify the bamfile object seqnames. An example of a seqname row of the bamfile object:

[1] DQ459430_gene=ERCC00002_loc:ERCC00002|1-1061|+_exons:1-1061_segs:1-1061

I should extract only "ERCC00002" (the one after ...loc:ERCC00002), in order to obtain something like in the gff ("NC_001133.9").

>sname1<- bamfileS1@seqnames@values #create a list of seqnames of bam object

>library(stringr)

>sname2<-str_split_fixed(sname1),":",3)  #split at ':', 3 columns

> head(sname2)
              [,1]                          [,2]                     
[1,] "DQ459430_gene=ERCC00002_loc" "ERCC00002|1-1061|+_exons"..
     [,3]               
[1,] "1-1061_segs:1-1061"..

>sname3<- str_split_fixed(sname2[,2],fixed("|"),3) #split at "|"

                      [,1]        [,2]     [,3]     
[1,] "ERCC00002" "1-1061" "+_exons"
[2,] "ERCC00003" "1-1023" "+_exons"..

>sname4<- sname3[,-(2:3)] #eliminate last 2 columns

> head(sname4)
[1] "ERCC00002" "ERCC00003" "ERCC00004" "ERCC00009" "ERCC00012" "ERCC00013"....

And then I would like to rename the seqnames of the bam object with this vector(sname4).

>bamfileS1_mod <- renameSeqlevels(seqlevels(bamfileS1), sname4)

Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘seqinfo’ for signature ‘"character"’

> bamfileS1_mod <- renameSeqlevels(bamfileS1, sname4)
Error in getSeqlevelsReplacementMode(value, seqlevels(x)) :
  the supplied 'seqlevels' must be a character vector with no NAs and no
  duplicates

Thank you for any suggestion