how to renameSeqlevels-GenomicAlignments
2
0
Entering edit mode
teresati ▴ 10
@teresati-12364
Last seen 7.6 years ago

Dear all,
 

I would like to ask if it is possible to change the seqnames of a bam file object, giving a vector of character to the function renameSeqlevels. This is because in order to use the fuction summarizeOverlap or count/find, the seqnames have to match.

From the bamfile object below, I have extracted the locus annotations from the seqnames (i.e ERCC00002, NC_001133.9...etc), since the seqnames are constructed differently respect to the one in the gff file, and I have created a list (same length as the seqlevels of the bam file).
 


bamfile
GAlignments object with 6 alignments and 0 metadata columns:
                                                                     seqnames
                                                                        <Rle>
  [1] DQ459430_gene=ERCC00002_loc:ERCC00002|1-1061|+_exons:1-1061_segs:1-1061
  [2] DQ459430_gene=ERCC00002_loc:ERCC00002|1-1061|+_exons:1-1061_segs:1-1061
  [3] DQ459430_gene=ERCC00002_loc:ERCC00002|1-1061|+_exons:1-1061_segs:1-1061
  [4] DQ459430_gene=ERCC00002_loc:ERCC00002|1-1061|+_exons:1-1061_segs:1-1061
  [5] DQ459430_gene=ERCC00002_loc:ERCC00002|1-1061|+_exons:1-1061_segs:1-1061
  [6] DQ459430_gene=ERCC00002_loc:ERCC00002|1-1061|+_exons:1-1061_segs:1-1061
      strand       cigar    qwidth     start       end     width     njunc
       <Rle> <character> <integer> <integer> <integer> <integer> <integer>
  [1]      +     8M2D27M        35      1025      1061        37         0
  [2]      +     8M2D27M        35      1025      1061        37         0
  [3]      -         36M        36      1025      1060        36         0
  [4]      -         36M        36      1026      1061        36         0
  [5]      +         35M        35      1027      1061        35         0
  [6]      +         35M        35      1027      1061        35         0
  -------

gffile
GRanges object with 6 ranges and 12 metadata columns:
         seqnames           ranges strand |       source     type     score
            <Rle>        <IRanges>  <Rle> |     <factor> <factor> <numeric>
  [1] NC_001133.9 [ 24837,  25070]      + | s_cerevisiae     exon      <NA>
  [2] NC_001133.9 [ 25048,  25394]      + | s_cerevisiae     exon      <NA>
  [3] NC_001133.9 [ 27155,  27786]      + | s_cerevisiae     exon      <NA>
  [4] NC_001133.9 [ 73431,  73792]      + | s_cerevisiae     exon      <NA>
  [5] NC_001133.9 [165314, 165561]      + | s_cerevisiae     exon      <NA>
  [6] NC_001133.9 [165388, 165781]      + | s_cerevisiae     exon      <NA>
          phase     gene_id  transcript_id exon_number   gene_name
      <integer> <character>    <character> <character> <character>
  [1]      <NA> XLOC_000040 TCONS_00000191           1        FLO9
  [2]      <NA> XLOC_000040 TCONS_00000192           1        FLO9
  [3]      <NA> XLOC_000041 TCONS_00000193           1        FLO9
  [4]      <NA> XLOC_000055 TCONS_00000200           1   YAL037C-A
  [5]      <NA> XLOC_000075 TCONS_00000100           1     YAR010C
  [6]      <NA> XLOC_000075 TCONS_00000219           1     YAR010C
                                         oId nearest_ref  class_code
                                 <character> <character> <character>
  [1]   {TRINITY_GG_normal}16_c1_g1_i1.mrna1        rna8           x
  [2]   {TRINITY_GG_normal}16_c0_g1_i1.mrna1        rna8           x
  [3]   {TRINITY_GG_normal}12_c0_g1_i1.mrna1        rna8           x
  [4]    {TRINITY_GG_normal}3_c3_g1_i1.mrna1       rna31           x
  [5] {TRINITY_GG_normal}3479_c0_g1_i1.mrna1       rna77           x
  [6]   {TRINITY_GG_normal}24_c0_g1_i1.mrna1       rna77           x
           tss_id
      <character>
  [1]       TSS42
  [2]       TSS43
  [3]       TSS44
  [4]       TSS71
  [5]      TSS118
  [6]      TSS118
  -------

It is possible to replace the seqlevels names with the ones of the vector of character?
I have tried:

bamfile1 <- renameSeqlevels(seqlevels(bamfile), listx)

Thank you for any advice,

Kind regards

annotation GAlignments bamfileobject renameseqlevels • 2.8k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States

Yes renameSeqlevels() should work but you have to call it on your GAlignments object, not on seqlevels(bamfile). Also I'm not sure what listx is in your code above and how you made it. Your description of what you did is confused. It is generally more useful to show the exact code you used, including the error messages and/or warnings you got.

H.

ADD COMMENT
0
Entering edit mode
teresati ▴ 10
@teresati-12364
Last seen 7.6 years ago

Apologies, I have loaded a bam file (mapped on transcript) and a gff file:

>library(GenomicAlignments)
>bamfileS1 <-list.files("/path/to/bam/S1.bam", pattern="bam$", full=TRUE)
GAlignments object
                                                                     seqnames
                                                                        <Rle>
  [1] DQ459430_gene=ERCC00002_loc:ERCC00002|1-1061|+_exons:1-1061_segs:1-1061
  [2] DQ459430_gene=ERCC00002_loc:ERCC00002|1-1061|+_exons:1-1061_segs:1-1061
      strand       cigar    qwidth     start       end     width     njunc
       <Rle> <character> <integer> <integer> <integer> <integer> <integer>
  [1]      +     8M2D27M        35      1025      1061        37         0..


>gff0 <- import("/path/gffFile")

GRanges object with 6 ranges and 12 metadata columns:
         seqnames           ranges strand |       source     type     score
            <Rle>        <IRanges>  <Rle> |     <factor> <factor> <numeric>
  [1] NC_001133.9 [ 24837,  25070]      + | s_cerevisiae     exon      <NA>
  [2] NC_001133.9 [ 25048,  25394]      + | s_cerevisiae     exon      <NA>..
          phase     gene_id  transcript_id exon_number   gene_name
      <integer> <character>    <character> <character> <character>
  [1]      <NA> XLOC_000040 TCONS_00000191           1        FLO9...

I would like to obtain a table of counts but I have a warning.

>tr_hits <- summarizeOverlaps(gff0, bamfileS1, mode="Union",singleEnd=FALSE)

-->Warning message:
In .Seqinfo.mergexy(x, y) :
  The 2 combined objects have no sequence levels in common. (Use
  suppressWarnings() to suppress this warning.)

> table(assay(tr_hits))

    0
26664

The problem should be that the seqnames of the bam object and the gff are different. To fix this I could modify the bamfile object seqnames. An example of a seqname row of the bamfile object:

[1] DQ459430_gene=ERCC00002_loc:ERCC00002|1-1061|+_exons:1-1061_segs:1-1061

I should extract only "ERCC00002" (the one after ...loc:ERCC00002), in order to obtain something like in the gff ("NC_001133.9").

>sname1<- bamfileS1@seqnames@values #create a list of seqnames of bam object

>library(stringr)

>sname2<-str_split_fixed(sname1),":",3)  #split at ':', 3 columns

> head(sname2)
              [,1]                          [,2]                     
[1,] "DQ459430_gene=ERCC00002_loc" "ERCC00002|1-1061|+_exons"..
     [,3]               
[1,] "1-1061_segs:1-1061"..

>sname3<- str_split_fixed(sname2[,2],fixed("|"),3) #split at "|"

                      [,1]        [,2]     [,3]     
[1,] "ERCC00002" "1-1061" "+_exons"
[2,] "ERCC00003" "1-1023" "+_exons"..

>sname4<- sname3[,-(2:3)] #eliminate last 2 columns

> head(sname4)
[1] "ERCC00002" "ERCC00003" "ERCC00004" "ERCC00009" "ERCC00012" "ERCC00013"....

And then I would like to rename the seqnames of the bam object with this vector(sname4).

>bamfileS1_mod <- renameSeqlevels(seqlevels(bamfileS1), sname4)

Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘seqinfo’ for signature ‘"character"’

> bamfileS1_mod <- renameSeqlevels(bamfileS1, sname4)
Error in getSeqlevelsReplacementMode(value, seqlevels(x)) :
  the supplied 'seqlevels' must be a character vector with no NAs and no
  duplicates

Thank you for any suggestion

 

ADD COMMENT
1
Entering edit mode

If by "bam file mapped on transcript" you mean that your reads have been aligned to the transcriptome, then you need to realize that the positions in your BAM file (and therefore in your GAlignments object) are relative to the transcriptome. OTOH GFF files annotate genomic features with respect to the reference genome, not to the transcriptome. That means your GAlignments object and your GRanges object gff0 use incompatible coordinate systems to report aligned reads and features. So trying to call summarizeOverlaps() on them doesn't make sense. No amount of seqlevels renaming will fix that.

If you've aligned your reads to the transcriptome, then it's easy to count the number of reads per transcript. Just do table(seqnames(.)) on your GAlignments object. You might need to repeat this for each of your BAM files. If you need a SummarizedExperiment object for your downstream analysis, it should not be hard to "manually" create such object from these counts and to stick the transcript ids (i.e. the seqlevels of your GAlignments objects) in the rowData part of the object.

Hope this helps,

H.

ADD REPLY

Login before adding your answer.

Traffic: 690 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6