Dear all,
I would like to ask if it is possible to change the seqnames of a bam file object, giving a vector of character to the function renameSeqlevels. This is because in order to use the fuction summarizeOverlap or count/find, the seqnames have to match.
From the bamfile object below, I have extracted the locus annotations from the seqnames (i.e ERCC00002, NC_001133.9...etc), since the seqnames are constructed differently respect to the one in the gff file, and I have created a list (same length as the seqlevels of the bam file).
bamfile
GAlignments object with 6 alignments and 0 metadata columns:
seqnames
<Rle>
[1] DQ459430_gene=ERCC00002_loc:ERCC00002|1-1061|+_exons:1-1061_segs:1-1061
[2] DQ459430_gene=ERCC00002_loc:ERCC00002|1-1061|+_exons:1-1061_segs:1-1061
[3] DQ459430_gene=ERCC00002_loc:ERCC00002|1-1061|+_exons:1-1061_segs:1-1061
[4] DQ459430_gene=ERCC00002_loc:ERCC00002|1-1061|+_exons:1-1061_segs:1-1061
[5] DQ459430_gene=ERCC00002_loc:ERCC00002|1-1061|+_exons:1-1061_segs:1-1061
[6] DQ459430_gene=ERCC00002_loc:ERCC00002|1-1061|+_exons:1-1061_segs:1-1061
strand cigar qwidth start end width njunc
<Rle> <character> <integer> <integer> <integer> <integer> <integer>
[1] + 8M2D27M 35 1025 1061 37 0
[2] + 8M2D27M 35 1025 1061 37 0
[3] - 36M 36 1025 1060 36 0
[4] - 36M 36 1026 1061 36 0
[5] + 35M 35 1027 1061 35 0
[6] + 35M 35 1027 1061 35 0
-------
gffile
GRanges object with 6 ranges and 12 metadata columns:
seqnames ranges strand | source type score
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
[1] NC_001133.9 [ 24837, 25070] + | s_cerevisiae exon <NA>
[2] NC_001133.9 [ 25048, 25394] + | s_cerevisiae exon <NA>
[3] NC_001133.9 [ 27155, 27786] + | s_cerevisiae exon <NA>
[4] NC_001133.9 [ 73431, 73792] + | s_cerevisiae exon <NA>
[5] NC_001133.9 [165314, 165561] + | s_cerevisiae exon <NA>
[6] NC_001133.9 [165388, 165781] + | s_cerevisiae exon <NA>
phase gene_id transcript_id exon_number gene_name
<integer> <character> <character> <character> <character>
[1] <NA> XLOC_000040 TCONS_00000191 1 FLO9
[2] <NA> XLOC_000040 TCONS_00000192 1 FLO9
[3] <NA> XLOC_000041 TCONS_00000193 1 FLO9
[4] <NA> XLOC_000055 TCONS_00000200 1 YAL037C-A
[5] <NA> XLOC_000075 TCONS_00000100 1 YAR010C
[6] <NA> XLOC_000075 TCONS_00000219 1 YAR010C
oId nearest_ref class_code
<character> <character> <character>
[1] {TRINITY_GG_normal}16_c1_g1_i1.mrna1 rna8 x
[2] {TRINITY_GG_normal}16_c0_g1_i1.mrna1 rna8 x
[3] {TRINITY_GG_normal}12_c0_g1_i1.mrna1 rna8 x
[4] {TRINITY_GG_normal}3_c3_g1_i1.mrna1 rna31 x
[5] {TRINITY_GG_normal}3479_c0_g1_i1.mrna1 rna77 x
[6] {TRINITY_GG_normal}24_c0_g1_i1.mrna1 rna77 x
tss_id
<character>
[1] TSS42
[2] TSS43
[3] TSS44
[4] TSS71
[5] TSS118
[6] TSS118
-------
It is possible to replace the seqlevels names with the ones of the vector of character?
I have tried:
bamfile1 <- renameSeqlevels(seqlevels(bamfile), listx)
Thank you for any advice,
Kind regards
If by "bam file mapped on transcript" you mean that your reads have been aligned to the transcriptome, then you need to realize that the positions in your BAM file (and therefore in your GAlignments object) are relative to the transcriptome. OTOH GFF files annotate genomic features with respect to the reference genome, not to the transcriptome. That means your GAlignments object and your GRanges object
gff0
use incompatible coordinate systems to report aligned reads and features. So trying to callsummarizeOverlaps()
on them doesn't make sense. No amount of seqlevels renaming will fix that.If you've aligned your reads to the transcriptome, then it's easy to count the number of reads per transcript. Just do
table(seqnames(.))
on your GAlignments object. You might need to repeat this for each of your BAM files. If you need a SummarizedExperiment object for your downstream analysis, it should not be hard to "manually" create such object from these counts and to stick the transcript ids (i.e. the seqlevels of your GAlignments objects) in therowData
part of the object.Hope this helps,
H.