readGappedAlignments, package:GenomicRanges

0

Entering edit mode

Milica Krunic ▴ 40

@milica-krunic-5169

Last seen 9.6 years ago

Hello! I am working on RNA Seq. I mapped the reads to a genome using bwa. Now I want to get the count tables (number of reads per genomic feature), and one of the steps include conversion of .sam files into sorted .bam files, and creating a GRangesList out of sorted .bam files. Before the last mentioned step, I also applied readGappedAlignments to read in the sorted .bam files. Code, tail of tested .sam file, and contest of variable bwa_output are following: *Code*: fcatusDb=makeTranscriptDbFromBiomart(biomart="ensembl", dataset="fcatus_gene_ensembl") tx=transcriptsBy(fcatusDb) bwa_output=readGappedAlignments(file_name1, format="BAM") bwa_grglist=grglist(bwa_output) rc=countOverlaps(tx, bwa_grglist[countOverlaps(bwa_grglist,tx)==1]) *test.sam*: @SQ SN:GeneScaffold_295 LN:1999586 @SQ SN:GeneScaffold_5264 LN:2008247 @SQ SN:GeneScaffold_3594 LN:2061208 @SQ SN:GeneScaffold_3107 LN:2081283 HWUSI-EAS475:1:1:9547:3828#0 0 GeneScaffold_1986 3069 0 41M * 0 0 CCACATCTCTCCCAGTTTCTTTGCAACATCACCAATGGATA * XT:A:R NM:i:0 X0:i:3 X1:i:1 XM:i:0 XO:i:0 XG:i:0 MD:Z:41 XA:Z:scaffold_214963,-224762,41M,0;GeneScaffold_4609,+41298,41M,0;Gene Scaffold_886,-415265,41M,1; HWUSI-EAS475:1:1:5401:3825#0 16 scaffold_96256 459 0 41M * 0 0 GTTGGCTCTCAGGGGTACATGATGAACTGGGGAAGGAGTAT * XT:A:R NM:i:0 X0:i:3 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:41 XA:Z:scaffold_166015,+60068,41M,0;scaffold_5553,+4693,41M,0; HWUSI-EAS475:1:1:8428:7999#0 0 GeneScaffold_4103 98907 37 41M * 0 0 CCTGCACGTTCATTGTGTGTGTCTTGAGTTGATTTGTCACC * XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:11T29 HWUSI-EAS475:1:1:5814:7999#0 0 scaffold_75884 112 14 41M * 0 0 CCCAGAAGTCCCAGGTGTTCCCATTCTGTGATCAGCACAGA * XT:A:U NM:i:0 X0:i:1 X1:i:8 XM:i:0 XO:i:0 XG:i:0 MD:Z:41 XA:Z:scaffold_99844,+4698,41M,1;scaffold_20760,+365,41M,1;scaffold_957 72,+1923,41M,1;scaffold_16284,-3782,41M,1;scaffold_22266,-52,41M,1;sca ffold_82279,-5356,41M,1;scaffold_75127,-757,41M,1;scaffold_82292,-3298 ,41M,1; *bwa_output:* * * * GappedAlignments with 4 alignments and 0 elementMetadata values: seqnames strand cigar qwidth start end <rle> <rle> <character> <integer> <integer> <integer> [1] scaffold_75884 + 41M 41 112 152 [2] scaffold_96256 - 41M 41 459 499 [3] GeneScaffold_1986 + 41M 41 3069 3109 [4] GeneScaffold_4103 + 41M 41 98907 98947 width ngap <integer> <integer> [1] 41 0 [2] 41 0 [3] 41 0 [4] 41 0 --- * * * According the manual ("Multi-reads won't receive any special treatment i.e. the various SAM/BAM records corresponding to a multi-read will show up in the GappedAlignments object as if they were coming from different/unrelated queries.") and taken into account previously mentioned, I would be grateful if someone can confirm me that readGappedAlignments (output of bwa gives .sam files which contain all information about mapping locations of one read in one line) will only take the first hit, no matter what is following (whether that read is mapped to several equally scored positions...). If this is the case, would you filter the .sam files for uniquely mapped reads before reading it via readGappedAlignments? Or will you let the countOverlaps do it? Thank you! [[alternative HTML version deleted]]

• 1.3k views

ADD COMMENT • link updated 12.1 years ago by Hervé Pagès 16k • written 12.1 years ago by Milica Krunic ▴ 40

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 5 hours ago

Seattle, WA, United States

Hi Milica, On 03/16/2012 07:28 AM, Milica Krunic wrote: > Hello! > > I am working on RNA Seq. I mapped the reads to a genome using bwa. Now I > want to get the count tables (number of reads per genomic feature), and one > of the steps include conversion of .sam files into sorted .bam files, and > creating a GRangesList out of sorted .bam files. Before the last mentioned > step, I also applied readGappedAlignments to read in the sorted .bam files. > Code, tail of tested .sam file, and contest of variable bwa_output are > following: > > *Code*: > > fcatusDb=makeTranscriptDbFromBiomart(biomart="ensembl", > dataset="fcatus_gene_ensembl") > tx=transcriptsBy(fcatusDb) > > bwa_output=readGappedAlignments(file_name1, format="BAM") > > bwa_grglist=grglist(bwa_output) > > rc=countOverlaps(tx, bwa_grglist[countOverlaps(bwa_grglist,tx)==1]) > > > > *test.sam*: > > @SQ SN:GeneScaffold_295 LN:1999586 > @SQ SN:GeneScaffold_5264 LN:2008247 > @SQ SN:GeneScaffold_3594 LN:2061208 > @SQ SN:GeneScaffold_3107 LN:2081283 > HWUSI-EAS475:1:1:9547:3828#0 0 GeneScaffold_1986 3069 0 > 41M * 0 0 > CCACATCTCTCCCAGTTTCTTTGCAACATCACCAATGGATA * XT:A:R NM:i:0 > X0:i:3 X1:i:1 XM:i:0 XO:i:0 XG:i:0 MD:Z:41 > XA:Z:scaffold_214963,-224762,41M,0;GeneScaffold_4609,+41298,41M,0;Ge neScaffold_886,-415265,41M,1; > HWUSI-EAS475:1:1:5401:3825#0 16 scaffold_96256 459 0 41M > * 0 0 GTTGGCTCTCAGGGGTACATGATGAACTGGGGAAGGAGTAT > * XT:A:R NM:i:0 X0:i:3 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:41 > XA:Z:scaffold_166015,+60068,41M,0;scaffold_5553,+4693,41M,0; > HWUSI-EAS475:1:1:8428:7999#0 0 GeneScaffold_4103 98907 37 > 41M * 0 0 > CCTGCACGTTCATTGTGTGTGTCTTGAGTTGATTTGTCACC * XT:A:U NM:i:1 > X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:11T29 > HWUSI-EAS475:1:1:5814:7999#0 0 scaffold_75884 112 14 41M > * 0 0 CCCAGAAGTCCCAGGTGTTCCCATTCTGTGATCAGCACAGA > * XT:A:U NM:i:0 X0:i:1 X1:i:8 XM:i:0 XO:i:0 XG:i:0 MD:Z:41 > XA:Z:scaffold_99844,+4698,41M,1;scaffold_20760,+365,41M,1;scaffold_9 5772,+1923,41M,1;scaffold_16284,-3782,41M,1;scaffold_22266,-52,41M,1;s caffold_82279,-5356,41M,1;scaffold_75127,-757,41M,1;scaffold_82292,-32 98,41M,1; > > > > *bwa_output:* > * > * > * > GappedAlignments with 4 alignments and 0 elementMetadata values: > seqnames strand cigar qwidth start end > <rle> <rle> <character> <integer> <integer> <integer> > [1] scaffold_75884 + 41M 41 112 152 > [2] scaffold_96256 - 41M 41 459 499 > [3] GeneScaffold_1986 + 41M 41 3069 3109 > [4] GeneScaffold_4103 + 41M 41 98907 98947 > width ngap > <integer> <integer> > [1] 41 0 > [2] 41 0 > [3] 41 0 > [4] 41 0 > --- > > > * > * > * > > According the manual ("Multi-reads won't receive any special treatment i.e. > the various SAM/BAM records corresponding to a multi-read will show up in > the GappedAlignments object as if they were coming from different/unrelated > queries.") and taken into account previously mentioned, I would be grateful > if someone can confirm me that readGappedAlignments (output of bwa gives > .sam files which contain all information about mapping locations of one > read in one line) will only take the first hit, no matter what is following > (whether that read is mapped to several equally scored positions...). This part of the manual is trying to emphasize the fact that readGappedAlignments() is returning a GappedAlignments object with *one* element per record in the BAM file. readGappedAlignments() is pretty low-level i.e. it doesn't try to merge or to drop records when a read has multiple hits or when reads are paired-end. By default it loads all the records that can be loaded. The only exception is for unmapped reads: those are never loaded in the GappedAlignments object. So, by default, records corresponding to the same read (i.e. records with the same QNAME value, note that you can use use.names=TRUE to set this field as the names of the returned object) will be loaded. And you will typically end up with duplicated names on your GappedAlignments object. AFAIK the reasons for finding more than 1 record with the same QNAME value in a BAM file are because: (a) more than 1 hit were reported in the BAM file by the aligner; (b) or the reads are paired-end, then 2 records are needed to report a "paired-hit"; (c) or a combination of (a) and (b), e.g. if the aligner found 3 "paired-hits" for a given paired-end read, then 6 records are needed. Here are the 2 most common strategies for dealing with single-end multi-reads (i.e. single-end reads with more than 1 record): 1. AFAIU most aligner should/will set flag bit 0x100 to 1 for hits that are considered secondary. So you can filter out those records at load time with: library(Rsamtools) param <- ScanBamParam(flag=scanBamFlag(isNotPrimaryRead=FALSE)) bwa_output <- readGappedAlignments(file_name1, param=param) My understanding of the SAM spec is that, in the case of single-end reads, there should be only 1 primary record in the BAM file for a given QNAME. 2. Aligners will sometime set predefined tag NH (number of hits) for each record. You can load this tag with: param <- ScanBamParam(tag="NH") bwa_output <- readGappedAlignments(file_name1, param=param) and then subset 'bwa_output' based on the value of this tag: bwa_output[elementMetadata(bwa_output)$NH == 1L] Note that this filtering is more stringent than 1.: it will not only drop all the secondary records but it will also drop the primary record of a read that has secondary records. For paired-end reads the situation is a little bit more complicated but there are tools currently in development in Rsamtools/GenomicRanges that will make it easy to load them, keep the pairing information and filter out "secondary pairs". > > If this is the case, would you filter the .sam files for uniquely mapped > reads before reading it via readGappedAlignments? Or will you let the > countOverlaps do it? The precise meaning of "uniquely mapped" depends on the context. When we talk about reads with unique/multiple SAM records, it means "uniquely mapped to the reference genome". But when you do: countOverlaps(bwa_grglist,tx)==1 you are selecting alignments (i.e. elements in your GappedAlignments object 'bwa_grglist') that hit only 1 gene (because of how you made 'bwa_grglist', i.e. by grouping transcripts by gene, the hit must actually be in one of the gene exons in order to be counted). So the 2 filterings (i.e. filter the .sam files for uniquely mapped reads and filtering with countOverlaps(bwa_grglist,tx)==1) are independent. A read can have multiple hits in the reference genome while still hitting a single gene. And, inversely, a read can have a unique hit in the reference genome and hit more than 1 gene (e.g. if 2 genes have exons that overlap, probably a rare but not impossible event). So you can do either filtering or both of them. I'm not sure there is a "one size fits all" strategy. I would try different solutions and compare them... Note that there are also higher level tools available for doing this kind of counting e.g. summarizeOverlaps() in GenomicRanges (see the summarizeOverlaps.pdf vignette in that package) and the htseq-count script distributed with the HTSeq Python framework (see the DESeq.pdf vignette in the DESeq package). It's a tough question so it's always good to see how other people have tried to solve it. HTH, H. > > > > Thank you! > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 12.1 years ago Hervé Pagès 16k

0

Entering edit mode

A small correction. See below... On 03/16/2012 01:03 PM, Hervé Pagès wrote: > Hi Milica, > > On 03/16/2012 07:28 AM, Milica Krunic wrote: >> Hello! >> >> I am working on RNA Seq. I mapped the reads to a genome using bwa. Now I >> want to get the count tables (number of reads per genomic feature), >> and one >> of the steps include conversion of .sam files into sorted .bam files, and >> creating a GRangesList out of sorted .bam files. Before the last >> mentioned >> step, I also applied readGappedAlignments to read in the sorted .bam >> files. >> Code, tail of tested .sam file, and contest of variable bwa_output are >> following: >> >> *Code*: >> >> fcatusDb=makeTranscriptDbFromBiomart(biomart="ensembl", >> dataset="fcatus_gene_ensembl") >> tx=transcriptsBy(fcatusDb) >> >> bwa_output=readGappedAlignments(file_name1, format="BAM") >> >> bwa_grglist=grglist(bwa_output) >> >> rc=countOverlaps(tx, bwa_grglist[countOverlaps(bwa_grglist,tx)==1]) >> >> >> >> *test.sam*: >> >> @SQ SN:GeneScaffold_295 LN:1999586 >> @SQ SN:GeneScaffold_5264 LN:2008247 >> @SQ SN:GeneScaffold_3594 LN:2061208 >> @SQ SN:GeneScaffold_3107 LN:2081283 >> HWUSI-EAS475:1:1:9547:3828#0 0 GeneScaffold_1986 3069 0 >> 41M * 0 0 >> CCACATCTCTCCCAGTTTCTTTGCAACATCACCAATGGATA * XT:A:R NM:i:0 >> X0:i:3 X1:i:1 XM:i:0 XO:i:0 XG:i:0 MD:Z:41 >> XA:Z:scaffold_214963,-224762,41M,0;GeneScaffold_4609,+41298,41M,0;G eneScaffold_886,-415265,41M,1; >> >> HWUSI-EAS475:1:1:5401:3825#0 16 scaffold_96256 459 0 41M >> * 0 0 GTTGGCTCTCAGGGGTACATGATGAACTGGGGAAGGAGTAT >> * XT:A:R NM:i:0 X0:i:3 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:41 >> XA:Z:scaffold_166015,+60068,41M,0;scaffold_5553,+4693,41M,0; >> HWUSI-EAS475:1:1:8428:7999#0 0 GeneScaffold_4103 98907 37 >> 41M * 0 0 >> CCTGCACGTTCATTGTGTGTGTCTTGAGTTGATTTGTCACC * XT:A:U NM:i:1 >> X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:11T29 >> HWUSI-EAS475:1:1:5814:7999#0 0 scaffold_75884 112 14 41M >> * 0 0 CCCAGAAGTCCCAGGTGTTCCCATTCTGTGATCAGCACAGA >> * XT:A:U NM:i:0 X0:i:1 X1:i:8 XM:i:0 XO:i:0 XG:i:0 MD:Z:41 >> XA:Z:scaffold_99844,+4698,41M,1;scaffold_20760,+365,41M,1;scaffold_ 95772,+1923,41M,1;scaffold_16284,-3782,41M,1;scaffold_22266,-52,41M,1; scaffold_82279,-5356,41M,1;scaffold_75127,-757,41M,1;scaffold_82292,-3 298,41M,1; >> >> >> >> >> *bwa_output:* >> * >> * >> * >> GappedAlignments with 4 alignments and 0 elementMetadata values: >> seqnames strand cigar qwidth start end >> <rle> <rle> <character> <integer> <integer> <integer> >> [1] scaffold_75884 + 41M 41 112 152 >> [2] scaffold_96256 - 41M 41 459 499 >> [3] GeneScaffold_1986 + 41M 41 3069 3109 >> [4] GeneScaffold_4103 + 41M 41 98907 98947 >> width ngap >> <integer> <integer> >> [1] 41 0 >> [2] 41 0 >> [3] 41 0 >> [4] 41 0 >> --- >> >> >> * >> * >> * >> >> According the manual ("Multi-reads won't receive any special treatment >> i.e. >> the various SAM/BAM records corresponding to a multi-read will show up in >> the GappedAlignments object as if they were coming from >> different/unrelated >> queries.") and taken into account previously mentioned, I would be >> grateful >> if someone can confirm me that readGappedAlignments (output of bwa gives >> .sam files which contain all information about mapping locations of one >> read in one line) will only take the first hit, no matter what is >> following >> (whether that read is mapped to several equally scored positions...). > > This part of the manual is trying to emphasize the fact that > readGappedAlignments() is returning a GappedAlignments object with > *one* element per record in the BAM file. readGappedAlignments() > is pretty low-level i.e. it doesn't try to merge or to drop records > when a read has multiple hits or when reads are paired-end. > By default it loads all the records that can be loaded. The only > exception is for unmapped reads: those are never loaded in the > GappedAlignments object. > > So, by default, records corresponding to the same read (i.e. records > with the same QNAME value, note that you can use use.names=TRUE to > set this field as the names of the returned object) will be loaded. > And you will typically end up with duplicated names on your > GappedAlignments object. > > AFAIK the reasons for finding more than 1 record with the same QNAME > value in a BAM file are because: > (a) more than 1 hit were reported in the BAM file by the aligner; > (b) or the reads are paired-end, then 2 records are needed to > report a "paired-hit"; > (c) or a combination of (a) and (b), e.g. if the aligner found 3 > "paired-hits" for a given paired-end read, then 6 records > are needed. > > Here are the 2 most common strategies for dealing with single-end > multi-reads (i.e. single-end reads with more than 1 record): > > 1. AFAIU most aligner should/will set flag bit 0x100 to 1 > for hits that are considered secondary. So you can filter > out those records at load time with: > > library(Rsamtools) > param <- ScanBamParam(flag=scanBamFlag(isNotPrimaryRead=FALSE)) > bwa_output <- readGappedAlignments(file_name1, param=param) > > My understanding of the SAM spec is that, in the case of > single-end reads, there should be only 1 primary record in the > BAM file for a given QNAME. > > 2. Aligners will sometime set predefined tag NH (number of hits) > for each record. You can load this tag with: > > param <- ScanBamParam(tag="NH") > bwa_output <- readGappedAlignments(file_name1, param=param) > > and then subset 'bwa_output' based on the value of this tag: > > bwa_output[elementMetadata(bwa_output)$NH == 1L] > > Note that this filtering is more stringent than 1.: it will not > only drop all the secondary records but it will also drop the > primary record of a read that has secondary records. > > For paired-end reads the situation is a little bit more complicated > but there are tools currently in development in Rsamtools/GenomicRanges > that will make it easy to load them, keep the pairing information > and filter out "secondary pairs". > >> >> If this is the case, would you filter the .sam files for uniquely mapped >> reads before reading it via readGappedAlignments? Or will you let the >> countOverlaps do it? > > The precise meaning of "uniquely mapped" depends on the context. When > we talk about reads with unique/multiple SAM records, it means "uniquely > mapped to the reference genome". But when you do: > > countOverlaps(bwa_grglist,tx)==1 > > you are selecting alignments (i.e. elements in your GappedAlignments > object 'bwa_grglist') that hit only 1 gene (because of how you made > 'bwa_grglist', i.e. by grouping transcripts by gene, the hit must > actually be in one of the gene exons in order to be counted). Sorry that's not correct. I should say: the hit is counted if it's withing the boundaries of at least 1 of the transcripts of the gene. > > So the 2 filterings (i.e. filter the .sam files for uniquely mapped > reads and filtering with countOverlaps(bwa_grglist,tx)==1) are > independent. A read can have multiple hits in the reference genome > while still hitting a single gene. And, inversely, a read can have > a unique hit in the reference genome and hit more than 1 gene (e.g. > if 2 genes have exons that overlap, probably a rare but not if 2 genes have transcripts that overlap (looking at their boundaries only, regardless of the exon/intron structure) Cheers, H. > impossible event). So you can do either filtering or both of them. > > I'm not sure there is a "one size fits all" strategy. I would try > different solutions and compare them... Note that there are also > higher level tools available for doing this kind of counting e.g. > summarizeOverlaps() in GenomicRanges (see the summarizeOverlaps.pdf > vignette in that package) and the htseq-count script distributed > with the HTSeq Python framework (see the DESeq.pdf vignette in the > DESeq package). It's a tough question so it's always good to see how > other people have tried to solve it. > > HTH, > H. > >> >> >> >> Thank you! >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 12.1 years ago Hervé Pagès 16k

Login before adding your answer.