Rsubread featureCounts unexpected strand-specifc counts with Ensembl GTF
1
0
Entering edit mode
Guest User ★ 13k
@guest-user-4897
Last seen 9.6 years ago
Hi, I am trying to use featureCounts to perform strand-specifc RNA-Seq read counting with an Ensembl GTF file. However, it seems to count reads from each strand separately rather than for each gene from the strand where the gene is located. I have created a small BAM that contains reads from only two genes (one from each strand) to illustrate the problem. First, if I specify strandSpefic = 0 I get 97.7% of the reads aligned to two genes: a = featureCounts("merged.bam", annot.ext = "Homo_sapiens.GRCh38.76.gtf", strandSpecific = 0, isGTFAnnotationFile = TRUE, isPairedEnd = TRUE) > a$stat Status counts.merged.bam 1 Assigned 16179 2 Unassigned_Ambiguity 0 3 Unassigned_MultiMapping 56 4 Unassigned_NoFeatures 335 And the reads counts for the two genes are: > a$counts[c("ENSG00000066336","ENSG00000136869"),] ENSG00000066336 ENSG00000136869 11759 4419 However, when I try to specifiy strandSpecific = 1 or 2 I get reads from only one strand counted in either case b = featureCounts("counts/merged.bam", annot.ext = "../../annotations/GRCh38/genes/Homo_sapiens.GRCh38.76.gtf", strandSpecific = 1, isGTFAnnotationFile = TRUE, isPairedEnd = TRUE) > b$stat Status counts.merged.bam 1 Assigned 5295 2 Unassigned_Ambiguity 0 3 Unassigned_MultiMapping 56 4 Unassigned_NoFeatures 11219 > b$counts[c("ENSG00000066336","ENSG00000136869"),] ENSG00000066336 ENSG00000136869 952 4343 ENSG00000136869 is located on the forward strand, gets high counts with strandSpecific = 1. ##### c = featureCounts("counts/merged.bam", annot.ext = "../../annotations/GRCh38/genes/Homo_sapiens.GRCh38.76.gtf", strandSpecific = 2, isGTFAnnotationFile = TRUE, isPairedEnd = TRUE) > c$stat Status counts.merged.bam 1 Assigned 10884 2 Unassigned_Ambiguity 0 3 Unassigned_MultiMapping 56 4 Unassigned_NoFeatures 5630 > c$counts[c("ENSG00000066336","ENSG00000136869"),] ENSG00000066336 ENSG00000136869 10807 76 ENSG00000066336 is located on the reverse strand, gets high counts with strandSpecific = 2. Furthermore, a$counts[c("ENSG00000066336","ENSG00000136869"),] = c$counts[c("ENSG00000066336","ENSG00000136869"),] + b$counts[c("ENSG00000066336","ENSG00000136869"),] I downloaded the Ensembl 76 GTF file from here: ftp://ftp.ensembl.org/pub/release-76/gtf/homo_sapiens The sample BAM files is available from here: https://dl.dropboxusercontent.com/u/6006832/merged.bam Is this a bug of featureCounts or am I doing something obviously wrong? Kind regards, Kaur Alasoo PhD student at Sanger Institute -- output of sessionInfo(): R version 3.0.2 (2013-09-25) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] Rsubread_1.12.6 BiocInstaller_1.12.1 loaded via a namespace (and not attached): [1] tools_3.0.2 -- Sent via the guest posting facility at bioconductor.org.
• 2.3k views
ADD COMMENT
1
Entering edit mode
Wei Shi ★ 3.6k
@wei-shi-2183
Last seen 14 days ago
Australia/Melbourne/Olivia Newton-John …
Dear Kaur, We had a close look at the reads in your supplied bam file and found that most of the reads were mapped to the positive strand. This is inconsistent with the annotation for the two genes you listed, ie. the read mapping results tell you that both genes are on the positive strand but the annotation data tell you that the two genes are on different strands. When strandSpecific is set to 1, featureCounts requires the strand of a read is the same as that of a gene before assigning the read to the gene. When strandSpecific is set to 2, featureCounts will flip the strand of each read and then try to match it to the strands of genes. Since your two genes are on different strands but the reads are almost all on the same strand, you will never get large counts (or small counts) for the two genes at the same time. This is not a problem of featureCounts. featureCounts just assigns the reads using the strand info provided in the mapping results. This is more like a problem with either the annotation or read mapping or maybe the sequencing experiment. Hope this helps. Best wishes, Wei On Aug 28, 2014, at 12:53 AM, Kaur Alasoo [guest] wrote: > Hi, > > I am trying to use featureCounts to perform strand-specifc RNA-Seq read counting with an Ensembl GTF file. However, it seems to count reads from each strand separately rather than for each gene from the strand where the gene is located. I have created a small BAM that contains reads from only two genes (one from each strand) to illustrate the problem. > > First, if I specify strandSpefic = 0 I get 97.7% of the reads aligned to two genes: > a = featureCounts("merged.bam", annot.ext = "Homo_sapiens.GRCh38.76.gtf", > strandSpecific = 0, isGTFAnnotationFile = TRUE, isPairedEnd = TRUE) > >> a$stat > Status counts.merged.bam > 1 Assigned 16179 > 2 Unassigned_Ambiguity 0 > 3 Unassigned_MultiMapping 56 > 4 Unassigned_NoFeatures 335 > > And the reads counts for the two genes are: >> a$counts[c("ENSG00000066336","ENSG00000136869"),] > ENSG00000066336 ENSG00000136869 > 11759 4419 > > However, when I try to specifiy strandSpecific = 1 or 2 I get reads from only one strand counted in either case > b = featureCounts("counts/merged.bam", annot.ext = "../../annotations/GRCh38/genes/Homo_sapiens.GRCh38.76.gtf", strandSpecific = 1, isGTFAnnotationFile = TRUE, isPairedEnd = TRUE) > >> b$stat > Status counts.merged.bam > 1 Assigned 5295 > 2 Unassigned_Ambiguity 0 > 3 Unassigned_MultiMapping 56 > 4 Unassigned_NoFeatures 11219 > >> b$counts[c("ENSG00000066336","ENSG00000136869"),] > ENSG00000066336 ENSG00000136869 > 952 4343 > > ENSG00000136869 is located on the forward strand, gets high counts with strandSpecific = 1. > > ##### > > c = featureCounts("counts/merged.bam", annot.ext = "../../annotations/GRCh38/genes/Homo_sapiens.GRCh38.76.gtf", strandSpecific = 2, isGTFAnnotationFile = TRUE, isPairedEnd = TRUE) > >> c$stat > Status counts.merged.bam > 1 Assigned 10884 > 2 Unassigned_Ambiguity 0 > 3 Unassigned_MultiMapping 56 > 4 Unassigned_NoFeatures 5630 > >> c$counts[c("ENSG00000066336","ENSG00000136869"),] > ENSG00000066336 ENSG00000136869 > 10807 76 > > ENSG00000066336 is located on the reverse strand, gets high counts with strandSpecific = 2. > > Furthermore, > a$counts[c("ENSG00000066336","ENSG00000136869"),] = c$counts[c("ENSG00000066336","ENSG00000136869"),] + b$counts[c("ENSG00000066336","ENSG00000136869"),] > > I downloaded the Ensembl 76 GTF file from here: > ftp://ftp.ensembl.org/pub/release-76/gtf/homo_sapiens > > The sample BAM files is available from here: > https://dl.dropboxusercontent.com/u/6006832/merged.bam > > Is this a bug of featureCounts or am I doing something obviously wrong? > > Kind regards, > Kaur Alasoo > PhD student at Sanger Institute > > -- output of sessionInfo(): > > R version 3.0.2 (2013-09-25) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] Rsubread_1.12.6 BiocInstaller_1.12.1 > > loaded via a namespace (and not attached): > [1] tools_3.0.2 > > -- > Sent via the guest posting facility at bioconductor.org. ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}
ADD COMMENT
0
Entering edit mode
Dear Wei, Thanks for your quick reply. I think the problem might be how the strand is represented in the BAM file. I used Tophat2 to perform the alignment and it uses XS:A:+ or XS:A:- tags to specify whether the read comes from the forward or the reverse strand, respectively. When I split the reads by chromosome and then count how many of them map to each strand I get the what is expected in the annotations: On chromosome 11 almost all reads are on reverse strand. ENSG00000066336 is also on on the reverse strand. $ samtools view merged.bam | awk '{if ($3 == "11") {print $0}}' | grep "XS:A:-" | wc -l 23410 $ samtools view merged.bam | awk '{if ($3 == "11") {print $0}}' | grep "XS:A:+" | wc -l 5 On chromosome 9 almost all read are on the forward strand. ENSG00000136869 is also on the fwd strand. $ samtools view merged.bam | awk '{if ($3 == "9") {print $0}}' | grep "XS:A:-" | wc -l 80 $ samtools view merged.bam | awk '{if ($3 == "9") {print $0}}' | grep "XS:A:+" | wc -l 8628 Which field of the BAM files does featureCounts use to infer the strand of the read? Best wishes, Kaur On 29 Aug 2014, at 03:59, Wei Shi <shi at="" wehi.edu.au=""> wrote: > Dear Kaur, > > We had a close look at the reads in your supplied bam file and found that most of the reads were mapped to the positive strand. This is inconsistent with the annotation for the two genes you listed, ie. the read mapping results tell you that both genes are on the positive strand but the annotation data tell you that the two genes are on different strands. > > When strandSpecific is set to 1, featureCounts requires the strand of a read is the same as that of a gene before assigning the read to the gene. When strandSpecific is set to 2, featureCounts will flip the strand of each read and then try to match it to the strands of genes. Since your two genes are on different strands but the reads are almost all on the same strand, you will never get large counts (or small counts) for the two genes at the same time. > > This is not a problem of featureCounts. featureCounts just assigns the reads using the strand info provided in the mapping results. This is more like a problem with either the annotation or read mapping or maybe the sequencing experiment. > > Hope this helps. > > Best wishes, > > Wei > > > > On Aug 28, 2014, at 12:53 AM, Kaur Alasoo [guest] wrote: > >> Hi, >> >> I am trying to use featureCounts to perform strand-specifc RNA-Seq read counting with an Ensembl GTF file. However, it seems to count reads from each strand separately rather than for each gene from the strand where the gene is located. I have created a small BAM that contains reads from only two genes (one from each strand) to illustrate the problem. >> >> First, if I specify strandSpefic = 0 I get 97.7% of the reads aligned to two genes: >> a = featureCounts("merged.bam", annot.ext = "Homo_sapiens.GRCh38.76.gtf", >> strandSpecific = 0, isGTFAnnotationFile = TRUE, isPairedEnd = TRUE) >> >>> a$stat >> Status counts.merged.bam >> 1 Assigned 16179 >> 2 Unassigned_Ambiguity 0 >> 3 Unassigned_MultiMapping 56 >> 4 Unassigned_NoFeatures 335 >> >> And the reads counts for the two genes are: >>> a$counts[c("ENSG00000066336","ENSG00000136869"),] >> ENSG00000066336 ENSG00000136869 >> 11759 4419 >> >> However, when I try to specifiy strandSpecific = 1 or 2 I get reads from only one strand counted in either case >> b = featureCounts("counts/merged.bam", annot.ext = "../../annotations/GRCh38/genes/Homo_sapiens.GRCh38.76.gtf", strandSpecific = 1, isGTFAnnotationFile = TRUE, isPairedEnd = TRUE) >> >>> b$stat >> Status counts.merged.bam >> 1 Assigned 5295 >> 2 Unassigned_Ambiguity 0 >> 3 Unassigned_MultiMapping 56 >> 4 Unassigned_NoFeatures 11219 >> >>> b$counts[c("ENSG00000066336","ENSG00000136869"),] >> ENSG00000066336 ENSG00000136869 >> 952 4343 >> >> ENSG00000136869 is located on the forward strand, gets high counts with strandSpecific = 1. >> >> ##### >> >> c = featureCounts("counts/merged.bam", annot.ext = "../../annotations/GRCh38/genes/Homo_sapiens.GRCh38.76.gtf", strandSpecific = 2, isGTFAnnotationFile = TRUE, isPairedEnd = TRUE) >> >>> c$stat >> Status counts.merged.bam >> 1 Assigned 10884 >> 2 Unassigned_Ambiguity 0 >> 3 Unassigned_MultiMapping 56 >> 4 Unassigned_NoFeatures 5630 >> >>> c$counts[c("ENSG00000066336","ENSG00000136869"),] >> ENSG00000066336 ENSG00000136869 >> 10807 76 >> >> ENSG00000066336 is located on the reverse strand, gets high counts with strandSpecific = 2. >> >> Furthermore, >> a$counts[c("ENSG00000066336","ENSG00000136869"),] = c$counts[c("ENSG00000066336","ENSG00000136869"),] + b$counts[c("ENSG00000066336","ENSG00000136869"),] >> >> I downloaded the Ensembl 76 GTF file from here: >> ftp://ftp.ensembl.org/pub/release-76/gtf/homo_sapiens >> >> The sample BAM files is available from here: >> https://dl.dropboxusercontent.com/u/6006832/merged.bam >> >> Is this a bug of featureCounts or am I doing something obviously wrong? >> >> Kind regards, >> Kaur Alasoo >> PhD student at Sanger Institute >> >> -- output of sessionInfo(): >> >> R version 3.0.2 (2013-09-25) >> Platform: x86_64-apple-darwin10.8.0 (64-bit) >> >> locale: >> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] Rsubread_1.12.6 BiocInstaller_1.12.1 >> >> loaded via a namespace (and not attached): >> [1] tools_3.0.2 >> >> -- >> Sent via the guest posting facility at bioconductor.org. > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:14}}
ADD REPLY

Login before adding your answer.

Traffic: 739 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6