Question: Rsubread inbuilt annotation
0
12 weeks ago by
diego.carvalhoalvarez0 wrote:

Hi. I have been working with mouse RNA-Seq data and I aimed to use Rsubread package to align, annotate and quantify the reads. My issue is that whenever I use the inbuilt annotation for the mouse genome practically none of the reads are mappped:

align (index = "GRCm38.p4", readfile1 = all.fastq1, readfile2 = all.fastq2, input_format = "FASTQ", output_format="BAM", nthreads=8, output_file = all.BAM)
featuresC <- featureCounts(all.BAM, annot.inbuilt="mm10")

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file mm10_RefSeq_exon.txt ...                              ||
||    Features : 222996                                                       ||
||    Meta-features : 27179                                                   ||
||    Chromosomes/contigs : 43                                                ||
||                                                                            ||
|| Process BAM file 1A.bam...                                                 ||
||    Paired-end reads are included.                                          ||
||    Assign alignments to features...                                        ||
||    Total alignments : 93836764                                             ||
||    Successfully assigned alignments : 29766 (0,0%)                         ||
||    Running time : 3,96 minutes                                             ||
||                                                                            ||


I tried with the latest patch too (p6) and I obtained the same result. It only works by explicitly stating an external annotation file (.gtf). But in that case genes are being annotated by name and not by entrezid which I require for gene ontology analysis and so on.

Could somebody help me with this? Thank you very much in advance.

rsubread featurecounts • 189 views
modified 12 weeks ago by Gordon Smyth38k • written 12 weeks ago by diego.carvalhoalvarez0
3
12 weeks ago by
Gordon Smyth38k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth38k wrote:

The problem is that NCBI has renamed all the chromosomes in its recent genome releases. A few years ago, chromosome 1 was called "chr1" in NCBI's fasta files, but it is now called "NC_000067.6". The Rsubread in-built annotation uses "chr1", "chr2" etc, which is compatible with UCSC and Gencode but no longer with NCBI.

Note that it's not a matter of which patch (p1, p4, p6) was used, as all the reference chromosomes are unchanged between patches. It's just an arbitrary change of notation by NCBI.

The easiest solution is to use a NCBI fasta file from before they renamed the chromosomes and gene ids. That's what I do, and it would seem that James MacDonald is doing the same.

ADD COMMENTlink modified 12 weeks ago • written 12 weeks ago by Gordon Smyth38k

So far I could not find an ncbi annotation file as desired so I decided to stick with Gencode, Ensembl assemblies and the inbuilt annotation. I tried both out of curiosity and they work as expected except for a detail...the percentage of successfully assigned alignments using Gencode primary assembly (release M18) is slightly better than with Ensembl (79,6% vs 77,6% for example). Regarding the detected features (after removal of zeros) they are almost the same 22392 using the Gencode assembly and 22399 using Emsembl. Thank you all for your help.

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by diego.carvalhoalvarez0
0
12 weeks ago by
United States
James W. MacDonald50k wrote:

Your cutting'n'pasting seems off - you are running featureCounts on 'all.BAM' but featureCounts appears to be running on 1A.bam.

Anyway, do you have a mismatch between the genomes you are using? The inbuilt annotation is based on the NCBI genome, which has chr1, chr2, etc. If you used an Ensembl genome to align, then all the chromosomes will be 1, 2, 3, etc. You can check using Rsamtools, doing something like

head(scanBamHeader("all.BAM")targets)  Where the names of the resulting vector are the chromosome names: > bf <- BamFile("../data/aligned_nodups/6-E-4Aligned.sortedByCoord.out.bam") > head(scanBamHeader(bf)targets)
chr1     chr10     chr11     chr12     chr13     chr14
195471971 130694993 122082543 120129022 120421639 124902244


And my BAM file is aligned against the mm10 genome from NCBI.

Thank you very much for your response. I used NCBI genome to build the index so they should match... (I only copied part of the output as the result is the same for all the files)...

I can't think of anything else that you could do to get such poor alignments. Even (not that I would know) using say the inbuilt mm10 file with human data will result in something like 10-15% of the reads aligning.

But I would caution against thinking things like they 'should' be anything. You should be inspecting the data from the BAM files to ensure that you are actually getting good alignments to the genome in the first place, and to ensure that the chromosome names match. Assuming something to be true that you can easily check is not really optimal.

ADD REPLYlink written 12 weeks ago by James W. MacDonald50k

I think it should be what you stated, a mismatch between ncbi genome and the inbuilt annotation. According to the documentation: "The in-built annotations are a modified version of NCBI RefSeq gene annotations. We downloaded the RefSeq gene annotations from NCBI ftp server (eg. RefSeq annotation for mm10 was downloaded from ftp://ftp.ncbi.nlm.nih.gov/genomes/Mmusculus/ARCHIVE/BUILD.38.1/mapview/seqgene.md.gz)." So it seems the authors used patch1 of the mm10. I am not sure, but apart from the issue I found using a different patch would bring inaccurate results. In that case, which version (and source) are you using? Thanks.

James suggested that you run scanBamHeader to check your chromosome names, but you don't seem to have done that. If you ran it, you would find that your chromosome names are not the same as James has. It's always best to check rather than to assume!