MT reference differences between hg19/GRCH37
1
0
Entering edit mode
Julian Gehring ★ 1.3k
@julian-gehring-5818
Last seen 5.0 years ago
Hi, The UCSC hg19 and GRCH 37 reference genome use different reference sequences for the mitochondrium (MT) that differ in length and have mismatches at multiple positions. For a short explanation on this, see https://lists.soe.ucsc.edu/pipermail/genome/2009-July/019631.html. Bioconductor normally only provides the UCSC references (see e.g. the BSgenome.* or TxDb.* packages. When using data aligned to the GRCH reference, to what extend does using the UCSC reference influence the analysis of features located on the MT, and for which kinds of downstream analyses could this become critical? E.g. locating SNPs on the MT is such a critical case. The problem will likely be solved with the hg20/GRCH38 release, but data that requires the hg19/GRCH37 releases may still be relevant for several years. Would it be reasonable to provide alternative reference packages, such as a GRCH37 BSgenome package? Best wishes Julian
BSgenome BSgenome BSgenome BSgenome • 2.2k views
ADD COMMENT
0
Entering edit mode
Julian Gehring ★ 1.3k
@julian-gehring-5818
Last seen 5.0 years ago
Hi, The UCSC hg19 and GRCH 37 reference genome use different reference sequences for the mitochondrium (MT) that differ in length and have mismatches at multiple positions. For a short explanation on this, see https://lists.soe.ucsc.edu/pipermail/genome/2009-July/019631.html. Bioconductor normally only provides the UCSC references (see e.g. the BSgenome.* or TxDb.* packages. When using data aligned to the GRCH reference, to what extend does using the UCSC reference influence the analysis of features located on the MT, and for which kinds of downstream analyses could this become critical? E.g. locating SNPs on the MT is such a critical case. The problem will likely be solved with the hg20/GRCH38 release, but data that requires the hg19/GRCH37 releases may still be relevant for several years. Would it be reasonable to provide alternative reference packages, such as a GRCH37 BSgenome package? Best wishes Julian
ADD COMMENT
0
Entering edit mode
Hi Julian, On 03/17/2013 05:53 AM, Julian Gehring wrote: > Hi, > > The UCSC hg19 and GRCH 37 reference genome use different reference > sequences for the mitochondrium (MT) that differ in length and have > mismatches at multiple positions. For a short explanation on this, see > https://lists.soe.ucsc.edu/pipermail/genome/2009-July/019631.html. > > Bioconductor normally only provides the UCSC references (see e.g. the > BSgenome.* or TxDb.* packages. Because they're the most frequently used but other annotations can also be used. For example you can use makeTranscriptDbFromBiomart() to make your own TranscriptDb object from Ensembl for Human. In that case, you get annotations relative to GRCh37, I think. Or, more precisely, to the patched version of GRCh37 currently used by Ensembl (GRCh37.p10 for Ensembl release 70, see Ensembl release announcements for the details). Also, if you work with SNPs, a tool like predictCoding() (in VariantAnnotation) accepts a FaFile object for sequence extraction (as an alternative to a BSgenome object). Making a FaFile object for GRCh37 should be easy. Note that you can also make your own BSgenome data package for GRCh37 (see the BSgenomeForge vignette in the BSgenome software package). Probably a little more work than making a FaFile object but shouldn't be too hard. > When using data aligned to the GRCH > reference, to what extend does using the UCSC reference influence the > analysis of features located on the MT, and for which kinds of > downstream analyses could this become critical? E.g. locating SNPs on > the MT is such a critical case. Any ranged-based analysis (i.e. an analysis where the genomic ranges of the data are compared with the genomic ranges of the annotations) might produce wrong results. And AFAIK most HTS downstream analyses are range-based analyses. The point-of-view we took in the GenomicRanges package is that it's always critical. This is the reason why any range-based operation that involves more than 1 object (e.g. findOverlaps) starts by checking that the underlying sequences (i.e. the chromosomes) are compatible between the 2 objects (i.e. have the same length and circularity flag), and raises an error if they are not. Packages that provide functionalities built on top of GenomicRanges package also inherit that behavior. This is not an absolute protection though because the chromosomes could have the same length and circularity flag and still differ. So it's really the responsibility of the user to check that s/he uses annotations that match the aligned data. > > The problem will likely be solved with the hg20/GRCH38 release, I hope so. Unless some chromosome sequence in GRCh38 gets updated after the initial GRCh38 release (this is what happened for GRCh37). > but data > that requires the hg19/GRCH37 releases may still be relevant for several > years. Would it be reasonable to provide alternative reference > packages, such as a GRCH37 BSgenome package? Maybe we could do that, even though there was no demand so far. Possible reasons could be that: - Most people manage to use annotations that match their aligned data. I agree that maybe we provide more BSgenome.* and TxDb.* packages based on UCSC than on other annotation resources, but we also do provide annotations from other resources. For example the SNPlocs.* packages contain dbSNP snapshots based on GRCh37, not hg19. Anyway we don't want people to be stuck with UCSC so we try to provide flexible tools that allow people to use annotations from different sources. Please let us know if you are in a situation where you are stuck and cannot use annotations that match your aligned data. - Most people simply exclude the MT chromosome from their analysis and are happy with that. Hope this helps, H. > > Best wishes > Julian > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY

Login before adding your answer.

Traffic: 679 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6