Question

Include and implement a new reference genome option in the R package BSgenome for downstream variant calling analysis

0

Entering edit mode

svlachavas ▴ 830

@svlachavas-7225

Last seen 9 months ago

Germany/Heidelberg/German Cancer Resear…

Dear Community,

based on the analysis results of an exome sequencing project (3 patients with paired cancer and normal samples-Small Cell Lung Cancer-Genomic DNA captured using Agilent in-solution enrichment methodology/paired-end 75 bases massively parallel sequencing on Illumina HiSeq4000)-

both for the alignment of fastq files, as also for the variant calling procedure, the following reference genome was selected and utilized from gencode:

https://www.gencodegenes.org/releases/current.html (Genome sequence (GRCh38.p12)-Regions-ALL-fasta format)

For my next step, i wanted to use the R package MutationalPatterns, in order to import the resulted vcf files, and inspect common mutational patterns (for SNPs). However, as this package utilizes the R package BSgenome for loading the reference genome:

https://bioconductor.org/packages/release/bioc/html/MutationalPatterns.html

library(BSgenome)

head(available.genomes())

which, for human includes the options: "BSgenome.Hsapiens.NCBI.GRCh38" and "BSgenome.Hsapiens.UCSC.hg38"

Thus, my crusial question is:

is possible, to also somehow install, modify and/or utilize also the gencode as a reference genome in the BSgenome R package ? with this structure ? in order to use it as a reference genome for my vcf files, in order to proceed for comparing the mutational patterns (SNPs) of these samples ? as described in the above vignette ?

Or alternatively, i could still use one of these two options ? For example the NCBI reference genome, as the relative from UCSC, has no updates from 2013 ? however, with this approach i could introduce considerable bias, from perhaps different annotations regarding the genomic coordinates ?

Kind Regards,

Efstathios-Iason

bsgenome variant calling whole exome sequence gencode reference genome • 2.0k views

ADD COMMENT • link updated 5.9 years ago by Hervé Pagès 16k • written 5.9 years ago by svlachavas ▴ 830

score 1 · Answer 1 · 2018-08-23

Hi,

As far as I know the GRCh38.p12 reference genome is the same as GRCh38. The only difference is that GRCh38.p12 adds corrections to GRCh38 in the form of additional sequences with respect to the 455 sequences present in GRCh38. So the set of sequences in GRCh38.p12 is a superset of the set of sequences in GRCh38. But the important thing to realize is that the sequences in the original set of 455 sequences has not changed between GRCh38 and GRCh38.p12.

Using unexported/undocumented internal helper fetch_assembly_report() from the GenomeInfoDb package:

library(GenomeInfoDb)

## Use the GenBank (or RefSeq) accessions of the assemblies to
## fetch the assembly reports from NCBI:
GRCh38 <- GenomeInfoDb:::fetch_assembly_report("GCA_000001405.15")
GRCh38.p12 <- GenomeInfoDb:::fetch_assembly_report("GCA_000001405.27")

## The assembly report is returned as a data.frame:
dim(GRCh38)
# [1] 455  10
dim(GRCh38.p12)
# [1] 595  10

head(GRCh38)
#   SequenceName       SequenceRole AssignedMolecule
# 1            1 assembled-molecule                1
# 2            2 assembled-molecule                2
# 3            3 assembled-molecule                3
# 4            4 assembled-molecule                4
# 5            5 assembled-molecule                5
# 6            6 assembled-molecule                6
#   AssignedMoleculeLocationOrType GenBankAccn Relationship   RefSeqAccn
# 1                     Chromosome  CM000663.2            = NC_000001.11
# 2                     Chromosome  CM000664.2            = NC_000002.12
# 3                     Chromosome  CM000665.2            = NC_000003.12
# 4                     Chromosome  CM000666.2            = NC_000004.12
# 5                     Chromosome  CM000667.2            = NC_000005.10
# 6                     Chromosome  CM000668.2            = NC_000006.12
#       AssemblyUnit SequenceLength UCSCStyleName
# 1 Primary Assembly      248956422          chr1
# 2 Primary Assembly      242193529          chr2
# 3 Primary Assembly      198295559          chr3
# 4 Primary Assembly      190214555          chr4
# 5 Primary Assembly      181538259          chr5
# 6 Primary Assembly      170805979          chr6

## GRCh38.p12 is a superset of GRCh38:
all(GRCh38$SequenceName %in% GRCh38.p12$SequenceName)
# [1] TRUE

## Display some of the 140 sequence names that are new in GRCh38.p12:
setdiff(GRCh38.p12$SequenceName, GRCh38$SequenceName)[1:30]
#  [1] "HG1342_HG2282_PATCH" "HSCHR1_5_CTG3"       "HG2095_PATCH"       
#  [4] "HSCHR1_4_CTG3"       "HG2058_PATCH"        "HSCHR1_8_CTG3"      
#  [7] "HG986_PATCH"         "HG460_PATCH"         "HSCHR1_3_CTG3"      
# [10] "HSCHR1_6_CTG3"       "HSCHR1_9_CTG3"       "HG2104_PATCH"       
# [13] "HG1832_PATCH"        "HG2002_PATCH"        "HSCHR1_5_CTG32_1"   
# [16] "HG2290_PATCH"        "HSCHR2_6_CTG7_2"     "HSCHR2_7_CTG7_2"    
# [19] "HSCHR2_8_CTG7_2"     "HG2233_PATCH"        "HG2232_PATCH"       
# [22] "HG2066_PATCH"        "HG126_PATCH"         "HG2235_PATCH"       
# [25] "HG2236_PATCH"        "HSCHR3_4_CTG1"       "HG2022_PATCH"       
# [28] "HG2237_PATCH"        "HG2133_PATCH"        "HSCHR3_7_CTG2_1"

If you need these new sequences for your analysis, you can always make a BSgenome for GRCh38.p12. See the "How to forge a BSgenome data package" vignette in the BSgenome package for how to do this: http://bioconductor.org/packages/BSgenome

Cheers,

H.