Include and implement a new reference genome option in the R package BSgenome for downstream variant calling analysis
Entering edit mode
svlachavas ▴ 830
Last seen 4 months ago
Germany/Heidelberg/German Cancer Resear…

Dear Community,

based on the analysis results of an exome sequencing project (3 patients with paired cancer and normal samples-Small Cell Lung Cancer-Genomic DNA captured using Agilent in-solution enrichment methodology/paired-end 75 bases massively parallel sequencing on Illumina HiSeq4000)-

both for the alignment of fastq files, as also for the variant calling procedure, the following reference genome was selected and utilized from gencode: (Genome sequence (GRCh38.p12)-Regions-ALL-fasta format)

For my next step, i wanted to use the R package MutationalPatterns, in order to import the resulted vcf files, and inspect common mutational patterns (for SNPs). However, as this package utilizes the R package BSgenome for loading the reference genome:



which, for human includes the options: "BSgenome.Hsapiens.NCBI.GRCh38" and "BSgenome.Hsapiens.UCSC.hg38"

Thus, my crusial question is:

is possible, to also somehow install, modify and/or utilize also the gencode as a reference genome in the BSgenome R package ? with this structure ? in order to use it as a reference genome for my vcf files, in order to proceed for comparing the mutational patterns (SNPs) of these samples ? as described in the above vignette ?

Or alternatively, i could still use one of these two options ? For example the NCBI reference genome, as the relative from UCSC, has no updates from 2013 ? however, with this approach i could introduce considerable bias, from perhaps different annotations regarding the genomic coordinates ? 

Kind Regards,



bsgenome variant calling whole exome sequence gencode reference genome • 1.7k views
Entering edit mode
Last seen 23 hours ago
Seattle, WA, United States


As far as I know the GRCh38.p12 reference genome is the same as GRCh38. The only difference is that GRCh38.p12 adds corrections to GRCh38 in the form of additional sequences with respect to the 455 sequences present in GRCh38. So the set of sequences in GRCh38.p12 is a superset of the set of sequences in GRCh38. But the important thing to realize is that the sequences in the original set of 455 sequences has not changed between GRCh38 and GRCh38.p12.

Using unexported/undocumented internal helper fetch_assembly_report() from the GenomeInfoDb package:


## Use the GenBank (or RefSeq) accessions of the assemblies to
## fetch the assembly reports from NCBI:
GRCh38 <- GenomeInfoDb:::fetch_assembly_report("GCA_000001405.15")
GRCh38.p12 <- GenomeInfoDb:::fetch_assembly_report("GCA_000001405.27")

## The assembly report is returned as a data.frame:
# [1] 455  10
# [1] 595  10

#   SequenceName       SequenceRole AssignedMolecule
# 1            1 assembled-molecule                1
# 2            2 assembled-molecule                2
# 3            3 assembled-molecule                3
# 4            4 assembled-molecule                4
# 5            5 assembled-molecule                5
# 6            6 assembled-molecule                6
#   AssignedMoleculeLocationOrType GenBankAccn Relationship   RefSeqAccn
# 1                     Chromosome  CM000663.2            = NC_000001.11
# 2                     Chromosome  CM000664.2            = NC_000002.12
# 3                     Chromosome  CM000665.2            = NC_000003.12
# 4                     Chromosome  CM000666.2            = NC_000004.12
# 5                     Chromosome  CM000667.2            = NC_000005.10
# 6                     Chromosome  CM000668.2            = NC_000006.12
#       AssemblyUnit SequenceLength UCSCStyleName
# 1 Primary Assembly      248956422          chr1
# 2 Primary Assembly      242193529          chr2
# 3 Primary Assembly      198295559          chr3
# 4 Primary Assembly      190214555          chr4
# 5 Primary Assembly      181538259          chr5
# 6 Primary Assembly      170805979          chr6

## GRCh38.p12 is a superset of GRCh38:
all(GRCh38$SequenceName %in% GRCh38.p12$SequenceName)
# [1] TRUE

## Display some of the 140 sequence names that are new in GRCh38.p12:
setdiff(GRCh38.p12$SequenceName, GRCh38$SequenceName)[1:30]
#  [1] "HG1342_HG2282_PATCH" "HSCHR1_5_CTG3"       "HG2095_PATCH"       
#  [4] "HSCHR1_4_CTG3"       "HG2058_PATCH"        "HSCHR1_8_CTG3"      
#  [7] "HG986_PATCH"         "HG460_PATCH"         "HSCHR1_3_CTG3"      
# [10] "HSCHR1_6_CTG3"       "HSCHR1_9_CTG3"       "HG2104_PATCH"       
# [13] "HG1832_PATCH"        "HG2002_PATCH"        "HSCHR1_5_CTG32_1"   
# [16] "HG2290_PATCH"        "HSCHR2_6_CTG7_2"     "HSCHR2_7_CTG7_2"    
# [19] "HSCHR2_8_CTG7_2"     "HG2233_PATCH"        "HG2232_PATCH"       
# [22] "HG2066_PATCH"        "HG126_PATCH"         "HG2235_PATCH"       
# [25] "HG2236_PATCH"        "HSCHR3_4_CTG1"       "HG2022_PATCH"       
# [28] "HG2237_PATCH"        "HG2133_PATCH"        "HSCHR3_7_CTG2_1"    

If you need these new sequences for your analysis, you can always make a BSgenome for GRCh38.p12. See the "How to forge a BSgenome data package" vignette in the BSgenome package for how to do this:



Entering edit mode

Dear Herve,

thank you very much for your answer and comments !! in order to verify that i fully understood your notion:

1) When you mentioned the GRCh38.p12 reference genome, you meant the relative for the gencode reference genome version that i have utilized, correct ? Thus, you believe that even i used the gencode version GRCh38.p12 for the alignment and variant calling, for my above approach, i could still utilize the NCBI "BSgenome.Hsapiens.NCBI.GRCh38" ? in order to inspect the mutational patterns of interest between the vcf files ? and could have a small impact/bias in my specific analysis ?

2) Or the only option would be to forge a new BSgenome data package ? And if so, in this scenario the fasta files would suffice ?

(*Please excuse me for any naive questions, but i pinpointed the first question, as it would be quite convinient if i could use as a solution the 1rst option)



Entering edit mode

I tried to explain the difference between GRCh38.p12 and GRCh38 the best I could. You need to understand this difference and decide whether it's important or not for your analysis. In other words, you need to evaluate if the 140 additional sequences provided by GRCh38.p12 are important for your analysis or not, by assessing how their presence or absence would impact the analysis. I don't know the details of your analysis so I can't do that for you. If their presence is important, forge a BSgenome package for GRCh38.p12. If they are not, use BSgenome.Hsapiens.NCBI.GRCh38 or BSgenome.Hsapiens.UCSC.hg38 (they are the same, except for the naming convention used for the sequence names).

You said you were planning to use the MutationalPatterns package for your analysis so maybe you could ask the MutationalPatterns authors if they can help with this. I would suggest that you ask a new question for this, with a clear post title, and that you tag the new post with MutationalPatterns.

Hope this helps,


Entering edit mode

Dear Herve,

thank you for your updated comments and please excuse me for any misunderstanding, as i thought initially that the version of gencode, includes except these sequences that you mentioned above, some extra sequence information about scaffolds, etc-indeed i initially asked the authors about the impact of this issue for their R package and the reference genome, and they suggested that i should contact you or create a post-thus, one small last question concerning how to forge a new BSgenome package:

based on the section 2, the fa.gz file from gencode, containing all the fasta sequence files, would be sufficient to start the process ?



Entering edit mode

Not sure what you mean by a "fa.gz file containing all the fasta sequence files". A fa.gz file is just a single compressed FASTA file so it does not contain other files. Maybe you mean a fa.tgz or a fa.tar.gz? Those are compressed tar archives (a.k.a. tarballs). They contain a collection of files that have been put together in a single file and that can be extracted with tar zxf myfile.tar.gz. Or maybe you mean a fa.gz file containing all the fasta sequences (in which case all the sequences are in the same file).

It's not the same and the difference is important. The former (i.e. a single fa.gz file with all the sequences in it) is not supported. Here is what the vignette says:

The sequence data must be in a single twoBit file (e.g. musFur1.2bit) or in a collection of FASTA files (possibly gzip-compressed). If the latter, then you need 1 FASTA file per sequence that you want to put in the target package. In that case the name of each FASTA file must be of the form where is the name of the sequence in it and and are a prefix and a suffix (possibly empty) that are the same for all the FASTA files.

Even though FASTA files can be used (but only if you manage to get one file per sequence), using a 2bit file is much easier. It will also produce a BSgenome package that allows much faster random access (i.e. fast access to any part of any sequence).

If you cannot find a 2bit file for GRCh38.p12, it should be easy to make one using the following 3 steps: (1) download a FASTA file containing the full assembly, (2) read it into R with Biostrings::readDNAStringSet() (this will load the 595 DNA sequences in a big DNAStringSet object), then export this DNAStringSet object in 2bit format with rtracklayer::export.2bit(). When doing so, you have an opportunity to reorder the sequences if you wish so.




Login before adding your answer.

Traffic: 542 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6