Dear Community,
based on the analysis results of an exome sequencing project (3 patients with paired cancer and normal samples-Small Cell Lung Cancer-Genomic DNA captured using Agilent in-solution enrichment methodology/paired-end 75 bases massively parallel sequencing on Illumina HiSeq4000)-
both for the alignment of fastq files, as also for the variant calling procedure, the following reference genome was selected and utilized from gencode:
https://www.gencodegenes.org/releases/current.html (Genome sequence (GRCh38.p12)-Regions-ALL-fasta format)
For my next step, i wanted to use the R package MutationalPatterns, in order to import the resulted vcf files, and inspect common mutational patterns (for SNPs). However, as this package utilizes the R package BSgenome for loading the reference genome:
https://bioconductor.org/packages/release/bioc/html/MutationalPatterns.html
library(BSgenome)
head(available.genomes())
which, for human includes the options: "BSgenome.Hsapiens.NCBI.GRCh38" and "BSgenome.Hsapiens.UCSC.hg38"
Thus, my crusial question is:
is possible, to also somehow install, modify and/or utilize also the gencode as a reference genome in the BSgenome R package ? with this structure ? in order to use it as a reference genome for my vcf files, in order to proceed for comparing the mutational patterns (SNPs) of these samples ? as described in the above vignette ?
Or alternatively, i could still use one of these two options ? For example the NCBI reference genome, as the relative from UCSC, has no updates from 2013 ? however, with this approach i could introduce considerable bias, from perhaps different annotations regarding the genomic coordinates ?
Kind Regards,
Efstathios-Iason
Dear Herve,
thank you very much for your answer and comments !! in order to verify that i fully understood your notion:
1) When you mentioned the GRCh38.p12 reference genome, you meant the relative for the gencode reference genome version that i have utilized, correct ? Thus, you believe that even i used the gencode version GRCh38.p12 for the alignment and variant calling, for my above approach, i could still utilize the NCBI "BSgenome.Hsapiens.NCBI.GRCh38" ? in order to inspect the mutational patterns of interest between the vcf files ? and could have a small impact/bias in my specific analysis ?
2) Or the only option would be to forge a new BSgenome data package ? And if so, in this scenario the fasta files would suffice ?
(*Please excuse me for any naive questions, but i pinpointed the first question, as it would be quite convinient if i could use as a solution the 1rst option)
Best,
Efstathios
I tried to explain the difference between GRCh38.p12 and GRCh38 the best I could. You need to understand this difference and decide whether it's important or not for your analysis. In other words, you need to evaluate if the 140 additional sequences provided by GRCh38.p12 are important for your analysis or not, by assessing how their presence or absence would impact the analysis. I don't know the details of your analysis so I can't do that for you. If their presence is important, forge a BSgenome package for GRCh38.p12. If they are not, use BSgenome.Hsapiens.NCBI.GRCh38 or BSgenome.Hsapiens.UCSC.hg38 (they are the same, except for the naming convention used for the sequence names).
You said you were planning to use the MutationalPatterns package for your analysis so maybe you could ask the MutationalPatterns authors if they can help with this. I would suggest that you ask a new question for this, with a clear post title, and that you tag the new post with MutationalPatterns.
Hope this helps,
H.
Dear Herve,
thank you for your updated comments and please excuse me for any misunderstanding, as i thought initially that the version of gencode, includes except these sequences that you mentioned above, some extra sequence information about scaffolds, etc-indeed i initially asked the authors about the impact of this issue for their R package and the reference genome, and they suggested that i should contact you or create a post-thus, one small last question concerning how to forge a new BSgenome package:
http://bioconductor.org/packages/release/bioc/vignettes/BSgenome/inst/doc/BSgenomeForge.pdf
based on the section 2, the fa.gz file from gencode, containing all the fasta sequence files, would be sufficient to start the process ?
Best,
Efstathios
Not sure what you mean by a
"fa.gz
file containing all the fasta sequence files". Afa.gz
file is just a single compressed FASTA file so it does not contain other files. Maybe you mean afa.tgz
or afa.tar.gz
? Those are compressed tar archives (a.k.a. tarballs). They contain a collection of files that have been put together in a single file and that can be extracted withtar zxf myfile.tar.gz
. Or maybe you mean afa.gz
file containing all the fasta sequences (in which case all the sequences are in the same file).It's not the same and the difference is important. The former (i.e. a single
fa.gz
file with all the sequences in it) is not supported. Here is what the vignette says:The sequence data must be in a single twoBit file (e.g. musFur1.2bit) or in a collection of FASTA files (possibly gzip-compressed). If the latter, then you need 1 FASTA file per sequence that you want to put in the target package. In that case the name of each FASTA file must be of the form where is the name of the sequence in it and and are a prefix and a suffix (possibly empty) that are the same for all the FASTA files.
Even though FASTA files can be used (but only if you manage to get one file per sequence), using a 2bit file is much easier. It will also produce a BSgenome package that allows much faster random access (i.e. fast access to any part of any sequence).
If you cannot find a 2bit file for GRCh38.p12, it should be easy to make one using the following 3 steps: (1) download a FASTA file containing the full assembly, (2) read it into R with
Biostrings::
readDNAStringSet()
(this will load the 595 DNA sequences in a big DNAStringSet object), then export this DNAStringSet object in 2bit format withrtracklayer::export.2bit()
. When doing so, you have an opportunity to reorder the sequences if you wish so.H.