Question: Create protein sequences including variants from a VCF file
3
5 months ago by

Dear all,

I am investigating the proteome of human cancer samples and want to insert their genetic variations into the reference proteome fasta sequences to increase the sensitivity of my peptide/protein quantification.

Can you implement this "proteomeVariantInsertion()" in the VariantAnnotation package?

The VariantAnnotation::predictCoding() function already translates codons at variant positions from a reference BSgenome object to assess the consequences of a variant. I would like to take all coding variants (or just non-synonymous SNVs for a start) and insert them into the reference proteome, then save the modified fasta file.

On customProDB: In principle the package customProDB is already doing this job. But from 11,000 genes with ~40k non-synonymous SNVs that were extracted using VariantAnnotation::predictCoding() only ~2k proteins are changed with at least one variant. There is too much loss. The customProDB package works mostly on custom data.frames and could utilize the maintained Bioc objects on variants and sequences much more.

I would highly appreciate a "Bioconductor-native" solution for the customized proteome challenge.

Thanks, Daniel

modified 5 months ago • written 5 months ago by daniel.magnus.bader30

This sounds like a feature request - Could you please open it as an issue on the github page for the package: https://github.com/Bioconductor/VariantAnnotation/issues

This is potentially doable with BSgenome::injectSNPs().

Thanks for the reply sheperl. I did not want to start right away with an issue, but now I posted it here: https://github.com/Bioconductor/VariantAnnotation/issues/24

Meanwhile, I will look at BSgenome::injectSNPs() which sounds indeed very interesting. Thanks Michael!

FWIW you might also want to take a look at Biostrings::replaceAt(). It's lower level and more flexible than BSgenome::injectSNPs() (the former works on AAString/AAStringSet/DNAString/DNAStringSet objects while the later only works on a BSgenome object).

Thanks Herve,

However, I think injectSNPs should be sufficient, since I start from genome coordinates in a VCF.

Best, Daniel

Answer: Create protein sequences including variants from a VCF file
0
5 months ago by

Thanks to the suggestions from Michael Lawrence and Herve Pages, I guess it should work as follows:

1. Identify all coding SNVs, e.g. via VariantAnnotation::predictCoding()
2. Injecting coding SNVs into the genome, e.g. via BSgenome::injectSNPs()
3. Concatenate the exons per protein isoform of a gene harboring a coding SNV to gain all relevant coding sequences (already modified)
4. Translate these into AAString, e.g. via Biostrings::translate()

Should work. Use GenomicFeatures::extractTranscriptSeqs() for #3.