I was wondering if it is possible to use VariantAnnotation on a non-model species. The example uses a pre-built human genome. I have a fasta file with genome and a vcf file with variants. Would it be possible to somehow read the fasta file and use it within VariantAnnotation.
Yes, you can use VariantAnnotation with other species. The key is to have the appropriate TxDb (annotations) and BSgenome (sequences) packages. A fasta file can be used instead of a BSgenome for predictCoding, see the 'seqsource' argument on the ?predictCoding man page.
To find the appropriate TxDb package, search the biocViews. Enter 'TxDb' in the search box:
I have used makeTranscriptDb as you suggested and it worked fine. I have pasted the code below if anyone would find it helpful. I have one more question. I have a data frame with number of coding variants per gene, I would also like to know the number of exons per gene (in my case exons and cds are the same) and the total length of exons (there is only one transcript per gene). Do you happen to know how I could do that?
I can use this to get the exons:
exons.list.per.gene <- exonsBy(txdb,by="gene")
But I don't know how to count them or calculate total length.
Agnieszka
#Code to get variants based on gff3 and vcf files
library(GenomicFeatures)
# create transcript database
txdb <- makeTranscriptDbFromGFF("variantannotation.gff3")
# read vcf file
vcf1<-readVcf("all.vcf","bolep")
rd <- rowData(vcf1)
# locate variants
loc <- locateVariants(rd, txdb, CodingVariants())
splt <- split(mcols(loc)$QUERYID, mcols(loc)$GENEID)
# get named list of counts of variants per gene
l <- sapply(splt, function(x) length(unique(x)))
# transform list to data frame
df <- stack(l)
If you only have 1 transcript per gene then the "total length of exons per gene" is just the length of each transcript right? And the length of a transcript is always the sum of the lengths of its exons, even if these exons are coming from different strands (a rare event) or from overlapping regions (a very very rare event). If that's what you want, then there is no need to use range() or to worry about the strand:
Calling length() on a GRangesList of exons by gene returns the number of genes:
elementLengths() counts the number of elements in a List, in this case it's the number of exons per gene:
The range of the exons in each gene is computed with range(). See ?range in GenomicRanges for details.
width() returns the length of the range of exons for each gene.
Be careful with strand. Some genes appear on both strands:
To avoid the strand problem, extract the genes separately. See ?genes and the 'single.strand.genes.only' argument.
Valerie
Great, thanks!
Hi,
If you only have 1 transcript per gene then the "total length of exons per gene" is just the length of each transcript right? And the length of a transcript is always the sum of the lengths of its exons, even if these exons are coming from different strands (a rare event) or from overlapping regions (a very very rare event). If that's what you want, then there is no need to use
range()
or to worry about the strand:Then:
HTH,
H.
Thanks! Arrived at something similar myself and worked well.
Agnieszka