I was wondering if it is possible to use VariantAnnotation on a non-model species. The example uses a pre-built human genome. I have a fasta file with genome and a vcf file with variants. Would it be possible to somehow read the fasta file and use it within VariantAnnotation.
Yes, you can use VariantAnnotation with other species. The key is to have the appropriate TxDb (annotations) and BSgenome (sequences) packages. A fasta file can be used instead of a BSgenome for predictCoding, see the 'seqsource' argument on the ?predictCoding man page.
To find the appropriate TxDb package, search the biocViews. Enter 'TxDb' in the search box:
I have used makeTranscriptDb as you suggested and it worked fine. I have pasted the code below if anyone would find it helpful. I have one more question. I have a data frame with number of coding variants per gene, I would also like to know the number of exons per gene (in my case exons and cds are the same) and the total length of exons (there is only one transcript per gene). Do you happen to know how I could do that?
I can use this to get the exons:
exons.list.per.gene <- exonsBy(txdb,by="gene")
But I don't know how to count them or calculate total length.
Agnieszka
#Code to get variants based on gff3 and vcf files
library(GenomicFeatures)
# create transcript database
txdb <- makeTranscriptDbFromGFF("variantannotation.gff3")
# read vcf file
vcf1<-readVcf("all.vcf","bolep")
rd <- rowData(vcf1)
# locate variants
loc <- locateVariants(rd, txdb, CodingVariants())
splt <- split(mcols(loc)$QUERYID, mcols(loc)$GENEID)
# get named list of counts of variants per gene
l <- sapply(splt, function(x) length(unique(x)))
# transform list to data frame
df <- stack(l)
If you only have 1 transcript per gene then the "total length of exons per gene" is just the length of each transcript right? And the length of a transcript is always the sum of the lengths of its exons, even if these exons are coming from different strands (a rare event) or from overlapping regions (a very very rare event). If that's what you want, then there is no need to use range() or to worry about the strand:
Calling length() on a GRangesList of exons by gene returns the number of genes:
elementLengths() counts the number of elements in a List, in this case it's the number of exons per gene:
> head(elementLengths(exons)) 1 10 100 1000 10000 100008586 15 2 12 17 17 4The range of the exons in each gene is computed with range(). See ?range in GenomicRanges for details.
> head(range(exons)) GRangesList object of length 6: $1 GRanges object with 1 range and 0 metadata columns: seqnames ranges strand <Rle> <IRanges> <Rle> [1] chr19 [58858172, 58874214] - $10 GRanges object with 1 range and 0 metadata columns: seqnames ranges strand [1] chr8 [18248755, 18258723] +width() returns the length of the range of exons for each gene.
Be careful with strand. Some genes appear on both strands:
> unique(elementLengths(runValue(strand(exons)))) [1] 1 2 3 4 The count of 4 is for '+' and '-' on both chr4 and chr4_ctg9_hap1.  > exons[which(elementLengths(runValue(strand(exons))) == 4)] GRangesList object of length 1: $7365 GRanges object with 26 ranges and 2 metadata columns: seqnames ranges strand | exon_id exon_name <Rle> <IRanges> <Rle> | <integer> <character> [1] chr4 [69681713, 69682203] + | 65621 <NA> [2] chr4 [69681713, 69682455] + | 65622 <NA> [3] chr4 [69683747, 69683895] + | 65623 <NA> [4] chr4 [69687989, 69688120] + | 65624 <NA> [5] chr4 [69692128, 69692215] + | 65625 <NA> ... ... ... ... ... ... ... [22] chr4_ctg9_hap1 [405298, 405517] + | 277857 <NA> [23] chr4_ctg9_hap1 [408569, 409992] + | 277858 <NA> [24] chr4_ctg9_hap1 [582546, 583104] - | 277889 <NA> [25] chr4_ctg9_hap1 [586826, 587045] - | 277890 <NA> [26] chr4_ctg9_hap1 [587885, 587974] - | 277891 <NA> ------- seqinfo: 93 sequences (1 circular) from hg19 genomeTo avoid the strand problem, extract the genes separately. See ?genes and the 'single.strand.genes.only' argument.
Valerie
Great, thanks!
Hi,
If you only have 1 transcript per gene then the "total length of exons per gene" is just the length of each transcript right? And the length of a transcript is always the sum of the lengths of its exons, even if these exons are coming from different strands (a rare event) or from overlapping regions (a very very rare event). If that's what you want, then there is no need to use
range()or to worry about the strand:Then:
> head(tx_lengths) uc001aaa.3 uc010nxq.1 uc010nxr.1 uc001aal.1 uc001aaq.2 uc001aar.2 1652 1488 1595 918 32 62HTH,
H.
Thanks! Arrived at something similar myself and worked well.
Agnieszka