VariantAnnotation non-model species
2
0
Entering edit mode
@miss-agnieszka-aleksandra-golicz-6531
Last seen 18 months ago
Australia

Hello,

I was wondering if it is possible to use VariantAnnotation on a non-model species. The example uses a pre-built human genome. I have a fasta file with genome and a vcf file with variants. Would it be possible to somehow read the fasta file and use it within VariantAnnotation​.

Best wishes,

Agnieszka

variantannotation • 2.0k views
3
Entering edit mode
@valerie-obenchain-4275
Last seen 2.9 years ago
United States

Hello,

Yes, you can use VariantAnnotation with other species. The key is to have the appropriate TxDb (annotations) and BSgenome (sequences) packages. A fasta file can be used instead of a BSgenome for predictCoding, see the 'seqsource' argument on the ?predictCoding man page.

To find the appropriate TxDb package, search the biocViews. Enter 'TxDb' in the search box:

http://www.bioconductor.org/packages/release/BiocViews.html#___AnnotationData

If you don't see your organism there, you can use functions in GenomicFeatures to create your own from UCSC, BioMart or a GFF file:

library(GenomicFeatures)

?makeTranscriptDb

Valerie

ADD COMMENT
0
Entering edit mode
@miss-agnieszka-aleksandra-golicz-6531
Last seen 18 months ago
Australia

Thanks very much.

I have used makeTranscriptDb as you suggested and it worked fine. I have pasted the code below if anyone would find it helpful. I have one more question. I have a data frame with number of coding variants per gene, I would also like to know the number of exons per gene (in my case exons and cds are the same) and the total length of exons (there is only one transcript per gene). Do you happen to know how I could do that?

I can use this to get the exons:

exons.list.per.gene <- exonsBy(txdb,by="gene")

But I don't know how to count them or calculate total length.

Agnieszka

#Code to get variants based on gff3 and vcf files 
library(GenomicFeatures) 

# create transcript database
txdb <- makeTranscriptDbFromGFF("variantannotation.gff3")

# read vcf file
vcf1<-readVcf("all.vcf","bolep")
rd <- rowData(vcf1)

# locate variants
loc <- locateVariants(rd, txdb, CodingVariants())
splt <- split(mcols(loc)$QUERYID, mcols(loc)$GENEID)

# get named list of counts of variants per gene
l <- sapply(splt, function(x) length(unique(x)))

# transform list to data frame
df <- stack(l)

 

ADD COMMENT
0
Entering edit mode
> library(TxDb.Hsapiens.UCSC.hg19.knownGene)

> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

> exons <- exonsBy(txdb, "gene")

Calling length() on a GRangesList of exons by gene returns the number of genes:

> length(exons)

[1] 23459

elementLengths() counts the number of elements in a List, in this case it's the number of exons per gene:

> head(elementLengths(exons))
        1        10       100      1000     10000 100008586 
       15         2        12        17        17         4 

The range of the exons in each gene is computed with range(). See ?range in GenomicRanges for details.

> head(range(exons))
GRangesList object of length 6:
$1 
GRanges object with 1 range and 0 metadata columns:
      seqnames               ranges strand
         <Rle>            <IRanges>  <Rle>
  [1]    chr19 [58858172, 58874214]      -

$10 
GRanges object with 1 range and 0 metadata columns:
      seqnames               ranges strand
  [1]     chr8 [18248755, 18258723]      +

width() returns  the length of the range of exons for each gene.

> head(width(range(exons)))
IntegerList of length 6
[["1"]] 16043
[["10"]] 9969
[["100"]] 32214
[["1000"]] 226516
[["10000"]] 355352
[["100008586"]] 15729

Be careful with strand. Some genes appear on both strands:

> unique(elementLengths(runValue(strand(exons))))
[1] 1 2 3 4

The count of 4 is for '+' and '-' on both chr4 and chr4_ctg9_hap1.
​
> exons[which(elementLengths(runValue(strand(exons))) == 4)]
GRangesList object of length 1:
$7365
GRanges object with 26 ranges and 2 metadata columns:
             seqnames               ranges strand   |   exon_id   exon_name
                <Rle>            <IRanges>  <Rle>   | <integer> <character>
   [1]           chr4 [69681713, 69682203]      +   |     65621        <NA>
   [2]           chr4 [69681713, 69682455]      +   |     65622        <NA>
   [3]           chr4 [69683747, 69683895]      +   |     65623        <NA>
   [4]           chr4 [69687989, 69688120]      +   |     65624        <NA>
   [5]           chr4 [69692128, 69692215]      +   |     65625        <NA>
   ...            ...                  ...    ... ...       ...         ...
  [22] chr4_ctg9_hap1     [405298, 405517]      +   |    277857        <NA>
  [23] chr4_ctg9_hap1     [408569, 409992]      +   |    277858        <NA>
  [24] chr4_ctg9_hap1     [582546, 583104]      -   |    277889        <NA>
  [25] chr4_ctg9_hap1     [586826, 587045]      -   |    277890        <NA>
  [26] chr4_ctg9_hap1     [587885, 587974]      -   |    277891        <NA>

-------
seqinfo: 93 sequences (1 circular) from hg19 genome


To avoid the strand problem, extract the genes separately. See ?genes and the 'single.strand.genes.only' argument.

> genes <- genes(txdb)  ## single strand
> length(genes)
[1] 23056
> head(width(genes))
[1]  16043   9969  32214 226516 355352  15729

 

Valerie

 

 

ADD REPLY
0
Entering edit mode

Great, thanks!

1
Entering edit mode

Hi,

If you only have 1 transcript per gene then the "total length of exons per gene" is just the length of each transcript right? And the length of a transcript is always the sum of the lengths of its exons, even if these exons are coming from different strands (a rare event) or from overlapping regions (a very very rare event). If that's what you want, then there is no need to use range() or to worry about the strand:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
ex_by_tx <- exonsBy(txdb, by="tx", use.names=TRUE)
tx_lengths <- sum(width(ex_by_tx))

Then:

> head(tx_lengths)
uc001aaa.3 uc010nxq.1 uc010nxr.1 uc001aal.1 uc001aaq.2 uc001aar.2 
      1652       1488       1595        918         32         62 

HTH,

H.

ADD REPLY
0
Entering edit mode

Thanks! Arrived at something similar myself and worked well.

 

Agnieszka

Login before adding your answer.

Traffic: 489 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6