Question

VariantAnnotation non-model species

0

Entering edit mode

Miss Agnieszka Aleksandra Golicz ▴ 40

@miss-agnieszka-aleksandra-golicz-6531

Last seen 11 months ago

Australia

Hello,

I was wondering if it is possible to use VariantAnnotation on a non-model species. The example uses a pre-built human genome. I have a fasta file with genome and a vcf file with variants. Would it be possible to somehow read the fasta file and use it within VariantAnnotation.

Best wishes,

Agnieszka

variantannotation • 1.9k views

ADD COMMENT • link 9.2 years ago Miss Agnieszka Aleksandra Golicz ▴ 40

0

Entering edit mode

Miss Agnieszka Aleksandra Golicz ▴ 40

@miss-agnieszka-aleksandra-golicz-6531

Last seen 11 months ago

Australia

Thanks very much.

I have used makeTranscriptDb as you suggested and it worked fine. I have pasted the code below if anyone would find it helpful. I have one more question. I have a data frame with number of coding variants per gene, I would also like to know the number of exons per gene (in my case exons and cds are the same) and the total length of exons (there is only one transcript per gene). Do you happen to know how I could do that?

I can use this to get the exons:

exons.list.per.gene <- exonsBy(txdb,by="gene")

But I don't know how to count them or calculate total length.

Agnieszka

#Code to get variants based on gff3 and vcf files 
library(GenomicFeatures) 

# create transcript database
txdb <- makeTranscriptDbFromGFF("variantannotation.gff3")

# read vcf file
vcf1<-readVcf("all.vcf","bolep")
rd <- rowData(vcf1)

# locate variants
loc <- locateVariants(rd, txdb, CodingVariants())
splt <- split(mcols(loc)$QUERYID, mcols(loc)$GENEID)

# get named list of counts of variants per gene
l <- sapply(splt, function(x) length(unique(x)))

# transform list to data frame
df <- stack(l)

ADD COMMENT • link updated 9.2 years ago by Martin Morgan 25k • written 9.2 years ago by Miss Agnieszka Aleksandra Golicz ▴ 40

0

Entering edit mode

> library(TxDb.Hsapiens.UCSC.hg19.knownGene)

> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

> exons <- exonsBy(txdb, "gene")

Calling length() on a GRangesList of exons by gene returns the number of genes:

> length(exons)

[1] 23459

elementLengths() counts the number of elements in a List, in this case it's the number of exons per gene:

> head(elementLengths(exons))
        1        10       100      1000     10000 100008586 
       15         2        12        17        17         4

The range of the exons in each gene is computed with range(). See ?range in GenomicRanges for details.

> head(range(exons))
GRangesList object of length 6:
$1 
GRanges object with 1 range and 0 metadata columns:
      seqnames               ranges strand
         <Rle>            <IRanges>  <Rle>
  [1]    chr19 [58858172, 58874214]      -

$10 
GRanges object with 1 range and 0 metadata columns:
      seqnames               ranges strand
  [1]     chr8 [18248755, 18258723]      +

width() returns the length of the range of exons for each gene.

> head(width(range(exons)))
IntegerList of length 6
[["1"]] 16043
[["10"]] 9969
[["100"]] 32214
[["1000"]] 226516
[["10000"]] 355352
[["100008586"]] 15729

Be careful with strand. Some genes appear on both strands:

> unique(elementLengths(runValue(strand(exons))))
[1] 1 2 3 4

The count of 4 is for '+' and '-' on both chr4 and chr4_ctg9_hap1.

> exons[which(elementLengths(runValue(strand(exons))) == 4)]
GRangesList object of length 1:
$7365
GRanges object with 26 ranges and 2 metadata columns:
             seqnames               ranges strand   |   exon_id   exon_name
                <Rle>            <IRanges>  <Rle>   | <integer> <character>
   [1]           chr4 [69681713, 69682203]      +   |     65621        <NA>
   [2]           chr4 [69681713, 69682455]      +   |     65622        <NA>
   [3]           chr4 [69683747, 69683895]      +   |     65623        <NA>
   [4]           chr4 [69687989, 69688120]      +   |     65624        <NA>
   [5]           chr4 [69692128, 69692215]      +   |     65625        <NA>
   ...            ...                  ...    ... ...       ...         ...
  [22] chr4_ctg9_hap1     [405298, 405517]      +   |    277857        <NA>
  [23] chr4_ctg9_hap1     [408569, 409992]      +   |    277858        <NA>
  [24] chr4_ctg9_hap1     [582546, 583104]      -   |    277889        <NA>
  [25] chr4_ctg9_hap1     [586826, 587045]      -   |    277890        <NA>
  [26] chr4_ctg9_hap1     [587885, 587974]      -   |    277891        <NA>

-------
seqinfo: 93 sequences (1 circular) from hg19 genome

To avoid the strand problem, extract the genes separately. See ?genes and the 'single.strand.genes.only' argument.

> genes <- genes(txdb)  ## single strand
> length(genes)
[1] 23056
> head(width(genes))
[1]  16043   9969  32214 226516 355352  15729

Valerie

ADD REPLY • link 9.2 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Great, thanks!

ADD REPLY • link 9.2 years ago Miss Agnieszka Aleksandra Golicz ▴ 40

1

Entering edit mode

Hi,

If you only have 1 transcript per gene then the "total length of exons per gene" is just the length of each transcript right? And the length of a transcript is always the sum of the lengths of its exons, even if these exons are coming from different strands (a rare event) or from overlapping regions (a very very rare event). If that's what you want, then there is no need to use range() or to worry about the strand:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
ex_by_tx <- exonsBy(txdb, by="tx", use.names=TRUE)
tx_lengths <- sum(width(ex_by_tx))

Then:

> head(tx_lengths)
uc001aaa.3 uc010nxq.1 uc010nxr.1 uc001aal.1 uc001aaq.2 uc001aar.2 
      1652       1488       1595        918         32         62

HTH,

H.

ADD REPLY • link 9.2 years ago Hervé Pagès 16k

0

Entering edit mode

Thanks! Arrived at something similar myself and worked well.

Agnieszka

ADD REPLY • link 9.2 years ago Miss Agnieszka Aleksandra Golicz ▴ 40

score 3 · Accepted Answer · 2015-02-09

Hello,

Yes, you can use VariantAnnotation with other species. The key is to have the appropriate TxDb (annotations) and BSgenome (sequences) packages. A fasta file can be used instead of a BSgenome for predictCoding, see the 'seqsource' argument on the ?predictCoding man page.

To find the appropriate TxDb package, search the biocViews. Enter 'TxDb' in the search box:

http://www.bioconductor.org/packages/release/BiocViews.html#___AnnotationData

If you don't see your organism there, you can use functions in GenomicFeatures to create your own from UCSC, BioMart or a GFF file:

library(GenomicFeatures)

?makeTranscriptDb

Valerie