Search
Question: Annotation for CNVs
0
12 months ago by
Haiying.Kong100
Germany
Haiying.Kong100 wrote:

Is there any Bioconductor tool for annotating CNVs? I could not find any.

modified 12 months ago by Martin Morgan ♦♦ 21k • written 12 months ago by Haiying.Kong100
2
12 months ago by
Martin Morgan ♦♦ 21k
United States
Martin Morgan ♦♦ 21k wrote:

How do you want to 'annotate' your CNVs? Maybe rtracklayer::import() and GenomicRanges::findOverlaps() are sufficient.

Thank you so much  :)

so do you know how to do this now？

The goal is to make a GRanges object that contains your copy number regions, a a GRanges object that contains your annotations.

For the first, maybe you have a plain text file with chromosome, start, end, strand information. You could read that in to R as a data.frame, then use

cnv = GenomicRanges::makeGRangesFromDataFrame()

For the second, the annotations may be available in a 'TxDb' package, or in a gtf (maybe from Ensembl and available through AnnotationHub, maybe from some other source). If from a gtf file, I'd suggest creating a TxDb object

txdb = makeTxDbFromGFF()  # see other makeTxDb

For use of AnnotationHub, see it's vignette. For use of TxDb packages, the flow is to install the package via biocLite() then use it

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb = TxDb.Hsapiens.UCSC.hg38.knownGene

For either txdb, I would then extract the GRanges corresponding to the features I wish to annotate, e.g.,

genes = genes(txdb)

find the overlaps between the 'query' cnv ranges and the 'subject' genes

olaps = GenomicRanges::findOverlaps(cnv, genes)

Create a 'long form' version of the data

long_annotated = cnv[queryHits(olaps)]
long_annotated$gene_id = genes[subjectHits(olaps)]$gene_id

Alternatively annotate each overlap with the gene id

mcols(olaps)$gene_id = genes$gene_id[subjectHits(olaps)]

and summarize the genes in each cnv

cnv_factor = factor(queryHits(olaps), levels=seq_len(queryHits(olaps)))
cvn$gene_id = IRanges::splitAsList(mcols(olaps)$gene_id, cnv_factor)

A function summarizing these steps is

annotateCnvs <-
function(cnv, txdb)
{
stopifnot(is(cnv, "GRanges"), is(txdb, "TxDb"))
anno = genes(txdb)
olaps = findOverlaps(cnv, anno)
mcols(olaps)$gene_id = genes$gene_id[subjectHits(olaps)]
cnv_factor = factor(queryHits(olaps), levels=seq_len(queryLength(olaps)))
cnv$gene_id = IRanges::splitAsList(mcols(olaps)$gene_id, cnv_factor)
cnv
}        

If I had a couple of copy number regions, created 'by hand' rather than using makeGRangesFrom... or rtracklayer::import()

cnv = GRanges(c("chr1:10000000-20000000", "chr20:1-1000000"))

I could get

> annotateCnvs(cnv, txdb)
GRanges object with 2 ranges and 1 metadata column:
seqnames               ranges strand |                           gene_id
<Rle>            <IRanges>  <Rle> |                   <CharacterList>
[1]     chr1 [10000000, 20000000]      * | 100133301,100302276,100379251,...
[2]    chr20 [       1,  1000000]      * |        100507459,10616,113278,...
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths

The gene_ids are Entrez gene identifiers, because that is the form of the annotation in the UCSC knownGene track.

Be careful that the coordinate system of the copy number variants are the same as the coordinate system of the annotation (e.g., 1-based, open intervals).

I have data from control-freec，foramt：chr start end num cnvtype（gain or loss），so how i annotate this ？

0
12 months ago by
Haiying.Kong100
Germany
Haiying.Kong100 wrote:

Probably this is too simple to make a package, I guess.

I probably just use bedtools to compare the regions I found and gene regions that I downloaded from UCSC.